This post is by Phil.

Psychologists perform experiments on Canadian undergraduate psychology students and draws conclusions that (they believe) apply to humans in general; they publish in Science. A drug company decides to embark on additional trials that will cost tens of millions of dollars based on the results of a careful double-blind study….whose patients are all volunteers from two hospitals. A movie studio holds 9 screenings of a new movie for volunteer viewers and, based on their survey responses, decides to spend another $8 million to re-shoot the ending. A researcher interested in the effect of ventilation on worker performance conducts a months-long study in which ventilation levels are varied and worker performance is monitored…in a single building.

In almost all fields of research, most studies are based on convenience samples, or on random samples from a larger population that is itself a convenience sample. The paragraph above gives just a few examples. The benefits of carefully conducted randomized trials are well known, but so are the costs and impediments. Lucky people studying some natural phenomena like solar output or earthquakes can deal with complete datasets, but for most data analysts and applied statisticians the fact that your data are not a random sample from your population of interest is so commonplace that it usually goes without saying. This does not mean that it goes without thinking, of course: most researchers, and all good ones, think about the extent to which their results might or might not be applicable to a wider population and try to frame their conclusions accordingly. But most or all researchers are willing to extrapolate their results to wider populations to some degree. The movie studio reshoots their ending not because they want to please their 9 test audiences, but because they think that the response of those 9 test audiences tells them something about millions of other viewers, even though those 9 audiences were not selected according to a careful, randomized sampling scheme.

If you think everything I’ve said so far is so obvious as to be boring, so did I, but I was proven wrong. Read on.

I’m currently working with time series data on building electricity consumption. People want to be able to answer questions like “I changed something about my large commercial building on March 1; how much energy am I saving,” and one way to do that is to fit a statistical model using data from before March 1, use it to predict the energy use after March 1, and compare the predictions to what actually happened. There are other uses for these models too.

There are quite a few companies that offer energy modeling programs. Typically, a company develops its own proprietary tool. Everyone thinks their package is better than anyone else’s, or at least they say so, but there’s usually no evidence. A few months ago a company approached us to ask if we will compare the accuracy of their tool to some standard methods and allow them to publicize the results if they want to. (I work at a government research lab and they correctly see us as a disinterested party.) We said sure. Our chosen approach is cross-validation. We compiled electricity data from a few dozen large commercial buildings — the population of interest — blanked out big chunks of data, and gave the resulting dataset to the company. Their task is to make predictions for the missing time periods and give them to us, and we will compare their predictions to reality. We’re doing the same with some standard prediction models. The data we’ve given them are a convenience sample. There’s no practical way to get data from a random sample of buildings in the country or even from a single electric utility, in part because of privacy issues (the utilities can’t give out the data without permission of the building owners). Our data are a grab bag from different sources, and certainly not representative of the broad population of commercial buildings in many ways. Still, what else can one do? We want to find out if these guys have a model that outperforms standard models, and this is a way to find the answer.

To my surprise, our postdoc strongly disagrees. He asserts that since we don’t have a sampling plan, just a haphazard collection of building data, we can’t say anything at all. He says “sure, of course you can say which method performs better for these 52 buildings, but you cannot say a single thing about a 53rd building. Nothing.” At first I thought he meant that we should be cautious about making firm claims, and I certainly agree with that, but that’s not it: he really thinks that it is wrong to draw any conclusions whatsoever from our results. He thinks it’s wrong (incorrect, and borderline immoral) to say that our results are even suggestive. I offered a wager: if we find that one method performs substantially better than the others on average, I will bet you dinner that it will perform better for a 53rd building. He said sure, fine for wagering over dinner, but it is scientifically indefensible to make any statement at all about which of the methods performs better in general on the basis of anything we might find using our dataset.

To me this has some parallels to a recent post about a theoretical statistician whose work is useless in practice. I am as aware of the problems of biased datasets as anyone — I once looked into the issue of bias in indoor radon measurement datasets, and the bias turned out to be absolutely enormous — but in most fields of research if you think you can’t learn anything about the wider world unless that wider world is inside your sampling frame, you may as well quit now because you are never going to get the world into your sampling frame.

This is an interesting problem because it is sort of outside the realm of statistics, and into some sort of meta-statistical area. How can you judge whether your results can be extrapolated to the “real world,” if you can’t get a real-world sample to compare to? (And if you could get a sample to compare to, you would, and then this problem wouldn’t come up).

I would welcome thoughtful commentary on this subject.

Phil:

I agree that this is an important and general problem, but I don’t think it is outside the realm of statistics! I think that one useful statistical framework here is multilevel modeling. Suppose you are applying a procedure to J cases and want to predict case J+1 (in this case, the cases are buildings and J=52). Let the parameters be theta_1,…,theta_{J+1}, with data y_1,…,y_{J+1}, and case-level predictors X_1,…,X_{J+1}. The question is how to generalize from (theta_1,…,theta_J) to theta_{J+1}. This can be framed in a hierarchical model in which the J cases in your training set are a sample from population 1 and your new case is drawn from population 2. Now you need to model how much the thetas can vary from one population to another, but this should be possible. They’re all buildings, after all. And, as with hierarchical models in general, the more information you have in the observed X’s, the less variation you would hope to have in the thetas.

Sander Greenland has written on related issues of bias modeling in epidemiology.

“how much the thetas can vary from one population to another”

and how much of that varying is systematic (bias) versus haphazard (random error).

The statistics discipline has been very wary of directly addressing the systematic part (and some authors have decreed that part “is not statistics!”.

But it is interesting how little has been written on Don Rubin’s suggestion of representing bias in informative priors (such as Greenland’s Multiple Bias Analysis – MBA.)

It is even more interesting how few want to even consider doing these analysis in epidemiology studies.

The challenge is that if you don’t draw some conclusion from this biased data set, someone else will. Now, that can sound like an excuse to many sorts of indefensible actions, but it’s meant as a more general statistical point. As you point out, it may not be possible to get a truly random sampling. If it’s not possible, or it’s cost prohibitive, or even just expensive, there will still be conclusions drawn. They will just be on worse or no models. Remember the alternative to bad statistics is not good statistics, it’s no statistics.

Having worked as a consultant for numerous Fortune 500 companies, statistical sophistication is lacking to nonexistent. They don’t know the right questions to ask (“Did you overfit the model?” is about the most challenging question we ever got, and if we had it would be easy to dodge). If they do have people who know the right questions–often in-house statisticians–they have a tendency to get hung up on questions like you point out above. They say that the data set is too small or non-random. So the decision-makers ignore them and go right ahead and buy the new product or make a huge investment decision.

It’s just as essential for statisticians to understand cost-benefit analysis of their work (and the time delays that go along with measuring something perfectly) as it is to understand the statistics. Caveat your work, but if you can move the needle from a 50% chance you’re right to a 75% chance, that’s quite valuable.

No statistics is better than bad statistics, because it means you don’t have the illusion of knowledge. It is better to know that you are ignorant, than to be ignorant that you are ignorant.

And you don’t even know if you can move your needle to 50% to 75%. You may very well be moving that needle backwards.

That being said, you go to war with the data you have, not the data you want. I’m going to have to side with your postdoc theoretically, and side with Phil practically. Besides, if you can be fairly certain that the 53rd building have the same characteristics of the 52 other buildings and is also non-randomly-selected, that would make the data far more applicable and understandable.

On the one hand saying you have NO new information after looking at a sample is absurd, and I would call saying you can make no better prediction outside of your dataset morally indefensible, and something a tobacco company might argue. That is, since we do not have perfect data let us throw ALL of it out instead of trying to figure out how much more information it could give. I think the ancient Greeks themselves had a name for that fallacy.

On the other hand, if the data is bad enough the level of confidence goes down and at some point you just say, well it isn’t really telling me anything better than I know now, or it is not useful in testing or adjusting any worthwhile model, so spending the effort is a waste.

But really the thing is, the real world isn’t like your example. There is tons of information even in the 50 buildings. There are the things that were changed. What other buildings have them, and what kind of energy use change happened when they were changed … This is the scientific method. The data can be used in very simple models and even in a binary or qualitative way – things changed or they didn’t. Then one can look at other buildings with similar traits, and perhaps if it is feasible, do some more experiments, and get more or less confidence.

Isn’t that the whole idea of the Bayesian type analysis in modeling – to explicitly set limits on how much we know based on prior experience? Shouldn’t even with minimal data the new weights coming in should be better?

Did you ask your post doc how he makes predictions/decisions in his day to day life. I assume he dosnt follow an epsilon-first policy before making every policy decision in his life. Guy must be incapacitated in any new environment or whenever he realizes that the environment might be nonstationary. ;)

I think that speaking as a strict Frequentist, your postdoc is right. The side that believes that the information in the first 52 buildings is relevant to the 53rd (even if not from the same “population” — what is a population in a complex system anyway?), is a subjective Bayesian. I think both sides have their merits and are more reasonable in some cases versus others.

I like that you are settling this via betting though! Reminds me of when I first learned about the Dutch Book argument (and how much I like decision theory).

This is a problem that has fascinated me for some time. On the one hand, there’s no scientific method for determining the generalizability of non-random (or imperfectly random) samples. On the other hand, most of the really interesting questions we want to answer are impossible or cost-prohibitive to get random samples for.

My feeling is that there’s a certain point where you move out of the realm of science and into the realm of scholarship. Your results become one piece of evidence, that is not anything like proof, for one perspective.

I’m no expert on his work, but it’s my understanding that Charles Manski is one guy who has done a lot of work on the question of just what conclusions you

canconcretely draw with the limited data that you have. One paper of his I read in grad school where he talked about it was “Bounding Disagreements about Treatment Effects: A Case Study of Sentencing and Recidivism”.I’ve hired statisticians like this guy. They go back into academia and happily write theoretical papers, I guess. Or they find an area of applied work which requires sampling from archival records or other domains without sampling issues.

Entire fields of applied statistics disappear under this viewpoint, I think. (forecasting, most obviously, since the future is only like the past by analogy).

If the process were a random one then it *would* be hard to generalize from a non-random sample. But if the process is controlled by causal elements such as predictable aspects of human behavior, thermal conduction, and fluid mechanics, and the primary uncertainty is the manner in which those causal elements affect the outcome, then observing pretty much ANY instances, regardless of how they’re chosen, can give you *some* insight into the deterministic causal portion of the response. This is the essence of science and is the difference between science + statistics, vs pure data analysis + probability theory.

well, it may be hard to get a random sample that has all of the variables you need to run your particulal model but it may be possible to get a random sample that has some variables that overlap with the ones you have on in our convenience sample and you can at least compare the distribution of these common variables across the two samples to get a sense of what is going on.

stated abstractly as that, this doesn’t seem useful but I would wager that in many applications it probably is. for your example — what is the set of variables that you can ask a utility to release without violating confidentiality. perhaps you could get location, square footage, number of floors, kind of industry etc. etc.. the utility could presumably give you this data at a more reasonable degree of representativeness than your convenience sample. you can then at least begin to study the

selection problem by comparing these observables across the two samples. if you find that your convenience sample consisted of buildings that are much smaller than those in the more representative sample, then you would presumaby have more to go on than before …

Some time ago, when commenting on a prominent science blog, I pointed out that reality evolves in a manner that is not entirely random. A surprising number of people, mostly claiming to be scientists, replied to inform me that I was wrong! The argument went in various unfruitful directions, and I never got round to asking them why on earth they were pursuing science, if that was their belief.

Tell that postdoc that you have a blade of grass in your hand. Ask him to make his best guess as to its colour. If he confidently guesses green, then I would argue he does not believe his own philosophy.

I think that a possible way to check if you could “extrapolate” the analysis, is to randomly select a subsample of your 52 buildings, say select 40 of them, and them check if in fact the models which predicts better in the sample of 40 buildings, perform better for the other buildings as well.

This can be a good starting point.

To add to what Lakeland, Zbicyclist, and Tom campbell-ricketts said: Physicists not only didn’t do randomized experiments, but they went out of their way to make their experiments as non-random as possible. Against all philosophical odds, they somehow manage to make progress.

On the other hand the 47 out of 53 fundamental papers in cancer research, no doubt full of properly randomized and designed experiments, couldn’t be replicated. http://news.yahoo.com/cancer-science-many-discoveries-dont-hold-174216262.html

The gulf between the rhetoric of randomization and the reality of it couldn’t be any wider.

I would love to ask this statistician in particular the following question: if you can draw no conclusions from a non-random x% sample. Then what happens as x->100%? Does he really believe, for example, that you can make no inferences off a 99% non-random sample? At what value of x can you start making inferences?

This discussion also had me thinking about the effective (rather than planned) alpha level in publications, as related to your mention of “47 out of 53 fundamental papers in cancer research … [which] couldn’t be replicated.”

I have lately been contemplating whether there should be an agreed-upon adjustment to alpha (or p-values) in all publications with this in mind. This adjustment might differ between fields, perhaps even individual journals, and probably be based on bayesian reasoning. Such an adjustment would be a conservative step towards accounting for such issues as convenience-sampling as well as investigator bias.

Clark,

I don’t think fiddling with alpha will help anything. For that cancer result to happen at least one of three things has to be true:

(1) Cancer research is mostly deliberate fraud.

(2) The best cancer researchers are staggering more incomptetent than everyone else.

(3) Despite randomization, there is almost always some unknown relevant factor correlated with the treatement. When the experiemnt is repeated that factor is typically no longer correlated with the treatment so the phenomenon disappears.

I guessing that in this case (3) is to blame. This however is in direct contradition to the folklore of Randomization, which says something like “if you use a Random Number Generator to assign treatment groups then (3) will not happen very much”.

Adjusting alpha doesn’t help with this.

We have established methods of adjusting p-values (or conversely, alpha) to account for multiple comparisons. It does not seem a great leap to adjusting for unknown (but estimable) biases. These may well include designed versus non-designed experiments, experience of the researcher, quality of randomization, and even historical rates of non-replicability associated with particular fields or journals. Many a graduate student could be entertained by determining appropriate adjustments, yielding many citable publications. Once recommendations are in place, papers could simply include raw and adjusted p-values side-by-side, and readers could choose their own interpretation.

These were said to be pre-clinical studies, so unlikely that most involved randomisation and almost none will have first drawn a random sample from a well defined population (Phil’s point).

Clark: Actually sort of the oposite _if_ the studies are properly conducted and reported (two p_values of ~ .10 roughly gives a combined p_value ~ .05).

Ingran Olkin’s publications are a good source for ideas on using combinations of p_values from independed studies and could address Phil’s concern somewhat with regard to the question have all the studies been (almost) NULL versus one or more being ALT (focussing on a particular direction or not).

“During a decade as head of global cancer research at Amgen, C. Glenn Begley identified 53 “landmark” publications — papers in top journals, from reputable labs — for his team to reproduce.”

these are basic science labs. i would not interpret such studies as “properly randomized and designed experiments.”

O’Rourke and Jimmy,

I stand corrected. Thank you.

Wow – this is truly an unfair dismissal of randomization, which is one of the most important innovations in statistical measurement. When the physicists are able to use their non-random methods to devise cancer treatments that are replicable, call me.

Exactly, Kaiser. Multiple comparisons may be a problem. Experimenter bias may be a problem. ‘Outlier’ handling might be a problem. Pressure for positive results by funders might be a problem. Broken ‘double blind’ might be a problem. It seems to me unlikely that randomization’s the culprit.

Of course, randomization of treatment is NOT the same concept as a random sample from a known population, which is where we started a couple of feet above.

Well, I felt randomization of whatever form could do with some character assignation if not an outright smear campaign.

Claims are made about the benefits of using Random Samples. Those claims are pretty close to being unfalsifiable. Every time random sampling leads to the wrong answer in the real world there’s a handy explain for why the sample wasn’t “truly random”. Others have admitted as much by saying truly random samples almost never occur in practice.

In all seriousness though, I don’t think randomization is useless. I think however, it is a very weak tool. How weak is hard to say because of the reason given in the paragraph above. And to the extent that this tool pushes scientists away from other, more powerful approaches, it’s a hindrance.

Physicists don’t have a monopoly on those other more powerful approaches incidentally. Biologists also formulate strong chemical and biology theories and then put them to definitive tests. They did so even before statisticians came along with their innovations in statistical measurement. The results of vacinnation for example, seem pretty reproducable.

Physicists can infer causality from non-randomized experiments because they are dealing with simple systems – meaning that they have enough control over the system to guarantee exchangability. For the systems that randomization is used to study (e.g. cancer treatments), this is usually not feasible. Randomization is the best you can do in such a case because the independence conditions hold by definition (although in practice there are certainly ways to screw it up). Chapter two of Hernan/Robins has a really good discussion of this:

http://www.hsph.harvard.edu/faculty/miguel-hernan/files/hernanrobins_v1.10.15.pdf

Oh so Maxwell’s Equations and all the Electrodynamical/optical phenomina they imply are simple?

I don’t see any reason to believe that Electric, Magnetic, and Optical phenomina would have appeared to be any less complicated to someone in 1500 than does cancer to us today. If E&M seems simple to you it’s because Physicist made it simple using techniques that didn’t involve any sort of randomization.

On the other hand, if Physicist had tried to used randomized trails to discover Maxwell’s Equations I’m pretty confident they never would have made it. So even if they could have learned a little something from those trails they would have stagnated in the long run and the phenomina would never appear “simple” to anyone.

For some reason, I can’t reply to the comment below.

You’re conflating how I used simple with the colloquial use of the term. “… meaning that they have enough control over the system to guarantee exchangability”. So no, I’m not implying that those phenomena nor the experiments to characterize them were trivial.

revo11: I apologize for the misunderstanding, but the confusion wasn’t from ignoring your reference. I did actually look at chapter two as suggested and it doesn’t appear to apply in the way you think. Even the concept of a repeated experiment or trials in Electrodynamics is seriously problematic (see for example “retarded potentials”). Any statements you make about the joint probability of any quantities in Electrodynamics have to be consistent with Maxwell’s Equations. You can’t just go around claiming things are independent like you can other fields.

Moreover, it’s worth considering the idea of “enough control over the system”. When man first started studying either electricity, magnetism, or electromagnetic radiation our “control over the system” was virtually nill, limited to common household static electricity and some rare naturally occurring magnets (this is certainly far less than our ability to control living cells currently). We can now “control the system” so well because we came to understand the system. That understanding was gained through a scientific process that didn’t include any kind of randomization.

It’s precisely because we initially had almost no control over the phenomena and because it’s so inherently complicated that you could never have discovered Maxwell’s equations through randomized trials.

You might want to read the discussion at Respectful Insolence about this paper.

First off, you cannot check the authors’s results since , from the interview” “Neither group of researchers alleges fraud, nor would they identify the research they had tried to replicate.”

So if you don’t know what studies were “replicated” nor how. As far as I can see, we really are seeing nothing more than some assertions.

Also Orac’s discussion makes the point that these studies are almost certainly cherry-picked for potentially high payoffs in drug development. High potentia = high risk.

Good catch! If they did not give a list of the studies that they replicated, then that means that their own meta-study cannot be replicated as well.

I think that knowledge about the phenomena is the key. If I can argue that the phenomena will be similar “out of the sample” then I would generalize it. The difficulties are : argue that it happens and, define a theoretical framework that encompass to “sample” + “out of the sample”.

I agree. And that’s why extrapolation is a rhetorical skill. But a whole lot of statistics (and science) is rhetoric. that surely doesn’t make it any less valuable.

One could spend a lifetime studying epistemology and probably never come up with an explanation that is good enough for this postdoc and others like him. He fails to see the leaps of faith that he has already made to arrive at his current world view, namely, his belief in the “truth” of causality and induction.

Science has a difficult (if not impossibly contradictory) needle to thread, in its belief that there are universal, certain, and necessary truths, but that these can be found through the contingent phenomena of experience.

The postdoc is not doing a good job threading that needle. Maybe predicting the 53rd building based on the 52 in the data set is bad science, but it is probably good decision-making. Science should be a tool (a useful one, but not the only one) to assist in good decision-making, and good decision-making should not be sacrificed to do good science.

(Adam said: “…there’s no scientific method for determining the generalizability of non-random (or imperfectly random) samples.”)

Precisely correct.

“Statistical Sampling” REQUIRES a random sample.

Non-random sampling produces “data” of factually ‘unknown’ applicability to the population sampled.

Correctly measuring a sample provides scientifically factual data/information ONLY about that sample… unless it is a true random sample of the designated population. Some factual data/information is better than none… and there are many valid methods to observe/research facts about the world; however, the method of ‘statistical-sampling’ can only be executed with random sampling. Non-random statistical-sampling is a fundamental contradiction… though it is very common (e.g., in Survey-Research); such practices are claimed to be scientific… with all manner of creative analysis of non-random samples — but no one openly refutes (or can refute) random sampling as the fundamental scientific basis of statistical-sampling.

True random sampling can often be very difficult or impossible… charades are a common substitute in the reports.

Right on, Breyer.

I know it’s tangential, but this reminds me of a famous computer hacker koan (http://en.wikipedia.org/wiki/Hacker_koan):

In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.

“What are you doing?”, asked Minsky.

“I am training a randomly wired neural net to play Tic-tac-toe”, Sussman replied.

“Why is the net wired randomly?”, asked Minsky.

“I do not want it to have any preconceptions of how to play”, Sussman said.

Minsky then shut his eyes.

“Why do you close your eyes?” Sussman asked his teacher.

“So that the room will be empty.”

At that moment, Sussman was enlightened.

With regards to the problem at hand, is it possible to utilize additional information from the following survey: Commercial Buildings Energy Consumption Survey – http://www.eia.gov/consumption/commercial/

If we can utilize some external data source that while may not be ideal, perhaps we can know what the known unknowns are as opposed to the unknown unknowns.

Adam said: “…there’s no scientific method for determining the generalizability of non-random (or imperfectly random) samples.”

No doubt about that. It is the theory

I am the one who raised the question.

I was saying this is the theory, but I know that quite often different point of views come into play when statisticians deal with non-statisticians. So I am definitely welcoming all the comments here.

I would reframe the problem at hand by resorting to the concept of Design of the experiment.

Has it been performed? I am not sure about that.

Even when a non randomized experiment comes into play, it would be great to have evidence about why a given statistical unit is included in the experiment.

A sample selection procedure should be performed according to such a crucial step.

If the design reduces to: “that is what I have”, well one might want to know a little bit more.

All the more so, design of experiment step not only deals with how to build up the sample but includes other steps. For example, has any sort of power analysis been conducted? Which is the distribution of the data. Which are the statistical procedure am I going to use in order to make meaningful conclusions?

For example, mean comparison should be performed only when some distributional assumptions are met.

Another interesting point is the following: if performed, a design of experiment procedure might have suggested a Montecarlo experiment.

It is not unreasonable to perform such a simulation process according to a set of data generating process (i.e. REG – SARIMAX / Double Exponential Smoothing) , shocked randomly according to real life situations (Level Shift, Trading Day, Holidays).

Has this procedure been considered? I am not sure about that.

Has anyone ever tried something like the following massive simulation experiment? Generate data sets from a wide range of distributions. Select samples from these data sets using sampling methods that depend on the data distributions in a variety of ways. Fit a variety of models to the samples. See how well predictions from the models hold up out of sample. Looking at various distributions of the combination of data distribution and true sampling distribution (impossible to know of course), this experiment might offer some insight into how frequently predictions can be extended out of sample.

My sense is that these sorts of simulation studies are pretty common. I did something similar (on a smaller scale) for my Masters project, and I continue to do similar studies as a “sanity check” to reassure myself that my analysis of a problem is going to achieve what I expect it to. Of course, purely mathematical approaches remain the gold standard.

Very interesting discussion; I may be too late to be read but…

I think that this boils down to general problems with mathematical modelling. Reality is always essentially different from mathematical models. There is no way out of that. The perfect situation the postdoc is dreaming of does not exist.

We do some analyses that are based on some assumptions all of which are usually violated in one way or another. We cannot formalise or model perfectly what the implications of this are. No way. Regardless of whether this is about random sampling, the Gaussian distribution, independence/exchangeability or whatever.

Now this doesn’t mean that we don’t learn anything. The tricky thing is to assess how much the violation of the assumptions will harm the analysis. Much can be done about this including all the stuff that has been suggested before (but it has to be kept in mind that everything formal that is done in order to improve matters comes with its own implicit assumptions that will be violated).

At the end of the day there is a decision to be made about whether what one has, taken into account common sense and all kinds of related considerations, is good enough to draw some conclusions from it. Of course one would use all kinds of background knowledge for this. What do we know about how precisely the convenience sample may differ from a proper random sample here? How bad do we think this is?

I think that much confusion is created by the usual way of teaching statistics stating that there are model assumptions that have to be fulfilled. No! There are model assumptions. They are all violated, and we need to decide whether the result is still useful (sometimes we may be wrong with this decision, hopefully we are right more often).

I would agree with that, but it’s very hard to do that in a formal course.

Especially folks that are mathematicians at heart are uncomfortable speaking and writing that way. For instance, David Cox after talking like that in seminars would add – but I would not want to put that in writing (I think he has softened on this recently).

Most of us learn this outside formal courses, likely on blogs like this.

Until recently – Box’s quote “All models are wrong, but some are useful” seemed surprising and notable to most.

And the last time (2010) I proposed “less wrong” terminology to “best” in a SAMSI working group document with statistical researchers (mostly Bayesian even) the response was “we would prefer to use optimal”. So we did.

Interesting stuff – but isn’t this basically a special case of the problem of induction?

Perhaps I am late for this post.

I have to disagree with the statement “there’s no scientific method for determining the generalizability of non-random (or imperfectly random) samples.”, we DO have some principled way to handle selection bias under reasonable set of assumptions (‘no free lunch theorem’).

(There’re other problems of generalizability besides selection bias when talking about causal effects, but I will confine my note for the selection problem and perhaps can discuss other problems later.)

Let me copy part of my comment in Terrence Tao note on selection bias, I think it’s useful because there’s some confusion between selection bias and confounding bias (from https://plus.google.com/114134834346472219368/posts/eWaetoVYivS)

In a typical clinical setup, the selection bias is characterized by treatment (X) or outcome (Y) affecting the inclusion of the subjects in the sample (for instance, indexed by S) — we have preferential exclusion of samples of the population (as +Terence Tao showed).

The dataset that we obtain is always conditioned on S=1, and we somehow would like to make a claim about the whole population (unconditional on S). I.e., we infer from data the distribution P(X, Y | S=1), but what we really want is to make a claim about P(X, Y). This is the generalization of the well known Berkson’s Paradox (from 40’s) and is also known as sampling selection bias in Economics or explaining away in AI from 80’s.

On the other hand, the confounding bias is characterized by treatment (X) and outcome (Y) being affected by common omitted variables U, and we have uncontrolled mixing of distinct populations.

The dataset that we obtain is not conditioned on U (just marginalized over). There are some solutions for this problem, and one of them is to run a clinical trial, which will balance the populations washing away the effect of U. The other alternative is to control for U, but this implies that we have strong qualitative assumptions about U such that U can remove this bias (part of the literature call this mechanism “ignorable”).

The technique used to remove this bias is somehow the opposite of the one used to remove selection bias, and now we have to condition on U, applying a technique called standardization (e.g. sum_u P(Y | X, U) P(U)). This technique is also known from much earlier in the literature, I guess it can be traced back to Economics or Epidemiology in 40-50’s, but I will need to double check that.

(There’s a whole literature on how to perform this adjustment which is not the only option to remove confounding bias (meaning, propensity score is not “complete” in the sense of removing this bias). For instance, there are some really cool cases in which there is no U such that we can remove the bias by the adjustment mentioned above, but there are other variables that will yield unbiasedness (some cases related to perfect mediation), which is quite unexpected.)

Note that it is possible that we have both kind of biases simultaneously, we study this scenario in a recent paper and explain how they can be systematically removed (or minimized) given a certain set of qualitative assumptions about nature (data-generating model): http://ftp.cs.ucla.edu/pub/stat_ser/r381.pdf

Well, back to the more practical problem of generalizing the results from the data at hand, what are the precondition that need to be met to separate the responses or differences between the responses from the sample? Isn’t this similar to the problem of (Tukey&Luce’s)conjoint measurement? In that case, I need to demonstrate that item effect is separable from the person effect. A first step would be to show that the difference or ratio of interest (the item effect) is stable across all the subgroups (the person effect) in the data.

Generalizability is a property of theories, not statistical procedures.

1. State a general theory, test it in any sample, and if not rejected, predict out of sample.

2. Generalizability on the basis of random samples from a population is nonsense. What are the sampling frames for all possible treatments operationalizations, outcome definitions, units, and experimental settings.

Generalizability is a property of theories, but you don’t know whether a theory is generalizable until you try it out. If you try to test it with a nonrandom sample, you are more likely to conclude it is generalizable when in fact it isn’t. In the language of your point #1:

1. State a false theory, test it in a “sample” which is similar to the cases that led you to state your theory (and this can easily happen without any deliberate wrongdoing on your part), surprise! your theory is not rejected, generate bogus predictions, publish incorrect paper, delay the progress of science.

@anonymous

You deliberately misinterpret me, I assume one tests out of sample. So let me misinterpret you:

1. State a false theory. Since we do not know the population of treatments operationalizations, outcome definitions, units, and experimental settings, test it on a random sample from the limited “population” we know. Namely, the population upon which we based our theory. Surprise “your theory is not rejected, generate bogus predictions, publish incorrect paper, delay the progress of science.”

We think something called gravity causes objects to fall at a constant velocity. We have not tried in the (infinite) population of days (including those long into the infinite? future). Because we cannot randomly sample from the population of future days, we cannot make any future predictions about the effects of gravity tomorrow.

Umh… I’d rather say gravity is a universal timeless law, and only reject that theory if the non-random evidence accumulating with the passage of time is inconsistent with it.

I was re-reading the post and we should make some distinction among the types of generalization that we wish to make.

I agree that the nature of the problem is a type of induction (generalizability), but perhaps we should partition it as internal and external validity. We DO have methodology for both classes of problems.

(Notwithstanding, the existence of some principled way to conduct our analysis does not imply that everything is always possible. For instance, in the external validity case, it might be impossible to get an unbiased estimate of a target relation if the source and target populations are completely unrelated. The natural research question is to understand the conditions under which we can perform such generalization. )

The problem of internal validity assumes the existence of a population, say Pi, and that one is interested in making inference about Pi. We might have diverse threads for internal validity, for instance, confounding bias or sampling bias (which I prefer to call selection bias) are the most common issues that can spoil the analysis; but again, they are all related to the population Pi. It seems that the vast majority of the theory that we have so far is related with this class of problems.

Sometimes, we conduct a perfect randomized trial in Pi, but what we really want is to perform inference about a different population Pi* (which is somehow related to Pi, but not the same; relatedness have to be defined formally). This problem is known as ‘external validity’. Recently, we came up with a principled mathematical (not ad hoc) way to perform this type of generalization, see http://ftp.cs.ucla.edu/pub/stat_ser/r372.pdf .

It is true that the set of qualitative assumptions encoded in the model of the underlying phenomenon implies the generalizability (or not) of the target claims (which are data-independent); still, some models have statistical implications that can be tested with the data at hand. (I recommend Pearl’s book that extensively discuss such paradigm and these implications.)

“Recently, we came up with a principled mathematical (not ad hoc) way to perform this type of generalization”

Isn’t any algebraic system “ad hoc”? Perhaps you meant: “.. with a precise, logically consistent way to perform…”.

Mathematics is good for being precise and logically consistent, but it is no the the only way. Some computer algorithms work too in this regard, without mathematics. Is logic of necessity mathematical?

I thought that was the power of DAGs: The ability to propose a new “ad hoc” language with certain logical properties that is both, different from mathematics, and more in tune with the way we think about causality.

But perhaps DAGS need mathematics to prove its own consistency. Put differently, can all major propositions of DAGs (e.g. back door criterion, etc.) be proven using DAGs. Is that self-referential? Are DAGs just a GUI to the math of causal inference?

PS I should add that DAGS, and the language of counterfactuals, and causal probability statements are all equivalent, according to Pearl. The question is whether any of them is logically prior.

Elias:

See my comment here.

No, Fernando, “not ad hoc” means more than just “logically consistent”. It means a way of doing things that is accompanied with certain *guarantees*.

The method we have developed for generalization tells us that, if we follow our routine, the results that we obtain (about the target population) logically follow from the assumptions that we made (about what the two populations share in common), and no method can derive those results with weaker assumptions.

This is more than being consistent, and I do not know of any non-mathematical way of achieving such guarantees. The machine-learning algorithms that you mentioned may or may not come with performance guarantees; if they do, they are not ad-hoc.

Similarly, DAGs are not ad-hoc, because they give us guarantees. For example, if the assumptions encoded in the DAGs are correct, then the do(x) expression we estimate from the DAG will coincide with the causal effect as measured in randomized trials. This is a powerful guarantee.

As to logical priority, DAGs, potential outcomes, and Structural Equation Models are all logically equivalent. The question, however, is not whether any of them is logically prior, but which of them permit an investigator to scrutinize his/her own assumptions with greater clarity and transparency. This is a psychological issue and the only way to answer it is to try encoding a given problem in these three languages and decide. So far, only the DAG people tried all three (see Pearl’s book), other people are too scared to face the conclusion. As the simplest example that I can mention for this case, consider the frontdoor criterion, in which conditional ignorability does hold but we do have a way to estimate the causal effects purely from observations.

Elias,

You are talking to a convert. I like DAGs, that is why I jumped at the mathematical aspect.

If the whole point of DAGs is precisely to propose a more user friendly language, and if its logical structure of DAGs is equivalent to math and just as powerful, why not state “we came up with a principled DAG approach to external validity”.

Such a statement would appear to me to be more consistent fir Pearl’s program. Having to appeal to mathematical language in order to prop up the graphical one strikes me as a little dissonant.

[…] the market. And the ability of the FDA trials to even truly keep us safe is questionable–the data are not really random, and any effect that might seem small for a sample of thousands might never the less effect a huge […]

[…] recently posted on the challenge of extrapolation of inferences to new data. After telling the story of a colleague […]

[…] Bareinboim asked what I thought about his comment on selection bias in which he referred to a paper by himself and Judea Pearl, “Controlling […]