In comments here, Alexis writes,

Your post prompts me to ask you something i’ve been wondering about ever since i began learning about NON-regression-based approaches to causal inference: namely, why do virtually all statistically-oriented political scientists think that regression-based/MLE methods are giving them the correct answers in observational settings? after all, we have long known (since at least the Rubin/Cochran papers of 1970s) that regression is often (and quite possible *generally*) unreliable in observational settings.

Do we have a single example of a non-trivial observational dataset wherein we can show that regression analysis produces the result that would have been obtained in a randomized experiment? We have lots of examples that show regression fails this test (here i’m thinking of dehejia/wahba/lalonde, etc.) where is the definitive empirical success story? there should be many success stories, given the universality of the methodology– but i don’t know of a single one.

In your blog, you write:

“(Parochially, I can point to this [link to gelman paper] and this [link to gelman paper] as particularly clean examples of causal inference from observational data, but lots more is out there.)

I do not doubt that your linked papers (which i have not read) are excellent and rigorous examples of applied regression analysis. but my question is, how is it that these papers validate a regression-based approach to causal inference? what do you know of the “correct answer” in these cases, aside from your regression-based estimates?

My response:

1. Matching and regression are different methods but ultimately are doing the same thing, which is to compare outcomes on cases that differ in the treatment variable while being as similar as possible on whatever pre-treatment covariates are around. Regression relies (and takes advantage of) linearity in the response surface, whereas matching is (potentially) nonparametric.

2. Matching is particularly relevant when the treatment and control groups do not completely overlap–matching methods will exclude the points outside the overlap region, with the understanding that the causal inference applies only to this region of overlap. Rubin’s thesis discussed why matching is improved if followed by regression.

3. In each of the examples I’ve worked on (most notably, estimating the effect of incumbency and the effect of redistricting), there was essentially complete overlap of treatment and control groups.

4. In their usual formulations, matching and regression both assume ignorability and thus ignore biases due to selection based on unmeasured variables.

5. In my particular examples, I don’t have external validation, however the results make sense when looked at from various directions (see, for example, our comparison of various estimates in our 1990 AJPS paper).

P.S. See some lively discussion in the comments.

Hi Andrew,

since you refer to the incumbency study as a "particularly clean example of causal inference from observational data" I would be interested in a follow up question.

What is your standart to judge that your incumbency study fits this category of a causal inference that is "particulary clean"?

Is it your point 5 that "the results make sense"?

That does not seem a very attractive citerion; a lot of estimates may make sense and are yet wrong.

Is it the plausibility of the assumptions imposed by your model? Then two questions follow:

1) How do you know the functional form is linear?

I remain unconvinced unless I see evidence that matching would give a similar answer. Regression adjustment after matching is not the same as drawing inferences from regresion fitted in an unmachted dataset. Once you did the matching, regression is only used to purge remaining bias due to matching discrepaancies, model dependence will be greatly reduced and if you have exact matches you won't need any model at all.

2) how do you know your results are robust to hidden bias? Is it reasonable to assume that conditional on two or three covariates, incumbents randomly choose wheter to re-reun or not? Presumably, they have private information at their disposal that is not reflected in your limited set of covariates so there is selection on unobservables.

Sensitivity tests would allow you to check for this, but I don't see any in your incumbency paper so how can I judge how much of this incumbnecy effect is actually just selection?

All in all, I don't quite see why this incumbency study would pass as a "particularly clean example of causal inference from observational data." Essentially all I see is a simple linear regression with three covariates and no effort whatsoever to tackle selection. This is highly problematic, because the main reason why we have a hard time drawing causal inferences in observational settings is that the assignment meachnaism is confounded by slection effects. The assumptions that make your simple regression model a qualify as a "causal" inference are heroic. The exact same assumptions are invoked when fitting a linear regression to the infamous dehejia/wahba/lalonde example and the answer is dead wrong. So would you also call this a "particularly clean example of causal inference from observational data"?

Thanks for your relpy.

Best,

Tom

hi professor gelman,

thank you very much for replying to my comment about problems using regression for causal inference in observational studies.

your post is largely focussed on comparing and contrasting matching vs. regression-based methods. for the record, my original comment made no mention of matching, and i would characterize the comparison very differently. my original comment was limited to asking for evidence that regression can give us the right answer in real-world non-experimental settings.

you wrote that your regression results "make sense when looked at from various directions." do your results have this property because you ran lots of regressions to make them look this way? of course.

as we all know, published regression-based causal estimates are carefully crafted to *look* correct, but there is no hard evidence to demonstrate that the method can be reliable in the absence of randomized selection-to-treatment. in fact, as you noted in your prior blog post on Berk's new book, all the evidence out there suggests quite the opposite. until there is a single published paper that rigorously demonstrates that regression can be reliable in non-experimental social science, i think social scientists should stop running OLS/MLE and try something else.

there are many alternative approaches to causal inference. i prefer methods consistent with Rubin (2001)'s call for honesty in research design, which dictates that observational studies should be designed the way experiments are designed: blind to outcomes.* this type of research protocol has become common in experimental physics, where it is called blind anlaysis (see P F Harrison 2002 J. Phys. G: Nucl. Part. Phys. 28 2679-2691.)

As you wrote: "…in each of the examples I've worked on (most notably, estimating the effect of incumbency and the effect of redistricting), there was essentially complete overlap of treatment and control groups." that's surprising; essentially complete overlap is very unusual in nonexperimental settings– it's even unusual in randomized experiments when there are multiple continuous covariates.

*Health Services & Outcomes Research Methodology 2: 169-188… this article explains that Rubin does not favor regression methods, except in special limited circumstances (such as after matching has already created a high degree of balance and linearity seems a reasonable assumption.)

As best I can tell, Alexis is asking for examples in which:

1) A randomized experiment was performed. We, therefore, "know" what the true causal effect is.

2) A regression approach is then used to analyze the data, pretending that is observational only.

3) The causal effect from 1) is the same as that from 2).

Now, there is a sense in which this is trivial. The math is such that, if you have a simple treatment with two possible values, the difference in means (typical approach with experimental data) will be the same as a regression coefficient using a 0/1 dummy. You can use "regression" to get the same answer for any experimental set up.

To make such a test interesting, you need some other tricky stuff going on, something that "messes up" the regression because it tries to adjust for it when it ought to, somehow, ignore it. With Lalonde and following, the trickiness lies in determining the relevant population (I think). If you know the subset that actually was randomized, the regression answer is the correct one.

It sure would be nice to know of other papers that do a compare and contrast in the same spirit as Lalonde. Or are we doomed to work with that data forever?

Tom and Alexis,

Thanks for the comments. We actually discuss in our 1990 paper that there is essentially complete overlap between treatment and control groups: perhaps surprisingly, the probability of an open seat in our dataset was about 10% independent of the parties' vote share in the previous election. So no matching was needed. Of course, I have no problem if anyone wants to replicate our paper, doing some matching first. (Hey, that's a great methods paper for some Ph.D. student!) It's possible this will smooth out some of the year-to-year fluctuation but I don't think it will change the main pattern.

(We also saw essentially complete overlap for redistrictings and nonredistrictings in our 1994 paper–this is perhaps less surprising, since redistricting is basically done every 10 years on a fixed schedule and is thus an externally-imposed treatment.)

Regarding other points: I see no evidence for nonlinearity and no reason to suspect it either. There were some problems with our model, however, notably that the before/after slope tends to be greater for control cases (incumbents) than treated (open seats). I discuss this, and some other before/after cases, in my paper in the 1994 Gelman/Meng volume, and Zaiying Huang and I refit the model allowing this variation in our 2006 (to appear) paper on estimating the incumbency advantage with variation.

Thus: I don't believe that least squares is the best solution to this problem, merely that it is a good and clean (if imperfect) solution in this case.

To continue . . . another confirmation of our results is that the regression estimate is consistent with the sophomore-surge/retirement-slump estimate (after correcting for selection bias), as we discuss in our paper. This cross-check is pretty important, I think, and it's probably one reason for the success of our paper. We did not just propose a new estimate, we also discussed and estimated the biases of old estimates.

And yes, selection is an issue, which we also discussed in our paper–we didn't resolve it, but we gave good reasons for bounding it. Ansolabehere and Snyder, I think, looked at this by comparing open seats arising from retirement, loss in the primary election, and death.

To step back a bit, it's fine to use various methods, and skepticism is great. Meanwhile, you have people using simple calculations like sophomore surge or the frequency of reelection and treating these as causal estimates. I think what we did in 1990 is better than that, I think what we did in 2006 is better still, and I'm sure more can be done.

For what it's worth, when Gary and I started our project in 1988, we had no desire to estimate the incumbency advantage–we just needed a correction for incumbency to use in our method for estimating seats-votes curves. We looked at the literature on incumbency and saw we could make a contribution. In fact, we had no desire to use regression, either. We started with the Rubin causal framework, carefully defined our estimand, and ended up coming around to regression as a good method to solve our problem.

If the question is: is there an alternative explanation for a regression result on observational data that undermines causality, I think one would have to admit that the answer is always "Yes." But so what? As Andrew points out (implicitly), the test of the causal explanation resides in the plausibility of the alternative explanations and their implications, not in the regression itself. Randomized experiments can be subject to the same problems… random assignment which isn't blinded can compromise causality, for example. At the (admittedly trivial) limit, imagine that your results were caused by an invisible demon whose whims happened to correspond with a regression explanation of the data (with error). Your causal explanation was wrong — it was the demon. But we rightly reject that explanation not because it may not be right, but because it's not helpful. Note that randomization doesn't exorcise the demon explanation, either. All models are forms of rhetoric. You can't prove causality any more than you can make someone believe something.

greetings,

(1) to david kane: there is a literature dedicated to LaLonde-esque experiments across various datasets. Imbens, Hotz, and Mortimer wrote a paper that may be of interest. Jas Sekhon and I wrote a paper that does Monte Carlo simulations and also looks at the LaLonde/DW datasets. Presumably Berk's book mentions other such papers. And then there is Chapter 5 of Howard Bloom's book, "Learning More From Social Experiments" which also goes through the literature.

(2) hi jonathan,

(a) my blog comment relates to methods for obtaining reliable causal estimates, not for obtaining "causal explanations" per se, by which i think you mean something relatively thicker akin to parsing out the causes. to wit, Rubin's Causal Framework was designed to facilitate estimation of causal effects, not to identify the causes of effects. this distinction may sound tricky or empty, but it's not: as Rubin has noted, it is unclear how to coherently pose the latter type of causal question, let alone address it with OLS.

(b) you write that "the test of the causal explanation resides in the plausibility of the alternative explanations and their implications, not in the regression itself." seems to me results generated via regression-based causal inference are always vulnerable to the alternative explanation that the matrix algebra of least-squares is producing a regression coefficient that does not answer the substantive causal question because appropriate assumptions do not obtain. this alternative explanation is extremely plausible (given that regression has gotten it wrong 99.5% of the time it has been rigorously tested using external validation–see the evidence in the literature referred to in my response to David Kane) unless one is willing to grant that the functional form has been adequately approximated *or* assignment to treatment was essentially random (independent of potential outcomes.)

[99.5% and not 100% because there *is* one paper by Heckman and Hotz (1988) that uses real experimental data to argue that certain models producing poor estimates could be eliminated using principled diagnostic tests. but this paper is hardly justification for the catholicity of regression-based causal inference in observational settings in social science.]

Joshua Angrist highlighted another difference between (OLS) regression (without treatment-covariate interactions) and matching/subclassification/weighting in pp. 255-257 of his 1998 Econometrica paper, "Estimating the labor market impact of voluntary military service using Social Security data on military applicants":

"The essential difference between regression and matching in evaluation research is the weighting scheme used to pool estimates at different values of the covariates."

His equations leave out some steps and I think there are typos, but informally, suppose you use OLS to regress an outcome on treatment and a full set of subgroup dummies, with no treatment x subgroup interactions. OLS will estimate a weighted average of the subgroup treatment effects with weights chosen to minimize variance.

In Angrist's example, matching (or subclassification) estimates the average effect of treatment on the treated, which may be a different weighted average.

Andrew, I wonder if this might help explain the difference between the poststratification/weighting and regression estimates in your paper "Struggles with survey weighting and regression modeling". (We've also been trying to adjust for differences in "X" between survey waves, using inverse propensity score weights so far.)

Alexis: I'm sure we won't agree when I'm done with this, but a colleague of mine insists I answer you.

(1) My comments apply equally well to the estimation of causal effects and the identification of the causes of effects. My point for either one is that you compare what you get from the regression (and what the regression implies in terms of its assumptions) versus other explanations for the phenomenon under study. Since the space of alternative explanations is bounded only by human imagination, there will almost always be a better explanation for the data (or, if you will, a better estimation of causal effects) than the regression gives you. I repeat, however: so what? I have never regarded the point of regression as giving the "correct" estimate of causal effects, because it is a model, which by definition is incorrect. Only reality is correct, and reality can't be modeled without error.

(2a) Your second point is that virtually every article which tests whether regression is correct shows it to be wrong. First, the irony of your position is delicious. Published articles do not represent a random sample of all instances of causal estimation — they are purposely chosen to make their point, which makes sample selection and Rubinesque bias a huge problem. So your prior that 99.5% of all regressions are wrong is itself subject to your exact critique. You got your estimate through a naive "regression" of "correct" on the dummy variable "regression". You got a coefficient of .95 and convinced yourself that result was right!

(2b) But more seriously, my point is that even if all regressions have a 95 percent chance of being wrong (I maintain above it is 100 percent, but never mind) you still don't have anything unless you show two things: first, that the reason the regression is "wrong" isn't just a failure to properly create explanatory variables with appropriate nonlinearities where necessary or selection terms as appropriate or whatever adjustments you need; and second, that the likelihood ratio of some other method which isn't regression is higher. I'll take a 95 percent chance of regression error over a 99.9999999 percent chance of astrology error any day.

(3) I highly commend Abelson's "Statistics as Principled Argument" which makes these points, especially Chapter 9.