Skip to content

Is it worth even trying to do causal inference from observational studies?

In response to my entry on whether propensity score analysis could fix the Harvard Nurses study, Joseph Delaney wrote:

I am unsure about how propensity scores give any advantage over a thoughtfully constructed regression model. . . . I’m not saying that better statistical models shouldn’t be used but I worry about overstating the benefits of propensity score analysis. It’s an extremely good technique, no question about it, and I’ve published on one of it’s variations. But I want to be very sure that we don’t miss issues of study design and bias in the process.

I agree completely. But I’d focus on that “thoughtfully constructed” part of the regression model. As we’ve discussed, even some of the most thoughtful researchers don’t talk much at all about construction of the model when they write regression textbooks.

So I think it might be too much to expect that working statisticians–those that might be employed by a long-running public health study, for example–to necessarily be using “a thoughtfully constructed regression model.” Maybe all we can hope is that they use standard methods and document them well.

From this perspective, propensity scores have the advantage that in their standard implementation they allow a researcher to include dozens of background variables, which is not generally done in classical regression. As I noted in my original entry, there are other methods out there that also can handle large numbers of inputs; it doesn’t necessarily have to be propensity scores.

The real issue is whether a method can allow a competent user to include the relevant information. This was the point of the famous Dehejia and Wahba paper on adjustment for observational studies.

Delaney also writes:

Issues of self-selection seriously limit all observational epidemiology. The issue is serious enough that I often wonder if we should not use observational studies to estimate medication benefits (at all). It’s just too misleading.

Sure, but we do have to make decisions in life, and what do you do in those settings where no randomized trial exists, or where you don’t trust a generalization of the results to the general population? Almost always we need some assumptions or another.


  1. John says:

    I agree with your point "we do have to make decisions in life" even when we don't have results from randomized trials. You reminded me of the spoof from Gordon Smith and Jill Pell reporting that there have been no randomized trials to prove that parachute use improves your chances of survival when jumping from an airplane.

  2. Kaiser says:

    I like PS because it forces us to look at the covariate balance directly. It allows us to quantify any perceived bias.
    I agree with Andrew that there are many real-life situations where an experiment is either not feasible or not available, and still we need some answers.
    I'm not getting what Joseph means by well-constructed regression models. How can creative regressions help overcome bias in observational studies? Are there examples of this work?

  3. Joseph Delaney says:

    In terms of "well constructed regression models", what I was getting at was that there is no theoretical advantage to propensity scores in terms of less biased estimates. You can do some really interesting diagnostics when using propensity scores by looking at covariate balance but that doesn't yield a really fundamental advantage.

    The case where I am the most sympathetic is the one that Andrew Gelman brings up — the ability to include a large number of background variables in an analysis with a common exposure and a rare outcome.

    In terms of regression and bias, it depends on the type of bias. Regression handles measured confounders well but it does nothing for selection bias (as an example). Fancier approaches can handle tricky cases like time varying confounding or non-linear relationships between the exposure and outcome (I like to use the J-shape seem with alcohol as an example).

    In terms of decision-making, the issue with a lot of prescription drugs is that there is a "healthy user effect". People who take statin drugs (as a classic example) may also engage in other health seeking behaviors. It can be unclear if an association between statin use and a reduced risk of an outcome (say cancer) is due to the medication or if the medication is a marker.

    It was noted that HRT use was associated with a reduced risk of violent death as well as cardiovascular death — this was an early sign that something could be wrong with this inference.

    This is not to say that observational drug research can't be extremely valuable (I hope it is or I will be retraining for a new career). These issues are much less severe for the detection of adverse drug reactions. And studies can generate hypotheses to be tested in randomized controlled trials.

    Compare the results of:

    Dale KM, Coleman CI, Henyan NN, Kluger J, White CM. Statins and cancer risk: a meta-analysis. JAMA 2006;295(1):74-80.


    Shannon J, Tewoderos S, Garzotto M, Beer TM, Derenick R, Palma A, Farris PE. Statins and prostate cancer risk: a case-control study. Am J Epidemiol 2005;162(4):318-25.

    The meta-analysis suggests an association between statin use and prostate cancer of OR 0.98 (95% CI: 0.83 to 1.15) while the case-control study suggests an OR of 0.38 (95% CI: 0.21 to 0.69). These are clearly not compatible estimates.

    Is the issue statistical modeling?

    I am not sure. Propensity scores have some strange properties when applied to case-control studies so I am not sure that this would necessarily help.

    So I worry about causal inference for beneficial drug effects. I've tried to look at improving methods for this (so I haven't given up) but it is an issue that worries me quite a bit.

  4. Kaiser says:

    Thanks for the clarification. I agree that while propensity scores is an elegant solution, it is not the end-all solution. The problem is not with measured confounders unless certain levels are not represented in some test group; it's with unknown confounders, as you pointed out. Of course, PS does not help with unknown confounders either although as you also indicated, one could theoretically throw a lot of covariates at the problem, a bit of data mining.

  5. rw says:

    I agree with Mr. Delaney. I approach observational studies with a great deal of skepticism, mostly as a result of the devastating effects of selection bias (for which propensity score matching can help us very little).

    I think this is a large reason that you see so many economists dipping into epidemiology. There are many opportunities to vastly improve on observational analysis with just a little bit of creativity and data (ie – Mostly harmless econometrics)

  6. Aaron says:

    It surprises me to learn of the extent to which epidemiologists and others rely on observational studies when what they really want to know is something to do with causality. And while you (Andrew) have a point when you say that this is all we've got, it seems like in practice a lot of observational stuff is regarded as saying something about causality, at least by the people writing the articles. I wonder if it wouldn't be more efficient for all of the resources that presently go into the large number of observational studies to go instead into a relative few randomized trials. Easier said than done I guess.

  7. a says:

    Might be worth re-emphasizing that RCTs as conducted in practice are not qualitatively different than observational studies with respect to confounding and bias (e.g. the "don't trust a generalization" )

    The main difference might be best put as there is always a (good) hope that a less wrong RCT could be done by someone in the future that reduces the confounding and bias ( i.e. a light at the end of an infinite tunnel )

    > too much to expect that working statisticians–> … Maybe all we can hope is that they use
    > standard methods and document them well.

    And that may well be the advantage of propensity scores (once they become standard) and in particular propensity score matching where the exclusion of "odd" patients seems "automatic" or at least conventional(rather than with excluding "odd" patients in propensity score stratification where it seems arbitrary and at risk of being overruled by the statistical reviewer) These awkwardnesses need to be managed and "reviewers" are very much part of the process.

    Also as an historical side, the reason Paul Rosenbaum once gave for his original motivation for propensity scores was to get a method that would be transparent to everyone. And the "well constructed regression models" response from the regression models industry very likley slowed the adoption of propensity score methods in the 80,s and 90,s.

  8. Stephen Senn says:

    In an article published in Statistics in Medicine last year, Erika Graf, Angelika Caputo and I argue that the Propensity Score may be less useful than commonly supposed. We argue that it makes sense to adjust for factors that are predictive of outcome (the regression philosophy) not for factors that are predictive of exposure (the propensity score philosophy). This explains in general why statisticians think it is wrong to analyse a matched pairs design the same way as a completely randomised design. In both cases the propensity score is 0.5 for all individuals but if you think that there is a bigger difference between than within pairs, the regression philosophy suggests that you should fit 'pair' as part of the model.

  9. Joseph Delaney says:

    In regards to Stephen Senn's comment, Alan Brookhart did a paper in the American Journal of Epidemiology showing that building a propensity score that included all predictors of outcome (including confounders) had less mean square error than one that included all predictors of exposure (also including confounders).

    It was a very counter-intuitive result. But it matches his argument well.

  10. Keith O'Rourke says:


    Interesting and timely paper, but I believe your match-pair design example here is a bit unfair.

    Propensity score "anything" needs to adequately model the exposure (assignment) and in the matched pair design with an X to represent the conceptual matching variable the assignment is
    P(T|X) = .5 not P(T) = .5 and the X's need to be matched well enough to recover the pairing.

    This reference (and references contained within)
    may be of interest to some (one of the big problems is with vocabulary and what people mean by propensity score "whatever")

    Teaching Statistical Inference for Causal Effects
    in Experiments and Observational Studies
    Donald B. Rubin. Journal of Educational and Behavioral Statistics Fall 2004, Vol. 29, No. 3, pp. 343–367

  11. Stephen Senn says:

    I don't agree. The propensity score is the probability of assignment and a great deal is made in R&R that it is the coarsest balancing score that is possible, although as Erika, Angelika and I show no efficiency gains result from this coarseness.

    The whole point about my argument, however, is that focussing on things that predict outcome always leads immediately to the 'correct' solution in terms of intuition whereas looking at what predicts assignment always requires bolstering up by auxilliary arguments.. 'That's not what's really meant etc etc. '

    Since, in a perfect clinical trial, the randomisation list is 100% predictive of treatment, received shouldn't we stratify by it? This would be pretty disastrous. But my belief that randomisation of itself does not cure patients allows me to ignore it, despite the fact that it is 100% predictive of assignment.

    Note that the reply to this argument that is sometimes given that the probability has to be calculated before the act of randomisation, at which point the propensity score is 50% for everybody rather than 100% and 0%, does not wash because you could run the randomisation process in a way (for example by giving people either a beer or a cup of tea prior to treatment to mark assignment) that statisticians would not accept and the only difference between acceptable and non-acceptable schemes comes down again to what is predictive of outcome.

  12. Anonymous says:

    Alan Gerber, Donald Green, and Edward Kaplan make this same argument, but formally, in one of their essays.