Skip to content

“How conditioning on post-treatment variables can ruin your experiment and what to do about it”

Brendan Nyhan writes:

Thought this might be of interest – new paper with Jacob Montgomery and Michelle Torres, How conditioning on post-treatment variables can ruin your experiment and what to do about it.

The post-treatment bias from dropout on Turk you just posted about is actually in my opinion a less severe problem than inadvertent experimenter-induced bias due to conditioning on post-treatment variables in determining the sample (attention/manipulation checks, etc.) and controlling for them/using them as moderators. We show how common these practices are in top journal articles, demonstrate the problem analytically, and reanalyze some published studies. Here’s the table on the extent of the problem:

Post-treatment bias is not new but it’s an important area where practice hasn’t improved as rapidly as in other areas.

I wish they’d round their numbers to the nearest percentage point.


  1. Carlos Ungil says:

    I wish they had used a larger (n=100) sample.

    • Marcus says:

      I think that they only reviewed these studies to show that post-treatment conditioning is an issue in their field – not to give a precise estimate of the base rate at which this occurs. I would also not be surprised if this was done in response to a reviewer comment that this issue is well understood in the field. I’ve had to do similar reviews of articles to highlight questionable practices that methodological savvy readers will incorrectly assume to be widely understood.

      • Carlos Ungil says:

        Actually I find the sample they use and the way they report their results perfectly acceptable. It’s just that if their sample had included exactly 100 articles then all the percentages in the table would be round and wouldn’t have triggered Andrew’s distaste for (what he perceives as) unwarranted precision.

  2. Kyle C says:

    I would argue for even less precision in the text. “Almost half,” “one in five,” “one in seven” are really the only kinds of meaningful ratos here. It’s not as if anyone cares whether this sample generalizes to a larger population, it is what it is, and it easily could have been a little different, by the contingencies of publishing, but probably not by much.

  3. Markus says:

    I was hoping for a post slamming the authors for a highly biased analysis which mostly serves to justify a publication on textbook knowledge. For instance, ‘Omits cases based on post-treatment criteria’ appears to include papers that drop e.g. 4 participants in group A and 5 in group B out of 200 per group because those participants failed to answer all questions (or always gave the same answer or whatever) post treatment. IMHO in that situation one can have a nuanced discussion on whether this might bias results, but my response to a general, unprompted lecture on ‘conditioning on post-treatment variables’ would be ‘go f*ck yourself’. The authors themselves (p. 13) note that this _can_ imbalance the sample and _may_ affect the composition of the samples, yet proceed as if this was guaranteed 100%, no matter what.

    Similarly, apparently any use of post-treatment variables gets flagged as ‘Controls for/interacts with a posttreat variable’, regardless of circumstances. This would appear to include many mediation type analyses. While I agree with the authors, that many of those are bad and shouldn’t be done, I think it’s ludicruous to declare all cases a priori ‘bad practice’ no matter the circumstances. Again, p.16 we get ‘variables that seem likely to remain fixed … _can_ be affected’ and ‘creates _risk_ of spillover effects’, but apparently again no matter how unlikely that may be in individual cases or how carefully the authors confront the problem, the paper gets added to the 20%.

    Or to put it even more simply: By the authors stated criteria, the paper itself uses post-treatment variables in analysis in section 4.1 and should be added to the 20%. After all, it apparently doesn’t matter how the variables are used and how the analysis is presented.

    Quite apart from Table 1, I also think the authors are plainly wrong on p.1 when they go from ‘conditioning on post-treatment variables can ruin experiments’ to ‘we should not do it’. For one, even a misguided/wrong additional analysis does not ‘ruin’ the primary analysis or the experiment. Second the proper response to risk is not total avoidance, but very careful management. If the authors believe anything that can be described as ‘conditioning on post-treatment variables’ always, inevitably ruins the entire experiment, they should say so. If they don’t, they need to show for each case they flag a bad practice that the authors in question actually failed to exercise due dilligence.

    • Jacob says:

      I did find myself shrugging my shoulders in a couple cases. If a small number of people in my large N survey experiment don’t answer the questions that measure the DV, what do I do? Omission is conditioning on post-treatment variables, but the reason I shouldn’t condition on post-treatment variables is the same reason the bulk of missing value imputation procedures are treated as suspect. Maybe I should throw away my data or hedge my conclusions so substantially that I dare not present it as very informative on my RQ. Sure, 4 out of 200 or something might introduce bias, but surely it is far more probable than not that this bias is so small that it’s unlikely to affect conclusions.

      On the other hand, if there is really such a substantial number of non-compliers or failed manipulations that we can only do a reasonable analysis by conditioning on those post-treatment variables, we probably should consider whether the experiment was implemented well-enough to get anything from it.

      I’m also unsatisfied with the section on mediation analysis, which they conclude (as seemed inevitable given the overall argument) is essentially impossible to conduct in a statistically unbiased way since mediators are always post-treatment. While there’s no doubt that some mediation analysis is bunk and implausible, I don’t accept that the implication that we sequential ignorability is implausible and unlikely. While they cast aside the Imai et al. (2010) work on mediation analysis as being founded on the assumption, empirically testing the plausibility/robustness of the assumption is a major focus of that work. This is not to mention, of course, that sometimes our scientific theory can provide substantial evidence in favor of a particular causal ordering. I’m sure the authors agree with this, but they should mention it.

      • The “right” thing to do is have a model for why the data is missing, and based on this model impute the missing values. The reason this is “right” is because it answers the question “surely it is far more probable than not that this bias is so small that it’s unlikely to affect conclusions.” quantitatively and shows what the assumptions are needed to get that answer.

        The “right” way to consider any other option is to consider how it relates to the above method. If it’s a decent approximation, then … fine.

        It seems to me that much of the concern here lies with getting “meaningful” p values. Since p values are often not meaningful even without the post-treatment issues, it seems to me the solution is to actually have a model for your data, and to just use Bayesian analysis of that model and don’t worry about invalidating p values. The Bayesian analysis of that model is independent of the missingness precisely when there is no plausible way in which the missingness provides information about the outcome. If the missingness *does* provide information about the outcome, then by having a model for that, you can still get the right answer. If you have no model for it… then you can at least create a model in which different levels of informativenesses are possible and see how the range of possible mechanisms you’re willing to accept affects the uncertainty in your outcomes.

      • John Bullock says:

        I don’t accept that the implication that we sequential ignorability is implausible and unlikely.

        Can you suggest a social-science study in which sequential ignorability is likely to hold?

Leave a Reply