Skip to content

Resolving disputes between J. Pearl and D. Rubin on causal inference

This is a pretty long one. It’s an attempt to explore some of the differences between Judea Pearl’s and Don Rubin’s approaches to causal inference, and is motivated by recent article by Pearl.

Pearl sent me a link to this piece of his, writing:

I [Pearl] would like to encourage a blog-discussion on the main points raised there. For example:

Whether graphical methods are in some way “less principled” than other methods of analysis.

Whether confounding bias can only decrease by conditioning on a new covariate.

Whether the M-bias, when it occurs, is merely a mathematical curiosity, unworthy of researchers attention.

Whether Bayesianism instructs us to condition on all available measurements.

I’ve never been able to understand Pearl’s notation: notions such as a “collider of an M-structure” remain completely opaque to me. I’m not saying this out of pride–I expect I’d be a better statistician if I understood these concepts–but rather to give a sense of where I’m coming from. I was a student of Rubin and have used his causal ideas for awhile, starting with this article from 1990 on estimating the incumbency advantage in politics. I’m pleased to see these ideas gaining wider acceptance. In many areas (including studying incumbency, in fact), I think the most helpful feature of Rubin’s potential-outcome framework is to get you, as a researcher, to think hard about what you are in fact trying to estimate. In much of the current discussion of identification strategies, regression discontinuities, differences in differences, and the like, I think there’s too much focus on technique and not enough thought put into what the estimates are really telling you. That said, it makes sense that other theoretical perspectives such as Pearl’s could be useful too.

To return to the article at hand: Pearl is clearly frustrated by what he views as Rubin’s bobbing and weaving to avoid a direct settlement of their technical dispute. From the other direction, I think Rubin is puzzled by Pearl’s approach and is not clear what the point of it all is.

I can’t resolve the disagreements here, but maybe I can clarify some technical issues.

Controlling for pre-treatment and post-treatment variables

Much of Pearl’s discussion turns upon notions of “bias,” which in a Bayesian context is tricky to define. We certainly aren’t talking about the classical-statistical “unbiasedness,” in which E(theta.hat | theta) = theta for all theta, an idea that breaks down horribly in all sorts of situations (see page 248 of Bayesian Data Analysis). Statisticians are always trying to tell people, Don’t do this, Don’t do that, but the rules for saying this can be elusive. This is not just a problem for Pearl: my own work with Rubin suffers from similar problems. In chapter 7 of Bayesian Data Analysis (a chapter that is pretty much my translation of Rubin’s ideas), we talk about how you can’t do this and you can’t do that. We avoid the term “bias,” but then it can be a bit unclear what our principles are. For example, we recommend that your model should, if possible, include all variables that affect the treatment assignment. This is good advice, but really we could go further and just recommend that an appropriate analysis should include all variables that are potentially relevant, to avoid omitted-variable bias (or the Bayesian equivalent). Once you’ve considered a variable, it’s hard to go back to the state of innocence in which that information was never present.

If I’m reading his article correctly, Pearl is making two statistical points, both in opposition to Rubin’s principle that a Bayesian analysis (and, by implication, any statistical analysis) should condition on all available information:

1. When it comes to causal inference, Rubin says not to control for post-treatment variables (that is, intermediate outcomes), which seems to contradict Rubin’s more general advice as a Bayesian to condition on everything.

2. Rubin (and his collaborators such as Paul Rosenbaum) state unequivocally that a model should control for all pre-treatment variables, even though including such variables, in Pearl’s words, “may create spurious associations between
treatment and outcome and this, in turns, may increase or decrease confounding bias.”

Let me discuss each of these criticisms, as best as I can understand them. Regarding the first point, a Bayesian analysis can control for intermediate outcomes–that’s ok–but then the causal effect of interest won’t be summarized by a single parameter–a “beta”–from the model. In our book, Jennifer and I recommend not controlling for intermediate outcomes, and a few years ago I heard Don Rubin make a similar point in a public lecture (giving an example where the great R. A. Fisher made this mistake). Strictly speaking, though, you can control for anything; you just then should suitably postprocess your inferences to get back to your causal inferences of interest.

I don’t fully understand Pearl’s second critique, in which he says that it’s not always a good idea to control for pre-treatment variables. My best reconstruction is that Pearl’s thinking about a setting where you could estimate a causal effect in a messy observational setting in which there are some important unobserved confounders, and it could well happen that controlling for a particular pre-treatment variable happens to make the confounding worse. The idea, I think, is that if you have an analysis where various problems cancel each other out, then fixing one of these problems (by controlling for one potential counfounder) could result in a net loss. I can believe this could happen in practice, but I’m wary of setting this up as a principle. I’d rather control for all the pre-treatment predictors that I can, and then make adjustments if necessary to attempt to account for remaining problems in the model. Perhaps Pearl’s position and mine are not so far apart, however, if his approach of not controlling for a covariate could be seen as an approximation to a fuller model that controls for it while also adjusting for other, unobserved, confounders.

The sum of unidentifiable components can be identifiable

At other points, Pearl seems to be displaying a misunderstanding of Bayesian inference (at least, as I see it). For example, he writes:

For example, if we merely wish to predict whether a given person is a smoker, and we have data on the smoking behavior of seat-belt users and non-users, we should condition our prior probability P(smoking) on whether that person is a “seat-belt user” or not. Likewise, if we wish to predict the causal effect of smoking for a person known to use seat-belts, and we have separate data on how smoking affects seat-belt users and non-users, we should use the former in our prediction. . . . However, if our interest lies in the average causal effect over the entire population, then there is nothing in Bayesianism that compels us to do the analysis in each subpopulation separately and then average the results. The class-specific analysis may actually fail if the causal effect in each class is not identifiable.

I think this discussion misses the point in two ways.

First, at the technical level, yes you definitely can estimate the treatment effect in two separate groups and then average. Pearl is worried that the two separate estimates might bot be identifiable–in Bayesian terms, that they will individually have large posterior uncertainties. But, if the study really is being done in a setting where the average treatment effect is identifiable, then the uncertainties in the two separate groups should cancel out when they’re being combined to get the average treatment effect. If the uncertainties don’t cancel, it sounds to me like there must be some additional (“prior”) information that you need to add.

The second way that I disagree with Pearl’s example is that I don’t think it makes sense to estimate the smoking behavior separately for seat-belt users and non-users. This just seems like a weird thing to be doing. I guess I’d have to see more about the example to understand why someone would do this. I have a lot of confidence in Rubin, so if he actually did this, I expect he had a good reason. But I’d have to see the example first.

Final thoughts

Hal Stern once told me the real division in statistics was not between the Bayesians and non-Bayesians, but between the modelers and the non-modelers. The distinction isn’t completely clear–for example, where does the “Bell Labs school” of Cleveland, Hastie, Tibshirani, etc. fall?–but I like the idea of sharing a category as all the modelers over the years–even those who have not felt the need to use Bayesian methods.

Reading Pearl’s article, however, reminded me of another distinction, this time between discrete models and continuous models. I have a taste for continuity and always like setting up my model with smooth parameters. I’m just about never interested in testing whether a parameter equals zero; instead, I’d rather infer about the parameter in a continuous space. To me, this makes particular sense in the sorts of social and environmental statistics problems where I work. For example, is there an interaction between income, religion, and state of residence in predicting one’s attitude toward school vouchers? Yes. I knew this ahead of time. Nothing is zero, everything matters to some extent. As discussed in chapter 6 of Bayesian Data Analysis, I prefer continuous model expansion to discrete model averaging.

In contrast, Pearl, like many other Bayesians I’ve encountered, seems to prefer discrete models and procedures for finding conditional independence. In some settings, this can’t matter much: if a source of variation is small, then maybe not much is lost by setting it to zero. But it changes one’s focus, pointing Pearl toward goals such as “eliminating bias” and “covariate selection” rather than toward the goals of modeling the relations between variables. I think graphical models are a great idea, but given my own preferences toward continuity, I’m not a fan of the sorts of analyses that attempt to discover whether variables X and Y really have a link between them in the graph. My feeling is, if X and Y might have a link, then they do have a link. The link might be weak, and I’d be happy to use Bayesian multilevel modeling to estimate the strength of the link, partially pool it toward zero, and all the rest–but I don’t get much out of statistical procedures that seek to estimate whether the link is there or not.

Finally, I’d like to steal something I wrote a couple years ago regarding disputes over statistical methodology:

Different statistical methods can be used successfully in applications–there are many roads to Rome–and so it is natural for anyone (myself included) to believe that our methods are particularly good for applications. For example, Adrian Raftery does excellent applied work using discrete model averaging, whereas I don’t feel comfortable with that approach. Brad Efron has used bootstrapping to help astronomers solve their statistical problems. Etc etc. I don’t think that Adrian’s methods are particularly appropriate to sociology, or Brad’s to astronomy–these are just powerful methods that can work in a variety of fields. Given that we each have successes, it’s unsurprising that we can each feel strongly in the superiority of our own approaches. And I certainly don’t feel that the approaches in Bayesian Data Analysis are the end of the story. In particular, nonparametric methods such as those of David Dunson, Ed George, and others seem to have a lot of advantages.

Similarly, Pearl has achieved a lot of success and so it would be silly for me to argue, or even to think, that he’s doing everything all wrong. I think this expresses some of Pearl’s frustration as well: Rubin’s ideas have clearly been successful in applied work, so it would be awkward to argue that Rubin is actually doing the wrong thing in the problems he’s worked on. It’s more that any theoretical system has holes, and the expert practitioners in any system know how to work around these holes.

P.S. More here (and follow the links for still more).


  1. Alex F says:

    >>That said, I see no reason why other theoretical perspectives such as Pearl's might be useful too.

    I assume this was a typo…

  2. Andrew Gelman says:

    I rewrote to be clearer.

  3. Andrew Gelman says:

    Larry Wasserman writes:

    This has nothing to do with conditioning or Bayesian inference. Let theta(x) be the causal effect of X on Y when X=x. Let Z and W be two other variables. In some cases we have

    theta(x) = integral E(Y|X=x,Z=z,W=w) f(z,w) dz dw

    and in other cases

    theta(x) = integral E(Y|X=x,Z=z) f(z) dz

    It depends on the graph.

    (This can also be expressed with counterfactuals. In this case, it depends on the joint distribution of the counterfactuals.)

    The point is simply that putting every possible variable into the equation for the adjusted effect is not always right. It's a math question about how to express the causal effect as a functional of the joint distribution. It is not about conditioning versus not conditioning on data.

    I appreciate Larry's note, but I think that, more than anything else, he's demonstrating that Pearl is operating in an entirely different conceptual framework than I am. First, the concept of "causal effect of X on Y when X=x" doesn't make sense to me, because I am used to thinking of a causal effect as a comparison, i.e., you need at least two different values of x. Second, I wouldn't usually think of defining a causal effect as an integral at all.

    I mentioned this to Larry, and he wrote that these were just notational issues, but I think there's something more there. We have more discussion in chapter 7 of BDA and chapters 9 and 10 of ARM,.

  4. Corey says:

    "By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental power of the race."

    — Alfred North Whitehead, “An Introduction to Mathematics”

    For any statement expressed in Rubin's framework, an equivalent statement can be expressed in Pearl's graphical language. When it comes to ease of use, there's just no comparison — Pearl's causal graphs are analogous to graphs (that is, images structured specifically to expose and convey information), and Rubin's notation is analgous to tables (that is, lists of symbols).

  5. Andrew Gelman says:


    I'm a big fan of notation, but I don't see the relevance of your quote here. Pearl's method does _not_ seem equivalent to the the statistical methods I'm familiar with. For example, as Pearl notes, Paul Rosenbaum recommends controlling for pre-treatment variables (as do Jennifer and I, following standard statistical practice), but Pearl does not. For another example, it seems to me that Pearl favors discrete models in which certain variables might not have links between them. As discussed above, I prefer fully continuous models in my social and environmental science applications. And Larry writes causal inferences as integrals, whereas I think of them as differences (see, for example, my 1990 paper with King, linked to above).

    These are real differences in modeling strategy and interpretation, not merely in notation.

    Again, I'm not saying Pearl (or Larry, or you) are wrong, merely that he's doing something _different_ from the methods I'm used to. And Pearl might do better to explore and understand these differences rather than maintain frustration that people don't see the equivalence. Once you recognize that we have different goals, it's easier to accept that we are using different methods.

  6. Cyrus says:

    Part of the problem as I see it is in the imprecision with which words like "condition" and "control" are used. Here's an example from a paper that I presented at MPSA a few months back, drawing on Hernan et al (2004; full ref below).

    Suppose a data generating process such that

    L~Bernoulli(logit^(-1)(-2 + 2D + 2E))
    C~Bernoulli(logit^(-1)(-4 + 3L)),

    and (i) Y values are missing for C = 1, but X, L, and C are always observed. The expected marginal relationship between X and Y is zero–they are just independent draws. L and C are "downstream" from X and Y; L is thus a "collider" for X and Y, C is a descendant of L, and thus in a reduced form, also a collider for X and Y. They are also "posttreatment", if we take X to be a treatment indicator. If we just use the observed Y's to estimate the relationship between X and Y in, say, a logistic regression, we'll get a coefficient on X that is too low. In doing this, we are implicitly "conditioning on C", because we are only working with data for which C=0. If we continue to "condition on C" in this way and we also "condition on L" by including it as a covariate in the regression, the estimate will be even more downwardly biased. But if we "condition on C and L" by first estimating the probability that C=0 conditional on L in our data, and then use the inverse of those probabilities to weight our data in a logistic regression of Y on X, we will get unbiased estimates.

    This example suggests the following. First, we need to be clearer about what we mean when we say "condition on…" Second, heuristics about how to work with "colliders"/"posttreatment" variables make need to "conditioned" on the type of selection processes that may be at work.


    Hernan MA, Hernandez-Diaz S, Robins JM. 2004. A structural approach to selection bias. Epidemiology. 15(5).

  7. David Kane says:

    Readers may be interested in the Wikipedia article on the Rubin Causal Model.

    Perhaps you could provide a link above?

  8. judea pearl says:

    —————-comment to A Gelman July 5, 2009 —
    Dear Andrew,
    Thank you for your mesg of July 5.
    I appreciate your genuine and respectful quest to
    explore the differences between the approaches
    that I and Don Rubin are taking to causal inference.

    In general, I would be the first to rally behind your
    call for theoretical pluralism (e.g., "It make sense that
    other theoretical perspectives such as Pearl's could be useful
    too.") We know that one can prove a theorem in geometry by either
    geometrical or algebraic methods, depending on the
    problem and the perspective one prefers to take —
    only the very dogmatic would lable one of the
    methods "unprincipled".

    My article "Myth, confusion and Science in Causal Analysis"
    is written with this dual perspective in mind,
    fully accommodating the graphical and
    potential-outcome conceptualizations as interchangeable,
    "A theorem in one approach is a theorem in another"
    I wrote.

    However, when adherents of the one-perspective approach
    make claims that mathematically contradict
    those derived from the dual-perspective approach,
    one begins to wonder whether there is something
    more fundamental at play here.

    In our case, the claims we hear from two
    adherents of the graph-less one-perspective school
    is: “there is no reason to avoid adjustment for a variable
    describing subjects before treatment''
    And from three adherents of the graph-assisted
    dual-perspective school we hear:
    "Adjustment for a variable describing subjects before
    treatment may be harmful''

    This is a blatant contradiction that affects every
    observational study and deserves therefore to be discussed
    even if we believe in "let a thousand roses bloom".

    One may be tempted to resolve the contradiction
    by appealing to practical expediencies. For example,
    1. Nothing is black and white.
    2. Perhaps adjustment may be harmful
    in theory, but is very rare in practice,
    3. Perhaps the harm is really very small, or
    4. we do not really know in practice if it is harmful
    or not, so why worry?

    This line of defense would be agreeable, were it
    not accompanied with profound philosophical claims that
    the dual-perspective approach is in some way
    "unprincipled" and standing (God forbid) "contrary to Bayesianism".

    The point is that we DO KNOW in practice when harm is
    likely to occur through improper adjustments. The
    same subjective knowledge that tells us that seat-belt usage
    does not cause smoking or lung disease also tells us that
    adjustment for seat-belt usage is likely to introduce
    Moreover, one can derive this warning in the graph-less
    notation of potential outcome.
    So, the question remains: why havent potential outcome
    scholars been issuing that warning to their

    The conjecture I made
    should concern every Bayesian and every educator,
    for it points beyond M-bias and covariate selction.
    The conjecture is that the language of "potential outcome"
    and "ignorability" discourages investigators from
    articulating and using valuable knowledge which they
    possess, for example, that seat-belt usage does not cause
    smoking. Do you know of any studiy where such a piece
    of knowledge was used in determining whether treatment
    assignment is "ignorable" or not?
    My conjecture is confirmed
    by potential-outcome practitioners who admit
    to be using "ignorability" invariably to justify their favorite
    method of analysis, never as an object to be
    justified by appeal to causal knowledge.

    As to indiscriminate conditioning in Bayesian philosophy,
    the example of controlling for an
    intermediate variable (between treatment and outcome)
    should illuminate our discussion.

    (I do not buy your statement that
    bias is "tricky to define" . It is extremely
    easy to define, even in Rubin's notation:
    "Bias" is what you get if you adjust for Z and
    treatment assignment is not ignorable conditioned
    on Z. This would suffice for our purposes)

    You say:
    1. a Bayesian analysis can control for intermediate
    outcomes -that's ok – but then ……
    2. Jennifer and I recommend not controlling for intermediate
    3. You can control for anything, you just then should
    suitable postprocess….
    4. I heard Don Rubin make a similar point… Fisher made
    this mistake.

    Andrew, I know you did not mean it to sound so
    indecisive, but it does. Surely, one can always add
    17.5 to any number, as long as one remembers to
    "post-process" and correct the mistake later on.
    But we are not dealing here with children arithmetics.
    Why not say it upfront: "You cant arbitrarily add 17,5
    to a number and hope you did not do any harm."
    Even the Mullahs of arithmetic addition would forgive us
    for saying it that way.

    If you incorporate an intermediate variable M as a
    predictor in your propensity score and continue to do
    matching as if it is just another evidence-carrying predictor,
    no post processing will ever help you, except
    of course, redoing the estimation afresh, with M removed.
    It will not fix itself by taking more samples.
    Is Bayesianism so dogmatic as to forbid us
    from speaking plainly and just say : "Dont condition".
    (No wonder I once wrote: "why I am only a half-Bayesian"….

    True, the great R A Fisher made a similar mistake.
    But it happened in the context of estimating
    "direct effects", where one wants to control for
    the intermediary variable, not in the context of"causal effects,"
    where one wants the intermediaries to vary freely.
    Incidentally, the repair that Don Rubin offered in the
    Fisher lecture made things even worse.
    For example, the direct effect according to Rubin's
    definition (using principal stratification) is definable
    only in units absent of indirect effects.
    This means that a grandfather would be deemed to have no direct
    effect on his grandson's behavior in families where
    he has some effect on the father.
    In linear systems, to take a sharper example,
    the direct effect would be undefined whenever indirect paths
    exist from the cause to its effect.

    Such paradoxical conclusions emanating from a
    one-perspective culture underscore the wisdom, if not necessity
    of a dual-perspective analysis, in which the counterfactual
    notation Y_x(u) is governed by the formal semantics of
    graphs, structural equations and open-mindedness.

    I just saw Larry Wasserman's comment.
    Larry is right, I do not operate in an "entirely different conceptual
    I call the [X x Y] –> [0,1] function P(Y_x = y) "Causal effect"
    leaving it up to the investigator to form differences
    P(Y_1 = y) – P(Y_0 = y), P(Y_8 = 3) – P(Y_5 = 3),
    or ratios: P(Y_1 = y) / P(Y_0 = y)
    or any other comparison that fits fashion and dogma.

    This does not make for a different
    conceptual framework, it is the common engineering practice
    of not wasting precious symbols on trivialities.
    What does call for a possible realignment
    of conceptual frameworks is what you tell your students about
    adjustment for intermediaries, and whether big-brother
    Bayes approves.

    Try it.

    Best =======Judea

  9. Cyrus says:

    There was a typo in my post. It should say

    L~Bernoulli(logit^(-1)(-2 + 2X + 2Y))

  10. Corey says:

    It seems Pearl's contention is the approaches are equivalent but that the difficulty of working with Rubin's notation has obscured that fact (and also important facts about how to do causal inference) — hence my quote. If so, it ought to be possible to re-express M-collider language and so forth and demonstrate Pearl's assertion about covariate adjustment in potential-outcome terms. Maybe I'll take a stab at that myself — it would be a good learning experience.

    When Pearl talks about bias, I think he actually means asymptotic consistency.

    On hierarchical models vs. causal graphs, it seems to me that someone who had really grokked Pearl's book on causality and also knew hierarchical models the way you do could probably write a raft of important papers combining the two approaches. (Alas, I feel that I personally lack the intuitive grasp of these two subjects necessary to write such papers myself.)

  11. Joseph Delaney says:

    "Whether the M-bias, when it occurs, is merely a mathematical curiosity, unworthy of researchers attention."

    I find the M-bias to be a confusing example.

    I think, in general, if strong and unmeasured confounders are introduced into an example then you will have confounded estimates. But what is typically lacking in these situations are serious candidates for the 2 "unknown" confounders required for M-bias.

    In general, even having the crude and the adjusted estimate be extremely close doesn't protect against missing information. What I guess that I missing is how this helps?

    That being said, I have used DAGs in some of my papers and I did think that Pearl did an amazing service to the field by getting researchers to show their assumptions in an easy to understand manner.

  12. Andrew Gelman says:

    Cyrus: So in your example, there are two things going on:

    (1) The usual advice of Rosenbaum, Rubin, and everyone else in the literature to not control for intermediate variables is correct, but

    (2) There is a missing-data problem, and that needs to be modeled correctly.

    I guess the point is that the notation of colliders, graphical models, etc., can help with problem 2 in this example?

  13. Andrew Gelman says:

    Corey: If you could combine Pearl's and Rubin's framework with multilevel modeling, that would be great. As I noted above in my response to Larry, I don't think Pearl's and Rubin's frameworks are equivalent, and I think it would be helpful to all concerned to explore the differences.

    David: Thanks for the link. This particular Wikipedia article looks like it was written by a follower of Pearl; I think a better (albeit imperfect) presentation of Rubin's ideas is in chapter 7 of Bayesian Data Analysis and chapters 9 and 10 of Data Analysis Using Regression and Multilevel/Hierarchical Models.

    Judea: Thanks for the long response. I'll reply in a future blog entry.

  14. Cyrus says:

    Andy: Yes, I think a graphical model/DAG is a very handy way to sort out how the available information can be used for unbiased inference of the X-Y relationship in the example.

  15. Phil says:

    Boy, these academic disputes are fun! Such vitriol! Such personal animosity! It's better than reality TV. Did Rubin slap Pearl's mom, or perhaps vice versa?

    As for the seat belts, smoking, and lung cancer example, where do I go to actually see the analysis that is being discussed? Without a specific case to think about, I can't say whether I think it is important to include seat belt usage as a variable, important to exclude it, or something in between.

    I will say that if you are trying to determine the causal relationship between smoking and lung cancer, and it turns out there is a strong _statistical_ relationship between non-seatbelt-wearing and lung cancer, then you have a problem. If you include seatbelt-wearing as a variable, you mess up your ability to interpret the smoking-cancer relationship—since you know there is no causal relationship between seat-belt-wearing and lung cancer, including this variable can only screw you up. But suppose you decide, on this basis, NOT to include seat-belt-wearing as a predictive variable (this seems to be Pearl's view of what you MUST do). How can you interpret the smoking-cancer relationship as a PURELY causal one? People who smoke are more likely to get lung cancer for the obvious reason — smoking causes lung cancer — but perhaps _also_ because they indulge in other risky behaviors (like smoking pot, huffing paint, working at jobs in dusty environments, whatever). This is just the usual correlation-causation problem. Perhaps you make it even worse if you include seatbelt-wearing than if you don't, but it's not like the problem goes away if you exclude seatbelt-wearing.

    The thing to do (please excuse me if this is obvious) is to look at the smoking/cancer relationship separately for seatbelt-wearers and for non-seatbelt-wearers. If it's about the same — if seatbelt-wearing smokers get lung cancer at about the same rate as non-seatbelt-wearing smokers — that would be evidence that the "smoking" coefficient is causal, or at least isn't confounded by other risky behaviors (though it still could be confounded by something other that isn't related to seatbelt-wearing).

    I guess the points I'm making are: (1) the Pearl-Rubin debate seems very personal, which is really fun in a way, but doesn't seem like the best way to advance the field of statistics. And (2) my general sense, uninformed by any formal knowledge whatsoever about M-bias and colliders, is that if you have information you should use it if you can. If you know that seat belts do not CAUSE lung cancer, you should use that fact in your analysis, and if you know that seat-belt-wearing is statistically associated with lung cancer you should use that fact too (if you can). The combination of these means that almost certainly you should not put "seat-belt-wearing" and "smoking" into an analysis to predict lung cancer IN THE SAME WAY (one is at least largely causal, the other is not), but also means that they should both be in your analysis _somehow_. If I have that wrong, I would love to know why, because it will definitely change the way I work (and think)!

  16. Andrew Gelman says:

    Phil: What I want to know is, how do you decide when to _emphasize_ and when to SHOUT?

    More seriously, I don't think the dispute between Pearl and Rubin is personal in any sort of bad way. I think it's just disagreement about some scientific ideas, and, as we all know, such disagreement can be painful. It's incredibly frustrating when others seem to miss the point.

    For example, it's frustrating to some Bayesians that people do posterior predictive checks because it's obviously wrong and uses the data twice. At the same time, it's frustrating to me that some Bayesians tell people not to use posterior predictive checks: as I've shown repeatedly, such checks have direct Bayesian interpretations and use the data exactly once! For some reason, we're not all at each others' throats over this one, but it still gets me worked up, that's for sure.

    Regarding the smoking example, I think that some of the difficulties in the smoking analysis had to do with legal constraints: Rubin was doing the analysis as part of a court case, and in such settings, the variables that you can use are sometimes determined externally, for example by a judge. So in some ways this might not be the best example for us to be focusing on.

  17. judea pearl says:

    Andrew et al,
    This discussion has been a great education for me,
    It gives me a first hand look at how people
    from a different culture react to ideas that,
    for me, have been second nature. It is a great
    preparation for the JSM tutorial that I will be
    giving August 4, in DC.

    A few specific comments to the various commentators.
    On M-bias.
    Let us focus on the easier example of an
    intermediate variable (Z) between treatment (X) and outcome (Y).
    Has anyone seen a proof that adjusting for Z would
    introduce bias? (And I mean asymptotic bias, defined
    unambiguously as the difference between what you want,
    i.e., E(Y_x), and what you estimate, i.e., E[Y|x,z)])
    or SUM_z E[Y|x,z)]P(z))
    And please spare me the labor of writing difference
    E(Y_1)-E(Y_0); all the information is in
    E(Y_x), x = 0,1,2,3….)

    Anyone has seen a proof?
    I found a verbal warning in Cox (1958, p. 48)
    saying: "the concomitant observations should be quite
    unaffected by the treatment".
    But why? And where is the proof?
    One would think that 80 years after Neyman wrote
    down the symbol Y_x, every statistics textbook would have
    a proof that adjusting for Z is a bad thing.
    (Similarly, adjusting for a proxy of Z, which is
    not on the pathway, is also a bad thing.)

    What I find strange is that even people versed in causal
    analysis are still referring to this warning as a good
    "advice" passed to us by the sages, Cox, Rosenbaum, Rubin,
    but not as a fact in the real world out there, fact
    that everyone can see, verify and prove mathematically,
    if needed.

    I dwell on this issue because it is indicative
    of the confusion about M-bias and about the larger
    issue of whether one should adjust for all available
    measurement or not. The general attitude in this discussion
    has been to treat the issue as if it was a personal dispute about
    a wine tasting contest; Rosenbaum and Rubin say
    "adjust", Pearl and his camp say "do not adjust",
    both sides quote good reasons, so it must be a matter
    of taste, style, focus, perspective , interest, method
    etc. It isn't.

    Isn't it possible that either Rosenbaum or Pearl just
    made a careless statement, which he now regrets?
    I am surprised that none of the discussants came up
    with a bold statement such as: "I can prove Pearl wrong" .
    or "I have a counterexample to Rosenbaum".
    We are in the 21st century, the age of mathematics, why not
    go out there and see what the world says about
    seatbelts or what the mathematics says.

    Let us go back to the intermediary variable issue .
    Let X and Y be the outcomes of two fair coins,
    and let Z be a bell that rings if at least one
    of X and Y comes up head. We wish to estimate
    the causal effect of X on Y after collecting
    a huge number of samples, each in the form of
    a triplet (X, Y, Z). Should we include Z in the analysis?
    IF so how? Would our favorite estimate of E(Y_x) be biased?
    Will it give us what we expect, namely, that
    X has no causal effect on Y, i.e., E(Y_x) = E(Y).

    Now, these are questions that, in the 21st century,
    should be answered immediately without referring to
    the sages or to philosophical disputes about
    Bayesianism or graphs, or hierarchical modelling
    or missing data, This is Stat-101 !!!

    The same goes for M-bias.
    Let treatment X' be determined by the coin X,
    let outcome Y' be determined by the coin Y
    and, lo and behold, we have an M-bias on our hand.
    We will get the right answer (i.e., ignore the measurements
    of Z) even when the coins are biased, even when the bell is
    corrupted by noise, and even if the treatment X' does have
    some effect on outcome Y'.

    My point is that the answer lies in the world out there,
    and in our ability to capture that world in our model;
    it is not a matter of a differences between me and Don Rubin.
    or between me and Andrew Gelman.

    Several times you alluded to "differences"
    between what I am doing and the way you are used
    to do things — why not look at the bell? No differences
    there; it is the objective world, even for a Bayesiam.

    Speaking of hierarchical models vs. causal graphs,
    I doubt whether they have anything in common.
    My litmus test is simple: Can we express mathematically
    the following piece of information: "the bell does
    not cause the outcome of coin X" ? We can say it in graphs,
    we can say it in potential-outcome notation , but I
    doubt you can express it with the language of hierarchical
    models . Why? because the latter is based strictly on
    properties of distributions, and we cannot say the
    work "cause" in the language of distributions.

    Note that the bell ringing is highly correlated with both
    the treatment and the outcome, like the seat-belt
    usage. And yet, stat-101 tells us that
    "if you have information you should use it "
    — true — but sometimes you should just ignore it (the bell).

    You have actually come close to solving our dispute.
    you said:
    " If you know that seat belts do not CAUSE lung cancer, you
    should use that fact in your analysis"
    This is the crux of the matter.
    To use that fact, you need to write it mathematically,
    and derive things from it.
    and this is where the weakness of the Rubin model lies.
    Try to write this fact in the language of potential
    outcomes, and see if you can derive from it the conclusion
    that adjusting for seat-belt may be harmful.
    My students can, because they learned how to map
    any causal utterance to the language of potential-outcome.
    Recall, we are a dual-perspective university.
    But I have not met many of Rubin's students who can
    express this fact even in the potential outcome language.
    Try it.

    So how do they manage without expressing such basic facts?

    The dont. And because it has been so cumbersome for them
    to express it,
    they just avoid thinking about such facts and, instead,
    they rely on what the sages said or did not say. See wikipedia.

    David Kane, Andrew, I did not write the Wikipedia entry for
    Rubin Causal Model. It was probably written be one of Rubin's
    disciples, (you can tell by the word "assignment", which is takes
    from experimental design, and which I rarely use).
    My assistant wrote the section on "Relation to Other
    Approaches" and it is still valid today, though it
    could benefit from two examples:
    1. The issue of indiscriminate conditioning..
    2. The definition of direct effects.

    Andrew, Corey,
    Corey is right in stating that "any statement expressed
    in Rubin's framework, an equivalent statement can be
    expressed in Pearl's structural language" The differences
    you see between my methods and the "statistical methods
    you are familiar with" are superficial. If Rosenbaum
    recommends controlling for pre-treatment variable and I do
    not, it is only because Rosenbaum was probably careless.
    (I dont blame him, given the opaque language he has been using).
    Are you sure Jennifer sides with Rosenbaum her?
    Impossible. The last paper I read of Jennifer was
    sound and open-minded. Please check with her again.

    The differences that you noted, between integrals and
    "differences" are trivial. You can get all the differences
    from Larry's integral.

    The difference between "discrete" and "fully continuous"
    is tangential. No matter how continuous you are, to
    express a qualitative fact like "seat-belt does not
    cause cancer" you become discrete, or discrete in disguise.
    And this is precisely the meaning of the missing arrows in the
    causal graphs: qualitative statement about lack of

    You might be right in speculating that "we have
    different goals". I can tell you my goals:
    I am interested in solving
    causal problems, starting with stat-101, so that
    I can convince my students NOT to adjust for
    bells ringing, seat-belt using, and others.
    If you tell me what your goals are, perhaps we can
    find commonalities. But surely, the coins and bell
    should be there before we go to fancy stuff, like
    hierarchical modeling, missing data, etc.
    You are right about the confusion between "conditioning
    and "control". The M-bias (and the bell) is just
    a special case of "selection bias".

    Well it has been fun. And, if I did not succeed
    in convincing anyone to convert to the dual-perspective
    camp, I hope I at least managed to convince you that
    causality is about the world – chimes, seat-belts, coins and bells,
    not about the method you use in your analysis and not
    about what this or that gurus said or did not say.
    Causality has been mathematized, so there is
    no more room for difference of opinion.

    =======Judea Pearl

  18. Phil says:

    It still looks pretty personal to me.

    As for when I SHOUT and when I _emphasize_…well, first of all I probably overdo both (in blog entries and emails) because it's easier than rewriting to provide the emphasis through sentence structure. In publications, I don't use emphasis more than the next guy, I think. But to get to the point: in my previous blog post I was emphasizing _so much_ that I was afraid it would be _distracting_ and maybe even _confusing_. I thought it was better to _switch_ between _underlining_ and CAPITALIZING. But perhaps that came out even more distracting. I suppose I ought to simply use html tags do do it. But I forget how to italicize. Oh, wait, there it is, it's exactly what you'd expect. OK, from now on this is how I'll do it. Happy now?

  19. Larry Wasserman says:

    I want to second what Judea says.
    This is a math question with a math answer.
    It is independent of which formalization
    (graphs or counterfactuals) you use.
    The question is simply how to express the causal
    effect as a functional of the joint distribution.

    And it is a population level question,
    not a data question.

    Best wishes


  20. I always mean to read your blog–glad I tuned in this time.
    * you indicated that "In much of the current discussion of identification strategies, regression discontinuities, differences in differences, and the like, I think there's too much focus on technique and not enough thought put into what the estimates are really telling you." I think there is a bit of a backlash against IVE/LATE for just that reason. Even in lovely applications of RD, you get a weird effect (e.g., Miller and Ludwig's analysis of head start)

    Deaton, A. (2009). Instruments of development: Randomization in the tropics, and the search for the elusive keys to economic development. NBER Working Paper.

    Heckman, J. J., Urzua, S., Foundation, A. B., & Str, P. Comparing IV with Structural Models: What Simple IV Can and Cannot Identify.

    * As a recovering economist (with Bayesian sympathies), I think I see the root of your problem.

    The problem with the Rubin approach, I think, is that you don't have an explicit model of treatment assignment. I know that's a strength of the approach, but it's also a weakness. With an explicit model of treatment assignment, it's easy to see the problems with controlling for some covariates. That's why I like DAGS–they better incorporate some of the ideas I learned in econometrics (e.g., analyzing a weird sample harms internal validity). I see them as a bridge from the Rubin causal model to econometrics.

    * Here's an example, and you can see that sorting into treatment runs through the example.

    Suppose we want to know what the "effect" of race on children's achievements are. (Let's not get into whether causal inference is possible with a treatment that can't be manipulated. Let's move forward assuming it can be.)

    Now, in the status attainment literature, one of the first things we would want to include as a covariate would be parent's education. Here's the problem–parent's education is affected by omitted ability bias, and parent's education is also a collider. When we condition on it, we establish a relationship between parent's ability and race **where none existed before**. As a result, we tend to under-estimate the effect of race. Indeed, in some status attainment models, black kids do better "all else equal". One possibility is that this is a wonderful story of resilience among disadvantage kids. Another is a problem with colliders.

    (Here's the intuition. Suppose we took only families with PhD's. In that strata, because of the other barriers they face, black parents are more capable in unobserved ways than the white families, holding education constant. Their kids do better, creating the appearance of resilience. Think Huxtables versus, say, my own children.)

    It's impossible to show this in the basic Rubin causal model. You can write structural equations (as in econometrics) but they're hardly transparent. The DAG is a nice middle ground.

    I've not waded through the posts so forgive me if I've stated what others have written.

    I do think Rubin has gone off the deep end in terms of the strength of his objections to diagrams. They surely help make some points to my developmental psychologist colleagues, especially in the area of analyses of weird, self-selected samples.

  21. Joseph Delaney says:

    RE: Pearl:

    "Has anyone seen a proof that adjusting for Z would
    introduce bias?"

    Schisterman EF, Cole SR, Platt RW. Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology. 2009;20(4):488-95.

    This seems to address this issue rather directly; I don't know what level of proof one wants but they derive estimates of the bias which seemed convincing to me that this bias exists.

    My issue with the more complex M-bias is that it is hard to know what to do with this knowledge other than to acknowledge that there is "at least one case where, despite careful work, bias could still occur from adjustment". But if it has the shape that Shrier gave it in his Statistics in Medicine letter (Shrier I. Re: The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Stat Med. 2008;27(14):2740-1.), it is unclear to me how one could posit it's existence for it without discovering both U's.

    That is not to say that one should not think carefully about the problem. Alan Brookhart and Peter Austin both have papers on variable selection for propensity scores that make it clear that variable selection needs to be done with care (as it can be counter-intuitive).

    So I guess what I am concerned about with the M-bias is that the statement above is too strong (it is quite worthy of researchers attention). But it's also unclear how often it actually occurs and thus what the impact is on practical problems. The idea that the crude is a better estimate under some conditions is fine but I would prefer to also point out that adjusting for known confounders should typically reduce bias.

  22. Cyrus says:

    An addition to the list of articles that provide a proof of the claim that "one should not control for posttreatment variables":

    Wooldridge, J. (2005), “Violating Ignorability of Treatment by Controlling for Too Many Factors,”. Econometric Theory, 21, 1026-1029.

  23. judea pearl says:

    Glad you brought Wooldridge's article to my attention.
    I read it carefully and, as I suspected, the fact that we are
    talking about an "intermediate
    variable", never makes it into the equations. I do not blame
    Wooldridge, it is a hard fact
    to capture in the language of

    Wooldridge's paper proves a sufficient condition for
    violating ignorability
    The condition is for Yj to be dependent on the controlled covariate X. It remain for us
    to envision what kind of X's would satisfy this countefactual

    Two questions remain:
    Is it the case that an intermediate variable X
    on the path from treatment W to outcome Y is typically dependent on Yj ?
    Is it possible for a variable X to be an outcome of X
    and still be independent on Yj.

    Well, it turns out that the answer to both questions is in the affirmative,
    (for graphical examples and general condition, see
    http;// }

    Schisterman et al got question 2 correctly (using graphs of course) and question-1 almost

    Conclusion: The century old advice to refrain from stratifying on covariates affected by the treatment if
    finally getting some mathematical
    treatment. Not bad for science.