Skip to content
 

More on Pearl/Rubin, this time focusing on a couple of points

To continue with our discussion (earlier entries 1, 2, and 3):

1. Pearl has mathematically proved the equivalence of Pearl’s and Rubin’s frameworks. At the same time, Pearl and Rubin recommend completely different approaches. For example, Rubin conditions on all information, whereas Pearl does not do so. In practice, the two approaches are much different. Accepting Pearl’s mathematics (which I have no reason to doubt), this implies to me that Pearl’s axioms do not quite apply to many of the settings that I’m interested in.

I think we’ve reached a stable point in this part of the discussion: we can all agree that Pearl’s theorem is correct, and we can disagree as to whether its axioms and conditions apply to statistical modeling in the social and environmental sciences. I’d claim some authority on this latter point, given my extensive experience in this area–and of course, Rubin, Rosenbaum, etc., have further experience–but of course I have no problem with Pearl’s methods being used on political science problems, and we can evaluate such applications one at a time.

2. Pearl and I have many interests in common, and we’ve each written two books that are relevant to this discussion. Unfortunately, I have not studied Pearl’s books in detail and I doubt he’s had the time to read my books in detail also. It takes a lot of work to understand someone else’s framework, work that we don’t necessarily want to do if we’re already spending a lot of time and effort developing our own research programmes. It will probably be the job of future researchers to make the synthesis. (Yes, yes, I know that Pearl feels that he already has the synthesis, and that he’s proved this to be the case, but Pearl’s synthesis doesn’t yet take me all the way to where I want to go, which is to do my applied work in social and environmental sciences.) I truly am open to the probability that everything I do can be usefully folded into Pearl’s framework someday.

That said, I think Pearl is on shaky ground when he tries to say that Don Rubin or Paul Rosenbaum is making a major mistake in causal inference. If Pearl’s mathematics implies that Rubin and Rosenbaum are making a mistake, then my first step would be to apply the syllogism the other way and see whether Pearl’s assumptions are appropriate for the problem at hand.

3. I’ve discussed a poststratification example. As I discussed yesterday (see the first item here), a standard idea, both in survey sampling and causal inference, is to perform estimates conditional on background variables, and then average over the population distribution of the background variables to estimate the population average. Mathematically, p(theta) = sum_x p(theta|x)p(x). Or, if x is discrete and takes on only two values, p(theta) = (N_1 p(theta|x=1) + N_2 p(theta|x=2)) / (N_1 + N_2).

This has nothing at all to do with causal inference: it’s straight Bayes.

Pearl thinks that if the separate components p(theta|x) are nonidentifiable, that you can’t do this, and you should not include x in the analysis. He writes:

I [Pearl] would really like to see how a Bayesian method estimates the treatment effect in two subgroups where it is not identifiable, and then, by averaging the two results (with two huge posterior uncertainties) gets the correct average treatment effect, which is identifiable, hence has a narrow posterior uncertainly. . . . I have no doubt that it can be done by fine-tuned tweaking . . . But I am talking about doing it the honest way, as you described it: “the uncertainties in the two separate groups should cancel out when they’re being combined to get the average treatment effect.” If I recall my happy days as a Bayesian, the only operation allowed in combining uncertainties from two subgroups is taking a linear combination of the two, weighted by the (given) relative frequencies of the groups. But, I am willing to learn new methods.

I’m glad that Pearl is willing to learn new methods–so am I–but, no new methods are needed here! This is straightforward, simple Bayes. Rod Little has written a lot about these ideas. I wrote some papers on it in 1997 and 2004. Jeff Lax and Justin Phillips do it in their multilevel modeling and poststratification papers where, for the first, time, they get good state-by-state estimates of public opinion on gay rights issues. No “fine-tuned tweaking” required. You just set up the model and it all works out. If the likelihood provides little to no information on theta|x but it does provide good information on the marginal distribution of theta, then this will work out fine.

In practice, of course, nobody is going to control for x if we have no information on it. Bayesian poststratification really becomes useful in that it can put together different sources of partial information, such as data with small sample sizes in some cells, along with census data on population cell totals.

Please, please don’t say “the correct thing to do is to ignore the subgroup identity.” If you want to ignore some information, that’s fine–in the context of the models you are using, it might even make sense. But Jeff and Justin and the rest of us use this additional information all the time, and we get a lot out of it. What we’re doing is not incorrect at all. It’s Bayesian inference. We set up a joint probability model and then work from it. If you want to criticize the probability model, that’s fine. If you want to criticize the entire Bayesian edifice, then you’ll have to go up against mountains of applied successes.

As I wrote earlier, you don’t have to be a Bayesian (or, I could say, you don’t have to be a Bayesian)–I have a great respect for the work of Hastie, Tibshirani, Robins, Rosenbaum, and many others who are developing methods outside the Bayesian framework)–but I think you’re on thin ice if you want to try to claim that Bayesian analysis is “incorrect.”

4. Jennifer and I and many others make the routine recommendation to exclude post-treatment variables from analysis. But, as both Pearl and Rubin have noted in different contexts, it can be a very good idea to include such variables–it’s just not a good idea to include them as regression predictors.) If the only think you’re allowed to do is regression (as in chapter 9 of ARM), then I think it’s a good idea to exclude post-treatment predictors. If you’re allowed more general models, then one can and should include them. I’m happy to have been corrected by both Pearl and Rubin on this one.

5. As I noted yesterday (see second-to-last item here), all statistical methods have holes. This is what motivates us to consider new conceptual frameworks as well as incremental improvements in the systems with which we are most familiar.

Summary . . . so far

I doubt this discussion is over yet, but I hope the above notes will settle some points. In particular:

– I accept (on authority of Pearl, Wasserman, etc.) that Pearl has proved the mathematical equivalence of his framework and Rubin’s. This, along with Pearl’s other claim that Rubin and Rosenbaum have made major blunders in applied causal inference (a claim that I doubt), leads me to believe that Pearl’s axioms are in some way not appropriate to the sorts of problems that Rubin, Rosenbaum, and I work on: social and environmental problems that don’t have clean mechanistic causation stories. Pearl believes his axioms do apply to these problems, but then again he doesn’t have the extensive experience that Rosenbaum and Rubin have. So I think it’s very reasonable to suppose that his axioms aren’t quite appropriate here.

– Poststratification works just fine. It’s straightforward Bayesian inference, nothing to do with causality at all.

– I have been sloppy when telling people not to include post-treatment variables. Both Rubin and Pearl, in their different ways, have been more precise about this.

– Much of this discussion is motivated by the fact, that, in practice, none of these methods currently solves all our applied problems in the way that we would like. I’m still struggling with various problems in descriptive/predictive modeling, and causation is even harder!

– Along with this, taste–that is, working with methods we’re familiar with–matters. Any of these methods is only as good as the models we put into them, and we typically are better modelers when we use languages with which we’re more familiar. (But not always. Sometimes it helps to liberate oneself, try something new, and break out of the implicit constraints we’ve been working on.)

13 Comments

  1. David Afshartous says:

    RE inclusion of post-treatment variables, Stephen Senn and Steven Julious have a nice example on this in the context of measurement in the current early online issue of Statistics in Medicine (Measurement in clinical trials: A neglected issue for statisticians?).

  2. Keith O'Rourke says:

    A speculative imputation of Rubin’s concern about PEARLing and Pearl’s concern about RUBINing

    Rubin on PEARLing – “Ok its fine to help scientists make their muddled and confused sense of underlying causality in a given application explicitly clear and even mathematically precise using graphs – BUT then not only will they likely take that overly precise (and surely wrong) picture much too seriously, they likely will also be taken advantage of by analysts recklessly suggesting they make the validity of their subsequent analyses highly or even completely dependent on that picture being true (and by the way, as that picture is a model we always know it’s always wrong). “

    Pearl on RUBINing – no need for imputation I can quote page 352
    http://bayes.cs.ucla.edu/BOOK-09/ch11-3-5-final.p
    “It is not that Rosenbaum and Rubin were careless in stating the conditions for success.
    Formally, they were very clear in warning practitioners that propensity scores work
    only under “strong ignorability” conditions. However, what they failed to realize is that it is not enough to warn people against dangers they cannot recognize; to protect them from perilous adventures, we must also give them eyeglasses to spot the threats, and a meaningful language to reason about them. By failing to equip readers with tools (e.g. graphs) for recognizing how “strong ignorability” can be violated or achieved, they have encouraged a generation of researchers (including federal agencies) to assume that ignorability either holds in most cases, or can be made to hold by clever designs.”

    Now what I am hoping for is both groups will strive to lessen the amount of work [needed] to understand their frameworks – given that “taste” does matter and mathematical equivalence does not mean the same in practice.

    Keith

  3. judea pearl says:

    Andrew,
    Brief comments today

    1. You reach over-sweeping conclusions about
    non-existing differences between Rubin's and Pearl's theories.
    It is misleading to say that "In practice, the two
    approaches are much different".
    "Rubin conditions on all information whereas
    Pearl does not do so".
    While Rubin made this careless statement in a certain
    context, there is nothing in his theory of potential-outcome
    that forces one to "condition on all information".
    his practice stands contrary to his own theory,
    and provenly so.
    Indiscriminate conditioning is a culturally-induced ritual
    that has survived, like the monarchy, only because it was
    erroneously supposed to do no harm.
    (Paraphrasing Russell), now the supposition
    begins to give way to principles.

    Please glance at the epidemiology literature of
    the past 10 years. Researchers there are using
    potential-outcome theory routinely but none,
    (I repeat: none) would dare make such a statement today.
    Why? because they use graphs to guide their thinking,
    and graphs protect one from making wrong statements.
    (Not all wrong statement, but the important ones).

    Thus, in as much as I admire the works of Rosenbaum and Rubin,
    I would not conclude from a couple of unfortunate statements that Rubin's theory is different than Pearl's,
    and, therefore, Pearl's theorem of equivalence
    must be inapplicable to the kind of problems Gelman is interested in.

    Mathematical theorems do not bend to problems, they solve them.

    2.
    Forgetting the gurus for a second, I honestly dont see where you and I differ.
    There is one area where I am waiting to learn
    from you. This is the example where Bayesian
    analysis can recover an identifiable average
    quantity from two subpopulations in which
    it is not identifiable.
    Our discussion has rekindled my hope in Bayesianism,
    and I am waiting to see it demonstrated in
    the bell-coins example (please lets not go
    to political science where we would waste a
    week just communicating the assumptions.)
    It should not take more than half a page
    to deal with THREE binary variables: coin-1, coin-2 and one innocent bell.
    How about it?

    3.
    In all other areas we seem to agree — please
    correct me if I am wrong.
    3.1
    We agree that if you include an intermediate
    variable Z as a predictor in your propensity score matching routine
    and if your task is to estimate the average causal effect over
    the population, you will get a biased estimate
    (asymptotically).

    2.2
    We agree that, in certain cases, if you include a pre-treatment
    variable Z as a predictor in your propensity score routine
    and if your task is to estimate the average causal effect over
    the population, you will get a biased estimate
    (asymptotically) when no bias existed.
    This is the M-bias case. I would not blame you if
    you say you cannot envision such cases, fine.
    But you agree that you have no theory that precludes such a case, aside from the respect you have for Rubin and Rosenbaum (who also do
    not have such a theory).

    3.3
    Jeff Wooldridge just sent me an (2006) article of his, where
    he proves (mathematically) that if you add a
    pre-treatment instrumental variable to your propensity score then bias always increases.
    I again assume you do not have a theory that refutes Wooldridge result (aside from remarks by Rubin and Rosenbaum).

    Conclusion.
    We agree on everything, with the exception
    of a couple of careless remarks made by Rubin and Rosenbaum
    (which I assume they have retracted by now).
    about indiscriminate stratification .
    And I am waiting to learn from you about bells and coins.
    This is progress.

  4. Andrew Gelman says:

    Keith: That sounds about right. Pearl and Rubin have both worked very hard on their methods and so they get frustrated when others don't understand them. On the other hand, I don't think either has read the other's books carefully, and so this limits the possibilities for discussion.

    Judea: Thanks for the comments. I agree that we're making progress but I disagree with several of your claims. The short answer, I think, is that I don't really understand the Pearl approach and I don't think you fully understand the Rubin approach. Each of us is wrestling with a shadow. In my next post, I'll focus more on the Rubin approach, as I think I can contribute more to the discussion by describing something that I do understand rather than guessing about something I don't really know anything about.

  5. Larry Wasserman says:

    Andrew:

    With all due respect,
    I think you are wrong that Judea
    does not understand the Rubin approach.
    I think he has studied it and understood it very
    deeply.

    It is my impression that the “graph people''
    have studied the Rubin approach carefully
    while the reverse is not true.

    I have always been surprised by the lack
    of willingness of the “Rubin group'' to
    resist studying the work of Pearl, Spirtes,
    etc.

    I think Judea has tried very hard to reach out
    to the other group but has only met skepticism
    and resistance.

    Best wishes
    Larry

  6. Andrew Gelman says:

    Larry:

    Based on what Pearl has written so far, it's clear to me that there are some aspects of Rubin's statistics that he does not understand.

    As I wrote above, "Pearl and I have many interests in common, and we've each written two books that are relevant to this discussion. Unfortunately, I have not studied Pearl's books in detail and I doubt he's had the time to read my books in detail also."

    That's ok, that's why we're having this discussion. I expect that Pearl understands Rubin's framework "deeply" from your perspective but not from mine.

  7. Steve Morgan says:

    I too received the invitation to debate from Pearl, following up on the latest round of the published controversy. Fortunately, I was on vacation and am only now reading everything!

    As you know, but I suspect most statisticians who read your blog do not know, I wrote a book with Chris Winship, Counterfactuals and Causal Inference (Cambridge UP 2007). It extols the virtues of adopting what you argued earlier is both "minimal Pearl" and "minimal Rubin" (though I think "minimal Heckman" has his place too).

    People who debate the deeper issues, as you are in this thread, ought to remember that moving rank-and-file researchers into the world of causal graphs and potential outcomes is a much more important and immediate goal for science. I have never made up my mind on whether the emotional nature of these debates helps or hinders pursuit of this goal. On the one hand, the intensity of it all is entertaining. On the other hand, the pressure to pick sides for those who are making the transition is less helpful (and can lead to "true believers" with no perspective at all).

    That being said, I think Pearl and Rubin are both correct but are arguing different points (though I confess that I have not had time to read everything they have written on this controversy):

    Pearl has shown that conditioning on a collider can open an already blocked back-door path, thereby generating bias that was not already present. Rubin is correct that such situations are probably quite rare, for those of us working in the weak-theory world of much social science research, since it is rare to possess the knowledge that such colliders are independent of all of the other stuff everyone agrees should be adjusted for. In our book, we always had to invent situations, such as in our in Figure 6.2(b).

    So, in my view, these situations are important to understand, but rarely will they strongly guide a model specification decision. Most reasonable people ought to recognize that colliders do exist in theory and that they should be on the lookout for them when thinking deeply about model specification. But, since asserting that a dangerous collider is present is a matter of theory (often with a weak justification), researchers ought to estimate models with slightly different conditioning sets and then give interpretations that can be reconciled with each other. Since the possible bias generated by mistakenly adjusting for a collider is probably quite modest in almost all cases, such reconciliation probably requires very little ink.

    That's my two cents. (And, thanks for providing a forum for this discussion.)

  8. judea pearl says:

    From: "Judea Pearl"
    To:
    Subject: c5-gelman
    Date: Monday, July 13, 2009 2:37 AM

    Theories vs. approaches.

    Dear Andrew,
    I think our discussion could benefit
    from the distinction between "theories" and
    "approaches." A theory T is a set of mathematical
    constraints on what can and cannot be deduced
    from a set of premises.
    An approach is what you do with those constraints,
    how you apply them, at what sequence, and in what language.

    In the context of this distinction I say that
    Rubin's theory T is equivalent to Pearl's,
    while the approach is different,
    Equivalence of theories means that there cannot be a
    clash of claims, and this is a proven fact.
    In other words if there is ever a clash about a given problem,
    it means one of
    two things, either the theory was not applied properly
    or additional information about the problem was assumed
    by one investigator that was not assumed by the other.

    Now to the "approach".
    Below is my analysis of the two approaches,
    please check if it coincide with your understanding of
    Rubin's approach.

    Pearl says, let us start with the science behind
    each problem, e.g., coins, bells, seat-belts,
    smoking etc.. Our theory tells us that
    no causal claim can ever be issued
    if we know nothing about the science,
    even if we take infinite samples.
    Therefore, let us articulate what we do know about
    the science, however meager, and see what we can get
    out of the theory. This calls for encoding the relationships
    among the relevant entities, coins, bells and seat-belts,
    in some language, call it L, thus creating
    a "problem description" L(P). L(P) contains variables,
    observed and unobserved factors, equations,
    graphs, physical constraints, processes, influences,
    lack of influences, dependencies, etc, whatever is needed to
    encode our understanding of the science behind the problem P.

    Now we are ready to solve the problem.
    We take L(P) and appeal to our theory T:
    Theory, theory on the wall,
    how should we solve L(P)? The theory says:
    Sorry, I dont speak L, I speak T.

    What do we do? Pearl's approach says: take the constraints from
    T, and translate them into new constraints, formulated
    in language L, thus creating a set of constraints L(T)
    that echo T and tell us what can and what cannot be deduced
    from certain premises encoded in L(P).
    Next, we deduced a claim C in L(P) (if possible)
    or we proclaim C to be "non-deducible". Done.

    Rubin's approach is a bit different.
    We again look at a problem P but, instead of encoding it
    in L, we skip that part and translate P directly
    into a language that the theory
    can recognize; call it T(P).
    (It looks like P(W|X, Y_1, Y_2) according to Rubin's
    SIM article (2007))
    Now we ask: Theory, theory on the wall,
    how should we solve T(P)? The theory answers:
    Easy, man! I speak T. So, the theory produces
    a claim C in T, and everyone is happy.

    To summarize, Pearl brings the theory to the
    problem, Rubin takes the problem to the theory.

    To an observer from the outside the two
    approaches would look identical, because
    the claims produced are identical and the
    estimation procedures they dictate are identical.
    So, one should naturally ask,
    how can there ever be a clash in claims
    like the one concerning covariate selection?

    Differences will show up when
    researchers begin to deviate from the philosophies
    that govern either one of the two approaches.
    For example, researchers might find it too hard to go
    from P to T(P). So hard in fact that
    they give up on thinking about P,
    and appeal directly to the theory:
    Theory, theory on the wall,
    we dont know anything about the problem,
    actually, we do know, but we dont feel like thinking
    about it. Can you deduce claim C for us?

    If asked, the theory would answer: "No, sorry,
    nothing can be deduced
    without some problem description."
    But some researchers may not wish to talk directly to
    the theory, it is too taxing
    to write a story and coins and bells in language
    of P(W|X, Y_1, Y_2).
    .
    So what do they do? They fall into a lazy mode,
    like: "Use whatever routines worked for
    you in the past. If propensity scores worked for you,
    use it, take all available measurements as predictors.
    the more the better."
    Lazy thinking forms subcultures, and subcultures
    tend to isolate themselves from the rest of
    the scientific community because nothing could be
    more enticing than methods and habits, especially
    when they reinforced by respected leaders,
    And especially when habits are supported by
    convincing metaphors. For example, how can you go
    wrong by "balancing" treated and untreated units
    on more and more covariates. Balancing, we all know,
    is a good thing to have; is is even present in
    randomized trials. So, how can we go wrong?

    An open-minded student of such subculture should ask:
    "The more the better? Really? How come? Pearl says some
    covariates might increase bias? And there should be no clash
    in claims between the two approaches."
    An open minded student would also be so bold
    as to take a pencil and paper and
    consult the theory T directly, asking:
    Do I have to worry about increased bias in my specific
    problem?" And the theory would answer:
    You might have to worry, yes, but I can only tell you where
    the threats are if you tell me something about the problem,
    which you refuse to do.

    Or the theory might answer:
    If you feel so shy about describing your problem,
    why dont you use the Bayesian method; this way, even if you
    end up with unidentified situation, the method
    would not punish you for not thinking about the problem,
    it would just produce a very wide posterior,
    The more you think, the narrower the posterior.
    Isn't this a fair play?

    To summarize:

    One theory has spawned two approaches,
    The two approaches have spawned two subcultures.
    Culture-1 solves problems in L(P) by the theoretical
    rules of L(T) that were translated from T into L.
    Culture-2 avoids describing P, or thinking about P,
    and relies primarily on metaphors, convenience of
    methods and guru's advise.

    Once in a while, when problems are simple enough,
    (like the binary Instrumental Variable problem),
    someone from culture 2 would formulate a problem
    in T and derive useful results. But, normally,
    problem-description avoidance is the rule of the day.
    So much so, that even 2-coins-one-bell problems
    are not analyzed mathematically by rank and file
    researches; they are sent to the gurus for opinion.

    I admit that I was not aware of the capability of
    Bayesian methods to combine two subpopulations in which
    a quantity is unidentified and extract a pointg
    estimate of the average, when such average is identified.
    I am still waiting for the bell-coins example worked
    out by this method — it would enrich by arsenal of
    techniques.
    But this would still not alter my approach, namely,
    to formulate problems in a language close to
    their source: human experience.

    In other words, even if the Bayesian method will be shown capable
    of untangling the two subpopulations, thus giving
    researchers the assurance that they
    have not ignored any data, I would still prefer
    to encode a problem in L(P), then ask L(T):
    Theory Theory on the wall, look at my problem
    and tell me if perhaps there are measurements that
    are redundant.

    If the answer is Yes, I would save the effort
    of measuring them, and the increased dimensionality
    of regressing on them, and just get the answer
    that I need from the essential measurements..
    Recall that, even if one insists on going the Bayesian route,
    the task of translating a problem into T remains
    the same. All we gain is the luxury of not
    thinking in advance about which measurements can
    be avoided, we let the theory do the filtering
    automatically.

    I am now eager to see how this is done;
    two-cons and one bell. Everyone knows the answer:
    coin-1 has no causal effect on coin-2 no matter
    if we listen to the bell or not.
    Lets see Rev. Bayes advise us correctly: ignore the bell.

  9. Andrew Gelman says:

    Judea: Thanks for sharing your thoughts. I'll think about whether there's enough here for another blog entry, but briefly:

    1. I agree with your distinction between frameworks and theories. A theory is only as good to the extent it is applied well.

    2. As I wrote a few days ago, I think different theories, and frameworks, can be better suited to different problems. In particular, I'm a fan of both Minimal Pearl and Minimal Rubin.
    I recommend Morgan and Winship's book for those who would like to get a sense of both models.

    3. You seem to be conflating the Neyman/Rubin causal model with Bayesian inference. These are logically distinct concepts. I understand that you feel that the Neyman/Rubin formulation encourages sloppy thinking about causality–and you might very well be right, as I know that I get confused when trying to think about structural equation models. But I don't think you make your case stronger by taking swipes at Bayes.

    4. I agree with you that the data alone, in the absence of substantive knowledge, will never be enough to answer causal questions. More generally, a sample will never tell us much about the population (unless it is, say, an 80% sample) unless we rely on a model for the sampling. I also agree with you that Rubin's and Pearl's frameworks are two different ways of allowing a user to encode such information. Ultimately it comes down to what approach, or mixture of approaches, is most effective in a particular class of applications.

    5. I think it's just silly for you to say that the Rubin approach "relies primarily on metaphors, convenience of methods and guru's advise." It's just a different approach from what you do. It's a common belief, I think, that we are principled while others are ad-hoc. Rubin used to refer to bootstrapping as lacking in principles because it was never clear where the estimator came from. Many bootstrappers consider Bayes to be unfounded because it is never clear where the prior comes from. Some diehard nonparametricians consider probability modeling to be sloppy because it's never clear where the likelihood comes from. And so on.

    We all use assumptions, and the methods we favor tend to seem more principled to us.

    6. You as, "I admit that I was not aware of the capability of Bayesian methods to combine two subpopulations in which a quantity is unidentified and extract a point estimate of the average, when such average is identified."

    As I noted earlier, this is just straight Bayes, nothing to do with causal inference. In the simplest setting, you have c=a+b, and a,b can have a highly correlated joint distribution. This is not something that has to be set up artificially: if the data provide good information about c but not about a or b individually, this will automatically induce the appropriate correlation in the likelihood.

    7. I remarked on your coins and bell example in my reply here to David's comment. I think I need more information to understand this example.

  10. Corey says:

    "But I don't think you make your case stronger by taking swipes at Bayes."

    I agree strongly with this. It seems obvious that if you're using Bayes to infer the (parameters of the) joint probability distribution over all nodes in the graph, then you need to condition on all data available to you. That doesn't imply that your causal estimate is the conditional distribution of one node given all other nodes. The causal graph theorems help you understand which conditional distributions of that joint probability distribution are equal to the causal estimand (and the causal assumptions that imply that equality).

  11. Rod Little says:

    The question of conditioning in causal inference reminds me of a disagreement between some economists and statisticians on whether to condition on "endogenous" variables in imputation for missing data. The econometric view (as I understand it) has resisted conditioning on such variables, because of causal arguments. Rubin's multiple imputation perspective says imputations should condition on "everything" including endogenous variables (though in practice there may be good reasons to limit this to the main variables that are predictive of the missing values, at least if samples are small). For predicting missing values, all useful variables should be conditioned, since this is not a causal problem. Once missing data have been filled in (multiply of course), the causal estimand should definitely not condition on endogenous and post-treatment variables. A key conceptual feature of multiple imputation is that the analysis model can differ from the imputation model.

    As Andy says, Bayes does this right. Whether to condition on variables in the imputation step that are not conditioned in the appropriate causal estimand depends on how much information they convey. If I have a complete sample on X, Y and Z and am interested in the correlation between X and Y, conditioning on Z is a waste of effort since it carries no information. On the other hand if Z is completely observed and X and Y have missing values, conditioning on Z in the imputation model might be useful (e.g. poststratification). Rod Little

  12. Keith O'Rourke says:

    2006 ?? Jeff Wooldridge article mentioned by Pearl

    Nothing listed in 2006 in Wooldridge's online CV

    Was the year mis-quoted??

    Keith

  13. judea pearl says:

    Reply to Keith,
    The paper I received from S. Wooldridge was dated
    2006, but it was not published. I suggest you
    write to him directly for a copy. =Judea