To follow up on yesterday’s discussion, I wanted to go through a bunch of different issues involving graphical modeling and causal inference.
– A practical issue: poststratification
– 3 kinds of graphs
– Minimal Pearl and Minimal Rubin
– Getting the most out of Minimal Pearl and Minimal Rubin
– Conceptual differences between Pearl’s and Rubin’s models
– Controlling for intermediate outcomes
– Statistical models are based on assumptions
– In defense of taste
– Argument from authority?
– How could these issues be resolved?
– Holes everywhere
– What I can contribute
A practical issue: poststratification
I’ll start with an issue where Pearl disagrees with Rubin (and also with me, and I expect with Paul Rosenbaum, Rod Little, and many others). I’ll repeat a bit from my earlier entry. Pearl writes:
For example, if we merely wish to predict whether a given person is a smoker, and we have data on the smoking behavior of seat-belt users and non-users, we should condition our prior probability P(smoking) on whether that person is a “seat-belt user” or not. Likewise, if we wish to predict the causal effect of smoking for a person known to use seat-belts, and we have separate data on how smoking affects seat-belt users and non-users, we should use the former in our prediction. . . . However, if our interest lies in the average causal effect over the entire population, then there is nothing in Bayesianism that compels us to do the analysis in each subpopulation separately and then average the results. The class-specific analysis may actually fail if the causal effect in each class is not identifiable.
Pearl seems to take this as an example of where Bayesian inference–the rule to condition on all observed data–gives the wrong answer. But I think he’s missing the point. At the technical level, yes you definitely can estimate the treatment effect in two separate groups and then average. Pearl is worried that the two separate estimates might bot be identifiable–in Bayesian terms, that they will individually have large posterior uncertainties. But, if the study really is being done in a setting where the average treatment effect is identifiable, then the uncertainties in the two separate groups should cancel out when they’re being combined to get the average treatment effect. If the uncertainties don’t cancel, it sounds to me like there must be some additional (“prior”) information that you need to add.
I’m pretty sure about this. Of all the stuff I’m talking about in this blog, Bayesian regression and poststratification is the area in which I’m most truly an expert. To get a sense of some of the gains from this approach, check out some of the recent work by Jeff Lax and Justin Phillips (here and here).
3 kinds of graphs
I can think of three different ways that directed graphs have been applied to statistical modeling.
1. Graphing the structure of a probability model. For example, consider a simple hierarchical model (the 8-schools example in chapter 5 of Bayesian Data Analysis), with a “likelihood” of y ~ N (theta, sigma^2), a “prior” of theta ~ N (mu, tau^2), and a “hyperprior” on (mu,tau). (For simplicity I’m assuming sigma is known and unmodeled; we can discuss this point later, if you’d like, but for now I’m trying to keep things clean so as to be able to use Ascii graphics.) The graph for this model is
(mu, tau) –> theta –> y
The arrows don’t represent causation or anything like that–it doesn’t make sense to me to talk about (mu, tau) “causing” theta, or theta “causing” y. The parameter mu, for example, simply represents the mean of the population of theta values; it has no meaning as a causal factor.
2. Graphing a hypothesized causal pattern. Cyrus gave an example in his blog comment yesterday:
(X, Y) –> L –> C
These are causal relations, as Cyrus has defined them: X and Y cause L, and so forth.
3. Graphing relations between real-world variables. Here I’m thinking of models with variables such as inflation, unemployment, and interest rates; or schooling, socioeconomic status, test scored, and delinquency.
These three sorts of graphs can look similar, but they have different interpretations for causality. In particular, graphs of type 1 can be helpful for complex hierarchical models even if they are purely descriptive. For example, I recently estimated public support for school vouchers among voters, characterized by religion, ethnicity, income, and state of residence. I’m not trying to understand whether being a rich person in Texas causes you to have a certain opinion–it’s an interesting question, but I’m answering some more basic descriptive questions. Nonetheless, my model has a graphical structure.
“Minimal Pearl” and “Minimal Rubin”
I’d like to separate each of Rubin’s and Pearl’s theories into a key conceptual part and a more elaborate analytical part. I’ll argue that, whatever you think of the analytical parts, the conceptual core of each theory represents a useful insight.
Minimal Rubin: Defining causal inference as potential outcomes (not necessarily “counterfactuals” because the notation can be used before an experiment is actually done, in which case any of the outcomes might be possible). I have found this to be an extremely useful way of focusing my understanding. To take just one example (which I mentioned in my earlier blog entry), when Gary and I started working on incumbency advantage twenty years ago, there was already a bit of literature on the topic: different articles with different definitions of incumbency advantage, and a near-complete confusion between estimands and estimates–that is, between the formulas used to compute “incumbency advantage” numbers from data, and the underlying quantities being estimated. The potential-outcome framework allowed us to formulate the estimand–our definition of incumbency advantage–clearly, and then we were able to move to the estimation phase.
Full Rubin: The research programme under which all causal inference problems can be framed in terms of potential outcomes. The Full Rubin has had some successes (for example, the paper with Angrist and Imbens on instrumental variables) but it also creates some new difficulties, notably when dealing with intermediate outcomes.
Minimal Pearl: Displaying causal relations as a directed graph, and using graph-theoretical ideas to understand patterns such as backdoor causality and colliders. I have certainly found it useful to use graphs to explore causality, and lots and lots of people have found Pearl’s ideas helpful in understanding the roles of different variables in a graph (see, for example, Cyrus’s comments in the earlier blog entry). As with Minimal Rubin (but in a different way), a key contribution of Minimal Pearl is to separate the causal structure from the specifics of a model. There had been lots of literature on graphs for path analysis, structural equation models, and so forth–but Pearl detached the graphical ideas from the specific correlation-based models that were out there.
Full Pearl: The research programme under which all causal inference problems can be framed in terms of graphs, colliders, the do operator, and the like. It doesn’t quite work for me, but many people feel the Full Pearl is the way to go. A good argument in favor of Full Pearl is that, by handling dependence structures in a compact way, this framework frees up the researcher to think about more complicated structures of variables, to not be limited to the very simple structures that we can hold in our heads. One reason that I’m sympathetic to Full Pearl–and, at the very least, why I’d like to better understand minimal Pearl–is to see if I can improve my own modeling in this way.
Getting the most out of Minimal Pearl and Minimal Rubin
I’d argue that all of us–Pearl, myself, and Rubin included–would benefit from consistently using the insights of Minimal Pearl and Minimal Rubin, in particular,
from Pearl: Write models as directed graphs and, where necessary, explain exactly what the links mean and how their strength can be measured.
from Rubin: Be explicit about data collection. For example, if you’re interested in the effect of inflation on unemployment, don’t just talk about using inflation as a treatment; instead, specify specific treatments you might consider (adding these to the graphs, in keeping with Pearl’s principles). This also goes for missing data. For example, Cyrus presented an example in which the variable Y is missing when a different variable, C, is observed. I recommend adding a new variable, I_Y, an indicator for whether Y is observed. The graphical model can then show that I_Y depends only on C.
Conceptual differences between Pearl’s and Rubin’s models
I’ll just list a few differences that I’ve seen in this discussion:
- Following Rubin’s perspective, I define the causal effect of a treatment at a unit level. For simplicity I’ll stick with two levels of the treatment, 0 and 1 (for example, incumbency or an open-seat election). I define the treatment effect for a single unit as y^1 – y^0, or, if you prefer subscripts, y^1_i – y^0_i. In contrast, Pearl (and Wasserman, in his comment) define treatment effects as expectations: E(y|x) or, with more labor, E(y|x=1) – E(y|x=0). Pearl et al. can feel free to use this definition, but it’s different from mine.
- Here’s another example. Pearl describes the following problem:
Let X and Y be the outcomes of two fair coins, and let Z be a bell that rings if at least one of X and Y comes up head. We wish to estimate the causal effect of X on Y after collecting a huge number of samples, each in the form of a triplet (X, Y, Z). Should we include Z in the analysis? If so how? Would our favorite estimate of E(Y_x) be biased? Will it give us what we expect, namely, that X has no causal effect on Y, i.e., E(Y_x) = E(Y).
I think I may be missing something in this example, but if I understand it correctly, it doesn’t fit into the Rubin framework at all: in the Rubin framework, decisions, not outcomes, have causal effects. I’m not saying that Pearl shouldn’t be working on this problem–there’s clearly a lot of interest out there in methods for estimating causal effects of things that are not decisions/interventions/treatments. What I am doing is illustrating that Pearl’s and Rubin’s methods are different.
Controlling for intermediate outcomes
Pearl writes, “Let us focus on the easier example of an intermediate variable (Z) between treatment (X) and outcome (Y). Has anyone seen a proof that adjusting for Z would introduce bias?” There is no proof that an adjustment will always introduce bias (I think Corey’s right that, by “unbiased,” Pearl means “asymptotic consistency,” in statistical jargon). The theorem is that the adjustment can introduce bias, and to prove this theorem, we only need a single example, such as is given on page 191 of my book with Jennifer.
In any given example, there can be all sorts of other problems going on, measurement error, missing data, key unmeasured predictors, and so forth. And it’s always possible that “doing something wrong” (for example, controlling for an intermediate outcome) can actually make the estimate better. Just as, to borrow Pearl’s example, we might do better by adding 17.5 to any of our estimates. Except in trivial examples, we can’t prove that this is a bad idea.
Even while living in a world of uncertainty, we make assumptions and, from there, try to do things that are optimal (or nearly so) within our assumptions. (A big part of my work is thinking about how to check and improve these assumptions, but that’s another story.)
I realize I stated Rubin’s view incorrectly. He doesn’t actually say, “do not control for post-randomization variables”. What he does say is: do not try to control for them ignoring the fact that they are post-randomization–that is, do not treat them as fully observed covariates. This is how his instrumental variables stuff works. So there is no contradiction with his view that you should ideally condition on all observed values, even on post-randomization variables (which can be thought of as partially observed, in the potential outcome sense).
From my perspective, the point is that Rubin’s fully-Bayesian approach gets difficulty in complex settings. Rubin would argue that this difficulty is inherent and should not be avoided. And Pearl correctly pointed out the sloppiness of my statement that “Jennifer and I recommend not controlling for intermediate outcomes.” A better way to put it is that it is appropriate to control for intermediate outcomes, but not if your only tool is unadjusted regression on available data.
I agree with Pearl that, “If you incorporate an intermediate variable M as a predictor in your propensity score and continue to do matching as if it is just another evidence-carrying predictor, no post processing will ever help you, except of course, redoing the estimation afresh.” More sophisticated adjustment–whether using Rubin’s framework, Pearl’s, or some other approach–is needed. My allusion to “post-processing” was too vague.
One place where my glib advice (“don’t control for intermediate outcomes”) can break down is in longitudinal studies of the sort where Robins and others have developed weighting methods to estimate causal effects. Again, the real point is, yes, it’s best to condition on all data; we just have to go beyond simple regression.
Statistical models are based on assumptions
Pearl refers to himself as a half-Bayesian and refers to “big-brother Bayes” as an impediment to clear thinking.. I consider myself a Bayesian, pretty much. I don’t always use Bayesian methods, but when I don’t, I think of what I’m doing as an approximation to a more laborious full Bayesian approach.
Bayesian inference has only two steps: (1) set up a joint probability distribution for everything involved in your problem, (2) condition on observed data to get a joint posterior distribution for everything unobserved. Everything else in Bayes fits into these steps: model checking can be formulated in terms of step 2 (as posterior inference on replications), and model expansion goes into step 1.
The most important objection to Bayesian statistics, in my opinion, is that, in realistic examples, the joint probability distribution is going to be arbitrary–often based on whatever data you happen to have at hand–and wrong. Is this a mortal flaw in Bayes? This is an empirical question; it depends on the example. Evaluation is complicated by the fact that statistical modeling, like all scientific activity, has some wiggle room. I tried a lot of models on the way to getting estimates of opinion on school vouchers.
Anyway, inference within the Bayesian framework is straightforward mathematics. We don’t have to go to the grave of Thomas Bayes and ask what he would do in a situation; we just set up the model and go from there. One advantage of Bayesian methods, to me, is that it puts the focus on the model rather than on the estimation procedure.
In defense of taste
In discussing different sorts of models, I wrote that, while some statisticians like to use discrete models, “I have a taste for continuity and always like setting up my model with smooth parameters.” Personally, I think discrete models make very little sense in most social science settings (I will make exceptions in some latent-variable settings such as conceptual models, personality traits, and party identification, along with social measurements that are highly correlated with discrete biological variables such as sex), and I think most of the discrete modeling in social science is a vestige of classical significance testing ideas. But I recognize that others have different tastes than I on this matter.
Pearl very rightly queries this statement of mine, writing: “The general attitude in this discussion
has been to treat the issue as if it was a personal dispute about a wine tasting contest . . . both sides quote good reasons, so it must be a matter of taste, style, focus, perspective , interest, method etc. It isn’t.” In this case, Pearl is referring to a question of adjusting for pre-treatment variables rather than a question of model choice, but the same issue arises: why should applied mathematicians and research scientists (which is, ultimately, what we are) care about “taste”?
Perhaps it would help I use the word “experience” instead. For example, I have experience using hierarchical models, interactions, graphical model checking, and all the other fun stuff that’s in my book, whereas Rob Tibshirani has experience with generalized additive models, lasso estimation, bootstrapping, and all the great stuff in his books. You could say that I have a taste for probability models and Rob has a taste for direct data-manipulation procedures, or that I have different experiences than Rob does. However you put it, I think Rob is going to do better statistical analyses using nonparametric methods than using hierarchical Bayes, just as I’ll do better the other way.
I don’t know if it really takes 10,000 hours to learn a new method, but it’s not always a bad idea to work with what works for you.
Argument from authority?
As Pearl notes, if a theorem is true, it’s true, and if it’s false, it’s false. It does sound silly to say that someone should use a certain method just because Gelman and Hill do, or because Rosenbaum does it in his book.
But what if you frame it slightly differently, and say that Gelman and King, for example, solved some existing problems in quantitative political science (incumbency advantage, the seats-votes curve, the effects of redistricting) using certain methods. So, just maybe, there’s something good about these methods, right? This sort of inductive reasoning is the basis of much of my work, which bounces back and forth between applications and methodology. Pearl is right that we should be able to focus on specific questions and not just rely on authorities (from Neyman and Fisher to Pearl and Rubin), but I don’t think it’s so unreasonable for me to hold up my applied successes as some validation of the methods I use.
Again, this is not to criticize Pearl’s approach on his problems, just to explain why I bristle at the suggestion that I’m doing the wrong thing, for example, by poststratifying.
How could these issues be resolved?
As I’ve already stated a few times, I think there’s room in the world for Minimal Pearl, Full Pearl, Minimal Rubin, and Full Rubin–even if I don’t think all four can be used at the same time on the same analysis of the same problem! Pearl and his colleagues (such as Wasserman and the anonymous author of the last section of the Wikipedia page on the Rubin Causal Model) believe Full Rubin to be a special case of Full Pearl, but as I’ve argued above, I don’t think so.
How could the methods be compared? One could imagine some sort of data competition–in fact, I think something like this was done recently (perhaps by Pearl himself, I don’t remember). The examples I have in mind are Dehejia and Wahba’s reanalysis of LaLonde’s analysis of a subset of data from a job-training experiment, or the notorious Harvard Nurses Study, which let do an observational analysis purportedly discovering a positive effect of hormone replacement therapy, a finding that was revealed to be in error from a later randomized experiment. The question in these examples is whether methods such as Rubin’s propensity score or Greenland and Robins’s g-estimation could reliably get good answers from observational data.
I don’t think such a competition would be conclusive though, in that different statisticians can do well using different methods. Much depends on details of implementation.
As I discussed earlier, scientific disagreement can be frustrating, and I think I very well understand the frustration that both Pearl and Rubin feel on this, in their own ways. I continue to believe that all the analytical tools we have–Rubin’s framework, Pearl’s framework, the normal distribution, the logistic distribution, and, yes, even Bayesian inference–are incomplete. They all have holes. Just to quickly list them:
- When you consider multiple treatment factors, Rubin’s framework leads to a proliferation of potential outcomes that has always left me confused. In addition, Rubin (in chapter 7 of Bayesian Data Analysis) recommends controlling for all variables that can affect treatment assignment. This is part of the more general recommendation to include all variables in a model, something that we realistically cannot generally do.
- Pearl’s framework seems to me to assume that each note in the network corresponds, via the do operator, to a particular treatment or manipulation. Realistically there can be many ways of altering a variable; including this in the model can lead to proliferation of nodes and no sense of how to proceed to estimate a causal model.
- The normal, logistic, etc., distributions can be useful but basically never fit real data. We always have the question of when to make our models more complicated and realistic.
- Bayesian inference–realistically, almost all useful statistical inference–depends on models. We are getting better at checking these models, but the theory can never be even close to airtight except in very simple examples.
What I can contribute
I’ve done some work on causal inference (notably, my 1990 article with King on incumbency advantage, and also my book chapter from 2004 on treatment effects in before-after data), but considering the other participants in this discussion, causality is clearly not my area of expertise. Most of my work is on statistical modeling, graphics, and model checking.
But . . . statistical modeling can contribute to causal inference. In an observational study with lots of background variables to control for, there is a lot of freedom in putting together a statistical model–different possible interactions, link functions, and all the rest. Further complexities arise in modeling missing data and latent factors. Better modeling, and model checking, can lead to better causal inference. This is true in Rubin’s framework as well as in Pearl’s: the structure may be there, but the model still needs to be built and tested. And, as we become more confident (without being overconfident) in our models, we can make them more complex and realistic.