Long discussion about causal inference and the use of hierarchical models to bridge between different inferential settings

Elias Bareinboim asked what I thought about his comment on selection bias in which he referred to a paper by himself and Judea Pearl, “Controlling Selection Bias in Causal Inference.”

I replied that I have no problem with what he wrote, but that from my perspective I find it easier to conceptualize such problems in terms of multilevel models. I elaborated on that point in a recent post, “Hierarchical modeling as a framework for extrapolation,” which I think was read by only a few people (I say this because it received only two comments).

I don’t think Bareinboim objected to anything I wrote, but like me he is comfortable working within his own framework. He wrote the following to me:

In some sense, “not ad hoc” could mean logically consistent. In other words, if one agrees with the assumptions encoded in the model, one must also agree with the conclusions entailed by these assumptions. I am not aware of any other way of doing mathematics. As it turns out, to get causal conclusions, we need causal assumptions (“no causes in-no causes out”, see Cartwright), because causality is not some entity outside the realm of mathematics. This is not my observation but had emerged along the last century based on research in many fields including philosophy, computer science, econometrics, epidemiology, etc. I believe that Greenland would agree with this point, if I am not mistaken, I guess he puts some emphasis on this.

It is not clear what Fernando’s description of algorithms means, or implies, but I did not mention any algorithm in the post, only mathematics. If we have a language with a sound reasoning system, we can think about automating some task related to this system, i.e., designing an algorithm. The inexistence of a sound inference system precludes any attempt of automation.

It is true that the language of (causal) DAGs provides a nice way to encode causal assumptions, but it does not mean that they are not mathematical-compatible, or that mathematics cannot be in tune with intuition and the way we think about causality. Personally, the whole beauty of this theory is to be mathematically “precise”, consistent, and at the same time preserving our very intuition about causality. (*)

In regard to the backdoor criterion, and other graphical methods to remove *confounding* bias, we usually assume *local qualitative* knowledge about the causal mechanisms, and then we ask the question of whether a causal query Q can be estimated from the assumptions A together with data D. If Q is included in A, the problem is in the real of statistical inference (e.g., Q is a causal effect obtained from a perfect randomized trial). Otherwise, which is the case in observational studies, the theory helps in reasoning with A in order to entail Q (using D). (**)

Interestingly, I already heard Judea saying in his talks that “no one can do better”, but his reasoning is not pretentious but purely based on logic. There are theoretical results showing completeness of some methods to remove confounding bias, i.e., given a set of assumptions A and a causal query Q, there exists a procedure that is capable of removing this bias if (and only if) it is possible to remove this bias with the assumptions A.

We might wonder what if one does not have a set of assumptions A about the phenomenon that is being studied? The answer is that nothing can be derived in some mathematical way, or, we obtain a logical inconsistency. In other words, we might have two competing models M1 and M2 that are both compatible with the data D, but M1 entails Q1, while M2 entails Q2, for Q1 not equal Q2. Another way to see it, if one has infinite and perfect data, nothing can be claimed that cannot be refuted (without further qualitative assumptions, even assymptotically). (***)

[ Side note: Those are interesting results that expose the very nature and limits of what can be computed from data in terms of causality. In computer science, we are not bothered in accepting certain limitations (some reality entailed by our own assumptions). I am not sure whether you are familiar with Turing machines and complexity theory, but we have exactly a couple of results of this type. First, we have a language to express the notion of ‘computation’, and then we have what can be indeed computed (not everything can be). Further, we might wonder what can be computed *efficiently*. In order to know that something is not computable (or efficiently computable), it is a prerequisite to know the assumptions involved in this computations are, nothing can be derived from scratch. The mainstream theoretical computer science is built on some of these “impossibility” results, and this does not imply that we are not able to process some information using these same models with their limitations ;-) ]

Furthermore, what I was trying to convey in the first post was that there are other biases different than confounding such as sampling bias (I prefer to call selection bias). There is even another problem outside the realm of internal validity called external validity. Interestingly, even though you could express the causal assumptions in the language of causal DAGs, so far, we did not have a sound theory on how to use this language to produce coherent results for the problem of external validity.

There are many other technical details that I decided to omit since this note is already somehow conceptually ‘loaded’. Looking forward to listen more from you.

(*) I would phrase the language of *causal* DAGs, not only DAGs. My understanding is that the more formal development of the language of DAGs can be traced back to early 80’s, which was developed in the context of probabilistic reasoning, i.e., not causality at al. For instance, one is interested in computing how the likelihood of Y would change when we observe X at level x. This is not related with causality and it is an exercise in pure Kolgomorov theory. Since early 90’s, part of this community turned their attention to the problem of how to encode (provide a language) and reason (devise algorithms) about causality using DAGs.

It turns out that a different semantics over “DAGs” was needed, and perhaps we should emphasize *causal* DAGs to avoid misleading interpretations. This differentiation is not surprising since it’s clear that just observing X at some level x is not the same of intervening on X setting to level x. I am not sure how you’re familiar with this claim, but it’s something very paradigmatic here. The difference, and expressiveness power, of these different languages is in the middle of the confusion between pure probabilistic and causal reasoning.

(**) The backdoor criterion is an interesting result but it is the graphical analogous to the concept of ignorability (in rubin’s framework), but there is much more that can be done.

To be honest, I am at the same time astonished and disappointed with some literature that acts as if there is no other way to derive causality. A quick example is the case of the front-door criterion (Pearl Chapter 3, I am in a coffee shop without the book here, probably around page 90, but not sure), in which there is NOT an ignorable adjustment but we DO have a way to get a unbiased estimate of the causal effects. Still, backdoor and frontdoor are just the tip of the iceberg, much more can be entailed with formal and sound mathematical theory. This exemplifies the power of deriving things that are not obvious from A.

(***) There is a whole research program going on for at least 20 years just concerned about the problem of inferring A from D, but this is outside my scope for now. I haven’t seen people outside computer science trying to do this kind of exercise. In some cases, with a mild set of causal assumptions, we can learn part of A from D, but for sure these approach have some limitations. You can see the first attempt in this line of research due to Pearl (in chapter 2 of his book) or the algorithm provided by the program at CMU (tetrad), results that trace back to 1991-93.

I feebly replied by linking to our earlier blog discussions of Pearl’s and Rubin’s causal frameworks (scroll to the bottom of this page), and Bareinboim wrote:

I am intrigued about how you can choose variables in order to create an unbiased estimate of the causal effects using passive data alone, which in the graphical framework is usually made by qualitative judgment.
(As I pointed in the previous message, it could be made by automated algorithms, but let’s skip this for now. Also, this is just one part of the internal validity problem, in the other post, I was trying to discuss about external validity.)

The following are a couple of representative examples that are somehow troublesome:

1. M-graph case: how do you know from probabilistic information alone that adjusting by the node in the middle of the M graph will bias your estimand?
G1 = { X -> Y , U1 -> M, U2 -> M }, where {U1, U2} are unobservables.

2. Front-door case: how could you make use of the intermediate variable tar (Z) to estimate the effect of smoking (X) on cancer (Y) without Judea’s theory?
G2 = { X -> Z , Z -> Y, U -> X, U -> Y }, where {U} is unobservable.

3. IV case: Without assuming any linearity, how do you know that adjusting by the instrument Z could amplify the bias of the effect of X on Y? Alternatively, how could you even distinguish between Z being an instrument or a typical confounder necessary for adjustment (G3 versus G4)?
G3 = { Z -> X , X -> Y, U -> X, U -> Y }, where {U} is unobservable.
G4 = { Z -> X , Z -> Y, X -> Y}

I responded by linking to my two articles on causality for the Yale volume and the AJS and wrote that I do not have any magic bullets, but in any particular example I will try my best to set up a model of observed and unobserved outcomes that makes sense.

Bareinboim then wrote:

I read the two papers that you sent me with great interest, and I have some questions.

In your book review, you mention three problems of causation, the second one being “Studying questions of forward causation in observational studies or experimental settings with missing data” (the traditional focus of causal inference in the statistics and biostatistics literature); recall that missingness is inherent in the counterfactual definition of causal effects.

Referring to this standard “missing data” (or “potential outcome”) approach, I have three questions:

1. We know that any causal inference in observational studies requires some untested causal assumptions. How does one express causal assumptions mathematically, say that “seatbelt usage” is correlated with, but does not affect choice of treatment?
How those assumptions mix with the bayesian hierarchical modeling framework?

2. Given a collection of such assumptions, can one tell (using the “missing data” formalism) if they have testable implications?

3. Given a collection of such assumptions, can one tell (in the “missing data” framework) if they are sufficient for estimating the causal effect of treatment on outcome without bias?

Furthermore, in reference to your first paper on generalization in causal inference, I agree with your observation that these books has focused pretty much on problems 2 and 3, and not in problem 1 (page 957 of the AJS paper), which I believe that somehow mirrors the whole literature. Do you think this statement is faithful to the current status of the literature?

I would like to call your attention for our papers that solves this problem.

J. Pearl and E. Bareinboim “Transportability across studies: A formal approach”
http://ftp.cs.ucla.edu/pub/stat_ser/r372.pdf

E. Bareinboim and J. Pearl “Transportability of Causal Effects: Completeness Results”
http://ftp.cs.ucla.edu/pub/stat_ser/r390.pdf

Putting in a simple way, the idea is that you can formally decide whether a given causal effect is “generalizable” among settings in a principle way; and when those effects are indeed generalizable, we are able to pinpoint what is the mapping between the source and the target settings.

Wow—it’s great to have someone read my papers! I responded as follows:

To answer your questions briefly:

1. In a Bayesian context, the assumptions go into the model of the joint distribution of the potential outcomes.

2. Again in a Bayesian context, the model is what it is. The testability of the assumptions depend on the data. One can do simulations, for example, to see how different aspects of the prior distirbution change upon the application of data. We discuss some of this in chpater 4 of Bayesian Data Analysis.

3. The concept of “bias” doesn’t really come into Bayesian inference. It’s more that you want to condition on all available information. In practice, of course, lots of shortcuts are made, so the general idea of bias is indeed relevant. See chapter 7 of Bayesian Data Analysis for further discussion of this point. I agree that it’s hard to get a handle on, though.

Finally, thanks for the links to your papers. The idea of transportability does indeed sound related to the hierarchical modeling ideas that I have been discussing. I expect there is some connection and that we are looking at it from different perspectives. Tranportability sounds like exchangeability, and one thing I’ve been emphasizing for many years is that in practice the real question is not “Are these cases exchangeable?” but rather “What information is available to distinguish these different cases?”

Given that I don’t have the energy to work all this through myself, I will blog on it so as to spread the word of these connections to others.

Bareinboim then elaborated:

Thanks for your attention. Your answers unveil some profound limitations of the framework you are using, perhaps they can be overcome. I will try to be as much transparent and direct as possible in my questions, please don’t take me wrong.

1. On assumptions

You say that “in the Bayesian framework the assumptions go into the model of the joint distribution of the potential outcomes”. This sounds like a nightmare – perhaps I am reading you wrong.

If our model has n binary variables, we need to specify a joint distribution over 2^n potential outcomes variables, which make up 2^(2^n) potential outcomes. Can anyone specify such a “beast” explicitly?

The same can be said about the assumptions; are they specified explicitly or embedded implicitly in the inference procedure. The former is preferred, of course, because it permits the investigator to scrutinize the assumptions, or let them be scrutinized by his/her peers.

My cursory examination of the potential-outcome literature finds that the assumptions all fall into the “ignorability” or “conditional ignorability” types, and rarely are they brought for discussion; they are just assumed by default to justify the author’s favorable estimation routine. Is there any methodological procedure in your framework to decide ignorability?

2. On testability of assumptions

You write that “The testability of the assumptions depend on the data.” This sound like a serious limitation. You mean one cannot tell in advance whether a model say something about the data until one actually collects the data and notice a change in some aspects of the prior?

Under such conditions a local misspecification (i.e. wrong assumption) would get lost in the sampling noise of the entire model and, even if one finds a clash between model and data, how can one determine the culprit, namely, which assumption should be repaired.

3. On Bias

You wrote that “The concept of “bias” doesn’t really come into Bayesian inference.” This sounds even harsher than the above. Do you mean that you do not care if the estimate you get is in any way close to what you want estimated (i.e., the causal effect)?

This is somehow hard to believe, because the critics would just call such a method “ad-hoc”, especially when we know that, under certain circumstances “conditioning on all available information” increases or introduces bias. This occurs, for example, in the IV setting, or when we condition on a variable that is a confounder but acts as an IV. I wonder if your methodology is able to distinguish these two cases?

One might even wonder what the role of theory is, if the only ruling paradigm is “to condition on all available information”, while seeking no guarantee that conditioning will produce better results than the “crude” (unconditioned) estimator. It seems that, at the end of the study, all that one can say is: We got some estimate and we can guarantee that, if the estimate is correct, then it is correct.” Don’t other students / practitioners crave for a stronger guarantee?

4. More on Bias reduction

Assume that we are able to measure two set of variables, S1 and S2, but measurement are extremely costly. Can your Bayesian framework advise us on which set of variables we should measure?

It appears to me that, in the absence of informed concerns about bias, the only advice one can expect from the theory is: “measure both” which is not really very informative — theory should do better.

5. On Transportability

I agree with you that the dichotomy exchangeable versus not exchangeable is insufficient to produce any meaningful analysis.

The point of our theory is precisely to systematize how to proceed in case of non-exchangeable populations. We show that the two populations can be pretty much different and still, one is able to transport relations between domains with guarantees of unbiasedness. I think this finding has broad applications in demography, meta-analysis, and any procedure that asks for generalization among settings; I will be happy if other students or bloggers could recognize the potential of our findings and benefit from them.

To which I reply:

1. On assumptions: our models are imperfect but I have found that we can make progress by starting with simple models and then complicating them as needed. It takes research, both to compute the models and to understand them, but I am happy to state my model assumptions explicitly. You write, “rarely are they brought for discussion; they are just assumed by default to justify the author’s favorable estimation routine.” I invite you to read my many applied statistics papers. Convenience is certainly one of our guides to picking models but we do try to build our models on substantive grounds. To the extent that we use default models, this is often because such models have worked on similar problems in the past.

2. When I say, “the testability of the assumptions depends on the data,” I mean that any given dataset or data structure will allow some assumptions to be tested but not others. For example, if you have two-level hierarchical data you can directly test various assumptions at the two levels but you won’t be able to say much about the third level. This is a well-known (although not always clearly stated) principle in statistics, that as we get more data we can test our assumptions better. (For example, you may have heard the expression, “If you have enough data, your chi-squared test will always reject.”)

3. On bias: You can look up “bias” or “unbiased” in the index to Bayesian Data Analysis to see why Bayesians have problems with the concept of bias. In short, bias is conditional on the true parameter value and it does not always make sense to perform that conditioning.

4. You ask, “Assume that we are able to measure two set of variables, S1 and S2, but measurement are extremely costly. Can your Bayesian framework advise us on which set of variables we should measure?”
My answer: Yes, this is a classical (Bayesian) decision problem. You write down your utility function and work through the tree. See the chapter in decision analysis in BDA for some examples. Computing the value of information and deciding whether to take a measurement—these are standard problems in Bayesian decision theory.

5. Transportability: I repeat what I wrote earlier that I think the problem of generalizing across different scenarios is extremely important, and I think hierarchical modeling is the way to go. The key parameter here will be the group-level variance, which determines how much partial pooling there will be when combining historical and current data. I am not particularly concerned about guarantees of unbiasedness, but I do want to do the best possible job of partial pooling.

Bareinboim replies:

On tolerating bias in the Bayesian framework:

Pearl (Causality, 2009, pages 279-280) provides a simple illustration of how Bayesian posteriors behave when the causal effect is not identified. The posterior remains flat (i.e., bounded away from zero) over a finite interval, regardless of sample size, and its shape remains at the mercy of the assumed prior.

On transportability:

The only way investigators can decide whether “hierarchical modeling is the way to go” is for someone to demonstrate the method on a toy example. In (Pearl and Bareinboim 2011) we analyze three toy examples, and vividly demonstrate how mathematical routines can tell us whether and how experimental results from one population can be used to estimate causal effects in another population, potentially different from the first. The results are crisp, transparent and come with theoretical guarantees on the estimator produced. It remains for experts in hierarchical modeling to demonstrate, on the same toy examples, how the distinction between “transportable” and “non-transportable” cases is determined, if at all, and what theoretical guarantees accompany the results produced by the hierarchical modeling framework, especially in “non-transportable” cases. I hope some readers on this blog would be enticed to tackle the “Z = reading ability” example in (Pearl and Bareinboim 2011). http://ftp.cs.ucla.edu/pub/stat_ser/r372.pdf

Bareinboim writes above that “mathematical routines can tell us whether and how experimental results from one population can be used to estimate causal effects in another population, potentially different from the first.”

From (my) Bayesian perspective, experimental results from one population can always be used to estimate a causal effect in another population (assuming there is some connection; obviously we would not be doing this for unrelated topics). The sorts of examples I’m thinking of from political science would be generalizations from one country to another or from one era to another. In practice we often do not explicitly share the information from one group to learn about another, as it is work to construct a statistical model. Instead we use what we call the secret weapon, and plot several estimates on a single graph so the partial pooling can be done by eye.

For another sort of example, we used hierarchical prior distributions to make causal inference in toxicology, combining data from different sources; see here.

It is an interesting cultural difference that Bareinboim believes that “The only way investigators can decide whether ‘hierarchical modeling is the way to go’ is for someone to demonstrate the method on a toy example,” whereas I am more convinced by live applications. I suppose it’s good that different researchers have different criteria for what is convincing for them.

Finally, I appreciate our correspondent taking the time to send me his thoughts! I’m posting this on the blog so others can learn more about his perspective.

49 thoughts on “Long discussion about causal inference and the use of hierarchical models to bridge between different inferential settings

  1. In re: “My cursory examination of the potential-outcome literature finds that the assumptions all fall into the “ignorability” or “conditional ignorability” types, and rarely are they brought for discussion; they are just assumed by default to justify the author’s favorable estimation routine. Is there any methodological procedure in your framework to decide ignorability?”

    It drives me crazy when people criticize the PO framework because practitioners make ignorability assumptions that are convenient rather than plausible. I don’t disagree (at all) that it happens and it’s bad practice, but the critics like to make it sounds like it’s impossible to assume a causal diagram because it’s convenient rather than plausible.

    My impression from the above comment is that Bareinboim is asking whether there is a simple way given a model to read off which variables would be sufficient for adjustment (measuring these would identify the causal effect of interest). Maybe, but I’d probably agree that it’s easier to do that with a graph. There’s no reason not to write down the graph implicit in any model we’re using, and I think it can be incredibly helpful. But the hard part is coming up with that graph and neither framework can speak to that.

    • Dear Jared,

      people criticize the PO framework not merely due to bad practices in the field but because they wish to highlight basic weakness in this notation that should be acknowledged and understood by all causal analysts.

      From my experience, the potential outcome framework excels in:
      1. defining research questions unambiguously, and
      2. proving identification conditions algebraically when the number of variables is small (<5) and the assumptions are in ignorability format.

      On the other side, the graphical framework is good for:
      1. expressing and scrutinizing assumptions,
      2. identifying the testable implications of those assumptions, and
      3. proving identification conditions graphically, close to where the assumptions came from (for potentially hundreds of variables).

      If this rating differs from yours, I would love to hear your thoughts. Even better, a toy example can do miracles to enhance communication.

      • Elias,

        I agree with you, but DAGs can become pretty unwieldy in the presence of many variables.

        A complicated causal structure can be equivalently represented in a DAG (made up of nodes and directed and bi-directed edges) or in a structural equation system. Though both are as complex as the underlying theory, the latter can be usefully summarized using matrix notation. Not clear there is an equivalent short-hand notation for DAGs.

        For an example of a real world application see Appendix C and Section 3 of

      • Elias,

        Thanks for the reply. You’ll note that I didn’t say poor practice was the *only* criticism of the PO framework, just that I think it’s an unfair one. (Some of) the abuse boils down to assuming conditional independence assumptions that aren’t justified, which you can do equally well with a graph or with a nod to ignorability. It’s like criticizing a hammer because someone who didn’t know how to use it properly broke their thumb.

        Now you might argue that it’s harder to actually get away with making unreasonable assumptions when you’re claiming (and displaying) a particular graph versus claiming an ignorability condition, and I’m not sure I would disagree with you – particularly in smaller examples.

        In regards to having hundreds of variables, sure that’s hard to do in a PO framework. But where is that graph coming from on these hundreds of variables? Do you (the hypothetical researcher, not you personally :) ) really have the substantive knowledge a priori to specify a graph this large? And can you quantify the sensitivity of these identifying conditions to small changes in the graph? For example, if I added this or that edge would a backdoor path open up? This was another point I meant to make previously – graphical criteria make it easy to find a set of variables satisfying ignorability conditions or other identifying criteria ***given a graph***. I find these points to be underemphasized in that literature, although the main players are good about mentioning that inference is always under the assumptions encoded in the graph (and I’m no expert so any references contrary to my impressions would be welcome!).

        Finally, I definitely don’t mean to suggest that the PO folks have a better alternative in this setting. But I think that causal graphs are often oversold (and toy examples, while illuminating, are in my opinion part of the problem in that context!).

        • To clarify, by “the abuse” in my 1st paragraph I was referring to the misuse of the PO framework in practice, specifically claiming implausible ignorability conditions because they’re convenient, and not any part of your earlier discussion.

        • Thank you Jared. I believe that at the same time, I find it unfair and misdirected when people criticize DAGs methods saying: “and where does the graph comes from”. In a world where we must rely so critically on every shred of substantive knowledge in our disposal, the right question to ask is “how can we make the little knowledge that we have more transparent and scrutinizable.”

        • Elias,

          For whatever reason I can’t reply to your comment directly. Again, my point was that the criticism levied at the PO framework (that it’s misapplied) isn’t fair. I’m sure you could find questionable analysis using a causal DAG that survived peer review too, or do the same for any other useful method.

          Fair criticisms you might make include: 1) The PO framework makes questionable assumptions opaque or 2) Casting everything in terms of ignorability conditions masks it harder to see other, maybe less restrictive conditions that also identify causal effects. Either of these are to my mind valid points of debate.

          I disagree emphatically that it’s unfair to ask whether it’s reasonable to build a causal graph on hundreds of variables. And it *isn’t* a criticism of causal graphs – I think that building *any* causal model on hundreds of variables is almost destined to make unreasonable assumptions, especially in applied problems in epidemiology or social science or when we’re relying on observational data. And I think that assuming a huge graph (and not incorporating or even addressing uncertainty we have about it’s structure) would usually be irresponsible and goes well beyond relying on the substantive knowledge in hand. Sometimes (and from your earlier comments I think you would agree) the right answer is “I don’t know”, or perhaps “I couldn’t say without making further untestable and/or unjustifiable assumptions”.

          So the fact that you can read off identifying conditions in a huge graph is fine as far as it goes, but the practical importance of that seems limited to me since I don’t see how you can put forward a graph that large with a straight face in most applications. Perhaps there are instances when you can (do you have an example?) but I just can’t see that happening in epi, econometrics, other social sciences and definitely not in the statistics literature. My criticism, although it was maybe unclear, was of the practical relevance of the parenthetical in your point (3) and not the causal graph methodology. Incidentally, I think that the ability of the causal graph methodology to derive different and occasionally unexpected identifying criteria with relative ease is a good selling point since these aren’t always obvious even in small problems.

          Also, you also didn’t answer my question about sensitivity to the assumed graph and how you might assess that. Is there literature there I’m unaware of?

        • Dear Jared,
          I also cannot reply to your comment directly.

          Your point about opacity of the PO notation is precisely what I was aiming at. A framework that is designed to aid researchers  organize knowledge and be explicit about assumptions should make this task possible. Or, at the very least, accept help from other languages (e.g., DAGs) that make these subtasks more transparent.

          I am not unaware of the difficulty in constructing a theory (causal graph) with a large number of variables. Still, it is unfair in comparing DAGs with other methods to say: “The trouble with DAG is that you need to know the structure in advance.” This is sheer nonsense. If I don’t know the structure, I just connect every variable to everything else and now I am at the same state of ignorance as those who refrain from admitting their ignorance, and there is nothing that they can do which I cannot. And this include the use of controlled experiments, IV’s, propensity scores, etc. I am just a little more aware of where threat are lurking. For example, I will not include an IV in my propensity score and I will not expect a massive amount of data to make up for my doubts about causal assumptions. In addition, there are areas where large causal graphs can be constructed with some confidence. Pedigree analysis is a good example, where the structure is defined by family lineage.

          Finally, science is about putting together many fragments of knowledge and deducing their consequences in a coherent way. Even though we do not have today all the fragments necessary for building a large causal graph, one day we will have, and it is reassuring to know that we have the mathematics for putting them together. Moreover, trying to build it tells us what fragments are badly needed, not trying tells us nothing.
          I hope more people engage in trying.

        • Elias,
          “Your point about opacity of the PO notation is precisely what I was aiming at.”

          And my original complaint was that this isn’t usually what I read from the causal DAG crowd, including the excerpt from your original post. The critique often given is that people assume things like ignorability because they’re convenient. And that isn’t a fair criticism of the methodology because you can just as well assume a graph because it’s convenient. My point was and is that if you want to say that the PO notation is opaque, say that. Don’t imply that the PO methodology is deficient because people use it incorrectly or don’t verify the assumptions. Because if practicioners pick up the causal graph methodology you can bet it will be misapplied in practice as well.

          In re: ‘Still, it is unfair in comparing DAGs with other methods to say: “The trouble with DAG is that you need to know the structure in advance.” This is sheer nonsense.’

          That just isn’t what I said! I agree that it’s nonsense and you would understand this if you read my comments with care – I’ve been as explicit as I can be in saying that it is NOT a criticism of causal graphs in particular. It IS a criticism of trying to do causal inference with hundreds of variables. You think it’s important that a method can handle that. I don’t, at least not at present, because there are serious difficulties in coming up with a defensible graph (or PO model or *whatever* method you want to use) in a problem of that size. Maybe someday (although in many applications I have my doubts) but as it stands it’s a weak addition to the pros column at best.

          You can disagree with that but it isn’t “sheer nonsense”.

        • Dear Jared,

          What I hear from the “causal DAG crowd” is the converse of what you complain about. It is not that the PO methodology is deficient because people use it incorrectly, but the other way around: people use it incorrectly BECAUSE it is opaque and they cant do better.
          (Pearl wrote so explicitly –I don’t have the exact quote here, but I can find it if needed).

          So, let me clarify this position: the fact the PO practitioners and researchers use “ignorability” as a catch-all term, without justification, is evidence that, even for them, the concept is opaque.

          Regarding “sheer nonsense”, if your position is that the current state of knowledge does not permit ANY method to be useful, then DAGs are not useful either. What is “sheer nonsense” however is to state (perhaps you did not mean it, but I read it all the time) that because we cannot defend a sparse graph, we can do better by no-graph, say by doing things in our heads, or assume ignorability, or use hierarchical Bayes, etc.

          Glad we are in agreement on this point. (and BTW, there is no difficulty in coming up with a defensible graph for thousands of variables; a complete graph is always defensible, since it makes no claims about the world and it represents precisely one’s state of total ignorance. It is unfortunately not very useful for inference).

        • Dear Jared,

          What I hear from the “causal DAG crowd” is the converse of what you complain about. It is not that the PO methodology is deficient because people use it incorrectly, but the other way around: people use it incorrectly BECAUSE it is opaque and they cant do better.

          So, let me clarify this position: the fact the PO practitioners and researchers use “ignorability” as a catch-all term, without justification, is evidence that, even for them, the concept is opaque.

          >> No need, I believe you. But the argument I’ve read has more than once been “PO practitioners use the framework incorrectly so it is flawed”. Obviously the implication doesn’t run that way and I think it would be better off if people just said that the PO framework is limited in applicability or hard to understand, etc. and as a result it is misused frequently.

          Regarding “sheer nonsense”, if your position is that the current state of knowledge does not permit ANY method to be useful, then DAGs are not useful either.

          >> We agree. In fact that’s what I wrote, and multiple times, with extra emphasis.

          What is “sheer nonsense” however is to state (perhaps you did not mean it, but I read it all the time) that because we cannot defend a sparse graph, we can do better by no-graph, say by doing things in our heads, or assume ignorability, or use hierarchical Bayes, etc.

          >> Not only didn’t I mean it, I didn’t write it at all. I explicitly wrote multiple times that in such problems we probably don’t have any tools that work (that is, let us make inferences) without unrealistic assumptions, and the “sheer nonsense” comment was quite off-point and a little rude, frankly.

          Glad we are in agreement on this point. (and BTW, there is no difficulty in coming up with a defensible graph for thousands of variables; a complete graph is always defensible, since it makes no claims about the world and it represents precisely one’s state of total ignorance. It is unfortunately not very useful for inference).

          >> Yes, I meant a defensible graph that is also useful for making inferences. I suppose that could have been clearer.

  2. As someone who happily switches between the PO and causal DAG formalisms as is convenient, I agree with Jared that I find it frustrating when the causal DAG people criticize the PO approach in bombastic ways.

  3. External validity is fundamentally an identification problem. Such problems cannot be solved by modeling approaches. Indeed, short of gathering data on the new population, there is no solution.

    There are, however, principled approaches to get the most bang from the research buck. Theory, days, hierarchical models, machine learning, etc are all useful but, strictly speaking, they are no solution. They can only be more or less useful.

    • Dear Fernando,

      every problem is essentially an identification problem because it is identification that gives us a license to estimate. That said, problems of external validity cannot be resolved by techniques of standard identification. If you find a way of doing it, I would be very interested to incorporate in my work.

      • Elias,

        We agree. If you read my comment carefully you’ll see I stated there is no solution. This is not a binary right/wrong problem.

        For example, a Heckman selection model is not a “solution” to selection on unobservables. Rather it is the correct model to use once we have decided to make some assumptions about the problem at hand. Quite literally, we “solve” the problem by assuming it away. But ultimately the proof of the pudding is in the eating. Did the model predict well, etc.

        DAGS are very useful to encode these assumptions and then apply algorithms like backdoor criterion to identify causal relations conditional on the theory/knowledge embodied in the DAG. They do not solve the problem, but they can help make a research program more effective.

        PS. I disagree everything is an identification problem. Neither prediction nor descriptive inference (understood as summary measures, not estimates of anything) require assuming any underlying parameters.

        • Dear Fernando,

          1. DAGs can decide identifiability and produce a  causal estimand in problems involving hundreds of variables, problems that PO techniques (if such exist) would take years to solve.

          2. Matrix notation is isomorphic to DAG’s (with +1 representing arrow directionality).
           So every DAG procedure (e.g., back-door) has an algebraically isomorphic procedure in matrix algebra.

          3. As to the distinction between “solving” a problem and “assuming it away”, the difference lies only in the strength of guarantees that the “assuming away” methodology can provide. If you are unhappy with the guarantees provided by a given methodology, say the assumptions are not sufficiently convincing, ask if stronger guarantees are mathematically feasible. If not, we must resign to what we have. But resigning without asking this last question is premature, and I am still waiting (and open) for an answer of my question about the type of “guarantees” that his methodology can provide.

        • 1. Agree. The issue is one of practicality. See Figure 6 in the annex of the paper I linked to above.

          2. I agree about the isomorphism (I spoke about equivalence) but not sure about the practical equivalence. Take Figure 6 in the annex of the paper I linked to above. It is easy to represent it as a non-parametric SEM. The latter can be manipulated using matrix algebra, which might be simpler than applying a graphical algorithm for backdoor identification directly to Figure 6 (e.g. delete a subset of arrows, etc.)

          3. Not sure what you mean. When you label a node with an S pointing to it you are making an assumption about the location of the modification mechanism. The validity of this assumption is an empirical question, not a mathematical one, unless the experiment was stratified on Z. Such stratification is not common in practice. As you mention, external validity is a causal problem not a statistical or mathematical one.

  4. I’ll have to read the material more carefully but

    Transportability looks like conditional exchangeability and also one of the first questions in any thoughtful meta-analysis – what is/should be common?

    Also (to me) toy models display mathematics more transparently/concretely whereas live applications “run into brute force realities” (e.g. what was hoped to be common being noticed to not be common at all, leading to new hopes and further disappointments, ad infinitum.)

    There is a very transparent and concrete model for informative Bayesian inference that Elias might find interesting was built by Francis Galton see fig 5 of http://onlinelibrary.wiley.com/doi/10.1111/j.1467-985X.2010.00643.x/pdf.

    (I plan to automate such machines to demonstrate many issues of inference (Bayesian and non-Bayesian) on toy problems.)

    • Dear O’Rorke,

      We have been combing the meta-analysis literature with toothbrush for any hint of what you call “thoughtful meta analysis– what is/should be common”. If you can point us to a meta-analytical paper where commonalities and differences among populations are taken into account (and also the identifiability question formulated), we would be very appreciative.

      As to the paper on Darwin and Galton, I appreciate the link but I can tell in advance that it is not related to the problem of transportability, because transportability deals with transferring “causal information” between populations — no discussion of regression or joint distributions can deal with this problem, however enlightened, because causal and distributional concepts do not mix.

      • I now have read far enough to notice the referenced pre-print on Meta-analysis by Pearl.

        A blog comment for now is that you have not read widely and deeply enough, but it is interesting that the “what is/should be common” is not well emphasized as it should be as there needs to be some replication (i.e. something common between the studies) for meta-analysis to make sense at all.

        I did make it clear in my thesis (which of course was not meant to clearly communicate important ideas) which is here http://statmodeling.stat.columbia.edu/movabletype/mlm/ThesisReprint.pdf (perhaps search through for common and maybe look at Tibshirani and Efron’s “Pre-validation” example for where common is miss-anticipated.)

        And some of it maybe made it into Greenland S, O’ ROURKE K: Meta-Analysis. In Modern Epidemiology, 3rd ed. Edited by Rothman KJ, Greenland S, Lash T. Lippincott Williams and Wilkins; 2008.

        Now your example of “age-specific effects being invariant across cities” is beyond what can be dealt with in most meta-analyses given simply on how studies are conducted and selective reported (e.g. you won’t get to see any age-specific information and anything get might have been informatively “selected” ), but for RCTs with binary outcomes it’s pretty standard to “hope” for “relative (rather than absolute) treatment effects to be invariant across cities” while the control rate varies at least to some degree. The L’Abbe plot was meant to give a visual presentation of both – what was common and what varied – though many authors solely focus on what they hoped was common (homogeneous) .

        So in meta-analysis one is left usually with just two hopes, one to rule out that the null hypothesis was common in all studies (perhaps by combining p-values) and the other is that non-transportability is replicating (and one can make some sense of the variation of treatment effects that were hoped to be common).

        But we can always hope for improvements in how studies are conducted and reported, so I would encourage work on methods here. http://statmodeling.stat.columbia.edu/2012/02/meta-analysis-game-theory-and-incentives-to-do-replicable-research/

        As for the paper on Darwin and Galton, it was just a toy example to show why/how Bayes theorem or conditioning “works”.

        • Dear O’Rorke,

          Your conclusion that we did not comb the meta-analytic literature deep enough has given me renewed hope that, with your help, we will be able to find a number of tools or ideas to enhance our transportability research.

          But please recall that we are not looking for statements about the many threats lurking around, or about the importance of having something common between the studies for meta-analysis to make sense at all, or prudent observations that we will never know for sure how populations differ from each other, etc, etc.

          This is not what we are looking for.

          We are trying to find out how meta-analysts represent commonalities and differences among populations IF (repeat: IF) they are known or suspected.
          And how they propose to take advantage of such commonalities once they are represented mathematically. Can you guide us in this search?

          I am particularly encouraged by the fact that you are not shy of toy examples (e.g., Darwin and Galton), which should make our communication more effective than the usual. In particular, do you find our three toy examples to be trivial, or challenging? Would a seasoned meta-analyst be able to write down the transport formulae by just listening to the three stories?

        • > Would a seasoned meta-analyst be able to write down the transport formulae …?

          I believe some would, but I am already convinced causal stories (as a model/representation) are better written in “Pearl” so would think commonality (replication) stories likely would to.

          Now whether they do more good than harm in applications (today!) is harder to discern. But I also believe it is always a good idea to work out (think through)what one would do if there were not any practical limitations (e.g. design an RCT to test benefit versus harm of low level radiation exposure in young children).

          I’ll send you an email.

          Also revisited this and found it still relevant http://statmodeling.stat.columbia.edu/2009/07/more_on_pearlru/

        • (I cannot reply to your comment directly.)

          I am trying to figure out what you referring to as being potential harmful, is it the focus on simplifying models (stories) or the attempt to encode the stories in mathematics?

          I believe both are important.

    • Dear Fernando,

      Your poster is refreshing in its scope and examples. You have expressed the need for a theory of causality to enable a solution of the problem of external validity. We have proposed such a theory, and it has produced satisfactory solutions to a number of problems in causal analysis. We are anxious to know if you think it would fit your strategy towards a solution of the external validity problem.

      • Elias,

        I really liked your paper. My complaint is not whether your selection diagrams and associated theorems work as intended (I trust your derivations) but rather whether such diagrams are the best representation possible. I am also not sure whether you mean to provide a new “theory of causality to enable a solution to the problem of external validity” as opposed to a new notation better suited to dealing with the problem.

        In terms of the notation, the problem arises from the ambiguity in representing moderation effects in DAGs. (See, for example, Van der Weele 2009 “On the Distinction Between Interaction and Effect Modification”.) In this view, grossly simplifying, and assuming I read your paper correctly, your proposal amounts to using S to label effect modifiers or interaction variables (as defined by Weele, op. cit.) in a given DAG.

        My quip with this labeling scheme is not with the internal logic of your notation and theorems, but rather with the choice of notation itself. One can quite effectively describe planetary motion using circles circling on circles, as Ptolemy did, or ellipses, which are the better representation. Similarly, my gut reaction to your proposal is to avoid adding new notation (S variables, square nodes) and associated theorems when simpler notation and extant theorems will do. Obviously, the latter claim remains to be proven, but that requires a paper not a blog comment. I’ll be happy to discuss further via e-mail.

        • PS To avoid confusion, my previous comment referred to

          J. Pearl and E. Bareinboim “Transportability across studies: A formal approach”

          not the one about selection bias at the start of the post.

  5. Pingback: Causal Analysis in Theory and Practice » Discussion about causal inference and hierarchical modeling

  6. Whenever reading stuff from Pearl and his group, I always feel like Jared above. Explicitly considering your causal graph is useful in cases where assuming a graph a priori is at all reasonable, and graphs certainly make reasoning about things much clearer in some situations. But it always seems more like fancy logic puzzles than useful theory. You never know the graph. If you have more than a handful or two of variables, if you’re honest you’ll admit that even domain experts can at best only rule out a relatively small subset (given the enormous total number) of possible graphs. So then what do you do if you’re theologically wed to DAGs only? Choose one and declare it an assumption?

    I think this comes down to different specialties. Pearl is a logician and theorist. Like Andrew I’m a statistician. Dealing with real data, I think the much more useful approach is to worry more about prediction, using domain knowledge and statistical principles and experience to build models that encompass the most plausible models and will be interpretable at least as associations. To me that seems the only way to find factors that could be causes or proxies. Then yes, graph theory can come into play, but to me it has little to say before the end game in most real world problems with many variables, many unobservables with unknown edges, and little knowledge of the true graph even of the observables.

    • Dear Matt,

      one reason I like DAG analysts is that, out of all the theories, paradigms, and approaches that were presented here, they are the only ones willing to demonstrate explicitly how their method works, and do it on simple problems, whose solution we can anticipate in advance. All the others shun such problems, which give me an uneasy feeling; I don’t know of any principled method in science that works ONLY when we do NOT know if the solution is right or wrong. Would any one trust such a method?

      As to not knowing the structure of the DAG, I discussed it in my comment to Jared — ignorance is the easiest thing to model in DAGs, where it is not hidden under traditional routines, but is dealt with explicitly and methodically.

      How about demonstrating an alternative method on a simple problem with four variables?

      • Not sure what this is supposed to mean – “the only ones willing to demonstrate explicitly how their method works, and do it on simple problems” – there’s never been a demonstration of how other approaches to causal inference work on a simple problem?

        I think part of the talking-past-each-other comes from different values. In the same way that older empirical scientists will say, “if you need statistics to get a result, you should have a better experiment or do nothing at all”. Seems like a difference in where different fields delineate the necessary conditions for knowledge generation. Analogously, you have the applied people who believe that causal inferences where the relationships are complex enough to warrant a graph representation are hopelessly too complicated to say anything. On the other side, you have people asking how you can even begin to answer a question without a principled basis for deciding which variables to control for, etc.

        This would be helped by more books that unify the different approaches, not just theoretically, but also in terms of language / nomenclature. The Hernan/Robins text looks promising for this from what I’ve seen of the drafts.

        • Dear revo11,

          Correct. In this blog forum, causal graphs researchers were indeed “the only ones willing to demonstrate explicitly how their method works, and do it on simple problems” – the critics invariably referred to the literature at large, where problem size and messy data obscure the method and what it guarantees.

          As to the needs for books and unification, I agree. Robin & Hernan promise to do a good job of unification (still to be read), and so does Pearl’s book, in which you can find (already read):
          a. Potential outcomes treated side by side with graphs and structural equations.
          b. “ignorability” is demystified, derived from friendly graphs (title is the title of a subsection devoted exclusively for this topic).
          c. “propensity scores” given graphical representation and understood.
          d. Robins’ g-computation compared to the back-door condition,
          e. Nomenclatures from diverse field as economics, epidemiology, statistics and philosophy defined and compared,
          f. Controversies explained on simple examples.

          No, I do not agree with you that what we need are more books. The books are available, and what is missing is willingness to read what they say
          with open mind, free of the attitude that practitioners, because of the complexity of their problems can afford to fail on toy problems.

          This is like saying that, for a software engineer, the larger the software the lesser logic and programming skills are required, and the more sloppy one can be.

        • 2nd revo11’s first paragraph.

          And revo11’s middle paragraph is exactly right. As a statistician I can’t overemphasize my lack of interest in 4 variable toy logic puzzles. For any problem I’m interested (ie, complex ones with unknown graphs and only a few edges known by domain experts almost surely to exist, which describes most unsolved real world problems outside the physical sciences), causal DAG theory has nothing to say. Again to quote revo11 above, a causal DAG acolyte might ask, “how you can even begin to answer a question without a principled basis for deciding which variables to control for, etc.” How does causal DAG theory help provide a principled basis for things like variable choice when the graph is complex and unknown (and in many cases unknowable), with many unmeasured and unknown nodes? And just using the DAG formalism as a model building tool doesn’t count, though that is very useful.

          Elias’s response to Jared on the unknown graph with many variables question is nonsense (to use his word!). A DAG is “acyclic”, no? That implies that it is in fact impossible to base a DAG analysis on a graph where every variable is connected to everything else. The acyclic constraint means if you require an a priori DAG you must make massive assumptions on the set of possible DAGs, assumptions that are in fact much more restrictive than the ignorability assumption (i.e., if you assume a variable is a root node, you are assuming away the DAGs where this is not true, a truly massive set when the number of variables is large, and that set may include a large number of DAGs where the ignorability assumption would hold). As Jared said, in real world cases of observational data where there is a large number of variables and an even larger number of unknown, unobserved, and/or unmeasurable variables and the true graph is not known, the right answer to the question of causality is almost always, “The measured association is a causal effect if we make massive assumptions”, or “I don’t know”. The right answer is not, in my opinion, “We (mostly) arbitrarily guessed as inclusive an a priori DAG as possible, assuming impossible the large majority of possible structures. Under that (mostly) arbitrary assumption, …”.

          Some things being more easily provable in a causal DAG setting with 5 variables and a known DAG is neither here nor there. Some things being easier to reason about or compute on particular known DAGs with many variables is also irrelevant. Building out a graph from data is hard and the stumbles certain true graph structures (that Pearl acolytes harp on when talking to non-fundamentalists) can lead to in interrogating causal relations also hound graph building. So you have identical base limitations and different analytic methods in real world problems. Now we’re right back to which assumptions you like. It’s true that if you make the much more restrictive assumption that the true graph is at least an edge subset of one built (mostly arbitrarily, if you’re honest) a priori, then you have a more explicit assumption and some nice tools to use. This doesn’t seem like a good trade to me in real world problems. Back to admitting no causal inference in most real world questions where only observational data can be gathered and focusing on prediction and description.

          Not to say the theory doesn’t have interesting and useful things to say when you have a simple problem with a known or almost known DAG. The logic of causal DAGs is also a great tool for understanding causality intuitively, and understanding the enormity of your assumptions if you assume, say, a simple IV DAG with a single, general “U” node that won’t stain your causal inference. Many results (e.g. possible M-bias, effects of including variables on various possible paths, etc.) are something researchers should understand and keep in mind. I just find the level of righteousness and condescension from the Pearl camp off-putting and unjustified and have never heard a satisfactory defense of why causal DAG theory must be the end all and be all of all statistical analysis even though the vaunted tools rely on an extent of a priori knowledge that is wildly unrealistic in most interesting settings.

          Sell it as a useful member of the toolbox? Absolutely. The word of God where heathens who dare ever use any other method deserve condescending Platonic questions? Stop kidding yourself.

          (Aside, I’m interested in a new Hernan/Robins text! Any idea when it’ll publish?)

        • Dear Matt,

          my perception of this discussion does not coincide with yours. You see in it as a debate between obstinate single-minded DAG advocates and broad-minded practitioners whose would accept any tool applicable to their tough problems.

          What I see here is a group of broad-minded causal analysts who, having unified most available approaches under one umbrella (e.g., potential outcomes, graphical models, structural equations, possible worlds), are now offering a community of practitioners to examine their ware and tell them if they are familiar with any tool that can accomplish a specific task (in our case it was transportability). Strangely, instead of receiving a list of alternative methods, all they hear is a fierce defense of the virtues of having no-method, because the problems are complex and “I can’t overemphasize my lack of interest in 4 variable toy logic puzzles”.

          I am leaving town soon, so I will let you reckon for yourself if your depiction of us as religious fanatics was justified.

          As Larry Wasserman commented on this blog:
          http://statmodeling.stat.columbia.edu/2009/07/more_on_pearlru/#comment-49544

          “Andrew:

          With all due respect,
          I think you are wrong that Judea
          does not understand the Rubin approach.
          I think he has studied it and understood it very
          deeply.

          It is my impression that the “graph people”
          have studied the Rubin approach carefully
          while the reverse is not true.

          I have always been surprised by the lack
          of willingness of the “Rubin group” to
          resist studying the work of Pearl, Spirtes,
          etc.

          I think Judea has tried very hard to reach out
          to the other group but has only met skepticism
          and resistance.”

          Best wishes,
          Elias

        • Elias:

          Hey, if we’re going to quote old blog comments, here’s what I wrote three years ago in response to Pearl:

          I think different theories, and frameworks, can be better suited to different problems. . . . I agree with you [Pearl] that the data alone, in the absence of substantive knowledge, will never be enough to answer causal questions. More generally, a sample will never tell us much about the population (unless it is, say, an 80% sample) unless we rely on a model for the sampling. I also agree with you that Rubin’s and Pearl’s frameworks are two different ways of allowing a user to encode such information. Ultimately it comes down to what approach, or mixture of approaches, is most effective in a particular class of applications.

          I think it’s just silly for you [Pearl] to say that the Rubin approach “relies primarily on metaphors, convenience of methods and guru’s advise.” It’s just a different approach from what you do. It’s a common belief, I think, that we are principled while others are ad-hoc. Rubin used to refer to bootstrapping as lacking in principles because it was never clear where the estimator came from. Many bootstrappers consider Bayes to be unfounded because it is never clear where the prior comes from. Some diehard nonparametricians consider probability modeling to be sloppy because it’s never clear where the likelihood comes from. And so on.

          We all use assumptions, and the methods we favor tend to seem more principled to us.

          I still believe this.

          As this discussion has illustrated, communication in this area can be difficult, hence my posting of this discussion. I hope my points 1 through 5 near the end of the above post are helpful. And thanks again for sharing your ideas with us.

        • Dear Andrew,

          I am a PhD student, really trying to understand multiple approaches in order to start my academic career. I don’t have huge stakes in the table, I just want to learn the art and science behind what has been called causal reasoning.

          It seems that I opened a pandora box by asking: “hey, someone, come with me, let’s dig more inside and see where it takes us, it seems cool!”, but I can hear my own echo in the room.

          I do not agree that “they [different paradigms] are all the same” as you stated in the original post. No matter how you slice it, you cannot brush aside the glaring asymmetry that Larry Wasserman and everyone else sees: graph-minded researchers know, use, and explore potential-outcomes; Potential-Outcome researchers do not know, do not use and systematically refrain from exploring graphs (even at the cost of pages and pages of unnecessary derivations). I know this asymmetry first hand because I am part of the first group, and I have been reading quite a few articles from the potential outcome camp with great interest.

          Keeping this asymmetry in mind, and seeing how this conversation is going, my only hope for bridging the gap is to gather a few fresh and open-minded students who are curious about new scientific questions and new mathematical tools, and ask them to dispassionately examine Judea’s material. Then, teach them PO, Bayesian, and other approaches, and put them in touch with other groups familiar with graphs that are doing the same. Again, this is perhaps an utopia, but I cannot envision another path.. it seems that we need to hand this problem out to the next generation.

          I appreciate your effort organizing our emails and courage in publishing our exchange. By exposing the asymmetry you have contributed to its eventual shrinkage.

          Best regards,
          Elias

        • Elias:

          You write, “I do not agree that ‘they [different paradigms] are all the same’ as you stated in the original post.”

          I did not recognize that quote so I searched my post for the phrase “all the same” and couldn’t find it. Could you please tell me where this quote came from?

          Just to clarify: I don’t think that different paradigms “are all the same.” Of course they’re not the same! That’s the whole point of having different paradigms, that they do different things.

        • Dear Andrew,

          Apologies, I should not have put the words “all the same” in quotation marks. What you wrote was:
          “It’s a common belief, I think, that we are principled while others are ad-hoc.”
          implying that all research paradigms have the same dismissive attitude toward the “others”.

          This symmetry of attitude does not hold in the case of graphs vs. PO, as I explained in my previous post.

          Best regards,
          Elias

        • I realize what I said sounded harsh, especially at the end. I’m glad you include yourself in the “broad-minded practitioners” camp. I guess you consider the following statements as selling your wares. To me they come off as dismissive of practitioners not as singularly devoted to causal DAGs as you and Pearl.

          “out of all the theories, paradigms, and approaches that were presented here, they are the only ones willing to demonstrate explicitly how their method works, and do it on simple problems, whose solution we can anticipate in advance.”

          Researchers in other methods may not present them in your favored style, but really you don’t believe any other approach to causal inference has ever been demonstrated on the kind of toy problems you’re talking about?

          “What is “sheer nonsense” however is to state (perhaps you did not mean it, but I read it all the time) that because we cannot defend a sparse graph, we can do better by no-graph, say by doing things in our heads, or assume ignorability, or use hierarchical Bayes, etc.”

          This is the main attitude that bothers me, and basically what I was writing about above. The statement above is totally dismissive of pretty much all methods other than causal DAGs. What bothers me so much is that what you are really saying implicitly is that no one should bother studying problems where you can’t defensibly assume you know the true DAG, something that describes most interesting real world problems. Or if you do you should make a set of arbitrary a priori assumptions (the a priori graph) that rules out the vast majority of possible relationships between variables arbitrarily.

          I’m honestly not trying to have a fight here. I’m interested in an actual answer. Take a research topic like one of Andrew’s. If you’re looking at election results you could construct a dataset with hundreds (maybe thousands?) of variables at multiple levels, and there are thousands (hundreds of thousands?) of unobserved (and many of them unobservable) variables. There is no principled way to claim you can construct an a priori DAG that doesn’t rule out a huge majority of just as plausible DAGs. What method do you propose for analyzing that data?

      • “one reason I like DAG analysts is that, out of all the theories, paradigms, and approaches that were presented here, they are the only ones willing to demonstrate explicitly how their method works, and do it on simple problems, whose solution we can anticipate in advance.”

        Rosenbaum’s latest book has some simple examples, if that’s what you’re after. And the literature is replete with examples of simulations testing different causal models/estimands under simulation. One good example is Jennifer Hill’s BART paper in JCGS, where she took (if memory serves) an RCT and corrupted it to make it “look like” an observational study.

        For most statisticians it doesn’t make any sense to restrict ourselves to scenarios we can work out analytically, although they are very useful. Quite often the what we learn in small examples doesn’t transfer.

    • The last go-round with Judea Pearl on this blog ended with Andrew claiming causal inference is hard and Judea claiming causal inference is simple. I think this is largely a function of Andrew having real data from real problems. The causal graph literature has contributed a lot but it’s still hard to actually do causal inference in real problems.

      • I agree. Solving one equation with two unknowns is hard for those who need to decide the values of X and Y, and fairly easy for the mathematician who can safely declare: we do not have enough information. Still, for decision makers to ignore mathematics would be a disaster, a year of hard labor with no results.

  7. Dear Andrew,
    I thought it would be fitting to wrap up
    this long discussion by informing your
    readers that a Tutorial on Causal Inference will
    be given at the Joint Statistical Meeting (JSM 2012), in San Diego.
    Sunday, July 17 4-6pm,
    http://www.amstat.org/meetings/jsm/2012/onlineprogram/AbstractDetails.cfm?abstractid=304318

    In the spirit of this long discussion,
    the tutorial will demonstrate how
    counterfactuals and graphical models can work together
    to solve problems that neither could have solved separately.
    There will also be half an hour of questions, at which
    time readers could bring up the issues of “toy problems
    vs. pyramids,” “what if you do not have the graph,”
    “Why not Bayes?” and others that came up in this
    forum.

    And for those who cannot make it to San Diego, the
    slides are available here:
    http://bayes.cs.ucla.edu/jp_home.html
    (7 lines from the top)

    Beware of the last slide, which reads:
    I TOLD YOU CAUSALITY IS SIMPLE

    Enjoy
    Judea

    • Interesting slides. This is the kind of incredibly useful stuff that practitioners should understand. “Causality” should be required reading for most researchers, as it forces you to confront the details of different causal relationships and analyzing data from that process. My issue with the “causality is simple” conclusion remains the same. This stuff is all directly applicable to real world problems if you have 5-10 variables and an outcome, a solid idea of what the graph looks like, this graph allows the desired analysis, and you can defensibly exclude arcs or specific relationships of unobserved variables that would make the desired analysis impossible. I maintain that this fails to describe a large proportion of real research questions.

      When you have few variables, even if you don’t have tons of confidence in a known graph, it remains very useful to consider a number of plausible graphs and to reason using the tools you’ve developed to see what kinds of pitfalls might be encountered. Most interesting problems don’t have only a few observed variables of interest and an obvious set of possible relationships between nuisance variables (observed or unobserved) and the observed variables of interest. Unfortunately I won’t be at JSM, but if you have any reference you can point to that outlines your views on what to do when examining a question with tons of variables and no known graph, I would be very interested.

  8. Pingback: Examples of the use of hierarchical modeling to generalize to new settings « Statistical Modeling, Causal Inference, and Social Science

  9. I think there an interesting intersection here between this discussion thread and issue of cognitive biases. As an ecologist studying the impact of symbiotic relationships on ecosystem functioning, I know the difficulties in establishing causal relationships when working with complex messy data ( rightly emphasized by previous commenters like Matt). However, researchers working with such messy real world data are nonetheless routinely producing strong causal statements based on such data. The logical rigour of the mathematically driven approach of causal analysis of Pearl when employed in tandem with methods like hierarchical modelling might be able to reduce the biases (confirmation bias e.g.) affecting ecology and social sciences, where the attribution of strong causal relationships without sufficient evidence seems to be the norm.

  10. Pingback: Causal Analysis in Theory and Practice » Follow-up note posted by Elias Bareinboim

Comments are closed.