## Understanding regression models and regression coefficients

David Hoaglin writes:

After seeing it cited, I just read your paper in Technometrics. The home radon levels provide an interesting and instructive example.

I [Hoaglin] have a different take on the difficulty of interpreting the estimated coefficient of the county-level basement proportion (gamma-sub-2) on page 434. An important part of the difficulty involves “other things being equal.” That sounds like the widespread interpretation of a regression coefficient as telling how the dependent variable responds to change in that predictor when the other predictors are held constant. Unfortunately, as a general interpretation, that language is oversimplified; it doesn’t reflect how regression actually works. The appropriate general interpretation is that the coefficient tells how the dependent variable responds to change in that predictor after allowing for simultaneous change in the other predictors in the data at hand. Thus, in the county-level regression gamma-sub-2 summarizes the relation of alpha to x-bar after allowing for the contribution of u (the log of the uranium level in the county). What was the relation between the basement proportion and the uranium level? A look at that scatterplot may make it easier to interpret gamma-sub-2.

My reply: This reminds me of the old literature in statistics and psychometrics on partial correlation. Sometimes I think that with all our technical capabilities now, we have lost some of the closeness-to-the-data that existed in earlier methods. Ideally we should be able to have the best of both worlds—complex adaptive models along with graphical and analytical tools for understanding what these models do—but we’re certainly not there yet.

David followed up with:

I strongly agree that close contact with the data is often missing, though current computing and graphics should make it easier than it was years ago. Part of the gap must lie in what students are taught to do. It should be possible to overcome that.

In connection with partial correlation and partial regression, Terry Speed’s column in the August IMS Bulletin (attached) is relevant.

I continue to be surprised at the number of textbooks that shortchange students by teaching the “held constant” interpretation of coefficients in multiple regression. Indeed, Section 3.2 of Gelman and Hill (2007) could go farther than it does, by not trying to hold any predictors constant. “Unless the data support it, one usually can’t change one predictor while holding all others constant.” In Data Analysis and Regression (1977) Fred Mosteller and John Tukey devote a chapter to “Woes of Regression Coefficients.”

My reply: As Jennifer and I discuss in our book, regression coefficients can be interpreted in more than one way. Hoaglin writes, “The appropriate general interpretation is that the coefficient tells how the dependent variable responds to change in that predictor after allowing for simultaneous change in the other predictors in the data at hand,” and Speed says something similar. But I don’t actually find that description very helpful because I don’t really know how to interpret the phrase, “allowing for simultaneous change in the other predictors.” If I’m in purely descriptive mode, I prefer to say that, if you’re regressing y on u and v, the coefficient of u is the average difference in y per difference in u, comparing pairs of items that differ in u but are identical in v. (See my paper with Pardoe on average predictive comparisons for more on this idea, including how to define this averaging so that, in a simple linear model, you end up with the usual regression coefficient.) Note two things about my purely descriptive interpretation:

1. It’s all about comparisons, nothing about how a variable “responds to change.” Why? Because, in its most basic form, regression tells you nothing at all about change. It’s a structured way of computing average comparisons in data.

2. We are comparing items that differ in u but are identical in v. Nothing about v being held constant or “clamped” (to use Terry’s term).

3. For sparse or continuous data, you can’t really find these comparisons where v is identical, so it’s clear that regression coefficients are model-based. In that sense, I don’t mind vague statements such as “allowing for simultaneous change in the other predictors.” I’d prefer the term “comparison” rather than “change,” but the real point is that regression coefficients represent averages in a sort of smoothed comparison, a particular smoothing based on a linear model.

I followed up by reading a second article by Terry on linear regression. This article too was interesting while offering points for me to disagree, or at least to elaborate. Terry writes:

Why do we run [multiple regression]? . . . To summarize. To predict. To estimate a parameter. To attempt a causal analysis. To find a model. I hope it is clear that these are different reasons.

I actually don’t think these are so different. More in a bit, but first another quote from Terry:

Think of the world of difference between using a regression model for prediction and using one for estimating a parameter with a causal interpretation, for example, the effect of class size on school children’s test scores. With prediction, we don’t need our relationship to be causal, but we do need to be concerned with the relation between our training and our test set. If we have reason to think that our future test set may differ from our past training set in unknown ways, nothing, including cross-validation, will save us. When estimating the causal parameter, we do need to ask whether the children were randomly assigned to classes of different sizes, and if not, we need to find a way to deal with possible selection bias. If we have not measured suitable covariates on our children, we may not be able to adjust for any bias.

Terry seems unaware of the potential-outcome framing of causal inference, in which causal estimands are defined in terms of various hypothetical scenarios. In that approach, causal estimation is in fact a special case of prediction. To put it another way, Speed’s “relation between our training and our test set” and his “possible section bias” are just two special case of the requirement that a model generalize to predictions of interest.

Terry continues:

I would like to see multiple regression taught as a series of case studies, each study addressing a sharp question, and focussing on those aspects of the topic that are relevant to that question.

I doubt Terry’s seen my book with Jennifer Hill, but actually we pretty much do what he recommends. So I recommend he take a look at our book! I’m sure we don’t do everything just how he’d like but it could be a useful start for the next time he teaches the subject.

1. Anonymous says:

Perhaps related:

• Andrew says:

Anonymous:

I read the link and agree with the general sentiment but I disagree with the author’s statement, “as Bayesian I can afford to speak about direct support for hypotheses (unlike frequentists who can only reject them).” That’s just silly. Bayes is great, it allows us to use prior information and automatically gives probabilistic predictions, but it’s just a way of doing inference, it doesn’t fundamentally change the nature of statistics.

2. judea pearl says:

Andrew,
I have naively assumed that the century-old
confusion regarding the meaning of regression coefficients
in terms of “response to changes” and “holding constant”
has all but settled twenty years ago. The posted discussion
between you and David Hoaglin reminded me that it hasn’t. So, allow
me to offer another perspective to the discussion, one that
should appeal to those who prefer to see things defined in term of what
they mean, rather than how they are estimated.
—————————————————–
———— Regression vs. structural coefficients —-
By writing down a regression equation
Y = alpha X + a1 Z1 + a2 Z2 + …+ ak Zk +eps
one intends the meaning of alpha to be
alpha = E[Y|X = x+1, Z =z] – E[Y|X = x, Z =z] (1)
where Z stands for the vector (Z1,Z2,…Zk)
In words, alpha is the shift in the conditional expectation
E(Y|X), per unit increase in X, limited to observations
in which Z happened to be the same.

In contrast, the meaning of the effect coefficient
beta is
beta = E[Y|do(X = x+1) Z =z)] – E[Y|do(X = x) Z =z] (2)
where the operator do(X=x) represents the action of
“fixing” or “clamping” or “holding X constant at X=x”,
which is well defined within any structural equation model.
For example, if our model reads:
Y = f(X,Z,eps’), then
E[Y|do(X = x) Z =z] = E[f(x,Z,eps’)
(where f may be any arbitrary function, and (X, Z, eps’)
a set arbitrarily distributed random variables.
——–end of definitions —————-

Now, before I am accused of defining one difficult
concept (beta = effect coefficient) with a harder one
(do-operator), note that the latter is well
defined in (hence,readily computed from) any structural
equation model and, surely, one cannot expect to
get causal concepts (beta)
from anything but a causal model, namely, SEM.

Further, before I am accused of being “unaware of the
potential-outcome framing of causal inference, in which
causal estimands are defined in terms of various hypothetical
scenarios” (from Gelman’s comment on Terry Speed),
note that the potential-outcome framing is unnecessary, if not
diversionary. The definition
E[Y|do(X = x) Z =z] = E[f(x,Z,eps’)
invokes no hypothetical scenarios, nor unobserved
potential outcomes, nor cryptic “ignorability”
assumptions; it comes directly from
our model (SEM) of the phenomenon under study.

(As you may know, the entire “potential outcome” enterprise,
with its rules, theorems, scenarios and assumptions
follows directly from SEM, as has been shown in
a number of publications (e.g., Appendix-1 of
http://ftp.cs.ucla.edu/pub/stat_ser/r391.pdf)
So, why do I prefer SEM? The advantages are enormous but can be
appreciated only by those who actually try to solve a problem
in both frameworks, from beginning to end. So far, only
my students have done it; the rest of the world
speculates and speculates, yet will not dare to try.)

I believe that keeping in mind the simple distinction between
Eq. (1) and Eq. (2) should help illuminate the “confusion of the century”
and render it a relic of a bygone age.

Reference:
Lindley (2002) “Seeing and doing: The concept of causation”
International Statistical Review 70:191-214, 2002.
http://bayes.cs.ucla.edu/BOOK-2K/lindley-rev.pdf

• Andrew says:

Judea:

Thanks for your comment, which I don’t see as contradicting anything I wrote in my post. You are talking about a particular area of application of regression (albeit a very important area), I am talking more generally about the statistical procedure. I agree with you that it is important to be able to go back and forth between different interpretations. In some applied settings, regression is used for estimating causal relations, in other settings not.

• alex says:

I’ve gotta question. Are AG and JP saying the same thing? AG puts his definition in terms of an average difference (which is defined in respect of actual data points), while JP put his in terms of a conditional expectation (which defined in respect of the first moment of a random variable). Aren’t those slightly different concepts?

3. judea pearl says:

Andrew,
No, I have not contradicted anything you wrote, I was merely explaining
why people keep coming up with interpretations that involve the notions
of “response to change”,”held constant” “clamped” etc.
These interpretations emanate from confusing regression equations
with structural equations, where the coefficient indeed are defined in terms
of “response to change” and “held constant”.

I think you misunderstood me when you said: “You are talking about a particular area of application of regression (albeit a very important area), I am talking more generally about the statistical procedure.”
This is not the case — I am talking about the broadest possible meaning of regression equations and regression coefficients.

The meaning is always defined in terms of properties of a model, in our case
properties of the joint distribution or, more particularly, the conditional expectation.
This interpretation does not vary with the application, nor does it vary with the procedure
one uses to estimate the regression coefficient.
I hope you agree with me on this point.

The fact that the value of a regression coefficient SOMETIMES coincides with a causal relation
does not in any way change the MEANING , or INTERPRETATION of the former.
The two interpretations are distinct in much the same way that the meaning of the
“mean” and the “median” remain distinct, despite the fact that,
under certain conditions, one may choose to estimate the latter using the former. Now, because
I attribute only one interpretation to regression coefficients, I cannot not see “different intepretations” there, and I cannot join your recommendation:
“it is important to be able to go back and forth between different interpretations.”
I would put it differently, and say: “it is important to be able to know when a given
regression coefficient can represent a causal relation”.
I hope this is what you had in mind, in which case we think alike.

I would only add here that the “confusion of the century” has evolved when researchers
started to talk about “different interpretations” of regression coefficients, sometimes predictive
and sometimes causal. (I do not blame you for this, it started in the time of Pearson and Yule),
I merely wish to solicit your assistance in preventing the confusion from re-emerging
in the 21st century.

BTW, can you guess why the confusion arose between alpha and beta (defined in my previous post)
and not between the “mean” and the “median”?

• Andrew says:

Judea:

All I’m saying is that you care particularly about regressions for causal inference, and I’m talking about regression as data description. Causal inference is one underlying model of regression but it’s not the only model. Several people I respect very much (including Don Rubin and Jennifer Hill, and maybe you too would have that view) have argued that, even in what I consider purely “descriptive” settings, I really have causal ideas underlying my questions. And maybe that’s right. I think underlying structural models and causal models are great. But I also think it’s helpful to understand regression models in their purely descriptive sense, hence my post above.

• judea pearl says:

Andrew,
We seem to agree on everything with one exception.
You keep on saying that “You (Pearl) care particularly about
regressions for causal inference and I’m (Gelman) talking
about regression as data description.” And I keep on
saying the opposite:
equations and regression coefficients,” including
regression as data description.

Where did you get the impression that I exclude
“regression as data description”?
Are any of my definitions inapplicable to “regression
as data description”?

But let’s forget false impressions; I am more curious to know how you propose we
should prevent the regressional-structural confusion from
re-emerging in the 21st century?
From your reaction to David Hoaglin’s post I gather that
you do think some confusion still exists in this area, and
that this was the reason you proposed an interpretation
of regression coefficients that does not use
“clamped” “response to change” and “held constant”.
Hat off.
But, given that people continue to invoke those causal notions
in their thinking even when it comes to regression
as data description, what do you propose we should
tell our students about “regression” vs. “response to
change”?
1. Can a mixup ever lead to errors? how?
Perhaps we should let it go?
2. Can we prevent the mixup by using clear mathematical
definions of “alpha = regression coefficient” vs.
“beta = response to change”?

Can you share with us what you tell your class
on points 1 and 2.?

Note, point 2 is mathematical, so it can be done
in a just couple of formulas. I have offered you my
formulas, to which you did not object. But I am eager
to see your formulas, because I know you do not
use mine in your class and I am curious to know
what alternatives exist, if any.

4. Fernando says:

If I read the above discussion correctly Pearl is suggesting students should learn the two definitions he provides above, and in applications set out which one they are using (e.g. regression or structural coefficient).

But I’m not sure if instead Pearl is suggesting more than this. Namely that we only use the “do” calculus definition. I would find that too restrictive and not as general as he implies.

But both definitions cover every possible case, so if that was the intention I am with Pearl. For prediction carry on as usual, for causal inference use do() operator.

PS if you treat counterfactuals as missing values then you can use prediction to impute them (e.g. without an explicit causal theory or structural model, just a brute imputation), in which case one would not use the do() operator. That may have been Andrew’s point. Of course if we really did not have a causal theory we would have no way of judging whether the estimated effect is biased or identified. But that is a feature, not a bug, of a predictive approach.

• judea pearl says:

Fernando,
I am the last person to suggest we only use the “do” calculus definition.
Not only because I am a pluralist, but primarily because it is silly to use the do-operator in predictive tasks, where classical
statistics can do the entire job well.

I am intrigued however by your statement that

“PS if you treat counterfactuals as missing values then you can use prediction to impute them (e.g. without an explicit causal theory or structural model, just a brute imputation), in which case one would not use the do() operator. “

Especially intriguing is how one can do causal analysis without “explicit causal theory”, which I wish you could elaborate on.
We all know that a causal theory of some sort in inescapable.
Now , if one leaves the theory IMPLICIT, it means that one leaves some assumptions concealed from scrutiny or,
to be polite, shoved under the rug. Is that what the “counterfactuals as missing values” framework advocates??

But rather than speculating on what one framework or another advocates, I have been begging for one example
worked out from beginning to end in both frameworks ( 1. structural, 2.potential-outcome); so far none was offered on this
blog (with the exception of the one worked out in my book, on smoking, tar and cancer).
From this non-response it is very tempting to conclude that potential-outcome experts are not too proud of the way they
solve problems. Perhaps you can defend their reputation by showing us how to solve one toy problem
by “brute imputation” ,with or without a causal theory.
One big request! Please do not send us to look it up in a famous book — they all hide what they are doing.
It should not take more than 4-5 lines to do it here, on this friendly blog.
And, please believe me that I am not asking for this example as a trap — I honestly do not know
how it is done. THe experts are all evasive.

• Fernando says:

Judea:

I was using irony. I agree with you that the predictive apprach is not suited for causal inference. This is inherent to the approach. Hence the next sentence abut features and bugs you left out of the quote.

Ps I’m no advocate of counterfactuals. I find the notion of what would have happened etc intriguing but find the notation a pain. And yes, in that framework identification often comes ex nihilo via siome appeal to ignorability or exogeneity that I find unsatisfying.

• judea pearl says:

Fernando,
we totally agree: “[having] no way of judging
whether the estimated effect is biased or
identified…is a feature, not a bug, of the
predictive approach.”

In other words, the “predictive approach” has
basic flaws that cannot be brushed aside as “bugs”.
I sometimes wonder therefore what qualifies it for the title
“approach.” Borrowing Bertrand Russell’s phrase,
it seems to be more of “a relic of bygone age,
surviving, like the monarchy, only
because it is erroneously supposed to do no harm …”

Seriously, what makes the “missing data” or “predictive”
approach so addictive that its inherent flaws
are ignored and its surface features revered ?

One reason, of course, is the authoritative way in
which it is trumpeted in the royal courts of the
statistical establishment.
Examples:
1. “..viewing all causal inference as the problem of
missing potential outcomes has been the most effective
method for conceptualizing and clarifying critical issues.”
{D. Rubin, abstract, JSM-2012)
2.
“[Causal] estimation is basically a missing data problem,
we only get to see the outcome from one treatment, the
— No theorems, so what’s new?
[R. Little, slides, Fisher Lecture, 2012]
3.
Even on this blog, in a comment on Terry Speed, Andrew
Gelman held the “predictive approach” as a canon of
causal thinking: “Terry seems unaware of the
potential-outcome framing of causal inference…In
that approach, causal estimation is in fact a special
case of prediction”

But I think the core reason
lies elsewhere — the missing-data paradigm legitimizes
laziness by creating the illusion of familiarity; there
is nothing out of the ordinary here, causal inference
is just a special case of prediction.
In other words, you can continue to use your favorite
model-fitting software and things will turn out ok;
no one would be able to tell if what you got is what
you wanted.

This also explains why we have not seen anyone from the
missing-data camp work out a toy example in both the
structural and potential outcome style.
You simply cannot do the former using standard
model-fitting software. If you try, you will be
forced to reveal the assumptions you made
about how the world operates, and this might be embarrassing;
it is safer to hide such assumptions in the software.

• Andrew says:

Judea:

That’s fine that you prefer your own method. We all tend to prefer methods with which we are familiar. I have found the “do” operator to be mysterious and hard to map to the problems that I study, and I prefer the missing-data approach to causal inference (in which there is a probability model for all potential outcomes). But I respect that you prefer a different formulation, and I respect that many people find your formulation to be helpful. I think it can be a mistake to attempt too much unification before a field is ready. For example, Newton and his contemporaries had somewhat incompatible theories of mechanics and optics. But there would be no way to unify these for hundreds of years. Until then it was useful to have multiple perspectives.

• Fernando says:

Pearl:

We agree. When wearing my economist hat I like to think of DAGs as providing the “micro-foundations” for causal inference.

And, to continue the analogy, we can think of the missingness approach as doing macroeconomics without micro-foundations: Many people do it, we have learned something from it, but inferences thus derived are highly insecure.

I am surprised bayesians take that approach: It does not fully account for the prior knowledge about the casual process generating the observed outcomes. (A causal model implies a probability model but not all probability model are compatible with the causal knowledge.)

• Andrew says:

Fernando:

Please see chapter 7 of Bayesian Data Analysis (second edition). We indeed discuss models for the processes generating the data and the observations. Also the paper by Angrist, Imbens, and Rubin is relevant here.

5. Fernando says:

Andrew:

I will take a look. Obviously many people using a missingness approach (or doing macro without explicit micro-foundations) are very careful in ensuring their probability model is compatible with the causal knowledge. Economists do this routinely when they write down a SEM. So in some ways I am criticizing a straw man for the sake of argument. I would not say Rubin does sloppy work.

Rather the argument is not so much who is making better use of causal knowledge — all careful researchers try to do their best — so much as what is the best language to discuss these identification questions.

And here I would argue that it is useful to encode our causal knowledge in a DAG (or SEM) and then derive the probability model from it, rather than discuss some intuition about the casual model and then jump straight to the probability model. And I would also argue that distinguishes P(Y|Z) from P(Y|do(z)) as it removes a layer of ambiguity.

In a nutshell, and by analogy, very good programers can get by using a programing language that includes GOTO statements, but perhaps because I’m not such a good programer I’d rather use a language that excludes these. Similarly I prefer arabic numerals over Roman ones for doing algebra. Our disagreements are largely semantic, but semantics do matter.

• judea pearl says:

Andrew,
I think your perception of the causal inference field is
harsh and overly pessimistic.
You say: “… it can be a mistake to attempt
too much unification before a field is ready”…
“Newton and his contemporaries had somewhat incompatible theories of mechanics and optics. “
You see here two incompatible theories
refusing to be unified, and all we, scientists, can do
is recognize our feeble-mindedness, accept the incompatibility
and say: “you do what you want, and I do what I want,
and let no one compare tools and results.”

This is terribly depressing.
Anyone who attempted to solve a causal problem in both
perspectives will tell you that you are very very wrong.
Not only is the field ready for unification,
but unification has been completed fifteen years ago.
Today we understand precisely
the “particle-wave” duality between the “missing-data” and
structural perspectives and we find no incompatibility
between the two. While each has its merits and weaknesses,
we understand what those merits and weaknesses are, and
we can adopt a symbiotic framework that benefits from
the merits of both.

My exchange with Fernando dealt with one weakness
of the “missing-data” approach that is easily
correctable. It can be seen in your statement:
“I prefer the missing-data
approach to causal inference (in which there is a probability
model for all potential outcome).
Have you ever seen such a probability written down?
or described in a table or an equation?
In general, such a probability would need billions of parameters
to be fully specified, even for a small-size problem.
If you give us one example,
(one line, one equation)
of how you use a “probability model for all potential outcomes”.
I will be able to use it to shed light on the issue of unification.
But it has to come from you.

BTW, have you ever tried to solve a problem in both
frameworks simultaneously?
(an example is given in my book, page 232-234);
If not, I cannot overemphasize its importance;
you will be thanking me forever.

• Andrew says:

Judea:

I recognize that you think you understand things precisely. Perhaps Savage thought he’d nailed Bayesian inference back in 1952, but there were many important things he was missing. I applaud the quest for grand frameworks but they often leave things out.

• judea pearl says:

Andrew,
I think you misread my post.
I did not say that “I think that I understand things
precisely…”
I said that “anyone who attempted to solve a causal problem
in both perspectives will tell you that ..unification
has been completed, and … there is no incompatibility
between the two”.

Do you know ANYONE who attempted a solution and came up with
a different conclusion? I politely asked you if you
(1952) and how he got important things missing.

So here we are, the second decade of 21st century, bearing witness
to two kinds of reports. The first comes from people who
actually tried to solve a problem two ways,
the second comes from people who refuse to engage in
the exercise.

When Galileo first presented his observations on the sun spots,
so we are told, Church officials refused to look
through his telescope; they reasoned that the Devil was
capable of making anything appear in the telescope, so it
was best not to look through it.

And here we are again, except it is the second decade of 21st century…
The rest will be told in the history of causal inference.

• Andrew says:

Judea:

Given that I am using a historical analogy from 1952, I suppose it’s only fair for you to use an analogy from the 1600’s! Let me explain the relevance of my analogy. I don’t know about Savage himself, but many of his followers seemed to believe that his framework was enough for all Bayesian statistics. I don’t think it is, though (as described starting on the very first page of Bayesian Data Analysis).

I don’t disagree with your statement that the methods used by Rubin can be placed in your framework. My feeling is that all these frameworks are still missing some important pieces, in particular how to actually specify the models (even down to specific questions like, when are normal distributions and logistic regression models reasonable). I wasn’t trying to say that your methods and Rubin’s are incompatible; I was saying that I think all these methods have some missing pieces. (I’ve argued with Rubin about this too!)

One of the advantages of your framework (as described, for example, by Fernando in this discussion) is that, for many people (including you, Fernando, and many others), your framework allows clear statement of assumptions. It’s not so clear to me (I remain confused by the concept of the “do” operator) or various others (e.g., Imbens), but I respect that many people find it useful. I think it’s great that different people can use different approaches to get similar results. In this position, I am completely different from the people who censored Galileo.

• Andrew says:

Judea:

For an example of where potential-outcome thinking helped me understand a problem better, see this recent blog post (which I actually wrote a month or two ago but just happened to appear today). I say this not to disparage your framework in any way, just to give an example of where the ideas of interventions and potential outcomes have been helpful to me.

• judea pearl says:

Andrew,
I dont mind being criticized, or being ignored, but, for the life of me, I cant understand how smart scientists
can show zero curiosity when new tools are introduced that can make their work so much easier.

By definition, “new tools” entail an investment in getting comfortable with them. For example, you say that you are
“confused by the concept of the “do” operator”. This is natural. But if you just ask, the answer will be given to you
on a silver platter, no philosophy and no evasive references to learned books. Here it is, in one equation:
P(Y=y|do(x)) is non other than P(Y_x =y) Eq. (1)
Now the problem of interpretation is resolved. If you feel comfortable with the potential outcome probability P(Y_x = y)
you should also feel confortable with the do-expression P(Y=y|do(x)) , and you can use the latter to express knowledge or interpret claims.
Duck soup.
What remains? How to express assumptions using do-expressions? using graphs? Just name your question
and life will be brighter than you ever thought.

You feel that “all these frameworks are still missing some important pieces”. Great, I am eager to find new challenges,
How about telling us what concrete facilities are missing in these frameworks and you will be surprised,
either we will discover that they actually exist in these frameworks or a new research project will be launched
to discover them. or, a mathematical proof will be presented that what you are asking for is undoable in any framework.
I am inclined to vote for the last option, but need further clarification of what it is that you would like to see accomplished.

• Andrew says:

Judea:

You write, “P(Y=y|do(x)) is non other than P(Y_x =y).” Sure, but the problem is that, in general, I don’t see the relevance of do(x) or of Y_x in many situations. I understand the idea of an intervention but I don’t understand the idea of “clamping” x or whatever. I get at some of this in the example here. In general, there can be many possible interventions that can change a variable x, and these different treatments can have different effects on y. To me it does not make sense to speak of setting x to a value without considering how this is done, which in many cases will require a new variable to be added to the model. Your framework can incorporate that, I agree, but then I don’t see the benefit (to me) of the graph.

My confusion on this is not just about your models. I have similar problems with some of Rubin’s formulations as well. I think he understands what he’s doing but it all leaves me confused.

I agree with both you and Rubin (in your different ways) that mathematical formalism can be helpful. As Rubin pointed out once in a talk, even the great R. A. Fisher made mistakes in causal inference (once he made the notorious mistake of controlling for an intermediate variable in estimating a causal effect, in a context where it was clear in retrospect (and I think you’d agree too, sorry I don’t have the reference right at hand) that it was a mistake. Fisher was a famous avoider of mathematical formalism and in this case he was hurt by that avoidance.

So I’m sympathetic with your general goals and in that sense I am glad that your approach is so popular, even though it leaves me confused.

• Andrew says:

Judea:

I think I understand your frustration here. Let me give an analogy that I think might help. It’s a hypothetical conversation between me and a non-Bayesian statistician:

Andrew Gelman: Bayes is great, it’s solved so many problems for me, blah blah blah . . .

Non-Bayesian: Sure, but I don’t believe the prior.

AG: You don’t need to believe the prior, just use it as part of your model. 8 schools example, toxicology example, blah blah . . .

NB: OK, but I don’t feel comfortable with it. In many settings, the idea of a probability distribution on a parameter makes no sense.

AG (sputtering): But what about the likelihood—the data model. You’re willing to make huge assumptions there but then you balk at the prior!

NB: Sure, I recognize that the data model is an approximation—but it’s an approximation of something I don’t understand. The prior is an approximation to something that makes no sense to me. And, one more thing . . . you keep saying that the likelihood is the big assumption, that the prior is no big deal, right?

AG (suspiciously): Yeah?

NB: In that case, why is the prior needed? Or, to put it another way, if the prior is such a small part of the model, how can you go around saying it makes such a difference? And, if it does make such a difference, aren’t you worried about getting it wrong and having no real way to check it?

AG: Predictive checks blah blah blah . . .

NB: Sure, but that don’t do you much good for the highest level of the hierarchy—your hyper-hyper parameters or whatever.

AG (weakly): Robustness . . . often results are robust to the reasonable variations in the prior . . .

NB: That’s great. I’m glad your methods are robust and that you’ve convinced others to use them. But I still feel uncomfortable putting priors on parameters.

AG: Often the prior distribution encodes real prior information that I want to include in the model.

NB: That’s fine, I have no problem with priors in those settings, it’s in the other cases that I don’t feel comfortable, in those cases where I have no prior information.

AG: But you always have some prior information.

NB: True, but in many cases of interest, the amount of prior information I have is so small that there would be no point in putting it into the analysis.

AG: But then you’re operating in a discontinuous fashion: when you have very weak prior information you use non-Bayesian methods, then you switch to Bayes in some settings where your prior information is strong.

NB: Sure. But I have no obligation to work within a single consistent frame of reasoning. That’s your hangup. I just want to get good answers, I don’t sitting around worrying about coherence and Dutch books and all that.

AG: But all I care about is getting good answers too. Look at all the applications I’ve worked on!

NB: That’s fine. I’m glad your methods are useful for you. I respect that Bayesian ideas have been useful for mainstream statisticians (for example, they inspired the very useful “lasso” idea), and I have no problem with people using such methods. But, as for me, I don’t in general feel comfortable assigning a probability distribution to a parameter in a model.

AG: But non-Bayesian methods such as maximum likelihood can be understood as special cases and approximations of fully Bayesian methods. My framework includes yours as a special case!

NB: Good for you. And if I knew what to think about your priors, I’d probably find that helpful. As it is, I recognize that these mathematical connections are often useful, both as tools for understanding and as theoretical devices which might help develop new models. Just don’t ask me to take your interpretations at face value. If I’m doing regularized maximum likelihood, I’m doing regularized maximum likelihood and I’ll understand it as a statistical procedure in its own right. You can think of it as a Bayes procedure, and I don’t dispute your mathematics, I just have no particular use for the interpretation.

AG: Fine. Take the example of regularized maximum likelihood. If you use such methods, you need to pick a tuning parameter. Bayesian methods work well here: you treat the tuning parameter as a hyper parameter, give it a hyperprior, etc.

NB: I’ll buy that. A Bayesian framework can be a useful technical tool to set tuning parameters in a regularizer. But the concept of a prior distribution as a distribution for a parameter, that’s the part I don’t believe. If Bayesian methods are used to create an estimator, I’ll evaluate the estimator using statistical principles.

Etc etc etc.

The conversation can go on forever. It’s not a useless conversation: Bayesian ideas can indeed be useful in constructing statistical procedures, and extra-Bayesian ideas remain relevant in evaluating the usefulness of Bayesian methods in settings where the model is only approximate (that is, in all statistical settings). But that doesn’t require me to convince anyone that prior distributions make sense in general, any more than you’ve convinced me that the “do” operator has any sensible meaning in general. I see your (Judea’s) causal methods as the non-Bayesian sees my Bayesian methods: (1) your methods are tools for understanding and making sense of some existing statistical ideas (just as Bayes is a mathematical framework that unifies and generalizes ideas of regularization); (2) your methods are popular, and I have to respect that, that they appeal to a lot of people; (3) your methods have been useful in various applications. I can, and do, hold to (1),(2),(3) without understanding or accepting the “do” operator. You might be right that it would be better if I were to learn your methods, but one reason for the difficulty of communication is that we think about different sorts of problems. And that’s fine with me—different methods work better on different problems.

• Paul says:

Andrew & Judea — thank you for this discussion!

• judea pearl says:

Fernando.
You are very generous to people using missingness — I am not too sure that they
can be ” very careful in ensuring their probability model is compatible with the causal knowledge.”
I say it, not because I doubt their integrity, but because I doubt the capabilitiy of the human mind to check compatibility using inhuman notation. Here is an example , suppose I give you three ignorability statements: Y_x _||_ X | Z , Z_y || Y | X , Z_yx || X | X_y
can you tell if they are compatible? redundant? compatible with your causal knowledge about X, Y Z?
Can you tell if they have testable implications? I can’t, and I dont think any of them can.
Yet they insist of using ignorability to encode knowledge, and they claim to be careful because no one challenges them with a concrete example.
I just did, let’s see if anyone responds.
c

6. Fernando says:

Judea:

As you know we agree. But I think your criticism abov, is slightly misplaced. Your argument is Wittgensteinian: If you cannot name it (E) you cannot know it (K), or K iff E. I think that is wrong. I would argue K implies E but not vice versa.

Therefore to say that a language cannot fully express causal knowledge does not imply its speakers cannot identify causal effects, in the same way that lacking a word for asparagus would not prevent me from making a asparagus omelette.

But this is a technical point. Obviously given the choice I would rather use a language that fully expresses my knowledge. So whereas I am willing to grant Rubin is careful about his work, I would also add he is running unnecessary risks by using an ambiguous language. And I can see why you would want to change that.

• judea pearl says:

Fernando,
I did not get the analogy with Wittgenstein, but
I would like to find out where you think my criticism is slightly misplaced.

Recall, I did not criticize Rubin personally. I criticize the culture he created,
and especially his disciples, young and old, for fearing to look at a telescope
decades after it was invented and practiced.
Would you say that these disciples are being “careful about getting their astronomy right?”

BTW, as an ex-economist you might be interested in the latest version of my paper
on Haavelmo, especially the new section on “What held the Cowles Commission at Bay”
http://ftp.cs.ucla.edu/pub/stat_ser/r391.pdf

• Fernando says:

Judea:

What I was trying to say is that one need not know Newtonian mechanics in order to play billiards. Similarly, you don’t need DAGs or counterfactuals in order to do useful causal inference (cf. John Snow), though these can help enormously.

But I think this touches on an interesting question about the role of notation in knowledge generation.

In principle you write down what you already know: it is the knowledge base (K) already in your head that guides its expression in some formal language statement (E) (as when you draw a DAG). But if the language you use has less expressive power than what you know, in principle that is OK: it should not diminish your knowledge base even if you cannot write it down fully.

So, if both Pearl and Rubin know K, then given the same data they should arrive at the same conclusions Q — even if they have different languages E_r and E_p. The fact that a language has less expressive power does not render the speaker of that language more stupid.

That’s in principle. But then what is the point of a language? Communication is one obvious aspect but, just as important I think, is its capacity to leverage knowledge and discipline our thinking, even when all we do is talk to ourselves.

Like the usefulness of an abacus, this has a lot to do with human’s computational abilities. Thus I find that by writing down my knowledge I often end up learning something new. And I’m persuaded the amount that I learn depends on the language that I use. That is why, for the sorts of problems I deal with, I prefer E_p to E_r.

• K? O'Rourke says:

Fernando: Very clear and thoughtful comments here.

Judea: From the preface of Wittgenstein’s tractatus “what can be said at all can be said clearly, and what we cannot talk [clearly?] about we must pass over in silence” (but here we know this is a model and so it is false and my addidtion _clearly_ is in the eye of the beholder.)

Think there may be some risk/cost issues at play here in terms of risks from a really wrong model being taken too serious and costs of learning how to and then doing it for complicated examples.

_Everyone_ wants to get less wrong as quickly as possible.

• judea pearl says:

Fernando,
You have expressed it much better than I ever could.

There is more to language than just convenience of notation.
Our ability to hear ourselves thinking depends on having
a language that talks to our brain in order to stimulate it and make
sure we stay faithful to our experience.

“So what?”, says the purist, “eventually you get the same result,
so who cares?? ” Wrong! The same results would be obtained
if the same knowledge is assumed (same K), but if one language is opaque
and you cannot tell if what you put down on paper reflects
what you believe is true in the world, and the other language is transparent
so you can scrutinize immediately if you believe in what you
wrote down formally, then the choice between the two is not
merely a matter of “convenience, but a matter of being right or wrong.

One correction, the expressive powers of the potential-outcomes and
structural models are the same.

• Fernando says:

Thanks. And yes I was ambiguous about expressive power. So the languages fully translate ( a bijection if you will) just like Roman and Arabic numerals, but the positional system of the latter makes them so much more powerful for arithmetic. That is what I had in mind by greater expressive power. I should look up a formal definition though.

• Andrew says:

Fernando:

I see what you mean with your analogy to Roman vs. Arabic numbers, but I don’t think it’s quite right. My impression is that just about everybody can do just about any mathematical operation faster and more accurately with arabic than with roman numbers. But this is not so, regarding Pearl’s do-calculus, which is helpful in some settings but to me doesn’t do much in an example such as this.

A better analogy might be boostrapping vs. Bayesian inference as two ways of getting standard errors. In some settings (for example, estimating a population average from a complex survey), bootstrapping does well and Bayes is a pain in the ass. In other settings (for example, analysis of space-time data) you have to do lots of contortions to get a good bootstrap, while Bayes is relatively straightforward.

I recognize that my analogy is not perfect either. Regarding various frameworks of causal inference, my point is that, although they may have complete generality in some mathematical sense, they have some gaps when I want to apply it to the external world. Just as I respect Pearl’s ideas in part because of their popularity among computer scientists, I think it makes sense to respect Neyman’s and Rubin’s ideas in part because of the clear mapping between interventions and potential outcomes (as illustrated in my recent post here.

7. Fernando says:

Pearl:

1. I think the reason Cowles did not do so well has more to do with bad predictions than lack of principles 1 and 2. My sense is during the 80s-90s big structural equation models were being replaced with time series forecasting models like ARIMA and VARS that had much better predictive accuracy. It seems people where using a wrench when what they wanted to do was hammer a nail.

2. Whether you have solved or not the external validity problem depends on how you define the problem in the first place. I do not see the problem as one of lacking the language to encode assumptions that license extrapolation. The problem of external validity arises from the need to make extrapolations in the first place, and so, in some sense, it is unsolvable. Rather than talk about “solutions” I would prefer to talk about “disciplined approaches to” dealing with a problem.

By analogy, I would not say that assuming the graphical notion of faithfulness resolves the real possibility of God being an evil deceiver in His choice of Nature’s parametrization. The problem is always there, we can only quarantine it.

• judea pearl says:

Fernando,
(1) Lousy predictions, Lucas critics, and retreating to forecasting (away
from policy guiding) are the factors usually cited for the decline
of the Cowles Commission program. However, the program actually continued as
a research paradigm from 1950 till today, as can be seen by the works of
Heckman, Leamer, Matzkin etc. and as can be seen in all econometrics textbooks
(They all teach structural equations, some with ARIMA and VAR, and none claims
that econometric models cannot offer guide to policy making)
So my question remains: why havn’t they developed the tools to solve even
the most elementary questions in nonparametric models?

(2) A “disciplined, formal treatment” of problem X becomes a “solution of X”
when no one proposes an alternative formal definition of X (and the problem
has been kicked around, informally, for half a century). We use both titles
intenchangeably but, I agree with you, claiming “a solution of X” gets readers
irritated and should be minimized. Thanks.

• Fernando says:

On (2) I think that if you feel justified in claiming to have solved the external validity problem, then you should also feel justified in claiming to have solved the fundamental problem of causal inference.

I doubt you would want to claim the latter, so by implication you might want to avoid the former. (This is in no way to diminish the importance of your findings).

8. Fernando says:

Andrew:

I have to disagree with your examples on genetic diversity. Ok one first news to conceptualize what the intervention is. But once you have defined all your variables put them as nodes on a piece of paper and start drawing connections. An assumption is a connection not drawn. When you are satisfied, check whether effect is identified using d-separation, front door criterion etc. At least we would have something more concrete to talk about.

That said I sympathise with you that this forces a binary distinction between variables: they are connected or not, whereas you might want to say that with x probability they are with 1-x they are not. You could do that across all edges and then sample graphs and make statements like : with 80 % probability given my priors the ATE is identified and is k, otherwise it is not identified and k is a measue of non causal association.

• Andrew says:

Fernando:

I disagree completely. I think it’s meaningless to try to understand a causal effect without having an idea of what is the potential intervention. This has nothing to do with pieces of paper or connections or front door criterion or d-separation or anything else, nor does it have to do with normal distributions or logistic regressions or whatever.. I believe all variables in an observational study are connected (in your terminology, x=1), but this does not help resolve the causal question. In the example cited, I think it’s meaningless to talk about “increasing the diversity of Bolivia” without considering how you would do this.

• CK says:

Andrew:
I think that depends on the context. For example we know that the rotation of the earth causes the tides. Is that meaningless?

• Andrew says:

CK,

To bring your tides example closer to the one under discussion, suppose someone claimed, based on an analysis of physical laws and data, that “increasing the rate of Earth’s rotation” would increase the height of the tides. Then, yes, I think this wouldn’t mean so much without specifying how the rate of rotation would be increased. In this case there might be a completely reasonable way for this to happen, or it might be that there are several different ways to do this, and they all result in the same change in tide level. If there is that sort of robustness then I can see how it would make sense to say that one variable causes another, without worrying too much about the way that the first variable is changed. But I don’t think the example of Bolivia’s genetic diversity and its economy is such an example!

I expect that it is possible to frame the above discussion in terms of colliders etc. but it’s not clear to me that such a re-expression would be an improvement.

• judea pearl says:

Andrew,
is perfectly justified, as you expressed it (quoting).
“In general, there can be many possible interventions that
can change a variable x, and these different treatment can
have different effects on y. To me it does not make sense
to speak of setting x to a value without considering how
this is done, which in many cases will require a new
incorporate that, I agree, but then I dont see the benefit
(to me) of the graph.”

You were very close to the answer, but retreated at
the last minute, with “I dont see the benefit (to me)
of the graph”. I will try to show you how close you were.

1. You recognized the futility of asking questions such
as “and what if my intervention has a side-effect that
I did not know about”? . You recognized the necessity of modeling
the stipulated intervention with as much diligence as we
model the world, even if it takes adding extra variables
to the model. Indeed, no one could predict things that are not
in the model (e.g., pressing an unknown button in the dark).

2. But now, instead of giving up, let us continue boldly
toward our aim: “Find the effect of the intervention I
from observational studies, using two pieces of information:
(a) Data from M – our pre-interventional model of the world,
(2) Our model M(I) of how M will change with the intervention I.

3. I believe you will be happy to know that our boldness
has paid off; the question posed in (2) can be answered
formally using the do-calculus. The graph
tells us if the information available is sufficient to
find a bias-free estimate of (2) and, if so, how.

4. To summarize, we start with the somber realization that we do not
want to evaluate the effect of some hypothetical atomic intervention
like do(x), but, rather, the effect of a compound intervention I.
Miraculously, through the courage of using mathematics we
are able to decompose our compound question into its
atomic constituents and find the answer using the tools
of causal inference. All it takes is courage to use the
available tools, be they graphs, or do calculus, or
potential outcomes. This is much more than notational
convenience; acquiring new tools gives us the courage
and curiosity to do things we would otherwise dismiss
as un-doable.

I will end by saying that a more elaborate discussion
on this topic (with examples) is available in my 2010 review
of Cartwright’s (2008) Hunting Causes..”, see
http://ftp.cs.ucla.edu/pub/stat_ser/r342.pdf

I hope I rekindled your curiosity.

9. Fernando says:

Andrew:

Allow me to try to clarify your criticism. (1) Are you worried that we cannot manipulate x; or (2) that we cannot do so directly; or (3) that we really do not know what x is? My answers:

(1) Lack of a manipulable treatment is not a problem in principle. You cannot run an experiment but you can do observational studies. The caveat is Nature may not want you to know the answer: no conditioning strategy is able to identify the effect. Then you are stuck.

(2) If we have no way of implementing a controlled manipulation of X, add a manipulable instrument z, or a set of instruments Z, to the DAG. The simplest diagram would look like z -> x -> y. Then you can worry whether Z is really an instrument, side effects, etc.. by checking for d-separation, etc. It may well turn out that after all is said and done you cannot isolate the effect of x (e.g. there are no suitable instruments/ conditioning strategies/ etc) but you’ll only know that if you write things down.

(3) If your claim is more loosely interpreted as “we cannot even talk about causality if we do not know what our treatment is”, then I agree with you. But this is a conceptual and measurement question.

This is not to deny there are difficult interventions. In terms of genetics, clearly we are never going to get a God like intervention where we just go to Bolivia and in the middle of the night change their genes. Moreover, I am not sure I’d be interested in that kind of intervention (hence I did not read the paper). But I’d be interested in the impact of Spain expelling all Jews and Muslims in 1492, or of US migration quotas, of refugee crises, or of drawing African maps that maximize or minimize ethnic fractionalization, all of which affect the cultural and genetic diversity.

A related example is diets. What is the effect of a low-card diet on weight? Well, it turns out it is hard to tell because when you remove the carbs overall caloric intake may go down, and, if you want to hold calories constant, then you need to subsitute in protein, or fats. And maybe their proportion matters. So a diet is a like a cocktail drug, and there are hundreds of small variations along the same theme (paleo diet, atkins, south beach, etc…) I think Wu and Hamada’s book talks about designing experiments to optimize a mix. It makes life hard but not impossible. But it is precisely in these hard situations where a clear notation can help.

Finally, if in your world everything is connected then that is fine. That is the more general case, but generality has a price. Presumably you can only make statements about prediction, not manipulation of cause and effect, because in that world the probability of confounding or controlling for a collider is exactly 1. Then the question becomes how big is the bias. You don’t care about bias, but tell that to a patient given the wrong medicine. You care about getting the signs of the estimated effects right, I do too, but then you are coarsening. Why then and not before?

• Andrew says:

Fernando:

You write of, “the impact of Spain expelling all Jews and Muslims in 1492, or of US migration quotas.” That’s fine, those are specifically defined. What I don’t get is “increasing the diversity of Bolivia.” I find it highly doubtful that effects of expelling Jews or admitting Mexicans has anything much to do with the genetic diversity measure used in that paper.

• Fernando says:

Andrew:

I have not read the paper so I cannot say. And I agree with you that policies are often more interesting. But consider the following: Suppose that the expulsion of the Jews and Muslims reduced growth in Spain. We might want to know why. Maybe it had to do with human capital, or work ethic, or network structure, or maybe genes.

The latter is not so far fetched. For example, if some people are genetically resistant to disease and there are non-linearities in disease transmission, then a few immune people intermingled in a general population can stop pandemics on their tracks, generate herd immunity, and so on.

In that context you might want to know whether genetic diversity is a mediator, and that requires identifying the effect of diversity on the outcome even if your interest is only in the policy (e.g. if true you might adopt an optimal genetic profile for immigration to confer herd immunity).

All this is far fetched, and although I’ve not read the paper I find it far fetched. But I think the problem there is not DAGS or whatever so much as weak conceptualization and theorizing.

PS Thank you for hosting this conversation!

• Andrew says:

Fernando:

I agree 100% that the problem with the linked paper is not DAGs or colliders or anything else. To the extent that the authors were thinking in any formal mode of causal inference, I expect it was the potential-outcome framework, which indeed did not protect them from their errors. (Nor do I think they would agree with me that they were making errors, but that’s another story!)

What I was saying was that the potential-outcomes framework allowed me to work through some of the problems in that paper, in a way that DAGs etc might not have done. If potential outcomes notation and DAG notation are two ways of solving similar problems, I think it’s fair to say that they focus on different aspects of these problems. DAGs are good if you’re worried about colliders and such things; potential outcomes are a useful framework if you want to think carefully about what is the assumed treatment and how it might vary.

• Fernando says:

I think potential outcomes are useful to introduce the notion of causal effect as what would have happened had the same unit been exposed to control or whatever. Also, perhaps bc of the way i’ve been trained (warped), Y_i1 vs Y_i0 seems more explicit than Y|do(x) vs Y|do(x’).

But then I don’t like to talk about (Y_i1,Y_i0 \perp X | z1,z2,z3). The causal model is not made explicit. For example, I don’t assume Missingness at Random (MAR), I assume an explicit causal structure (a DAG), and then derive MAR from it. The final outcome is equivalent, but DAGs call for making explicit the causal knowledge from which the conditional statement is derived. It’s like showing your work when doing high school arithmetic homework. Clever people want to jump steps, often they err.

And I disagree about concepts. If you read Gary Goertz’s book on social science concepts you’ll see he argues most concepts are causal and ontological, and comes very close to using DAGs for concept definition. Just flick through the book and you’ll see the graphs. I am doing some work on that (correction: I have been planning to ….)

• judea pearl says:

Andrew,
You say: “DAGs are good if you’re worried about colliders and such things”
Again, you are dismissing without trying.
Here is a correction from someone who solved a few problems in both frameworks:
(1) Counterfactuals are necessary for specifying your research question, namely, what you want estimated.
(2) DAGs are necessary if you worry whether the assumptions you made agree with what you know,
(3) or if you worry whether what you know is sufficient for estimating what you want, and
(4) if you worry whether what you know has testable implications.

Note, I said “counterfactuals”, not “potential outcomes”. Why? Because the latter is
becoming more and more associated with a culture of hasty dismissals, as if (2)-(4)
do not deserve one’s attention, as if (2)-(4) have ever been accomplished without DAGs,
and as if DAG’s are just about things you can do without. I suggest the use of “counterfactuals” for those who wish to be associated with a culture of symbiotic tool building.

10. Fernando says:

Judea:

I liked how you put it. Again it gets us to DAGS as a laguage to sort out our private knowledge.

I find that interesting because it suggests we have more knowledge than we can process unaided.

This has testable implications. Students randomly assigned to a dag course, others to potential outcomes. Then given a series of identification problems as a test. It would be interesting to compare the error rates.

11. judea pearl says:

Fernando,
The idea of testing the merit of DAGs in student performance is a good one, in theory.
Unfortunately it will not convince those who do not want to be convinced.

First, you will never get anyone from the arrow-phobic camp to pose a problem that students can
understand. How would you describe such a problem to students? If you start with a story,
say smoking–> tar–>cancer,
they will argue that they do not understand the assumptions, that the assumptions are unrealistic,
that it is a toy problem, that they
are dealing only with “practical problems”, that practical problem do not lend themselves to
graphical representation and, in general, it is not the kind of problems that are of interest to THEM.
Second, they will argue that identification is irrelevant in the practical and important work that
they are doing. So how would you measure performance?

I have tried all the tricks, you will never get any of them to say “here is a causal story you
and I understand, lets try to formalize it”. Their picture of the world does not consist of stories,
mechanisms, processes, influences, effects, equations etc. it consists instead of statistical tables with missing data.
It is a dark world, I agree, but they are willing to pay this price for the comfort of working within a
That is why I emphasize “testable implications” in my posts; this is something that no self-respecting statistician can afford to dismiss. To no avail. Missing-data researchers are not motivated by what
needs to be done, but by what they know how to do, and if one does not know how to detect a testable implication then it is not something that deserves one’s attention.

It is a strange but, luckily, temporary phenomenon.

• Andrew says:

Judea:

If you include me, Rubin, or Imbens in the category of “missing-data researchers” when you write, “It is a dark world . . . Missing-data researchers are not motivated by what needs to be done, but by what they know how to do . . .”, then I think you’re just being silly. People have all sorts of motivations. I find your methods confusing. That doesn’t mean that I’m not motivated by what needs to be done (whatever that means), it just means that I’m using different tools than you’re using. And it’s silly to say that Rubin is motivated by what he knows how to do: he’s spent many years developing new methods so he could do things that he could not do before!

Part of this might simply be temperament, that I’m a fox and you’re a hedgehog. Many of the great statisticians of the past century have been foxes. Arguably, though, the most important contributions to science have come from hedgehogs. Albert Einstein was a hedgehog; Richard Feynman was a fox. Feynman was great, but Einstein’s the one who really changed how we think about the world. Even those hedgehogs such as L. J. Savage who prematurely trying to unify a scientific field can end up making important contributions. I think we need both hedgehogs and foxes.

Then again, i would think that. I’m a fox.

• Fernando says:

Andrew:

This conversation cannot end with the conclusion that you are a fox and Judea is a hedgehog. As scientists we should do better.

But maybe that makes me a platypus.

• Andrew says:

Fernando:

My contribution as a scientist is to write articles, books, and blogs about statistics and its applications. For example, my entry here illustrated how the potential-outcome framework can be used to clear us some confusion in a recently published article in an economics journal.

The hedgehog/fox thing is just my way of trying to explain to Judea how it is that I can be satisfied that he and I use different methods, while he is clearly unsatisfied. I see different scientists using different methods as natural in a field such as ours that is not fully developed, while Judea attributes all sorts of unscientific attitudes to people who prefer not to use his methods. I bring up the hedgehog/fox thing to bring this into some sort of perspective. I really do think that Judea’s methods are more appropriate for some sorts of problems than for others, but I think one aspect of his disagreement on this is his hedgehog nature, just as my fox nature makes me more agreeable to the notation that different methods work better in different settings.

• Fernando says:

Your contribution as a scientist is also to teach the future generation of scientists.

An empirical claim has been made that there exists a language that can improve how scientists leverage and communicate their causal knowledge, reducing identification errors (e.g. controlling for a mediator when estimating the total effect).

I think as teachers we have a fiduciary duty to teach the best methods we know of (you have written about this in this blog). Although in this case I would argue it is up to Judea to conduct the test. His claim is an empirical one, as logically both notations have identical expressive power, so he needs to provide the empirical evidence.

I think that if you train people in different methods and the conduct tests, student satisfaction surveys, etc… you can come up with very robust evidence. Obviously the transition may happen anyhow, if the claims are true. But change need not be generational: I find that depressing.

• judea pearl says:

Andrew,
I assume the fox/hedgehog analogy characterizes the
contrast between pluralism/openmindedness/tolerance and tunnel-vision/narrow-mindedness,
and that I am classified as a hedgehog in this discussion, and the missing-data champions
as foxes.
For the records, my students use counterfactuals (Y_x), graphs and structural equation notation, with equal comfort, sometimes deploying all three on different subtasks of
the same problem.
Please compare to the notational tools used by Rubin’s dynasty , ordained as foxes.

You have an explanation for this asymmetry. You say that the missing-data camp is not
narrow-minded, or inflexible, or compromising on what needs to be done, no! Your explanation
is “different problems”, (quoting )”I truely believe that Judea’s methods are more appropriate for some sorts of problems than for others,”

Let’s see if this explanation holds water.
First, how can you believe that a given method is less appropriate for a problem if you have
not used it in ANY problem? Judgment from a distance is often error prone, shouldn’t it
wait for some evidence
Second, the merits of the structural approach are universal; they are not limited to
“some sort of problems “.
I have listed these merits three times before, and you have not mentioned
a single problem where those merits are NOT critical and appropriate.

I will list them here again, lest they are covered up with foxes and hedgehogs.
(1) Counterfactuals are necessary for specifying your research question, namely, what you want estimated.
(2) DAGs are necessary if you worry whether the assumptions you made agree with what you know,
(3) or if you worry whether what you know is sufficient for estimating what you want, and
(4) if you worry whether what you know has testable implications.

You keep on referring to “different settings”. Can you name a “setting” where finding out
if the assumptions are plausible is NOT APPROPRIATE? Can you name a “setting” where finding
if the assumptions have testable implications is NOT APPROPRIATE?

This “different setting” excuse reminds me of the guy who prefers to count with his fingers rather than learn arithmetics, his explanation: “different methods work better in different settings.”

In conclusion: the asymmetry between the two camps lies not in “different settings”
but in different attitudes. One camp has learned to decide which method works better for
any given subproblem, the other camp refuses apriori to examine certain methods,
even at the cost of poor performance on universally critical issues (i.e., plausibility and testability)

• Andrew says:

Judea:

No! “Hedgehog” is not a put-down, it’s a characterization. As I wrote above, “Arguably, though, the most important contributions to science have come from hedgehogs. Albert Einstein was a hedgehog.” Einstein didn’t have tunnel vision. He was focused, he had a singular vision. That’s what being a hedgehog is all about. It’s great that you’re a hedgehog! I just happen to be a fox, that’s me.

• Fernando says:

Andrew:

I like this in the sense that it points towards heterogeneous effects.

The claim is that there is a new language that has the same expressive power as the old but is easier to use and more effective. Think of this new language as a treatment.

Not all people react well to treatment. Judea and I find the treatment is effective and makes our work easier. I would also bet money the average scientist would benefit from treatment, and for this reason would advocate for making it the default in most graduate programs.

But, just like Aspirin helps most people but kills a few, so some people may not benefit from the new language. Indeed, they may suffer harm (e.g. become less productive). The latter may choose to be never takers.

So maybe you are not only a Fox but also a Never Taker, and Judea is not only a Hedgehog but an Always Taker, and I am a platypus and a Complier…

• Paul says:

In The Signal and The Noise, Nate Silver really favors the foxes over the hedgehogs. At least when it comes to forecasting. So far — I’m only half way through his book.

12. Brian says:

Andrew (and others – great discussion):

When you say you find Pearl’s “methods confusing,” do you mean that at some conceptual level they don’t make sense to you (lack of a coherence, or unclear ontology, etc.), or more that they’re just (overly) difficult in some technical sense?

The reason I ask is that I’m unsure how to think about the fundamental divisions within statistics / inference (Bayesian vs. frequentist vs. Dempster–Shafer theory vs. Pearl’s approach, etc.). Are these primarily pragmatic divisions (disputes about what are the best methods and techniques), or is it more of an ontological thing in the sense that different camps have different conceptions of uncertainty, causation, etc.?

• Andrew says:

Brian:

What I mean is, I don’t get the point of the “do” operator. I don’t in general think it makes sense to “fix” a variable at a value, without defining the intervention that this corresponds to. As I see it, Pearl’s framework is based on the idea that different variables are connected together like a set of probabilistic components (generalizations of deterministic “gates”) in an electronic circuit. I think his model makes complete sense in that sort of setting, where the inference is to determine which gates are wired to which other gates. In social science, maybe not. I think that, to really get it to work, you have to define new variables for each potential intervention and you have to give up on the idea of finding conditional independence. But at that point the theory becomes less useful.

• StatsStudent says:

As others have stated before me, thanks for the great discussion. I’m new to causal inference. I can’t claim to have perfectly understood both sides, but, a part of me feels like framing the issue as a disagreement about which notation to use (DAGs: good or bad?) is misguided. I think that Andrew and Judea have substantive differences in their philosophical approach to causal inference (or at least they have very different priorities). These substantive differences naturally lead to disagreement about whether or not to use DAGs or whether the “do” operator is confusing. I’m still trying to understand where these real differences lie.

• judea pearl says:

Andrew,
YOu say: “I don’t in general think it makes sense to “fix” a variable at a value, without defining the intervention that this corresponds to.”
What you did not tell Brian is how to “define the intervention.”

Once we agree that an intervention needs to be “defined”, not just given
an indentity index (e.g., “intervention number 17), we need to know how to
specify it mathematically, so we can predict its consequences .

I challenge you to describe an intervention of your choice, in a problem of your choice,
in a model of your choice, such that we can predict its consequences from observational
studies.

I have posted a way of doing this which also addresses your difficulty with
“fixing a variable at a value, without defining the intervention that this corresponds to.”
I will repeat it, just in case it skipped your attention and the attention of
other readers who want to make sense of “fixing a variable to a constant” in
everyday expressions, such as ‘make me laugh’ “raise taxes”, “lower interest rates”
“lower class sizes” “sell your house” etc…..
It does not make sense without “defining the Intervention”, agree, and yet we understand each
other, we communicate, and we make each other laugh.

So here is my proposal again:

is perfectly justified, as you expressed it (quoting).
“In general, there can be many possible interventions that
can change a variable x, and these different treatment can
have different effects on y. To me it does not make sense
to speak of setting x to a value without considering how
this is done, which in many cases will require a new
incorporate that, I agree, but then I dont see the benefit
(to me) of the graph.”

You were very close to the answer, but retreated at
the last minute, with “I dont see the benefit (to me)
of the graph”. I will try to show you how close you were.

1. You recognized the futility of asking questions such
as “and what if my intervention has a side-effect that
I did not know about”? . You recognized the necessity of modeling
the stipulated intervention with as much diligence as we
model the world, even if it takes adding extra variables
to the model. Indeed, no one could predict things that are not
in the model (e.g., pressing an unknown button in the dark).

2. But now, instead of giving up, let us continue boldly
toward our aim: “Find the effect of the intervention I
from observational studies, using two pieces of information:
(a) Data from M – our pre-interventional model of the world,
(2) Our model M(I) of how M will change with the intervention I.

3. I believe you will be happy to know that our boldness
has paid off; the question posed in (2) can be answered
formally using the do-calculus. The graph
tells us if the information available is sufficient to
find a bias-free estimate of (2) and, if so, how.

4. To summarize, we start with the somber realization that we do not
want to evaluate the effect of some hypothetical atomic intervention
like do(x), but, rather, the effect of a compound intervention I.
Miraculously, through the courage of using mathematics we
are able to decompose our compound question into its
atomic constituents and find the answer using the tools
of causal inference. All it takes is courage to use the
available tools, be they graphs, or do calculus, or
potential outcomes. This is much more than notational
convenience; acquiring new tools gives us the courage
and curiosity to do things we would otherwise dismiss
as un-doable.

I will end by saying that a more elaborate discussion
on this topic (with examples) is available in my 2010 review
of Cartwright’s (2008) Hunting Causes..”, see
http://ftp.cs.ucla.edu/pub/stat_ser/r342.pdf

I hope I rekindled your curiosity.

• CK says:

Andrew:
I was wondering if you can come up with a specific causal problem in social science where SEM/graph framework is not useful but not the potential outcome. Judea thinks there is none. Can you prove him wrong?

• Andrew says:

CK:

I think it depends on the user. Above in the thread I gave a recent example where the potential outcome framework was very helpful to me, and where I didn’t see the relevance of graphical models. As I see it, the potential outcome framework focuses on clearly defining an intervention in a specific context, whereas the structural equation approach seeks to untangle conditional independence relations among a set of variables. In the particular example under discussion (the one where the authors of the paper claimed that “increasing the diversity” of Bolivia would increase its income per capita by some specified amount, I got a lot of insight out of asking myself the question: What possible interventions would increase Bolivia’s diversity (as defined in the article). The potential-outcome approach forced me to think about that. In the structural equation approach, one can attack the problem by adding a variable to the graph. But that’s not what I see people doing; I see them, over and over again, talking about estimating the effect of a variable without considering how it can be altered.

So here’s what I think. In some settings, it makes sense to talk about estimating the effect of a variable without considering how it can be altered. Examples of such settings include networks of components in an electronic circuit and also whatever other problems there are where, the particular intervention used to change X doesn’t really matter, they all do the same thing. In other settings, you can’t really talk about applying the “do” operator to X or “clamping” X or whatever; you have to add the intervention variable as a new node in the graph and consider X as a non-clampable variable.

It’s really a matter of focus. I’m sure that any expression in one framework can be translated into another, but the different frameworks lend themselves more directly in different sorts of problems. And in the example given, I found my potential-outcomes training to be effective in cutting through the confusion. (Not that the potential-outcomes formulation is any sort of magic; as I noted above, I have a feeling that the authors of the paper in question were using the potential-outcomes framework, but that didn’t stop them from making the same mistake that a zillion other social science researchers have made before them, to treat a regression coefficient as a causal effect without thinking about what it really means.)

• judea pearl says:

Andrew,
I perfectly agree with the first part of your sentence (quoting):

“As I see it, the potential outcome framework focuses on clearly defining an intervention in a specific context, whereas the structural equation approach seeks to untangle conditional independence relations among a set of variables.”

Indeed, if you examine item (1) in my symbiotic agenda:
(1) Counterfactuals are necessary for specifying your research question, namely, what you want estimated.
(2) DAGs are necessary if you worry whether the assumptions you made agree with what you know,
(3) or if you worry whether what you know is sufficient for estimating what you want, and
(4) if you worry whether what you know has testable implications.

you will find the virtue of potential outcomes spelled out clearly as ability “to specify
the research question” which, in your case meant thinking about the intervention
in the specific context.

But I do not agree with the second part of your statement (quoting):
“whereas the structural equation approach seeks to untangle conditional independence relations among a set of variables.”
Again, you are dismissing methods that you have not tried yourself,
by attributing to them virtues
that no one care about, e.g., “untangle conditional independence”. How about attributing
to them items (2),(3), and (4) above, and admitting that
a. These items are critical and needed in every problem and every “setting”, and
b. These items have not been achieved by any study based strictly on “potential outcomes”,and
c. Some people claim that items (2),(3) and (4) are achievable through structural
equation models or DAGs but you have not tried it yet, and you do not know how they
do it, or if what they do it right.

• Andrew says:

Judea:

You write that no one cares about untangling conditional independence. What about Steven Sloman (see here, pages 961-962)? He is confused, and I wouldn’t cite him as an exemplary user of structural equation models, but he’s a naive believer in such models who thinks they’re for untangling conditional independence.

Just to make clear I’m not trying to pick on structural equation models, one could similarly find naive believers in Bayesian inference who believe that, by being Bayesian, they are being logically coherent; or naive believers in classical inference who believe that 95% confidence intervals actually cover the true value 95% of the time. The point of all these examples is not that the methods in question are wrong, rather that at their extremes they can be used in certain ways. For better or worse, Sloman represents a class of users of structural equation models, so I don’t think it’s right to say that no one cares about such things as untangling conditional independence.

• judea pearl says:

Andrew,
by “virtues that no one cares about” I meant “no one among the readers of this
blog cares about”. Steven Sloman is a cognitive psychologist, he talks about
conditional independence in the context of discovering causal structures from
data alone. In this endeavor, one relies on conditional independencies
as a means for structuring models. Again, it is a means to an end.

Our discussion did not revolve around discovering structure but around
problems that one can solve GIVEN a causal structure, or given
any assumption one cares to make which match what one believes about
the world. In this context the role of structural equation is not to
entangle conditional independencies but to decide which covariates to
control for, if any, which assumption is plausible and how to test the
model , whether a variable is a good instrument, etc. These are critical
questions that readers of this blog can appreciate, because they surface
in almost every problem.

• judea pearl says:

CK,
Your challenge to Andrew is a great one.
But you need to define what you mean by “a specific causl problem,”
else Andrew can cite all the problems he worked on in his career and say:
here is the “specific causal problem” that you asked for.

I think you need to insist that the example cited will include the following components:

1. Definition of the research question, and what kind of criteria we pose
on the answer before it is deemed acceptable.
2. Did we rely on any untested assumptions in our solution and, if so,
are they articulated in a language that allows us to judge their
plausibility
3. Did we rely on any untested assumptions in our solution and, if so,
do they have any testable implications
4. What guarantees if any do we have that the answer we got
is acceptable (according to 1), or follows logically from
our assumptions.

13. Fernando says:

1. Nothing in the language prevents you from redefining the intervention variable, as both Judea and I explained above.

2. If you give up conditional independence then you give up MCAR and MAR, which is a little problematic if you believe causality is a missing data problem.

• judea pearl says:

Fernando,
I do not think Andrew meant to give up conditional independence altogether.
He mentioned conditional independence as a goal of structural models, failing
to notice that conditional independence is only a means to an end, the end being
solving causal problems from beginning to end, including plausibility checking and
model testing.

The role of conditional independence assumptions in MCAR and MAR is interesting.
missing-data researchers make these assumptions routinely, when the need arises to justify
their routines, not realizing that the assumptions can be verified and checked for plausibility
by graphical methods. I stipulate that the missing-data literature will change