To follow up on yesterday’s discussion, I wanted to go through a bunch of different issues involving graphical modeling and causal inference.

Contents:

– A practical issue: poststratification

– 3 kinds of graphs

– Minimal Pearl and Minimal Rubin

– Getting the most out of Minimal Pearl and Minimal Rubin

– Conceptual differences between Pearl’s and Rubin’s models

– Controlling for intermediate outcomes

– Statistical models are based on assumptions

– In defense of taste

– Argument from authority?

– How could these issues be resolved?

– Holes everywhere

– What I can contribute

**A practical issue: poststratification**

I’ll start with an issue where Pearl disagrees with Rubin (and also with me, and I expect with Paul Rosenbaum, Rod Little, and many others). I’ll repeat a bit from my earlier entry. Pearl writes:

For example, if we merely wish to predict whether a given person is a smoker, and we have data on the smoking behavior of seat-belt users and non-users, we should condition our prior probability P(smoking) on whether that person is a “seat-belt user” or not. Likewise, if we wish to predict the causal effect of smoking for a person known to use seat-belts, and we have separate data on how smoking affects seat-belt users and non-users, we should use the former in our prediction. . . . However, if our interest lies in the average causal effect over the entire population, then there is nothing in Bayesianism that compels us to do the analysis in each subpopulation separately and then average the results. The class-specific analysis may actually fail if the causal effect in each class is not identifiable.

Pearl seems to take this as an example of where Bayesian inference–the rule to condition on all observed data–gives the wrong answer. But I think he’s missing the point. At the technical level, yes you definitely can estimate the treatment effect in two separate groups and then average. Pearl is worried that the two separate estimates might bot be identifiable–in Bayesian terms, that they will individually have large posterior uncertainties. But, if the study really is being done in a setting where the average treatment effect is identifiable, then the uncertainties in the two separate groups should cancel out when they’re being combined to get the average treatment effect. If the uncertainties don’t cancel, it sounds to me like there must be some additional (“prior”) information that you need to add.

I’m pretty sure about this. Of all the stuff I’m talking about in this blog, Bayesian regression and poststratification is the area in which I’m most truly an expert. To get a sense of some of the gains from this approach, check out some of the recent work by Jeff Lax and Justin Phillips (here and here).

**3 kinds of graphs**

I can think of three different ways that directed graphs have been applied to statistical modeling.

1. *Graphing the structure of a probability model.* For example, consider a simple hierarchical model (the 8-schools example in chapter 5 of Bayesian Data Analysis), with a “likelihood” of y ~ N (theta, sigma^2), a “prior” of theta ~ N (mu, tau^2), and a “hyperprior” on (mu,tau). (For simplicity I’m assuming sigma is known and unmodeled; we can discuss this point later, if you’d like, but for now I’m trying to keep things clean so as to be able to use Ascii graphics.) The graph for this model is

(mu, tau) –> theta –> y

The arrows don’t represent causation or anything like that–it doesn’t make sense to me to talk about (mu, tau) “causing” theta, or theta “causing” y. The parameter mu, for example, simply represents the mean of the population of theta values; it has no meaning as a causal factor.

2. *Graphing a hypothesized causal pattern.* Cyrus gave an example in his blog comment yesterday:

(X, Y) –> L –> C

These are causal relations, as Cyrus has defined them: X and Y cause L, and so forth.

3. *Graphing relations between real-world variables.* Here I’m thinking of models with variables such as inflation, unemployment, and interest rates; or schooling, socioeconomic status, test scored, and delinquency.

These three sorts of graphs can look similar, but they have different interpretations for causality. In particular, graphs of type 1 can be helpful for complex hierarchical models even if they are purely descriptive. For example, I recently estimated public support for school vouchers among voters, characterized by religion, ethnicity, income, and state of residence. I’m not trying to understand whether being a rich person in Texas causes you to have a certain opinion–it’s an interesting question, but I’m answering some more basic descriptive questions. Nonetheless, my model has a graphical structure.

**“Minimal Pearl” and “Minimal Rubin”**

I’d like to separate each of Rubin’s and Pearl’s theories into a key conceptual part and a more elaborate analytical part. I’ll argue that, whatever you think of the analytical parts, the conceptual core of each theory represents a useful insight.

*Minimal Rubin*: Defining causal inference as potential outcomes (not necessarily “counterfactuals” because the notation can be used before an experiment is actually done, in which case any of the outcomes might be possible). I have found this to be an extremely useful way of focusing my understanding. To take just one example (which I mentioned in my earlier blog entry), when Gary and I started working on incumbency advantage twenty years ago, there was already a bit of literature on the topic: different articles with different definitions of incumbency advantage, and a near-complete confusion between *estimands* and *estimates*–that is, between the formulas used to compute “incumbency advantage” numbers from data, and the underlying quantities being estimated. The potential-outcome framework allowed us to formulate the estimand–our definition of incumbency advantage–clearly, and then we were able to move to the estimation phase.

*Full Rubin*: The research programme under which all causal inference problems can be framed in terms of potential outcomes. The Full Rubin has had some successes (for example, the paper with Angrist and Imbens on instrumental variables) but it also creates some new difficulties, notably when dealing with intermediate outcomes.

*Minimal Pearl*: Displaying causal relations as a directed graph, and using graph-theoretical ideas to understand patterns such as backdoor causality and colliders. I have certainly found it useful to use graphs to explore causality, and lots and lots of people have found Pearl’s ideas helpful in understanding the roles of different variables in a graph (see, for example, Cyrus’s comments in the earlier blog entry). As with Minimal Rubin (but in a different way), a key contribution of Minimal Pearl is to separate the causal structure from the specifics of a model. There had been lots of literature on graphs for path analysis, structural equation models, and so forth–but Pearl detached the graphical ideas from the specific correlation-based models that were out there.

*Full Pearl*: The research programme under which all causal inference problems can be framed in terms of graphs, colliders, the do operator, and the like. It doesn’t quite work for me, but many people feel the Full Pearl is the way to go. A good argument in favor of Full Pearl is that, by handling dependence structures in a compact way, this framework frees up the researcher to think about more complicated structures of variables, to not be limited to the very simple structures that we can hold in our heads. One reason that I’m sympathetic to Full Pearl–and, at the very least, why I’d like to better understand minimal Pearl–is to see if I can improve my own modeling in this way.

**Getting the most out of Minimal Pearl and Minimal Rubin**

I’d argue that all of us–Pearl, myself, and Rubin included–would benefit from consistently using the insights of Minimal Pearl and Minimal Rubin, in particular,

*from Pearl:* Write models as directed graphs and, where necessary, explain exactly what the links mean and how their strength can be measured.

*from Rubin:* Be explicit about data collection. For example, if you’re interested in the effect of inflation on unemployment, don’t just talk about using inflation as a treatment; instead, specify specific treatments you might consider (adding these to the graphs, in keeping with Pearl’s principles). This also goes for missing data. For example, Cyrus presented an example in which the variable Y is missing when a different variable, C, is observed. I recommend adding a new variable, I_Y, an indicator for whether Y is observed. The graphical model can then show that I_Y depends only on C.

**Conceptual differences between Pearl’s and Rubin’s models**

I’ll just list a few differences that I’ve seen in this discussion:

– Following Rubin’s perspective, I define the causal effect of a treatment at a unit level. For simplicity I’ll stick with two levels of the treatment, 0 and 1 (for example, incumbency or an open-seat election). I define the treatment effect for a single unit as y^1 – y^0, or, if you prefer subscripts, y^1_i – y^0_i. In contrast, Pearl (and Wasserman, in his comment) define treatment effects as expectations: E(y|x) or, with more labor, E(y|x=1) – E(y|x=0). Pearl et al. can feel free to use this definition, but it’s different from mine.

– Here’s another example. Pearl describes the following problem:

Let X and Y be the outcomes of two fair coins, and let Z be a bell that rings if at least one of X and Y comes up head. We wish to estimate the causal effect of X on Y after collecting a huge number of samples, each in the form of a triplet (X, Y, Z). Should we include Z in the analysis? If so how? Would our favorite estimate of E(Y_x) be biased? Will it give us what we expect, namely, that X has no causal effect on Y, i.e., E(Y_x) = E(Y).

I think I may be missing something in this example, but if I understand it correctly, it doesn’t fit into the Rubin framework at all: in the Rubin framework, *decisions*, not *outcomes*, have causal effects. I’m not saying that Pearl shouldn’t be working on this problem–there’s clearly a lot of interest out there in methods for estimating causal effects of things that are not decisions/interventions/treatments. What I am doing is illustrating that Pearl’s and Rubin’s methods are different.

**Controlling for intermediate outcomes**

Pearl writes, “Let us focus on the easier example of an intermediate variable (Z) between treatment (X) and outcome (Y). Has anyone seen a proof that adjusting for Z would introduce bias?” There is no proof that an adjustment will always introduce bias (I think Corey’s right that, by “unbiased,” Pearl means “asymptotic consistency,” in statistical jargon). The theorem is that the adjustment *can* introduce bias, and to prove this theorem, we only need a single example, such as is given on page 191 of my book with Jennifer.

In any given example, there can be all sorts of other problems going on, measurement error, missing data, key unmeasured predictors, and so forth. And it’s always possible that “doing something wrong” (for example, controlling for an intermediate outcome) can actually make the estimate better. Just as, to borrow Pearl’s example, we might do better by adding 17.5 to any of our estimates. Except in trivial examples, we can’t prove that this is a bad idea.

Even while living in a world of uncertainty, we make assumptions and, from there, try to do things that are optimal (or nearly so) within our assumptions. (A big part of my work is thinking about how to check and improve these assumptions, but that’s another story.)

I realize I stated Rubin’s view incorrectly. He doesn’t actually say, “do not control for post-randomization variables”. What he does say is: do not try to control for them ignoring the fact that they are post-randomization–that is, do not treat them as fully observed covariates. This is how his instrumental variables stuff works. So there is no contradiction with his view that you should ideally condition on all observed values, even on post-randomization variables (which can be thought of as partially observed, in the potential outcome sense).

From my perspective, the point is that Rubin’s fully-Bayesian approach gets difficulty in complex settings. Rubin would argue that this difficulty is inherent and should not be avoided. And Pearl correctly pointed out the sloppiness of my statement that “Jennifer and I recommend not controlling for intermediate outcomes.” A better way to put it is that it is appropriate to control for intermediate outcomes, but not if your only tool is unadjusted regression on available data.

I agree with Pearl that, “If you incorporate an intermediate variable M as a predictor in your propensity score and continue to do matching as if it is just another evidence-carrying predictor, no post processing will ever help you, except of course, redoing the estimation afresh.” More sophisticated adjustment–whether using Rubin’s framework, Pearl’s, or some other approach–is needed. My allusion to “post-processing” was too vague.

One place where my glib advice (“don’t control for intermediate outcomes”) can break down is in longitudinal studies of the sort where Robins and others have developed weighting methods to estimate causal effects. Again, the real point is, yes, it’s best to condition on all data; we just have to go beyond simple regression.

**Statistical models are based on assumptions**

Pearl refers to himself as a half-Bayesian and refers to “big-brother Bayes” as an impediment to clear thinking.. I consider myself a Bayesian, pretty much. I don’t always use Bayesian methods, but when I don’t, I think of what I’m doing as an approximation to a more laborious full Bayesian approach.

Bayesian inference has only two steps: (1) set up a joint probability distribution for everything involved in your problem, (2) condition on observed data to get a joint posterior distribution for everything unobserved. Everything else in Bayes fits into these steps: model checking can be formulated in terms of step 2 (as posterior inference on replications), and model expansion goes into step 1.

The most important objection to Bayesian statistics, in my opinion, is that, in realistic examples, the joint probability distribution is going to be arbitrary–often based on whatever data you happen to have at hand–and wrong. Is this a mortal flaw in Bayes? This is an empirical question; it depends on the example. Evaluation is complicated by the fact that statistical modeling, like all scientific activity, has some wiggle room. I tried a lot of models on the way to getting estimates of opinion on school vouchers.

Anyway, inference within the Bayesian framework is straightforward mathematics. We don’t have to go to the grave of Thomas Bayes and ask what he would do in a situation; we just set up the model and go from there. One advantage of Bayesian methods, to me, is that it puts the focus on the model rather than on the estimation procedure.

**In defense of taste**

In discussing different sorts of models, I wrote that, while some statisticians like to use discrete models, “I have a taste for continuity and always like setting up my model with smooth parameters.” Personally, I think discrete models make very little sense in most social science settings (I will make exceptions in some latent-variable settings such as conceptual models, personality traits, and party identification, along with social measurements that are highly correlated with discrete biological variables such as sex), and I think most of the discrete modeling in social science is a vestige of classical significance testing ideas. But I recognize that others have different tastes than I on this matter.

Pearl very rightly queries this statement of mine, writing: “The general attitude in this discussion

has been to treat the issue as if it was a personal dispute about a wine tasting contest . . . both sides quote good reasons, so it must be a matter of taste, style, focus, perspective , interest, method etc. It isn’t.” In this case, Pearl is referring to a question of adjusting for pre-treatment variables rather than a question of model choice, but the same issue arises: why should applied mathematicians and research scientists (which is, ultimately, what we are) care about “taste”?

Perhaps it would help I use the word “experience” instead. For example, I have experience using hierarchical models, interactions, graphical model checking, and all the other fun stuff that’s in my book, whereas Rob Tibshirani has experience with generalized additive models, lasso estimation, bootstrapping, and all the great stuff in *his* books. You could say that I have a taste for probability models and Rob has a taste for direct data-manipulation procedures, or that I have different experiences than Rob does. However you put it, I think Rob is going to do better statistical analyses using nonparametric methods than using hierarchical Bayes, just as I’ll do better the other way.

I don’t know if it really takes 10,000 hours to learn a new method, but it’s not always a bad idea to work with what works for you.

**Argument from authority?**

As Pearl notes, if a theorem is true, it’s true, and if it’s false, it’s false. It does sound silly to say that someone should use a certain method just because Gelman and Hill do, or because Rosenbaum does it in his book.

But what if you frame it slightly differently, and say that Gelman and King, for example, solved some existing problems in quantitative political science (incumbency advantage, the seats-votes curve, the effects of redistricting) using certain methods. So, just maybe, there’s something good about these methods, right? This sort of inductive reasoning is the basis of much of my work, which bounces back and forth between applications and methodology. Pearl is right that we should be able to focus on specific questions and not just rely on authorities (from Neyman and Fisher to Pearl and Rubin), but I don’t think it’s so unreasonable for me to hold up my applied successes as some validation of the methods I use.

Again, this is not to criticize Pearl’s approach on his problems, just to explain why I bristle at the suggestion that I’m doing the wrong thing, for example, by poststratifying.

**How could these issues be resolved?**

As I’ve already stated a few times, I think there’s room in the world for Minimal Pearl, Full Pearl, Minimal Rubin, and Full Rubin–even if I don’t think all four can be used at the same time on the same analysis of the same problem! Pearl and his colleagues (such as Wasserman and the anonymous author of the last section of the Wikipedia page on the Rubin Causal Model) believe Full Rubin to be a special case of Full Pearl, but as I’ve argued above, I don’t think so.

How could the methods be compared? One could imagine some sort of data competition–in fact, I think something like this was done recently (perhaps by Pearl himself, I don’t remember). The examples I have in mind are Dehejia and Wahba’s reanalysis of LaLonde’s analysis of a subset of data from a job-training experiment, or the notorious Harvard Nurses Study, which let do an observational analysis purportedly discovering a positive effect of hormone replacement therapy, a finding that was revealed to be in error from a later randomized experiment. The question in these examples is whether methods such as Rubin’s propensity score or Greenland and Robins’s g-estimation could reliably get good answers from observational data.

I don’t think such a competition would be conclusive though, in that different statisticians can do well using different methods. Much depends on details of implementation.

**Holes everywhere**

As I discussed earlier, scientific disagreement can be frustrating, and I think I very well understand the frustration that both Pearl and Rubin feel on this, in their own ways. I continue to believe that all the analytical tools we have–Rubin’s framework, Pearl’s framework, the normal distribution, the logistic distribution, and, yes, even Bayesian inference–are incomplete. They all have holes. Just to quickly list them:

– When you consider multiple treatment factors, Rubin’s framework leads to a proliferation of potential outcomes that has always left me confused. In addition, Rubin (in chapter 7 of Bayesian Data Analysis) recommends controlling for all variables that can affect treatment assignment. This is part of the more general recommendation to include all variables in a model, something that we realistically cannot generally do.

– Pearl’s framework seems to me to assume that each note in the network corresponds, via the do operator, to a particular treatment or manipulation. Realistically there can be many ways of altering a variable; including this in the model can lead to proliferation of nodes and no sense of how to proceed to estimate a causal model.

– The normal, logistic, etc., distributions can be useful but basically never fit real data. We always have the question of when to make our models more complicated and realistic.

– Bayesian inference–realistically, almost all useful statistical inference–depends on models. We are getting better at checking these models, but the theory can never be even close to airtight except in very simple examples.

**What I can contribute**

I’ve done some work on causal inference (notably, my 1990 article with King on incumbency advantage, and also my book chapter from 2004 on treatment effects in before-after data), but considering the other participants in this discussion, causality is clearly not my area of expertise. Most of my work is on statistical modeling, graphics, and model checking.

But . . . statistical modeling can contribute to causal inference. In an observational study with lots of background variables to control for, there is a lot of freedom in putting together a statistical model–different possible interactions, link functions, and all the rest. Further complexities arise in modeling missing data and latent factors. Better modeling, and model checking, can lead to better causal inference. This is true in Rubin’s framework as well as in Pearl’s: the structure may be there, but the model still needs to be built and tested. And, as we become more confident (without being overconfident) in our models, we can make them more complex and realistic.

The "I['m a] Bayesian, pretty much" link is broken. Thanks to the magic of Google's broken link page, I can even tell you how: you've pointed the link to your published folder, but the doc's actually in your unpublished folder.

I don't think this has been made explicit, so I'll state it plainly: Pearl isn't merely asserting that the two approaches are equivalent — he (claims to have) proved by construction the existence of the isomorphism between them. There are two relevant citations in his letter; they are both to his reference #2, which is his book "Causality".

Pearl's framework does not require that each node in the network corresponds, via the do operator, to a particular treatment or manipulation. Rather, each node is a random variable, and the do operator fixes a node

XtoX=xwithout affecting the joint distribution of the ancestors of X (in sharp contrast to standard conditioning).It just occurred to me that this is exactly what BUGS's cut operator does, so maybe BUGS could be used to do Pearl-style causal inference…

Dear Andrew,

Thanks for a comprehensive summary of the discussion

which, I am sure, would help many readers understand

the fundamentals of causal analysis.

There are five brief comments I would like to add.

1. I would really like to see how a Bayesian method

estimates the treatment effect in two subgroups

where it is not identifiable, and then, by

averaging the two results (with two huge

posterior uncertainties) gets the correct

average treatment effect, which is identifiable,

hence has a narrow posterior uncertainly.

The reason I declared myself a "half Bayesian"

is that, from my perspective, non-identifiability is

more that just large posterior uncertainty.

I see non-identifiability as "irredeemable" posterior uncertainty.

Which means that on a certain surface in probability space

the shape of the prior uncertainty remains

the same no matter how many samples you take.

Simple example: Consider two competing causal models

X—->Y and Xhttp://ftp.cs.ucla.edu/pub/stat_ser/Test_pea-final.pdf) It means that

Full Rubin is a special case of Full Pearl, suffering

only from two syntactic deficiencies: Dont use graphs,

dont use structural equations; i.e., express all knowledge in

the language of "ignorability" sentences.

In computer science, we can look back and imagine

(counterfactually) what the world would be like had

we disallowed compilers and forced everyone to program in

machine language.

The analogy is clear.

The two blunders I mentioned earlier (1. inappropriate

conditioning and 2. paradoxical direct effects) are

the first concrete manifestations of the harm caused by

such prohibition.

Let us hope they are the last ones.

Corey:1. I checked but don't see any broken links.

2. You write:

My point is that this assertion, or proof, or whatever you call it, can't be correct in a practical statistical sense.

It's as if, for a homely analogy, someone offers me a proof that dollars and yen are equivalent. My response would be that you can buy a $5 coffee anywhere, but nobody will sell you a coffee for 5 yen. I don't know exactly where Pearl's theorem fails to apply–I'm sure the problem could be traced to the inapplicability of one or more of its axioms to certain real-world settings that I'm interested in–but I know there's some problem somewhere.

Regarding the do operator: you're basically restating my point, I think. Pearl is implicitly claiming that each node in his model can be manipulated in a single uniquely-defined way (as defined by the do operator), whereas Rubin's theory requires that the user state, as additional information, what it means for a variable to be manipulated.

Judea:Thanks for the long comment. It's great to have this open and free-flowing discussion.

For the reasons given above (including my response to Corey), no, I don't think Rubin's method is a special case of yours. I have no doubt that your theorem is mathematically true, so I think there must be some way that your axioms are inappropriate for the sorts of problems I'm working on.

I don't see why Full Rubin requires "Don't use graphs." I agree that Rubin doesn't use graphs–that's partly his taste, I think, he doesn't spend a lot of time graphing data either, he's more a symbolic thinker, nothing like me in that respect, actually. But I see no reason why, in general, users of the Full Rubin can't use Pearl-style graphs.

Similarly, I think that users of the Full Pearl can still use Minimal Rubin, in particular the idea of more explicitly considering the definition of potential outcomes (again, I refer to my 1990 paper with King for a simple example).

Finally, I completely, completely disagree with you in that example that the "correct thing to do is to ignore the subgroup identity." As I noted earlier, this is probably the only point of our discussion where I really am an expert. Although this is a minor issue in the world of causal inference, it's a huge, huge point in my research. I'll see if I can clarify it in a later post.

Perhaps some discussion of _arguably reasonable/defensible_ specifications for the joint distributions (priors and likelihoods) may be helpful (for me anyways).

(the do operate operates on the specification??? and should be separate from usual Bayes conditioning of the chosen joint probability model on "all" the data ??? as well as the choice of Larry's functional on the posterior? )

But in his response the letters Rubin does seem concerned about the specification of in-defensible models …

Fisher in his later writings seemed to be arguing that the most important task in the design of studies was to lessen dependence on assumptions that were difficult to make true/check. And Rubin seems to be concerned that the Pearl approach will lead to more likely use of less defensible specifications. Pearl's admirable encouragement for investigators to explicate their beliefs may need to be better managed.

This is not math, but _pragmatics_ and of large concern in applications.

Keith

Odd, I don't see a broken link anymore… (perverse computer mutter mutter).

From the discrepancy between the practices recommended by Pearl and Rubin, you

I also had an analogy floating around in my head, only mine was the equivalence (proved by Dyson) of Feynman's graphical path-integral approach and Schwinger and Tomonaga's operator approach to quantum electrodynamics.

Judea, var(b1 + b2) = var(b1) + var(b2) + 2*corr(b1,b2)*sd(b1)*sd(b2). If var(b1)=var(b2) and corr(b1,b2) = -1 then var(b1 + b2) = 0 no matter how large var(b1) and var(b2) are. I think that's what Andrew's talking about when he says that posterior uncertainties can cancel.

I'm reluctant to post anything because I am comparatively ignorant (and, perhaps, objectively ignorant) about the differences being discussed here. But at least I can play the role of Ordinary Guy, to try to encourage the discussion to stay at a level I can follow. So that's what I'll do.

First, if anyone claims that it is "wrong" to draw graphs to summarize, explain, or create a model, that's ridiculous. I doubt anyone, including Rubin, actually claims that — I suspect that if he said anything of the sort, he said that he doesn't personally find graphs helpful, or that he finds them confusing, or something. Whatever. Let's just agree that it is fine to draw graphs, and move on.

Second, I'd like to come back to the seat belts, smoking, and lung cancer example, because I am one of the people who learns better by generalizing from specific instances rather than the other way around.

Below, I've tried to create an ASCII-art graph of what I might propose as a model. Sorry, my ASCII-art skills aren't what they should be. The vertical bars are just used to draw boxes, they're not absolute value symbols or "or" signs or anything.

| risk………… | –> | no seat belt |

| taking………| –> | smoking | ——————–> | lung…. |

| propensity…| –> | other risky behaviors | –> | cancer |

The model is: Each person has some "risk-taking propensity", call it r. The higher their r, the more likely they are to fail to use a seat belt. Also, the higher their r, the more likely to smoke, which promotes lung cancer. And the higher their r, the more likely to do other risky things (like live in a house with a high radon concentration), some of which might promote lung cancer.

I suppose that before continuing, I should ask if this is what Pearl and Rubin and Gelman and everyone means by a "graph" of the model? It's what I picture.

Let's assume seat-belt-nonwearing does not cause lung cancer (note no arrow between seat belts and cancer on the graph). Smoking does cause lung cancer (arrow). Some other risky behaviors (ORB) also cause lung cancer (arrow).

Suppose seat-belt-wearing and smoking are observed; risk propensity, and risky behaviors other than smoking and seat-belt-wearing, are not.

If I wanted to model this, I'd write a model that pretty much matches the graph. Of course I would have to make some modeling choices about the links, e.g.:

p(no seat belt | r) ~ invlogit(beta1 * r)

p(smoking | r) ~ invlogit(beta2 * r)

p(ORB | r) ~ invlogit(beta3*r)

p(lung cancer | smoking, ORB) ~ invlogit(beta5*smoking + beta6*ORB)

and so on. (I haven't thought about what the link function should be for that last one, so maybe that's a bad choice; to give just one of many examples, smoking could increase the susceptibility to other insults to the lung, so there could be some sort of interaction between smoking and other risky behaviors). Perhaps I would need informative priors for some of the parameters (such as beta5, or beta5/beta6). Indeed, by not including non-seatbelt-wearing in that last one, I have already imposed an extremely strong prior on the causal relationship between seat-belt-wearing and smoking.

I could certainly imagine objections to analyzing smoking/lung-cancer relationships this way — how are you going to estimate the causal effect of "other risky behaviors" that are completely unobserved, you'll have to put in such strong priors that they will drive the model, etc. — but I wouldn't think that either Rubin or Pearl (or anyone else) would say that this approach is "wrong."

I would have said that this is a Rubin-like (Rubinsesque?) model. Perhaps it is a Pearl-like model. More likely, it is simply a Phil Price-like model (that's my name, don't wear it out). If it is both Rubin-approved and Pearl-approved, then what is the big debate about the difference between the approaches of these two titans? If it's neither Pearl-approved nor Rubin-approved, then I'm embarrassed but I would really like to hear what is wrong with it. If Pearl would approve but Rubin wouldn't, or vice versa, then I'm learning something…but I would like to have the Rubin or Pearl object explained…hopefully in terms of the model parameters and input variables rather than the lingo of "colliders" and "M-bias."

Actually, I'd rather not focus on WHOW would or wouldn't recommend this type of model, I'd rather focus on whether, when I create models like this, I am doing something I shouldn't (e.g. my estimates will be biased in the plain-english sense of the word), and whether there is something else I could do that is better.