Causal mediation

Judea Pearl points me to this discussion with Kosuke Imai at a conference on causal mediation. I continue to think that the most useful way to think about mediation is in terms of a joint or multivariate outcome, and I continue to think that if we want to understand mediation, we need to think about potential interventions or “instruments” in different places in a system. I think this is consistent with Pearl’s view although in different language. Recently I was talking with some colleagues about estimating effects of the city’s recent Vision Zero plan on reducing traffic deaths, and some of this thinking came up, that it makes sense to think about effects on crashes, injuries, serious injuries, and deaths. I also agree with Pearl (I think) that it’s generally important to have a substantive model of the process being studied. When I was a statistics student I was somehow given the impression that causal inference could, and even should, be done from a sort of black-box perspective. You have the treatment assignment, the outcomes, and you estimate the causal effect. But more and more it seems that this approach doesn’t work so well, that it really helps to understand at some level the causal mechanism.

Another way of putting it is that most effects are not large: they can’t be, there’s just not room in the world for zillions of large and consistent effects, it just wouldn’t be mathematically possible. So prior information is necessarily relevant in the design of a study. And, correspondingly, prior information will be useful, even crucial, in the analysis.

How does this relate to Pearl’s framework of causal inference? I’m not exactly sure, but I think when he’s using these graphs and estimating whether certain pathways are large and others are zero, that corresponds to a model of the world in which there are some outstanding large effects, and such a model can be appropriate in certain problem-situations where the user has prior knowledge, or is willing to make the prior assumption, that this is the case.

Anyway, perhaps the discussion of Imai and Pearl on these topics will interest you. Pearl writes, “Overall, the panel was illuminating, primarily due to the active participation of curious students. It gave me good reasons to believe that Political Science is destined to become a bastion of modern causal analysis.” That sounds good to me! My colleagues and I have been thinking about causal inference in political science for a long time, as in this 1990 paper. Political scientists didn’t talk much about causal inference at that time. Then a bunch of years later, political scientists started following economists in the over-use, or perhaps I should say, over-interpretation, of various trendy methods such as instrumental variables and regression discontinuity analysis. Don’t get me wrong—IV and RD are great, indeed Jennifer and I discuss both of them in our book—but there got to be a point where researchers would let the instrument or the discontinuity drive their work, rather than stepping back and thinking about their larger research aims. (We discuss one such example here.) A more encouraging trend in political science, with the work of Gerber and Green and others, is a seriousness about causal reasoning. One advantage of tying causal inference to field experiments, beyond all issues of identification, is that these experiments are expensive, which typically means that the people who conduct such an experiment have a sense that it might really work. Skin in the game. Prior information. Now I’m hoping that the field of political science is moving to a new maturity in thinking about causal inference, recognizing that we have various useful tools of design and analyses but not being blinded by them. I don’t agree with everything that Judea Pearl has written about causal inference, but one place I do agree with him is that causal reasoning is fundamental, and causal inference is too important to be restricted to clean settings with instruments, or discontinuities, or randomization. We need to go out and collect data and model the world.

140 thoughts on “Causal mediation

  1. Or maybe people waste too much time studying really small effects? Because they wrongly believe that increasingly sophisticated mathematical methods can somehow magically clarify a tiny effect.

    When they could more fruitfully use their time to do the hard thinking and work to come up with a novel intervention that could really lead to a large effect size.

    Too little effort expended on coming up with large-effect-interventions and too much effort spent on torturing out a subliminal signal from noisy data?

    • Maybe this idea about finding “effects” is misguided to begin with? Instead the effort should be spent on finding consistent quantifiable relationships between different observations (ie “old school” science), or go the machine learning route and simply try to predict what will happen due to a web of observed correlations.

      Honestly, I don’t see much future for this “searching for effects” approach. Maybe it was useful in some cases to narrow down what we want to focus on, but we can see that it has now devolved into institutionalized noise mining.

        • >”If you find a consistent quantifiable relationship between symptoms and disease is that sufficient knowledge to tackle the disease?”

          This sounds tautological to me. Isn’t the disease defined by the symptoms?

        • Not really. Symptoms are simply markers of a disease e.g. frequent urination is a symptom of diabetes but diabetes is the lack of, or insufficient insulin resulting in elevated levels blood glucose.

          So you don’t see the point of searching for effects, should we abandon the practice of randomization in clinical trials? We can simply compute correlations in the observed data and if they are consistent enough FDA can approve the drug.

        • >”Symptoms are simply markers of a disease e.g. frequent urination is a symptom of diabetes but diabetes is the lack of, or insufficient insulin resulting in elevated levels blood glucose.”

          And how is “insufficient” defined? When some set of symptoms are apparent, right? You seemed to be asking in the first post whether changing the definition of a disease to be more precise is sufficient knowledge to “tackle” (cure it?) it. I am not certain though, and this did not clarify it for me.

          >”So you don’t see the point of searching for effects, should we abandon the practice of randomization in clinical trials?”

          Tools like random allocation, random selection, and blinding are great, but they are more “basic things to do whenever you can” rather than panaceas or gold standards like they are treated by the medical community.

          >”We can simply compute correlations in the observed data and if they are consistent enough FDA can approve the drug.”

          Astronomy is one of our most successful sciences and that is basically what they do (besides the FDA part, which isn’t an organization I hold in any particular esteem). There was some discussion on here about them a while back.[1]

          I would say much more could be learned about cancer from fitting data from a database like SEER with theoretical curves than any RCT. If it were up to me, cancer research funds would be diverted to improving the quality of the SEER data (which would include the oft ignored denominator: census data) and setting up other free databases filled with distributions of physiological parameters like number of cells in each tissue by age, etc.

          [1] http://www.fda.gov/ohrms/dockets/ac/07/briefing/2007-4329b_02_01_FDA%20Report%20on%20Science%20and%20Technology.pdf

        • “I would say much more could be learned about cancer from fitting data from a database like SEER with theoretical curves than any RCT”

          I’m not sure if I understand your point. You don’t believe in experiments?

        • >”I’m not sure if I understand your point. You don’t believe in experiments?”

          Of course not. For example, experimental data is way better than observational data for testing hypotheses. The whole point of the experiment is to reduce the various sources error. However, for areas of research like medicine and psychology, at this point our understanding very rudimentary. Rudimentary to the point that there are few actual hypotheses worth testing (however conventional wisdom says you *need* to test a hypothesis to be science, so instead they test some default “null hypothesis” that no one actually believes).

          Now, that isn’t to say there isn’t other reasons for experimentation, like trying to get something to reproducibly happen. But if you look you will find the same fields that suffer from failure to test real hypotheses also fail to check very hard (or sometimes at all) for reproducibility. They mostly just waste time/money by spreading misinformation that confuses each other.

      • Hmm…maybe I don’t understand what you mean. If I search for a new drug that can reduce malarial deaths isn’t that searching for an “effect”?

        Of course, if I find a *really good* drug then the “effect” will be evident with very little effort. No sophisticated statistical sleight of hand needed.

        My point was, in most cases people are trying to pass of ineffective interventions as somewhat-effective-cures. That’s where the searching for subliminal effects dance starts.

        • >”If I search for a new drug that can reduce malarial deaths isn’t that searching for an “effect”?”

          These “effects” exist all over the place for all sorts of reasons. Merely seeing an “effect” doesn’t tell you much. For example, there is the issue that these “effects” are often non-stationary (now you see it now you don’t). However, besides p-hacking, the most common and important issue is probably whether the “effect” being measured is really what you should care about. Interestingly, this is said to be a big problem when it comes to assessing the effectiveness of malaria treatments:

          “It is widely recognized that immunity makes a potentially substantial contribution to iRBC clearance rates and that fitting a “dead-awaiting clearance” class of iRBCs improves the model fit to clinical data (29, 30). It therefore seems extraordinary that there has been no objective investigation of the impact of host immunity on the use of iRBC clearance rates as surveillance tools for drug resistance and as efficacy tools for evaluating drug regimen changes. This was the impetus for the work presented here. Our model output suggests that host clearance processes such as immunity completely dominate the iRBC clearance phenotype unless artemisinin effectiveness is extremely low. This makes iRBC clearance rates highly insensitive to changes in underlying parasite drug sensitivity and to drug effectiveness cause by regimen changes.”
          http://www.ncbi.nlm.nih.gov/pubmed/26239987

        • I believe in Political Science the things are way worse than that. All this effort to make little “effects” sound real and feasible are, many times, useless from the beginning, since the causal relation they are talking about is not one that could help we ACT and change the world, fight the “diseases” like in medical science. The complexity and abstractness of social sciences gets combined with this misuse/over-use of statistical methods to provide thousands of work on effects that are not only questionable, biased or small, but will not provide us with good instructions to make real-world politics better.

  2. “I think this is consistent with Pearl’s view although in different language. Recently I was talking with some colleagues about estimating effects of the city’s recent Vision Zero plan on reducing traffic deaths, and some of this thinking came up, that it makes sense to think about effects on crashes, injuries, serious injuries, and deaths.”

    It seems like the mediator in this example is crashes? Were you hoping to see whether vision zero impacts number of deaths by reducing number of crashes vs by reducing seriousness of crashes that occur? It actually seems nontrivial to formulate this properly as a causal question.

    You could try saying that every person in NYC on a given day has a counterfactual continuous crash variable C(0) representing the severity of any crash they would experience that day if vision zero were temporarily abolished. C(0) equals 0 for anyone who would not experience any crash. Y(0,C(0)) represents the counterfactual binary variable indicating whether a person would have died in a traffic accident in the absence of vision zero. I think it’s fair to assume that Y(0,C(0)) = Y(C(0)), i.e. that vision zero has no direct effect on death not through occurrence or severity of a crash. The overall average effect of vision zero is then E[Y(C(0))] – E[Y(C)]. Now, how much of this difference is due purely to P(C(0)=0)>P(C=0) (i.e. reducing number of accidents) and how much is due to E[C(0)|C(0)>0]>E[C|C>0] (i.e. reducing severity of accidents)?

    Suppose C|C>0 ~ P_1. Then perhaps we are interested in the quantity E[Y(C_1(0))] where C_1(0) ~ P_1 if C(0)>0 and is equal to 0 if C(0)=0. E[Y(C_1(0))] – E[Y(C)] could then be interpreted as the effect of vision zero on traffic deaths through reducing number of crashes alone. As for whether and when this is identified, I’ll consult Tyler VanderWeele’s book.

    • I think crashes mean two vehicles are involved, but traffic deaths includes when pedestrians are killed. In New York a lot of people walk. Also in I’m not sure whether a bicyclist being hit would count as a crash. In my 30 minute NYC drive to work I pass 2 white bicycles, which are memorials put up in places where a bicyclist is killed.

      • I think that two of the most visible ideas that I have seen in terms of vision zero (which is very visible to drivers) is that reducing speed will reduce harm and that wearing seatbelts will reduce harm. Also built into this is a deterrence concept which says that in some locations having very visible things like speed monitoring police, red light cameras (lights are where a lot of pedestrian injury happen), and warnings about “click it or ticket.” However, in addition to the deterrence model (increase costs of behavior -> reduction of the behavior) there is a norm resetting model, which the name “vision zero” implies, meaning that drivers are asked to be more conscious of these things. Finally, just the literal reduction of the speed limit on almost all streets in and of itself may have an impact. On top of all that, there have been a lot of site-specific interventions, like a road I drive on got new traffic signals at some pedestrian crossings. So I think overall it is very hard to disentangle what any actual causal mechanism might be.

      • I assumed Andrew just meant ‘incidents’ when he said ‘crashes’ because I think the vast majority of traffic deaths in the city come from car on pedestrian or car on bike accidents. Car-car crashes would mainly lead to deaths on highways, and I don’t think vision zero does much on highways.

  3. “Then a bunch of years later, political scientists started following economists in the over-use, or perhaps I should say, over-interpretation, of various trendy methods such as instrumental variables and regression discontinuity analysis. Don’t get me wrong—IV and RD are great, indeed Jennifer and I discuss both of them in our book—but there got to be a point where researchers would let the instrument or the discontinuity drive their work, rather than stepping back and thinking about their larger research aims.”

    Am I right in thinking that the latter point relates to external validity? In the sense that (many) researchers are driven by a search for (crude) proxies of natural experiments, without much concern for both 1) whether those sites / decision problems are representative, and 2) how the results might be interpreted within existing theory.

    On top of that, I’d be interested to hear your thoughts on Pearls recent work on transportability (I confess to not being able to understand much beyond his motivation).

    • Brian,
      Our recent work on transportability is now explained in non-technical terms on wikipedia
      https://en.wikipedia.org/wiki/External_validity (section 2)
      If you want to go beyond the motivation to the actual techniques, we have a new paper on that in the PNAS
      http://ftp.cs.ucla.edu/pub/stat_ser/r450-reprint.pdf
      If you can identify the obstacles in the understanding these techniques, I will be happy to try and remove them.

      As you can see, the tecnhiques involve graphs and do-calculus. This is unavoidable, the same way that
      if one wants to solve 3 equations with 3 unknowns one needs to invoke arithmetic operations — there is no other way.
      All attempts that I have seen to avoid modern techniques end up with Campbell and Stanley’s warning: Beware of threats,
      but nothing beyond that.
      It is a worthwhile investment, like arithmetics.
      Judea

    • Brian:

      We had a discussion about Judea’s “transportability” idea awhile ago on the blog. My short answer was that, if there is a question about whether inference from domain A could be applied to make predictions in domain B, I’d prefer a hierarchical model that does partial pooling.

    • Brian: This might make the “transportability” idea more familiar and possibly clarify Andrew’s preference.

      A quote from External Validity: From Do-Calculus to Transportability Across Populations, Judea Pearl and Elias Bareinboim
      http://arxiv.org/pdf/1503.01603.pdf
      “By pooling together commonalities and discarding areas of disparity, we gain maximum use of the available samples”

      Case.1
      Think of two independents sets of measurements on the same object with unbiased measuring instruments of different unknown precision.
      Reasonable to conjecture the mean parameter is common, the variance parameter different – pool for the mean but not the variance.

      Case.2
      Think of birth records on male/female births in different cities in France in a given year.
      Not reasonable to conjecture the proportion parameter is common as these vary more than by random sampling but no pattern discern-able or suspected.
      Reasonable to conjecture the proportion parameter is _like_ a draw from a common distribution (e.g. Beta distribution) – pool that common distribution but not a common parameter per se (do partial pooling).

      Case.3
      Two randomized clinical trials of the same treatment in different hospitals.
      Reasonable to conjecture the control proportion parameter is different but the ratio of treatment proportion to the control proportion is common – pool (or partially pool) for the ratio but not the control proportion.
      If you are interested in the say the difference in proportions if the treatment was applied to a given population use estimates of the untreated proportion in that population – ratio*population.proportion.i – population.proportion.i.

      Case.n
      They can get very complicated and formalism should be helpful but not if they block partial pooling.

      • Keith:

        Yes, in my discussion with Judea, I opined that questions of statistical inference (including partial pooling, hierarchical models, Bayesian data analysis, etc.) are orthogonal to questions of causal inference. I didn’t want Judea to think of partial pooling as a competitor to his causal inference framework. Rather, to the extent he is “transporting” information according to his causal goals, I was suggesting he do so using hierarchical modeling, rather than completely pooling or completely discarding information.

  4. Andrew,
    I am not sure I can agree with you on the following:
    I continue to think that the most useful way to think about mediation is in terms of a joint or multivariate outcome,
    and I continue to think that if we want to understand mediation, we need to think about potential interventions or “instruments” in different places in a system.

    If by a “joint or multivariate outcome” you mean
    joint distribution of all variables, than I must
    disagree. A joint distribution tells us nothing
    about mediation. It is really important for
    everyone to internalize this fact. If by
    a “joint or multivariate outcome” you mean
    something else, I hope structural equations
    are part of that something, and distributions
    are excluded.

    As to potential interventions, it is something useful
    to quantify mediation before we have specific
    interventions in mind. For example, we may wish
    to quantify the extent to which cholestrol mediates
    between diet and heart disease prior to having
    any instrument in mind; we simply wish decide
    if it is worth developing an instrument
    to control cholestrol. levels.
    Judea

    • Judea:

      What I’m saying is that if there is a vector of pre-treatment variables X, a treatment variable T, and a vector of post-treatment variables Y, that I’d model the joint distribution of Y given T and X. To the extent I’m interested in mediation, I’d also consider instruments or interventions on individual components of Y. I think this is consistent with your approach of considering interventions at different nodes of a graph.

      • “To the extent I’m interested in mediation, I’d also consider instruments or interventions on individual components of Y.”

        I don’t think this is what most people mean when they talk about ‘mediation analysis’. Let’s split Y into a variable M for mediator and a variable Z. Mediation analyses typically target quantities like E[Z(T=1,M=0)] – E[Z(T=0,M=0)] (a ‘controlled direct effect’ setting the mediator to 0) or E[Z(T=1,M=M(T=0))] – E[Z(T=0,M=M(T=0))] (a ‘natural direct effect’ setting M equal to the value it would have had had T been set to 0). You’re proposing to separately estimate E[Z(T=1)] – E[Z(T=0)], E[M(T=1)]-E[M(T=0)], and E[Z(M=1)]-E[Z(M=0)]. The quantities you would estimate don’t give you the controlled or natural direct effects that would be obtained from a traditional mediation analysis.

        • Z ,
          You are right in concluding that Andrew’s statement about “instruments and interventions” stand contrary to
          “traditional mediation analysis”.
          I have reached the same conclusion without detailed analysis of the formulas involved.
          How?
          By simply examining the vocabulary that Andrew has used.
          e.g. any statement about “distribution” “given that” etc. tells us nothing about causation
          e.g., any statement about “instrument” or “interventions” tells us nothing about counterfactuals.
          (except for some bounds).

          This is one important ramification of the 3-level causal hierarchy, which I am linking below, as promised.
          http://web.cs.ucla.edu/~kaoru/3-layer-causal-hierarchy.pdf

          I have found vocabulary analysis (based on the hierarchy) to be extremely helpful in sorting out
          valid from invalid conjectures, especially when it comes to non-technical communication.

          Judea

        • Z:

          Here’s what I said, “is that if there is a vector of pre-treatment variables X, a treatment variable T, and a vector of post-treatment variables Y, that I’d model the joint distribution of Y given T and X. To the extent I’m interested in mediation, I’d also consider instruments or interventions on individual components of Y.”

          I recognize that this is different from traditional mediation analyses such as Lisrel etc. That’s because I don’t think those analyses usually make much sense! As I said, if I’m interested in mediation, I’d need to have a model of what would happen if these intermediate variables (“M,” in your notation) are changed. This could be done by actual experimentation on M, it could be done via observation of a “natural experiment” on M (some sort of “instrument”), or by some substantive modeling assumptions. Or by some combination of the three.

          Judea is correct that I did not use the word “counterfactual” (or even the more general term “potential outcome”). But I was thinking about counterfactuals or potential outcomes, and I was referring to them implicitly when I wrote about “instruments or interventions on individual components of Y.” Next time I write about this, maybe I’ll use the term “counterfactual” or “potential outcomes” to emphasize this point!

        • I’ve actually never heard of Lisrel. I learned this stuff from Jamie Robins’ and Tyler VanderWeele’s work, so maybe what I call ‘traditional’ is actually relatively new. My point though is that “a model of what would happen if these intermediate variables…are changed” is not sufficient. You can know how interventions on T change M and Z in expectation and also how interventions on M change Z in expectation without knowing what the direct effect of T on Z not through M is. Neither the controlled direct effect (i.e. the effect of T on Z if M is set to the same value for everybody) nor the natural direct effect (i.e. the effect of T on Z if M is set to the value it would naturally take in the absence of intervention on T) is generally a function of the average effects you would learn through separately experimenting on T and M. And the controlled and natural direct effects can be of real substantive policy interest, and methods for mediation analysis can identify them (under assumptions of course).

          (As a side note, experimenting on T and M at the same time in the same experiment can get you the controlled direct effect. The natural direct effect actually can’t be obtained by any experiment because it’s a ‘cross-world’ quantity.)

        • Andrew,
          My difficulty in agreeing to what you are sayings stems from my inability to interpret what
          you mean by ” I’d model” or “I’d also consider” or “the most useful way to think about mediation is” or “we need to think about potential intervention”

          Mediation analysis has reached the point where researchers are not only “thinking” about this or that term,
          but are actually “formulating” their problems in those terms (plus more) as part of their models and inference procedures.
          Thus, I am not sure if by saying: “I’s also consider” you mean to add new considerations to what others forgot to consider, or
          to encourage readers to acquire the tools now being used in causal mediation?

          And I hope you find the Causal Hierarchy to be helpful:
          http://web.cs.ucla.edu/~kaoru/3-layer-causal-hierarchy.pdf

          Judea

        • Judea, it seems to me that the issue is whether when one writes down an equation Y = f(A,M(B,C)) + Err one is modeling *observations* of Y given observations of A,B,C, vs one is modeling a process in which the values of the variables A,B,C actually cause the value of Y to necessarily be near f within the error Err possibly by B,C causing M to take on a certain values etc.

          Andrew says “I’d need to have a model of what would happen if these intermediate variables (“M,” in your notation) are changed. This could be done by actual experimentation on M, it could be done via observation of a “natural experiment” on M (some sort of “instrument”), or by some substantive modeling assumptions. Or by some combination of the three.”

          What I take “some substantive modeling assumptions” to mean is to build an equation which describes an assumed causal process, perhaps the terminology you’d use is “structural equation” but I’m not sure.

          Sometimes we may not have much of a structural equation, but we can still find out about causality by doing experiments on A,B,C and observing M and Y, perhaps we can find combinations of B,C which cause M to stay constant, and can test whether Y stays constant regardless of the B,C value so long as M is constant, etc. Those are what I take Andrew’s comments about experimentation or natural experimentation to mean.

          Because I often work on problems of physical sciences I often write down equations which I take to be causal without first going through any graphs or do calculus etc. For example Newton’s equation a = F/m, if I apply the F I expect that “a” will be affected directly.

          I can appreciate that there are areas of study where coming up with some equations that describe causal relations could be much more difficult, and where having a formal system to work out models might be helpful. But my impression is that Andrew hasn’t yet found that a formal system is helpful for him in the problems he works on. Yet, I still think he agrees with you that without some assumptions and/or some experiments involving “do” ing something that we can’t fit causal models. My impression is that while he fully understands the distinction in concept, he just doesn’t see the need for formal language describing the difference between “when I write down y = f(A,B,C) I mean it will hold causally with respect to forcing changes to occur on A,B,C” vs “I observed that y = f(A,B,C) when I had no control over A,B,C”

          It has always seemed to me to be the case that you strongly prefer a formal language for making this distinction, and Andrew prefers to simply write down models and tell people which case he means using informal words. It’s never struck me as being the case that you two have a disagreement on the need for a distinction in the concepts though.

          In many physical sciences we do the more informal version routinely. We might use a “structural” equation say a = F/m and then have another structural equation F = f(A,B,C) both of which imply causal assumptions, but then when it comes to figuring out how say C works, we might rely on purely observational information p(C | Q,R,S) where we’re well aware that we don’t think we can change Q,R,S and cause C to do what we want, and yet, this can be sufficient for us, perhaps Q,R,S are observable, not really under our control, but if we have a prediction for C then we can control A,B in such a way as to get the right F because the f relationship is a structural or causal one. For example, perhaps we are working with a speed control device, and A,B are the throttle and brake, whereas C is a kind of drag that depends observationally on things Q,R,S we can measure with sensors… We don’t imagine we can adjust the output of the sensors and control the drag, but we don’t have a special formal language to make the distinction between the causal connection f and the non-causal connection C ~ c(Q,R,S) + err or in other words p(C | Q,R,S)

          You might say that we should adopt your formal language. I’m not sure it is going to help everyone to add that formal language to the mix. But, regardless of whether it will or won’t help to have the formal language. It is absolutely impossible to make progress on causal models without understanding the fundamental conceptual difference! And there, I think you and Andrew seem to agree, which makes these back and forths you have frustrating because it sometimes appears that there are vast gulfs in concepts, where perhaps there are simply gulfs in language.

        • Daniel,
          I am glad you found fundamental agreement between
          Andrew’s position and mine. I have also welcome
          the prospect of such agreement after
          Andrew wrote that he does not insist on
          thinking in terms of conditional distributions, or
          in terms of interventions, but would embrace any thinking,
          including thinking with “some substantive modeling assumptions”.
          This bring us to a total conceptual agreement.

          My only concern in my last post was to understand what
          “thinking” means, as in “the most useful way to think about mediation is …”

          You interpret it to mean thinking informally, and you
          assume that I would insist on adopting some formal language.
          I dont.
          I just want to understand what you do after you “think”
          Do you proceed to get an answer to the research
          question that you asked (in our case, the extent of mediation)?
          Or you are satisfied with doing the thinking, then
          run some well recommended software, hoping that it reflects ‘
          your thinking?

          Let us try it on a simple example. Assume we have observational data
          on just three binary variables, X, Y and M, very large sample,
          and we need to estimate the extent to which M mediates between
          X and Y.
          What do we think about, and what do we do after we think?

          There is honestly no trap intended here, just a genuine attempt to
          understand your understanding of how “informal thinking” works.

          Truly appreciating your taking the time to strive for an understanding.

          Judea

      • Andrew
        I have to disagree with what you are saying:
        “What I’m saying is that if there is a vector of pre-treatment variables X, a treatment variable T, and a vector of post-treatment variables Y, that I’d model the joint distribution of Y given T and X. To the extent I’m interested in mediation, I’d also consider instruments or interventions on individual components of Y. I think this is consistent with your approach of considering interventions at different nodes of a graph.”

        I wish I could agree and say: “Yes it is consistent with my approach,” because this would have given readers the comfort that causal analysis has reached a state of consensus,which is good for science and good for peace and prosperity on earth.
        Unfortunately, what you have written is diametrically opposed to current conceptualization of mediation. I think this is the essence of what Z has written.
        1. the ” joint distribution of Y given T and X” has nothing to do with mediation, nor with causal effects.
        2. ” Instruments or interventions on individual components of Y” can only provide information for the Controlled Direct Effect;
        not for Mediation which is inherently a counterfactual concept, hence it requires information from the 3rd layer of the causal hierarchy.

        I have the feeling that not many readers are familiar with the causal hierarchy, so I will compose a note and post a link in subsequent comment.

        Judea

  5. Judea:

    You ask: “Assume we have observational data
    on just three binary variables, X, Y and M, very large sample,
    and we need to estimate the extent to which M mediates between
    X and Y.
    What do we think about, and what do we do after we think?”

    Typically in problems I have attacked, I posit a scientific hypothesis based on what I know about the system, so in this example I am handicapped by the fact that there is nothing really to think about in terms of scientific content since all the content we have is that there are 3 variables and they have names. So, perhaps I make up some background to make this problem more realistic.

    The binary outcome Y is whether or not a patient is classified by some pre-existing rules as being “in remission” for a particular cancer, the variable X is whether or not a particular fixed dosing regimen of a drug was given for a month, and the variable M is whether or not antibodies to the cancer are detected by some chemical lab test at 1 month.

    Now, I think about what the cancer fighting drug is intended to do, perhaps it is a kind of modified coat protein of the cancer cells, something that looks like the cancer cells, only causes a strong immune reaction where the cancer cells themselves don’t and this immune reaction attacks both the injected proteins, as well as the cancer cells.

    So, I hypothesize that the presence of the dosing often causes an immune reaction, and this reaction is what causes remission, and I consider this against an alternative where the X directly causes remission without an immune reaction, perhaps because the drug is toxic to the cancer cells directly.

    In the first model the frequency of observed remission will be a function of the detection of the immune reaction, and the frequency of the detection of the immune reaction will be a function of the dosing vs not dosing.

    In the second model the frequency of observed remission is a function of dosing which is the same regardless of the presence of the immune reaction.

    Now, to fit this model, I use Bayesian reasoning, in which I write down the probability of observing Y under the two alternatives, and I give some prior information about how much I think one or the other alternative is probably right. Perhaps I really don’t believe that the drug itself is toxic, so I’d put 80% probability on the immune reaction mediating the effect. P(Model1) = 0.8 then

    P(Y |X,M) \propto P(Y|M,Model1)P(M|X,Model1) P(Model1) + P(Y|X,Model2)P(M|Model2)P(X|Model2)P(Model2)

    where in model2, M,X are independent and Y is independent of M but not independent of X.

    Then I fit my distribution to find parameters that are not mentioned here, but are actually built-in to the P expressions and are involved with things like logistic regression formulas etc.

    The causal analysis is in determining that there are two different possible ways in which X could alter the outcome Y, one through altering the outcome M and one where Y would be expected to occur at a single constant rate regardless of whether we know M. In the end I’ll have both parameter distributions, and posterior values for my model probabilities (a model-choice problem).

    The causal analysis helps me decide which models to consider, the Bayesian analysis just extracts the information that the data gives me about which of the two possible models fits the data and my expectations better.

    In the end, if P(Model1) is large enough, I might come to ignore Model2 and then focus on trying to understand how strongly and how effectively the dosing of X induces the immune reaction M.

    • Also, I’ll posit that we’ve assigned the treatment X based on a random number generator so given the large sample size, there is nothing different about the two populations, and maybe we’ll even go so far as to say that IF the treatment looks promising we’ll take the untreated group and treat them with the drug X at a later time, so that we could potentially have a two phase experiment and a two phase analysis, but so far we’re just in the first phase.

    • Also, instead of a discrete choice between “there is mediation” and “there is no mediation” we might consider embedding the whole problem in a larger model where “no mediation” corresponds to some parameter having a value of zero, and then perhaps we’ll place a prior over this parameter so that say 20% of the probability is within \epsilon of zero for this parameter. This kind of choice depends on the specifics of our chosen equations.

    • > So, I hypothesize that the presence of the dosing often causes an immune reaction, and this reaction is what causes remission, and I consider this against an alternative where the X directly causes remission without an immune reaction, perhaps because the drug is toxic to the cancer cells directly.

      Daniel, are you considering the alternative where X does indeed cause an immune reaction but X causes remission through a different mechanism (perhaps toxicity) and the immune reaction has no direct effect on remission ?

        • Being more precise (I’m distracted by small children all weekend, please bear with me):

          P(Y,X,M,Model2) = P(Y|X,M,Model2) P(M|X, Model2) = P(Y|X,Model2) P(M|X,Model2)

          in other words, if M is acausal for Y, then M provides no information about Y, once you know X, so P(Y|X,M) = P(Y|X)

        • Daniel,
          Your proposed solution introduces me to a totally
          unfamiliar world. Is it your own innovation, or
          part of a known approach to inference that other
          readers on this blog would recognize?

          I would like very much to understand its rationale
          in this context, because it is totally new to me, and
          I see that it encapsulates wisdom that could be useful.
          (One nice thing about working out a toy example is that
          people understand what you are doing; a rare event these
          days.)

          Let me summarize what I have read from your method,
          You are postulating two models:
          M1: X—>M—->Y
          M2: MY
          M1 says 100% mediation and M2 says 0% mediation,
          You are assigning to them priors, P(M1)=80% P(M2)=20%,
          and you fit them to data to get their posteriors,
          say p1 and p2.
          Finally, once you get those these posteriors
          you would say that p1 percentage of the effect of
          X on Y is mediated through M and p2 percent
          goes directly, bypassing M.

          Have I read you correctly so far?
          I am not sure, because you say:
          “Then I fit my distribution to find parameters that are not
          mentioned…”
          Why fit your distribution if it is given to you with
          the data (recall: we assumed very large sample).
          Moreover, we know that the data, being generated
          from a mixture of the two models, will reject each one
          of them separately, as the number of samples increases.
          So, I am afraid we will end up with p1=P(M1).

          These are some of my worries, but the main thing
          is to confirm the philosophy of your approach.
          Did I read it correctly?

          Judea

        • Daniel,
          I see that the two graphs I drew got messed up,
          so I am resubmitting.

          Your proposed solution introduces me to a totally
          unfamiliar world. Is it your own innovation, or
          part of a known approach to inference that other
          readers on this blog would recognize?

          I would like very much to understand its rationale
          in this context, because it is totally new to me, and
          I see that it encapsulates wisdom that could be useful.
          (One nice thing about working out a toy example is that
          people understand what you are doing; a rare event these
          days.)

          Let me summarize what I have read from your method,
          You are postulating two models:
          M1: X—>M—->Y
          M2: MY
          M1 says 100% mediation and M2 says 0% mediation,
          You are assigning to them priors, P(M1)=80% P(M2)=20%,
          and you fit them to data to get their posteriors,
          say p1 and p2.
          Finally, once you get those these posteriors
          you would say that p1 percentage of the effect of
          X on Y is mediated through M and p2 percent
          goes directly, bypassing M.

          Have I read you correctly so far?
          I am not sure, because you say:
          “Then I fit my distribution to find parameters that are not
          mentioned…”
          Why fit your distribution if it is given to you with
          the data (recall: we assumed very large sample).
          Moreover, we know that the data, being generated
          from a mixture of the two models, will reject each one
          of them separately, as the number of samples increases.
          So, I am afraid we will end up with p1=P(M1).

          These are some of my worries, but the main thing
          is to confirm the philosophy of your approach.
          Did I read it correctly?

          Judea

        • Daniel,
          I see that model M2 was messed up again, so here it is
          M2 consists of an arrow from X to M, and an arrow from X to Y. No mediation
          JP

        • Judea, I think you’ve understood the basic concepts, I explain with further details some of your concerns in a reply to Carlos below. I do believe that many people in this blog will understand my description of the modeling process as fleshed out below. I appreciate this opportunity to try to come to some reconciliation!

        • But Daniel your Model2 represents the effect of Y on X not mediated through M. The question requires that you only isolate the part of the effect of Y on X passing through M.

        • Ok. But if I understand correctly model2 also makes it plausible that X can affect Y directly, so the final results seem to me like a mixture of both direct and mediated effect (because you are averaging the two models). How do you separate the mediated effect from the mixture model.

        • I’m not sure I follow your reasoning. Say you have two different scenarios:

          A) X=treatment causes (and is perfectly correlated with) M=immune response which causes (and is perfectly correlated with) Y=remission

          B) X=treatment causes (and is perfectly correlated with) M=hair loss and X=treatment also causes (and is perfectly correlated with) Y=remission

          In both cases, the experimental data will be a set of data points of the form (X=0,M=0,Y=0) and (X=1,M=1,Y=1).

          Wouldn’t your computation give exactly the same result for scenario A ( X => M => Y ) and scenario B ( M Y ) ?

        • Judea, Carlos,

          So one problem with working with a toy example and symbolic explanations is that they are ambiguous, they hide the details. As Judea said:

          “‘Then I fit my distribution to find parameters that are not
          mentioned…’ Why fit your distribution if it is given to you with
          the data”

          What are these parameters not mentioned? When I write P(….) in the above formulas, these are not numbers, they are formulas. So, for example, we have supposed that I have no further information about the individual patients than the Y,X,M as mentioned. In this situation, our information about individual patients with the same values of Y,X,M is symmetric. In such a situation I would use a binomial distribution to represent the probability associated with any vector of outcomes Y because it is symmetric with respect to any sequence Y_i that has the same sum(Y). That is, it’s insensitive to certain irrelevant labels, such as the names of the patients.

          In the binomial distribution we have a parameter often called p for (probability of success), but which I want to call f here. I am going to make an explicit difference in concept between the frequency in a large sample and the concept of probability (which I reserve for numerically assigned values of plausibility under Cox type axioms). In this conception, the concept of plausibility includes frequency in large samples as a subset of concepts (still valid, but not the only case).

          so I have for each Y_i

          Y_i ~ binomial(1,f(X,M));

          where f is a function that combines my information about X and M into a frequency of occurrence. Simply because we have mediation of an effect by M does not mean that the effect is totally 100% deterministically effective. Perhaps we can get remission Y without M, or without X. Perhaps mediation by immune response varies in effectiveness across the different patients, the different cancers, and with regards to other variables which I have no information about (such as immunocompromised patients etc).

          For example, we might be using something like f = logistic_curve(f0*(a+ 1*X + b*M + c*XM )

          In this example, f0 is the base rate of remission when we treat with X and see no M, a*f0 is the rate of remission when untreated, etc. THis is not the only possible formula, but let’s explain the motivational ideas behind choosing such a formula. First off, the drug X could be effective by itself through toxicity. Thus, the frequency could increase from a*f0 to f0*(a+1) even without M. Secondly, M could mediate an even further improvement in effectiveness, so f0*(a+1+b) when M is observed and no interactions between X and M are assumed. Finally, perhaps when M occurs, the actual presence of X at each dosing causes the immune system to kick into high gear for a while… so there could be an “bonus” effect, and in a more complex dosing regime or a trial with variable time duration of dosing etc, this might be important. Also it is important when M occurs even though we gave no X. So I might “think” harder about this to produce an even more specific f function that takes into account background information like how many times per month the drug is given and how much I think this bonus interaction effect occurs through time (perhaps it is dynamic, the immune system gets “used” to X or whatever).

          I now provide some prior information for distributions over f0, a,b,c which depends on scientific information that I presumably have because I’m working with people who have been studying this treatment for years now… so I want to fit posterior distributions over the a,b,c etc and because Judea says we have very large samples… they will be tight distributions, *unless there is some non-identifiable aspect of the model*

          In the alternative model, I believe that the frequency of occurrence is independent of M, so I use a binomial distribution whose f depends only on X.

          Y_i ~ binomial(1,f2(X))

          Now, I may be able to realize that when b,c are both zero, this model *might* be equivalent to model M1 (it depends on the details of f2), so I could *perhaps* provide a prior p(b,c) that has my 20% probability on b=c ~ 0 to within epsilon and stick with one model. Or, I could continue as in the previous posts, with a mixture model and simply have Model2 has 20% probability.

          As to Carlos’ question about would my model provide the same computation when M = hair loss?? The stage at which I do a causal analysis of the *meaning* of M and do my “thinking” about the science involved, is where I choose my models. I would not choose to analyze the hair-loss example with the same models.

        • Note: the logistic_curve(x) = 1/(1+exp(-x)) is there just to constrain the rage of f to the range [0,1]. I could have used any function there, and then I might want to do nonlinear transforms on the arguments to get proper fit… so there is modeling ambiguity here that is inessential to the issue at hand but I want to mention because later I talk about f0 and a and b as adding to the f, whereas in fact they add to the argument of the logistic_curve function, please ignore this imprecision.

        • Daniel,
          I am glad you found my understanding of your proposal to be faithful to what you are proposing.
          But I have some difficulties with your description above.

          1. Since we have only three binary variables, X,M,Y, the probability P(x,m,y) can be totally specified
          using seven independent parameters, p(0,0,0), p(0,0,1),….p(1,1,1).
          No additional information is given to us by the data.

          All the chimes and bells we can use with binomial,
          Bernoulli, logistic curves etc etc, will not change
          the fact that all we get from the data is just
          P(x,m,y), seven parameters. So, at the end of the day,
          no matter how sophisticated one is in playing with priors
          on priors on priors, one must end up with an answer
          the form: The degree of mediation is a function of P(x,y,z).

          2.
          This, of course, is un-doable, because mediation is a level-3 concept, so it cannot be
          computed from strictly level-1 information, like P(x,y,z).
          We therefore need assumption from level-3, i.e., the level of counterfactuals..
          The conventional way of introducing such assumptions
          is to make ignorability assumptions of various kind.
          These are assumptions about independence among
          potential outcomes (which can also be made using
          graphs, but this is beside the point)
          The Causal Hierarchy tells us that without such
          assumptions we can say nothing about mediation.
          Nadda!!!
          http://web.cs.ucla.edu/~kaoru/3-layer-causal-hierarchy.pdf

          3.
          Now I look at your proposal and I do not see any
          such assumptions. I ask myself, perhaps they are
          hidden in the Binomial, Bernoulli, logistic curves etc etc,
          But, these are all probabilities and frequencies, not counterfactuals. So my lazy eyes tell me:
          Do not bother with the details of those Binomials, Bernoullis, logistic curves etc
          the basic ingredients from level-3 are missing, so
          nothing resembling mediation can come out of this exercise.

          Am I wrong in this laziness? Perhaps you can just point me to those ingredients in
          your proposal.

          The nice thing about a toy problem is that we convey
          our ideas in a transparent way, without pages upon
          pages of parametric models. Can we do it on this
          example?

          ps. You mentioned that your approach will be familiar
          to most people on this blog. Does it have a name?
          Hierarchical Bayes perhaps? Another?
          My main objective in getting into this sort of discussions
          is to understand the thinking paradigm of readers on this blog,
          because it is so totally foreign to the way people in the causal
          inference literature are thinking, that I keep on telling
          myself: Surely, one toy-example should unveil the gulf.
          I am still hoping. This is a great opportunity.

          Judea

        • Judea, this seems interesting because I think it brings up points of disagreement that are enlightening, so I will proceed.

          1) “Since we have only three binary variables, X,M,Y, the probability P(x,m,y) can be totally specified
          using seven independent parameters”

          I disagree. Probability is in the Cox/Jaynes/Bayes conception, a description which is relative to a state of information about the world. *the data* does not, by itself, give you probabilities. Furthermore, probabilities over the parameters are what is of interest to me, not probabilities over outcomes, because it is the parameters which tell me about the *physics/chemistry/biology*. This is why it is not possible to describe how to solve this problem without the details. Because it’s the physics (and by extension, chemistry, biochemistry, laboratory instrument construction, and soforth) that gives us our state of information.

          So, if someone came to me with this problem I would assure them for example that it’s a crappy way to do the analysis to dichotomize the X and M variables. Surely the blood lab can give us antibody concentrations in ng/ml and the dosage of X could be divided by the patient mass to give a dimensionless ratio, or whatever. So, yes, at the moment you have only the binary information that the M exceeded some threshold say 25 ng/ml for example, but I also have the *causal, physics and biochemistry based* information that there does not exist a threshold in the response of the people to the concentration of M. If 25ng/ml is the reporting threshold and someone has 24.99774 ng/ml they do not cease to have any affect from the immune response. So, my goal is to specify what I think is going on in the physics/chemistry/biochemistry as best as possible, and then if you give me very limited information, I will still use my causal physical model to extract what information about the underling physics that I can from your data.

          2) The causal physical information comes in with the choice of my f function. I gave you some examples of why I might choose to make f have a certain form, but the truth is since we’re making this example up, we can’t be very specific about how the f(X,M) function should look, because we don’t have the specifics of the physics/chemistry/biochemistry to inform us about f, we’re left just giving some examples of the considerations that would change the ways in which f could be constructed. One thing however, is that the f function can represent what we believe about the continuous processes that underly the dichotomized variables. IN my example, we could talk about the counterfactual where we give X and also give an immunosuppresor drug to eliminate what we think the M effect is, and then even if we see M the immune response, we should still expect in the presence of IS = immunosuppresor, f(X=1,M=1, IS=1) = f(X=1,M=0) for example. So we can talk about the counterfactuals, but we can not talk about them in the abstract, we need to know how they are implemented, because in the real world, that implementation detail alters the physics/chemistry/biochemistry and we need a model for it.

          3) as to the laziness, yes I think you are wrong, because the essence of the causal and counterfactual portions of my model is in the construction of an f function that represents our causal information about how the chemistry changes the overall frequency of occurrences based on dose-response (and yes, it’s a dose-response model even though we’re only measuring the dosage very poorly in a dichotomized way). Unfortunately, in a toy example, there is not specific physical/biochemical information to bring to bear. I could make some up, if it helps to elucidate. But perhaps you simply agree with me that by “thinking” about the relationships between fundamental physical variables and the measurements, it is a causal kind of thinking and it results in at least a valid attempt to test some assumptions about the world against data.

          So, does that help us get closer together?

        • Daniel,
          Yes, this clarification helps bringing up closer together.
          A few comments:
          1.
          I now see where your counterfactuals are coming from, they come from deterministic functions like f(X,M) that you
          assume. Fine. I thought your f’s are probability functions, since they were labeled Bernoulli, Binomials, etc.
          (To prevent such mishaps, lets agree that when we write “f” it stands for a deterministic function,
          and P will stand for distributions)

          2.
          I do not agree that, in a simple binary 3-variable example we cannot communicate without lengthy explanations about
          the chemistry of the process. I am not asking for chemical justification, only for what is assumed and
          how we compute mediation. So, can we just say that we assume two deterministic functions , f and g,
          such that Y=f(X,M,u) and M=g(X,u).
          where u is the identity of the unit (e.g, the subject).

          We are done, we do not need the details of the chemistry when we are not asking for justification.
          Now we can communicate with generic functions f and g,
          without committing to a specific functional form, which will save us lots of writing.

          3.
          So, lets communicate with f and g.
          Assume we already have these beautiful functions
          Y=f(X,M,u) and M=g(X,u).
          How do we compute the extent to which M mediates between X and Y for individual u. ??

          4.
          Moreover,
          Suppose we do not have the form of the functions f and g,
          But we have data in the form of: P(x,m,y),
          Can we use the availability of P to supplement for the absence of f and g and calculate the average mediation
          over all units u ?
          What assumptions do we need to do that?

          5.
          I happened to know the answers to (3) and (4), because I did some readings into the causal-inference literature.
          But I am wondering how these two questions would be handled in the paradigm with which you feel comfortable.

          BTW. You never gave me the name of your paradigm,
          Is this all in Cox/Jaynes/Bayes ?

          Judea

        • @Daniel

          Just a thought: It’d be interesting to take a real world problem & have you vs Judea juxtapose your analysis of it.

          What I’d love to know is whether this is just an academic disagreement Judea has or whether it changes the conclusions substantively on any real problem.

          The problem is that most of these discussions happen in the abstract or over toy problems.

          What I’d love to see is to take the same real world problem & have Andrew approach it using hierarchical models & Judea using DAGs etc.

        • Rahul
          You say:
          What I’d love to know is whether this is just an academic disagreement Judea has or whether it changes the conclusions
          substantively on any real problem.

          Where did you get the idea of a disagreement?
          Is it not possible for someone to attempt to learn from another without being depicted as disagreeing?

          We are trying to understand how a simple problem of mediation is being handled in two scientific communities,
          one with which I am familiar and one with which everyone here (so I am told) is familiar.

          We are close to reaching understanding, and we may end up with a conclusion that, given the same assumptions,
          the two paradigms will always end up with the same solution.

          Judea

        • Judea:

          I’ve followed multiple exchanges between you and Andrew on the blog & it gave me the impression that you essentially disagree with his techniques being a valid way to do “causal inference”?

          Well, apologies if I misinterpreted you and, in fact, you do agree that both Andrew’s and your approaches are valid ways to do causal inference.

          >>>I am not sure I can agree with you on the following:I continue to think that the most useful way to think about mediation is in terms of a joint or multivariate outcome, and I continue to think that if we want to understand mediation, we need to think about potential interventions or “instruments” in different places in a system.<<<

        • Rahul,
          If I gave the impression that I disagree with anyone, I need to clarify.
          You write:
          :I’ve followed multiple exchanges between you and Andrew on the blog & it gave me the impression that you essentially disagree with his techniques being a valid way to do “causal inference”?

          I never got to understand Andrew techniques of doing “causal inference” because Andrew refrains from discussing toy problems,
          and I cannot think in terms of “real life” problems, where one never knows if the result obtained is correct or just happened .
          If I expressed disagreement it was with the way Andrew characterized his techniques, not with the techniques themselves, which
          I never understood.

          You wrote:
          >Well, apologies if I misinterpreted you and, in fact, you do agree that both Andrew’s and your approaches are valid ways to do causal inference.

          Not yet. I do not know Andrew’s approach, but I see hope for mortal me to agree with Daniel’s approach, since Daniel is willing to
          teach it to me in the context of a toy problem.
          Whether Daniel’s approach equals Andrew’s I know not. Only a toy problem will be able to tell.’
          Judea

        • Hello Professor Pearl,

          I have read your slides and one of your papers on mediation. One way of interpreting some of your work on mediation is as introducing an operator like “hold M constant” alongside “control for M”. Am I right that the difference between these two in your Gender, Qualifications, Hiring network is as follows:

          Controlling for M means considering men and women who happen to have the same qualifications and looking at differences in hiring.

          Holding M constant means considering men and women who have the same qualifications (because the experimenter intervened by e.g. using similar resumes) and looking at differences in hiring.

          Your three level hierarchy suggests that you cannot make counterfactual claims without a model that supports the latter operation.

          Is this correct?

        • @judea

          Toy problems & abstract arguments are great but they have their limitations. In particular I sense we are going about in circles with these toy problems in this specific case of trying to evaluate your approach against Andrews.

          In my opinion, it’d be productive to take a specific “real” problem, with numbers and attendant “messiness” & have someone apply both approaches (Judea’s & Andrew’s) to it.

          Maybe Daniel will volunteer to do the hard work!

        • Rahul,
          The limitation that you see in toy-problems, namely, abstract arguments instead of real numbers can
          easily be removed. I can easily decorate the 3-variable example with a plausible scenario and real data.
          (In fact, I will do it if readers request)

          I cannot do much about messiness, and I have never seen an example where insight was gained by deliberate messiness.

          As to your sense that

          I have two comments:
          1. My sense is that we are converging on an understanding, not going in a circle, because I am learning how people in Daniel’s camp are
          thinking.
          2. We are not trying to evaluatef two approaches because , before such an evaluation can be conducted, we first need
          to agree that we are trying to estimate the same quantity. So far I have defined what quantity I am trying to estimate.
          Soon Daniel will define his target quantity. Once we do that, there will be no two approaches, but one logic to decide
          if we estimate things correctly or not. The days of “approaches” are over, we are in the scientific age.
          Judea

        • Neil Girdhar
          The difference between “controlling for M” and “hold M constant” play a role in mediation, but
          it is not sufficient. We need another operator “freeze”
          I will explain using the example you mentioned : Sex discrimination in hiring.

          Controlling for M means considering men and women who happen to have the same qualifications and looking at differences in hiring.
          Yes

          Holding M constant means considering men and women who have the same qualifications (because the experimenter intervened by e.g. using similar resumes) and looking at differences in hiring.
          Yes, and beautifully expressed.

          Your three level hierarchy suggests that you cannot make counterfactual claims without a model that supports the latter operation.
          Not exactly. The second operator is interventional, do(M=m). .For estimating indirect effect we also need an operator that
          allow gender to affect qualification but freese what the emploer knows about gender.
          We can operationalize this operator by imagining an experimenter who fakes the “Gender” entry in the resume.
          This keeps the perception of gender fixed but lets Gender affect qualification and hiring.

          Judea
          Is this correct?

        • Thank you for your informative reply Prof Pearl,

          You said:

          > Not exactly. The second operator is interventional, do(M=m). .For estimating indirect effect we also need an operator that
          allow gender to affect qualification but freeze what the employer knows about gender.
          We can operationalize this operator by imagining an experimenter who fakes the “Gender” entry in the resume.
          This keeps the perception of gender fixed but lets Gender affect qualification and hiring.

          However, I don’t see why this is not another (slightly more clever) case of “holding” in the following way. To estimate the indirect effect of gender on hiring via qualifications, do the following:

          – Set Gender to Man
          – Measure Hiring (call it A)

          – Set Gender to Woman
          – Infer Qualifications and Hold its value
          – Set Gender to Man
          – Measure Hiring (call it B)

          E(B-A) is the indirect effect.

          Compare this with the direct effect:

          – Set Gender to Man
          – Infer Qualifications and Hold its value
          – Set Gender to Woman
          – Measure Hiring (call it C)

          E(C-A) is the direct effect.

          Essentially, the indirect effect is the expected hiring of a man with “a woman’s qualifications”. The direct effect is the expected hiring of a woman with “a man’s qualifications”.

          Did I make a mistake? Can this not all be done with a single “Hold” operator used judiciously?

        • Niel Girdhar,
          Good question.
          The answer is that if we can separate
          Gender from “Set Gender” we are ok,
          we can regard the “set gender” as a new variable
          and what I called “freezing” may be interpreted
          as “holding” — no problem.

          The reason I said that we need a new operator, “freezing”
          was to cover situations where we do not have this
          facility to change gender perception without changing
          gender itself. If we only have three variables
          X, M and Y, and we are allowed only do operators on
          these three, we cannot identify NIE.
          (unless we have non-confounding assumptions)

          Good question.
          Judea

        • > As to Carlos’ question about would my model provide the same computation when M = hair loss?? The stage at which I do a causal analysis of the *meaning* of M and do my “thinking” about the science involved, is where I choose my models. I would not choose to analyze the hair-loss example with the same models.

          The same problem arises if you compare M = immune response in the case ( X => M => Y) and M = immune response in the case (X =>M, X=>Y). In your example you were doing model selection, my point is that it may work for the two particular models you chose to compare but I don’t think it works in general. And if you already know what model is the correct one, there is no need to do model selection.

          I think the original question can be interpreted in the following form: assuming that X affects the outcome via M and also directly (i.e. the model contains both X=>Y and X=>M=>Y), estimate the relative importance of those paths. As CK also pointed out, you considered only the two extreme cases.

        • I think I gave that misimpression of considering just the two cases because as a toy problem, I glossed over the specifics of the f(X,M) function, but if you look back I did give some sense of the specifics. In that example (which is purely made up) I do have separate terms for both a direct effect of X and an indirect effect of X through M as well as an effect where in the presence of M, the X can be even more effective, So I am considering actually a model where X can have a direct effect, M can have a direct effect, and the two have a synergy as well. all of this altering the frequency of remission. Obviously the choice of the model requires specifics of a real problem, but I acknowledge that we need to consider multiple interactions.

          In the second model I am explicit in saying that M has no effect.

          After large data with sufficient experimental conditions to exercise all the major features of both models, I will have one model dominating. If I do not have one model dominating (ie. probability of the other models goes to zero), I must either design additional experiments to exercise the features that distinguish between the models, or look at the mathematical form and try to figure out whether perhaps they do not have different features, or there is no way to distinguish between them… in which case perhaps they are actually the same model in different language etc.

        • Daniel, looking at your latests comments I see you acknowledge that your procedure may fail to identify parameter values. That was my point, if your model includes both the direct and mediated effect you may not be able to distinguish them. In the example I gave, X could cause M which cause Y or X could be causing M and Y at the same time. To establish causality you would need to change M while keeping X unchanged, and see if Y does or doesn’t change.

          By the way, it’s true that the seven-dimensional joint probability distribution p(X,M,Y) is a sufficient statistic. And the original question was about “very large sample” and in that case the likelihood will dominate over the prior and your result will converge to well defined probabilities P(Y|X,M)=f(X,M), etc.

          The interesting part of the question is of course how causality is defined and what can be inferred from observations, interventions, etc. The parameter estimation is an orthogonal issue. I think toy problems with large samples and a handful of binary variables are complex enough to have interesting discussions about causality.

        • Carlos: the P(X,M,Y) “sufficient statistic” describes only what might happen in repeated application of the exact same type of experiment. What if I wish to give different dosage of drug X? What if I measure M in terms of nanograms/ml of antibody instead of “greater than or less than threshold X” and wish to handle this case? What if instead of some kind of body-scan to measure number of cancer cells detected is or is not above threshold, I have a new blood based quantitative measure of how much of some cancer metabolite there is? You can perhaps think of the P(X,M,Y) for 3 binary variables as estimating the value of a function at 8 points in a 3 dimensional manifold. Those 8 points are certainly not sufficient to learn the whole manifold in general.

          I think it’s a mistake to model *the data* directly in many cases, and think it’s more appropriate to model an underlying process together with a measurement process. In the limit of large samples, even if you get the f function at the 8 points exactly, if the measurement process sucks, we may still not recover sharply peaked information about the underlying process. For example if we see after 1 month 30% of patients at below-threshold levels of cancer metabolites, it tells us only that in those 30% of patients the cancer killing process was at least as fast as some rate r… but can not tell us whether it was even faster or precisely how fast etc.

          Also, YES YES YES, I absolutely acknowledge that models and data together may fail to identify everything. This is actually a great thing about Bayes. A point-estimation procedure like maximum likelihood can mislead us into thinking we know the answer. A full Bayesian posterior can show us what it is we still don’t know, and also can help us design additional experiments and additional measurement methods etc

          I also agree that in this discussion I think Judea and I and you and a few others are coming to an explicit understanding of the separation between *how the causal model arises* and *what method we use to fit/infer it*. And in discussions of how the causal model arises, I think there are also some philosophical issues such as whether to model underlying processes and what is or is not an intervention that needs its own model. So I hope it is helpful for a community larger than just Judea and myself.

        • Daniel, the sufficient statistic describes the data. Relative to the model, there is no other information in the data than what is included in the sufficient statistic. The likelihood will be a function of the sufficient statistic only. And when you have enough data only the likelihood matters, which is a function of the sufficient statistic only.

          Of course things will be different if you have a different model (or you have little data, or you have time-varying parameters, or you do a completely different experiment, etc., etc.). And be reassured, nobody expects you to find a cure for cancer using three binary variables. However, I agree with Judea Pearl that toy examples facilitate understanding. There is no need to make them complex just because you can (but of course when you understand well enough the toy problem you might upgrade to a more complex toy).

        • Carlos: I think one of the most important objections is that if you insist on a case in which there is no physics/chemistry/science, then the whole procedure is pointless.

          Suppose we are going to collect 100000 data points of X,M,Y. Suppose there are several causal models:

          1) You will flip 3 coins and write down a vector, such as 0,1,0… You will then copy this vector 100000 times and paste it into a file.

          2) You will flip 3 coins 10000 times, copy each one down 10 times after each flip.

          3) You will flip 2 coins and depending on the X,M values you will deterministically always choose the same Y = f(X,M) for some f…

          4) You will flip 3 real mechanical coins 100000 times, but each of which will be flipped in a special coin-flipping machine designed by Percy Diaconis so as to always output the same known series of bits equal to the output of the Mersenne twister algorithm for seed = 1.

          It is distinctly the case that given these different options, we do NOT have the same state of information about the world after seeing the data. This is the essential ingredient to Cox/Jaynes Bayesianism… the probabilities are NOT *in the data*.

        • Daniel,
          As Carlos pointed out averaging X->M->Y with X->Y will not give you the mediated effect. What you need is to figure out how to “freeze” (using Judea’s term) X->Y from a mixture of X->M->Y and X->Y in order to isolate X->M->Y, the quantity of interest.
          CK.

        • Another way to say this is, whether or not some statistic is sufficient is a property of the model. With my example models

          1) the first vector is sufficient to completely specify the file
          2) You can tell me only every 10th vector and I can completely reconstruct the file
          3) after seeing one vector each starting with 00, 01, 10, 11 I can completely reconstruct the file by knowing only the X,M values.
          4) You don’t need to tell me anything about the data file, I can make R reconstruct it perfectly by set.seed(1) and a few other commands.

        • CK, model1 includes both a direct and a mediated effect, model2 posits the idea that M has exactly no effect. Under certain circumstances, the data will not identify the mediated effect. For example, if every time I give X I also get M, and every time I don’t give X I never get M.

          If those things are the case, then I can for example do something like take cancer cells in a dish and apply the X randomly to some dishes, and now there is no immune system to make things happen, and I can see if there is a killing effect from X alone, and I can model or hypothesize whether the in-vitro effect of X and the in-vivo effect of X would be the same. And then, I can identify what is the X effect differently from the M effect. If I also have an XM effect, then perhaps I can do some experiments that help to identify it.

          The point is:

          1) I acknowledge the need to be able to identify the X vs the M effect, and that model 1 does not necessarily do so under every circumstance.

          2) I acknowledge the possibility to perturb the system by actions and to use those results to infer information about different effects.

          3) I disagree with the idea that all interventions that could inform me about the separation of effects can be thought of as some abstract “do(X=1,M=0)” sort of thing. Instead, each conceivable intervention needs its own physical causal model. Some of those physical interventions MAY in some instances be modeled by simply setting the value of an observable to a given value M=0 for example, but this is a physical/chemical/biological/social-science/whatever question.

        • Daniel, I’m glad to see you agree with me on sufficient statistics being a complete description of the data relative to the model (and being different when the model changes).

    • Daniel:
      How did you account for the direct effect of X on Y (i.e. the effect of X on Y mediated through mechanisms other than M). Not accounting for the direct effect will contaminate your results.

  6. The only reason Andrew’s opinions on causal inference are taken more seriously than those of some arbitrary commenter is that they appear in yellow. He is not an expert on this topic (despite the title of his blog), and the discussion need not center around his (consistently) inaccurate statements.

    • Anonymous:

      You’re right. I don’t know anything. You’re the expert here, not me. I’ll work on modifying the blog software so that your comments are in yellow. Then I’m sure everyone will take your opinions much more seriously! Really I can’t figure out why anyone would read anything that I write.

    • “… they appear in yellow …”

      Huh? Check your browser settings because Andrew’s comments never appear in yellow on my iPhone, Safari, or Chrome. Or alternatively reboot your computer and replace the user.

  7. Judea: starting again a new thread because only a few replies deep are available in this blog.

    You say: “I now see where your counterfactuals are coming from, they come from deterministic functions like f(X,M) that you
    assume. Fine. I thought your f’s are probability functions, since they were labeled Bernoulli, Binomials, etc.”

    Now that you realize there is a counterfactual, and a scientific justification, and that some assumptions are being made etc, I ask you to please bear with me in delving deeper into this community and ideas from Cox and Jaynes. yes that would be at least the name for the “school” of thought that I am adhering to, it may not be universally understood on this blog but it is probably reasonably well understood by many.

    First off, we will agree I think that there are two conceptually separate issues. The first is to understand some science or some mechanism by which causality occurs and then to make some assumptions in the form of equations or differential equations, or the like which express this process precisely and quantitatively (ie. think up the f functions). However, we can not write down every component of these equations by pure thought!! Therefore, the second issue is to learn from data and assumptions put together, some information about the precise form of those equations, including the numerical values of various symbolically named quantities that are needed. You mention “deterministic” functions, and a confusion because I discuss Binomial distributions in the same time as what you now interpret as deterministic equations. There is, indeed, a synthesis needed here. Let me try to explain it.

    1) In the first place, we ARE going to use the mathematics of probability. But how? In general, we are NOT going to explain physical processes “as if” they were the repeated independent output of a random number generator with a Binomial distribution or any other distribution (at least not in general, we may do that in some specific cases). However, we also are NOT going to say that there is only one possible outcome associated with observations of X,M. What are we going to do?

    The general flavor is this, we are going to write down equations, such as a function f(X,M) which describe some knowledge about a chemical/biological process, and we are going to have some additional symbolic quantities that we need to use, perhaps f(X,M,a,b,c,d,e,f…). If we knew the numerical value of all of those a,b,c,d,e,f and we saw some data X,M, then our scientific understanding in writing down the f would allow us to make a prediction about some vector of outcomes Q_i for i in 1…N (one Q for each patient for example). And, the operation of some measurement instrument would transform those Q_i values into Y values either 0 or 1 (no or yes for “in remission”). Perhaps for example the Q_i are some prediction of the number of cancer cells left in the body based on some unknown initial number of cells and some rate equations related to the toxicity of the X drug and the effectiveness of the immune process that might mediate M etc. Then, the “measurement” that gives us Y is some kind of say blood draw or body scan or something that tries to count those cancer cells, and if it finds none then the patient is classified as “in remission”. But, the measurement is crude it doesn’t give us Q just “yes or no was Q large enough for detection?” (this is because of our assumption Y,X,M are binary, so I want to explain how to reconcile what might be a continuous scientific process that we model in forming the f, with a dichotomous measurement in collecting data)

    In general we can not predict that vector of outcomes Q_i exactly. So in fact we will be forced to assign more or less plausibility to different possible values of Q_i. That is we need to ASSIGN a distribution over possible Q values given the X,M,a,b,c,… etc. This assignment is *in our head* or *for all we know*, that is values with bigger P(Q) we think are more likely than values with smaller P(Q).

    And then there will be the real Q_i values that occur. And then from the real Q_i values through some measurement there will be the Y_i values that occur. You see, through these different possibilities, there is not necessarily “determinism”. For example, we don’t KNOW the initial count, we don’t KNOW the final count Q and we don’t KNOW the detection threshold exactly, and we don’t KNOW the correct rate constants a,b,c, etc

    The one thing we DO know however, is that when the a,b,c values are set to some particular values a1,b1,c1 that the probabilities assigned to the observed outcomes X,Y,M (taken as vectors) may be larger than when the various a,b,c, values take on other values a2,b2,c2 etc. And so, together with our assigned prior probabilities for what we think the a,b,c values are, now the axioms of Cox as elucidated further by Jaynes, say in essence that we should assign higher probability to the values of a,b,c being a1,b1,c1 than to them being a2,b2,c2 and in general we get a distribution over these unknown a,b,c values, a posterior distribution after our calculation of probabilities.

    So, we wind up with a posterior distribution over the unknown quantities within our causal scientific explanation for how things work, the a,b,c values, and the “extent of mediation” is now something we need to interpret by looking at those values of a,b,c which have high probability, and looking at the specific functional form of the f(X,M,a,b,c…) and interpreting what those a,b,c, values mean scientifically for our process… for example, if c is large perhaps it means that our equation for f predicts that the case where M occurs will have much higher frequency of remission than the counterfactual case where M did not occur. So, in that specific case, the c together with the physical or biological assumptions, determines some estimate of a causal mediation of the outcome.

    • Daniel,
      I think we are converging.
      What you call a,b,c …. I called u (unit), which can be represented as a vector of characteristics a, b, c..

      I also understand your calculation of the posterior distribution over the u’s
      (I have used it once, in chapter 8 of my book, pp 277-289)
      No problem.

      What I do not understand is why we should wait for this posterior before
      defining the “extent of mediation”.
      Surely. if we can define the “extent of mediation” when we have a posterior over a,b,c..,
      we should also be able to define it for a given vector a, b, c.

      This is what I asked in question (3), and I believe once we do it for a given pair
      of functions (g, f) it will be automatically defined for any posterior.

      So, lets discuss this definition tomorrow, because I have a paper to finish tonight.

      Thanks for taking the time.
      Judea

      • In general, there is an assumption in this line of thinking that probability arises in large part because of lack of knowledge, not because of some inherent random numbers that come out of “gods dice” as Einstein is supposed to have said (please let us leave quantum mechanics out of this though, it will be a distraction). The basic idea is that there are some initial conditions, and some laws that govern the evolution of the state of the world, and then if we knew all these things, we’d be able to know exactly how many cancer cells there are and how much immune reaction and etc. But we don’t know those things, so we look for approximate relationships which hold to within some approximate bounds, and we describe those bounds using probability theory, and we describe those relations using algebra/differential-equations/integro-differential-equations/whatever-seems-appropriate.

        So, in my slightly-fleshed-out toy problem, there is potentially some real actual value “c” which is often called the “true value” but if not the “true” value at least in some sense the “best” value that it is possible to have. Only after large amounts of data can we discover it, and then once we do, it is *that value* which is the actual “extent of mediation”. So, yes, the c value in the abstract represents the effect of mediation, and the numerical c value near the peak of a sharply peaked Bayesian posterior after observing large quantities of data represents the *actual* extent of mediation out there in the world.

        This perhaps seems pedantic, but I think it is important to distinguish between the representation of the thing (the letter c in some equation) and the actual thing (the number which we are trying to discover which should be plugged into that equation in order to have that equation be something that describes approximately the real world and not some other world).

        Now, how can we discover the “c” value? We must do experiments. In our example, assigning X randomly to many patients is our first attempt. Is it sufficient? It depends on the details of our scientific knowledge as incorporated into the f function. Under some versions of the f assumptions and with some outcomes in the world, the Bayesian posterior will be “identified”. Roughly, with enough data, the posterior distribution over the vector of a,b,c… or the “u” vector in your notation will be sharply peaked, and if we then take pretty much any value in the high probability region, such as the mean value, it will be within epsilon of the right value.

        With some other assumptions going into the f value, or with some other outcomes from our experiments, we may fail to have identification. For example, we’re not sure whether b=2 and c=1 or maybe b=1 and c=2 or maybe b=1.5 and c=1.5 etc… (and this lack of identification can be multi-dimensional, leaving us in general not with a 2 dimensional ambiguity between b and c, but potentially some other higher dimensional ambiguity)

        In this situation, we either need to alter our assumptions (if we’ve made too many simplifying ones we may need to flesh out our more detailed model) or, we may need to run alternative experiments that can distinguish between the different cases, or make additional measurements in the same type of experiment. For example, we might need to give the immunosuppressor drug I hypothesized above. If we do this, we also need to include in our f function the information about the immunosuppressor drug so that we can make a prediction about what happens when it’s given. In *some* cases, the intervention required could be “harmless” that is it doesn’t perturb the system enough to bother with much of a model for it. Our model could be “giving the immunosuppresor drug is equivalent to setting M=0 even if M=1” though in the actual case of immunosuppresors that seems unlikely to be a good enough model.

        Elsewhere on this page you discuss a “do” operator, and a “freeze” operator, and you talk about do(Gender=Male) for example. So suppose this means we start with a resume from a woman, and we put it in our word processor, and we change the gender, and print it out and give it to the hiring manager. Perhaps this is a “harmless” intervention (it doesn’t meaningfully alter the physics). But, if instead we take a piece of paper, put a line through “Female” and write next to it “Male” I think it’s obvious that this alters the experiment in a meaningful way. The *physics* is different, instead of light bouncing off the paper and revealing the word “Male” to a hiring manager, the light bounces off the paper and reveals “Someone changed Female to Male”.

        And if instead we actually give this woman a gender changing operation, this alters the physics even more!!

        So, for some “harmless” interventions, we can in our f function pretend that they simply alter the value of a given variable. In other cases, we actually need a model for the physical changes to the world that occur because of our intervention.

        I believe that it is this insight which is what Andrew is expressing in the first paragraph of the post where he says: ” I continue to think that the most useful way to think about mediation is in terms of a joint or multivariate outcome, and I continue to think that if we want to understand mediation, we need to think about potential interventions or “instruments” in different places in a system.”

        I hope this all is helping! Good luck with completing your paper, and I look forward to the next round of discussion!

        • Daniel,
          I am back to our problem of estimating the extent of mediation (let’s call it XM) in toy example
          of 3 binary variables, X, M, Y.

          In order to take full advantage of the insight that our example bestows upon us, I prefer not to
          get into [hilosophical discussions on whether XM exists or not, whether probability is physical or
          subjective etc etc. These discussions are fun, but insight comes from listening
          to our thoughts while we solve a real problem like ours.

          There are two points that I found missing in our discussion thus far.
          1.
          We badly need a definition of XM. Assume that an oracle gives us the functions
          f(X,M, a,b,c) and g(X,a,b,c..) for every vector a,b,c that we wish to imagine, we
          still need to define the target quantity XM(a,b,c..) and not wait till a posterior sharpens
          around some vector a*, b*, c* … and then ask ourselves: Now what?
          We know a*, b*, c* .., Fine, but what about XM?
          This is question (3) in one of my earlier posts.

          Have you seen such a definition in your readings of the literature?
          If not, no problem, I will tell you what definition has become standard in the literature
          that I have read and see if it seems plausible to you.
          If yes, just write it down, and I will examine whether it captures
          what we mean by “The extent to which M mediates between X and Y”.

          2.
          There is a passage in your post that is quite common in the Bayesian literature, which is true in statistics
          but is plain false in causal analysis. It reads:

          “Roughly, with enough data, the posterior distribution over
          the vector of a b c or the “u” vector in your notation will be
          sharply peaked,…”

          The peaking of posteriors does not happen in causal analysis, unless a, b, c is “identifiable”,
          which is a very very very rare case. We can actually take the peaking of posteriors to be the
          Bayesian definition of identifiability, in case someone is not familiar with notion of identifiability
          as treated in the causal inference literature.

          So, I am eager to hear from you what people from the Fox\Jaynes culture thought/think about these two
          issues: 1. Definition of XM, 2. Eternal flatness of posteriors

          Judea

        • Judea, I am enjoying this, so onward. I am a little confused about your notation XM and I think it clashes with some earlier notation I used. by XM do you mean the observed variable X multiplied by the observed variable M ? I don’t think that is what you mean.

          So, let me ask you this. Suppose f(X,M,a,b,c,f0) is as I mentioned earlier

          f = logistic_function(f0*(a + X + b*M + c*X*M))

          Then, when X=1,M=0 f has value logistic_function(f0+a) and when X=1,M=1 f has value logistic_function(f0*(1+a+c)) and when X=0,M=1 …. etc

          Now suppose that my biochemical/causal/scientific analysis tells me that not only does this functional form work for all the values X=0 or 1, and M=0 or 1, but also if I gave say half the drug I could put X=1/2 and if I had twice as much M as the threshold for detection I could put M=2 and soforth. In other words, it’s the magic function that predicts f perfectly for both binary and continuous versions of X and M measured in the appropriate dimensionless units.

          Now, there is also a part I’ve been leaving out, and that is the frequency with which giving X results in an over-threshold value of M. This is probably a useful thing to consider as well. I think it is your g function. it tells me how often on average a given does of X produces an above-threshold M response. But it might go further, and actually predict some probability of a given level of M and together with some information about the lab measuring instrument I can then predict a frequency with which M will be detected…

          Now, since we speak different languages, I ask you, what is the concept you are after? It is perhaps evidently not just how strongly M affects f? That is, f(X,M=1) – f(X,M=0)? So tell me what concept you are after because “the extent to which M mediates the effect of X” is not something I normally am trying to discover. Instead, I usually am trying to discover something like “How much more effective will this treatment be if it can induce an immune response above the threshold for defining M? or usually even better, if it can induce an immune response equal to M ng/ml of antibody?” also I would ask questions like “how effective is X at inducing the immune response of at least strength M? (ie. how often does that occur)” That is, given that I’ve figured out the functional form of f and g by long hard work of creating a mathematical model, now, I want to know the a,b,c values.

          Second. identifiability, the ability to get a sharply peaked posterior, is something I fully acknowledge we can not always do. But, I wouldn’t say “very very very” rare. at least, not in the kinds of problems I work on. For example, I want to find out something about how the presence of silt affects the permeability of a given sand. So, I run some experiments where I add different quantities of silt to this sand and then I posit some unknown function, and I use my experimental data and I get a reasonably sharply peaked posterior over some parameter that defines Permeability of this sand as a function of the silt mass fraction… or I want to find out how the presence of different concentrations of several different chemicals affects the speed of diffusion of a protein through a gel matrix… so I run some experiments and if I do enough of them and of the right sort, I get a sharply peaked posterior distribution. Or, I want to figure out how the presence of several genes affects the speed with which a C. elegans worm detects a hypoxic environment. So I put various different mutants in the hypoxic environment and I get a distribution of times until the worms detect the environment and change their behavior… and yes in this case there is a hypothesized mediating chemical and the various mutations affect the chemical kinetics of this mediating signal, end so there is a mediation issue… but it’s one that is informed by my modeling and biochemistry discussions with my colleagues..

          so, why do we have such different perceptions of what the important questions and concerns are? I want to spend time thinking about physics, chemistry, biology, genetics, to hypothesize some possible explanations for how things work,and then use applied probability theory and some experiments to find out the unknown quantities I need.

          What is the thing you are after now that we have the f, g functions?

        • > I am a little confused about your notation XM

          “XM” is just a name:
          “the extent of mediation (let’s call it XM)”

          I think the question about “the extent to which M mediates the effect of X” is essentially whether you can quantify the effect of M on Y (the residual effect would be direct).

          You seem to be assuming than the causality chain is X=>M=>Y. The point is that M may be a side effect of X. Can you predict what the effect of changing M (using a different mechanism than X) will be?

        • After reading Judea’s explanation I’ve realized my interpretation was an over-simplification because the NIE is a counterfactual. And things are also a bit more complex in the non-limear case.

        • Daniel,
          Once we define the unit-based NIE(u) = f(0,g(1),u)-f(0,g(0),u)
          we can play around with its expectation, its probability, its median or, some people even
          like the expectation of the ratio f(0,g(1),u) / f(0,g(0),u), why not?
          Whatever one is concerned about in practice.
          This does not deserve an ideological war.
          Judea

        • Daniel,
          Lets start with your last question:
          “What is the thing you are after now that we have the f, g
          functions?”
          And let us define that “thing” (XM) in the context of
          our concrete example: X=sex, Y=Hiring M=Qualification
          all are binary.

          We want to define a quantity XM which tells us
          what portion of the excess hiring of men over women
          is explained by disparity in qualification,
          as opposed to outright sex discrimination in hiring.

          Why are we interested in XM? Because if XM is low then there
          is no point correcting disparity in education.

          Just a pointer, if f and g were linear, the answer is trivial: For example, for:
          f = cX+bM+u1, g = aX +u2
          We obtain: XM = ab

          It tells us what would happen if we disable the direct
          effect a. But when f and g are nonlinear (or unknown), we need
          a general definition that will hold universally.
          (Including, for example, your logistic functions)

          It seems that this generalization has not penetrated
          the Cox/Bayes literature yet, so I will lay it
          down for you, just in case you bump into it in future
          reading.

          The general definition is:
          XM = NIE = E[f(0,g(1),u)-f(0,g(0),u)]
          where the expectation is over u.
          It goes by the name “Natural Indirect Effect (NIE),” because it tells us the increase
          in E(Y) due to an increase in M from what it was under X=0 to what it would be had X been X=1.
          It sounds a bit convoluted but, if you try it out, you will find that it generalizes to any function and
          that there is no alternative way of capturing the idea of indirect effect.

          Now, to identifiability.
          All the examples you brought up deal with quantities
          that are identifiable from the experiment that you are
          conducting. So, we are guaranteed peaking of the posterior.
          Great. In general, however, if we do not have experiments,
          then only statistical quantities will be identified, and,
          to estimate causal quantities (like effects), we need causal
          assumptions. This motivate, I believe, why people in
          the causal inference culture are so concerned with
          causal assumptions (eg DAGs) and identifiability;
          they want to guarantee peaking of the posterior.

          The quantity XM, however, is not identified even from
          experimental data ; it requires level-3 assumptions.

          Has this issue of identification been of concern to leaders of the Cox/Jaynes movement?
          I keep asking you these questions because you have been my only window to this opaque world.

          Judea

        • Good, I think we are making progress still. However I’d like a clarification, and it goes back to our discussion about the a,b,c values post-data vs pre data.

          In my conception of the problem so far, f is a “deterministic” function, there are some real actual values out in the world of the a,b,c (or in your notation u as a vector). And, if I plug those in, it will tell me what f value to put into the binomial distribution in order to put accurate (according to my model) probabilities over all possible sequences of Y under the assumption that X was not 0 but was counterfactually 1 instead.

          So, when you take an expectation, you are asking *about what is in our head* about *our model-based plausibilities* about what sequences of Y values we might get, given our uncertain nature of our knowledge about the a,b,c (or u) values in the f function. Is that what you mean? Because now the NIE is not an actual fact about the world but rather about what we know, or assume to know, about the world, and it varies between you and me, because we have at least different priors over the a,b,c.

          In the case where we’re looking post-data with large data sets and where we have identifiability, then your NIE and my NIE will agree to within epsilon.

          Also, when the data values are continuous (or modeled as continuous) what do you use for your NIE? Is it a derivative with respect to changes in that variable? Of course, this derivative will be different at different points in the high dimensional manifold that you have when there are several dimensions to the data. So it’s a more complicated object.

          So, if you remember, you have asked for my informal “thinking”, and I conclude that conceptually at some level, your concern formally expressed by your NIE and my concern about “how much more effective would this treatment have been if it could have induced an immune response…” are similar in that they both ask for what “might happen” if something had changed.

          I think this concern about “what might have happened if something had changed” is very central to many of the discussions that go on here on the blog. As evidence I point you to this discussion about a recent statistical analysis of gun-control laws: http://statmodeling.stat.columbia.edu/2016/03/17/kalesan-fagan-and-galea-respond-to-criticism-of-their-paper-on-gun-laws-and/

          If you read through the comments there were lots of questions about the mechanism by which passing these laws was supposed to actually *cause* reductions in crime, and how ridiculous those assertions by the original authors were, because, quite frankly, they did not do any kind of causal thinking or modeling in this study. I believe you would call it “level 3 thinking” that was totally absent.

          As some background, both Cox and Jaynes were physicists. Their concern was not really to figure out how to think causally about the world, because they were physicists, and physicists are already discussing physical mechanisms by which one kind of thing causes another kind of thing and have been since Newton or maybe before. Their concern was *what kind of mathematics can we use to find out the bits and pieces in our model that we don’t know*. That is, inference once a model has been developed. Specifically, inference for models that are mechanistic, deterministic but unknown or incomplete. So I suspect most people who come to this blog with a Cox/Jaynes attitude are already thinking in terms of causation when they build their models, and when they do algebra they think of this algebra causally, or they are at least AWARE of when they might be creating causal models, and when they are NOT creating causal models. And this is true whether those models about hypoxic worms, or cancer treatments, or lake ecology, or the dispersion of chemicals in the ocean, or gun control laws, or policies regarding the distribution of coal for heating in china, or switching from well to well based on arsenic levels in Bangladesh or whatever.

          When it comes to Andrew and the observational social science literature I assure you some people are very interested in causation, see the gun control debate linked above. People are not going to put up with someone fitting an observational model and saying “we could cut deaths by 90% by passing these laws” without real actual causal counterfactual thinking that is plausible behind it.

          As to identifiability, certainly it is of concern to practitioners on this blog. In general I would characterize the modeling that we discuss here as having three phases:

          1) Think about mechanisms of causation, and try to decide on a formula that expresses what you know about mechanism. Or, possibly a family of formulas to choose between, or the like. Inherently this is “level 3 thinking” because the assumption of some mechanism means that when inputs change we expect outputs to be “forced” to change. If this is not what we’re doing, at least we are aware of when we are, and when we are not, or we strive to be aware or strive to communicate the difference to the readers who come here to learn.

          2) Think about how accurate your mechanistic prediction is.. it can only rarely be super-precise, so we must define some “bubble” of plausible values near our predicted ones that would be unsurprising to us. Doing so gives us our conditional probability for the data p(Data | Model, Parameters). Also think about what we know about the parameters in our model. They are unknown but often we have some information about them, doing so gives us our prior values P(Parameters).

          3) Now, we have a Bayesian model, consider what information we need to identify the parameters. In general thanks to the speed of computers we probably fit the model to our existing data first, and then see how well identified things are… and then try to find out how to improve that if needed. Look for sources of additional data, and consider how to incorporate that data into the model, consider experiments to run or additional measurements to take while running the same kind of experiments etc. We usually try to get identification of the parameters, BUT ALSO if we can’t identify the parameters we want to discover this fact and be realistic about it. Sometimes it is already informative (at a meta level) to say “the data is not informative”

          The Cox/Jaynes perspective on the logic of Bayesian probability is the one people seem to settle on if they start out with a mechanistic causal thinking and then need to do inference. And the reason is that this perspective was built up by people like Laplace, Jeffreys, Cox, and Jaynes who started out with causal thinking and asked “now what? how do I find out what values to plug in to my equations?” and then Cox basically came up with a set of requirements which allowed him to prove Cox’s theorem which says essentially “Use Bayesian probability and it will give you consistent logical results”.

        • Daniel
          1.
          First to the clarification you need.
          f and g are deterministic functions, and a,b,c
          are stochastic, whose probabilities may be
          objective (e.g, fraction of subjects possessing
          qualities a,b,c) or subjective (e.g., my belief
          as to how likely it is for an individual to have
          those qualities).

          This makes the natural indirect effect NIE
          objective or subjective, depending on the nature
          of the probability. However, the defining expression
          NIE = E[f(0,g(1),u)-f(0,g(0),u)]
          is universal, namely, if you possess a probability
          function of a,b,c.., be it objective or subjective,
          it behooves you to compute NIE as defined above whenever
          you talk about mediation, no hand waving here, no
          excuses, no double talk allowed.

          Moreover, if you and I possess the same prior on
          a,b,c then we share NIE even if we do not have
          identifiability. Causation exists whether on not we can estimate it from data.

          When the values are continuous, the same definition
          holds for NIE except that, instead of focusing on
          the transition from X=0 to X=1, we consider two
          arbitrary values for X, say x and x’, and we write:
          NIE(x,x’) = E[f(x,g(x’),u)-f(x,g(x),u)]
          If you wish x’ to be an infinitesimal above x,
          we can do partial derivative tricks.

          2.
          Level-3 thinking
          I would not be surprised if you see parallels
          between your conception of mediation and the
          formula for NIE — that is why it is called
          “Natural” indirect effect , it is a natural concept.
          If this is the case, then you should also rejoice
          knowing that mediation has been elevated from
          the level of conceptualization to a level of science, in the sense
          that we now can reason about it mathematically,
          and answer some fairly hard questions.
          For example, what must one assume in
          the prior on (a, b, c) to guarantee that the
          posterior of NIE would peak sharply as the
          number of samples increases. Conversely, if you wait
          and wait and your posterior does not peak,
          what does it tell you about your prior? Most importantly,
          practical mediation problems can now be solved, of which
          the hiring discrimination is but a toy example.
          Rejoice.

          3. Cox and Jaynes
          For background, I am also a physicist in training, and
          quite familiar with the writings of Cox and Jaynes
          on probability. I did not realize though that their
          philosophy has bloomed into a methodological movement
          that is more than just a free license to
          attach a prior to whatever one feels uncertain about
          and wait for the posteriors to peak.
          Now that you tell me that “when they do algebra they
          think of this algebra causally” I am hopeful that they
          would be interested one day in embedding their thinking
          in a systematic science.

          4. Identifiability and your 3 phases of modeling.

          I am not sure I can subscribe to your three phases,
          primarily because I have so many questions on the
          details that it would take me all night to discuss.
          Instead, I will do something that I hate to do in
          blog posts — send you to a paper of mine and say:
          see “so and so”. So, here it is. My modeling phases
          are described in Figure 1 of this paper.
          http://ftp.cs.ucla.edu/pub/stat_ser/r370.pdf

          Note that a key component in this scheme is Q:
          the query of interest (for example, the NIE formula)
          There is nothing much one can do without
          defining formally what one wants to estimate. And this is
          something I have found terribly lacking in most discussions
          on this blog. I hate to blame Cox and Jaynes for this
          trend, and I hate to blame anyone else.
          But why blame; lets just hope it is temporary.

          Judea

        • Judea, thanks for your paper, having read the initial 5 pages (I am planning to continue through to the end), I see even further agreement between what you discuss in terms of SEM and my own personal opinion on modeling. I of course can’t speak for Andrew, nor for all others here. But, discussions here indicate to me that many of the frequent commenters have similar views and also that many have alternative views, so this is a blog where we hash out some ideas. Furthermore, in my discussions with people who are not professional mathematical modelers, but who are intellectually rigorous professional scientists (mostly Biologists) I see an ability to design experiments that inherently implies an understanding of important issues we’ve been discussing. I believe your SEM and my method of building mathematical models, and my scientific but non-mathematical colleagues (Biologists for example) all subscribe to similar views.

          1) There is a difference between y = a + b*x + err (I see this in the data) and y = a + b*x + err (and if I can intervene and set the values of x I will observe the relationship continues to hold)

          2) Some things confound each other in that they operate at the same time and have related effects on the outcomes, and it is then useful to design experiments where the different effects can be teased apart by changing the different variables independently or setting the various variables simultaneously to various random values or designing experiments where the effect of some particular “pathway” or “cause” has been neutralized, amplified, or otherwise altered through intervention. For example my Biologist colleagues will propose to add a drug which binds to a receptor to test whether without that receptor active, a different biological pathway will continue to induce an observed effect…

          Based on what you say, here is where I think we have important differences in understanding that may hinder your communication with this blog’s readership.

          In your comment you say:

          “I did not realize though that their philosophy has bloomed into a methodological movement
          that is more than just a free license to attach a prior to whatever one feels uncertain about
          and wait for the posteriors to peak.”

          And I think this is an important and HUGE misunderstanding on your part, because insights into what probability mean to Cox or Jaynes (or me or several others on this blog such as Corey, or “Laplace”) help to distinguish between different kinds of “statistical” thinking. One kind (Cox/Jaynes Bayes) is strongly associated with people who start with what you call “Structural Equations” with causal interpretations, and who then need to find out the numerical value of certain quantities in that equation. The other kind of Statistical thinking is associated with people who take the world to be “as if” a random number generator generates outcomes, and this thinking is associated with “testing” whether a distribution fits the data. There are other groups of people, many of whom simply sort of continue along doing what they were taught without having strong methodological beliefs. For the moment, let’s just contrast the two strong methodologies.

          Now, what I’d like to suggest is that Cox/Jaynes do not simply give “a free license to attach a prior to whatever one feels uncertain about and wait for the posteriors to peak” but rather, a free license to assign numbers that describe a degree of plausibility to *any numerical quantity*. In particular, that would include OUTCOMES.

          Now, this differs from the alternative conception where distributions can *only* be assigned to the outcomes, and where inherently there is a replacement of causality with “as if from a random number generator”.

          Why is this important? You in the past have said you prefer to avoid this discussion. I think you should not if you want to understand this blog, and I think you should welcome this point because I think it is strongly parallel to the things you say in your paper linked above about SEM.

          If I tell you that when I measure something, say people’s pulse, it will be a random variable with a certain gamma distribution and this is the actual long term frequency of those different measurements… you and I should immediately ask “what is the cause of this thing having such a precisely defined long term frequency? Certainly that could not occur unless some physical laws were in place to enforce it!”

          Now, if I tell you that “if you go and measure 500 people’s pulses, the most information I have about them is that each will individually be somewhere in the high probability region of a given gamma distribution and where the gamma has higher priority is where I think things are more plausibly going to be found” then you and I need not ask for some physical causes that make that gamma histogram happen in the real world. In fact, if we measure 500 pulses and they are all say 55-65 beats per minute, and the gamma distribution I specified has high probability region between 40 and 150, clearly our data do not have a gamma histogram (there were no measurements between 40 and 55 or between 65 and 150!), yet the assertion “they will all be within the high probability region of the gamma distribution” DID hold. By this criterion, the Cox/Jaynes bayesian says “see, my predictions held!” and the Frequentist says “see, your predictions were utterly false p < 0.0000000000033 !”

          It’s that disconnect between “the histogram of the actual data” and “the information I have about what the measurements are likely to be” which MAKES IT POSSIBLE to do causal modeling with probability to represent uncertainty. For, if I insist on the matching of my causal model to the histogram of the data…. then I must search for causal models that produce that histogram! But, if instead I *start* with a causal model, and I acknowledge its lack of precision, and I measure the imprecision allowable according to relatively more or less plausible alternatives in the region of the causal prediction… then I can be very happy with a causal model that fails to meet the histogram of outcomes, provided that it DOES put the predictions in the high probability region of the assigned distribution.

          It is *this* rather than the “license to put a prior” which distinguishes, because the Cox/Jaynes version involving “plausibility” is perfectly compatible with a causal interpretation of the equations regardless of what the equations predict for long term frequencies, and the “Frequentist” interpretation of outcomes “as if from a random number generator, filling up the predicted histogram” is ONLY compatible with the subset of Structural Equations which have causal mechanisms that ENFORCE that frequency histogram.

          And so, you will find a group of people who work with algebraic equations that describe CAUSES such as “the water gets warmer and then the plant life changes, and also the metabolism of fish changes, and these things cause some fish to die and other fish to grow more quickly” and they will inevitably NEED to wind up using probability in the manner that Cox’s axioms describe, and that will include *assigning* distributions over outcomes which will NOT be the observed histogram of anything, but instead whose shape itself describes a kind of plausibility for a causal outcome in a certain region given the imprecision of the model.

          So, on page 4 of your paper when you say “We will see that the structural interpretation of this equation has in fact nothing to do with the conditional distribution of y given x; rather, it conveys causal information that is orthogonal to the statistical properties of x and y”

          I think the root of confusion about why people at this blog can talk about “the conditional distribution of y given x” and “causality” at the same time is that *the distribution they are talking about is NOT the observed histogram of Y when X is held constant* but rather *the precision of the information that the causal model for Y can give us after we’ve seen the actual value of X*

          I hope you can tell me that you see this distinction, and that you agree that the distinction is useful and that it helps to separate *the observed statistical properties of many measurements* from *the plausible values for an outcome that our causal model would predict given that it can’t predict everything with perfect precision*.

          And then, I hope you and I can come to an agreement that with this conception, it is possible to talk about an SEM providing us with “a conditional distribution (of plausibilities) for y given x” and also “x causes y” as being no longer orthogonal and there is less confusion between you and those of us on this blog who have this Cox/Jaynes view !

        • Also, Judea, I question whether it really behooves me to compute NIE in the pre-data case. For example, a researcher works many years to develop a causal structural model for some process… Then, the researcher hires a computer scientist who knows how to fit models using software, perhaps a naive young graduate student RA who does not study the process in question.

          Then, the RA says “gee I need to put some priors on these parameters my boss gave me… but I know virtually nothing about this process… for now I’ll put these enormously wide priors.. maybe priors with no mean value such as cauchy priors…

          Now the boss comes and says “we’re going to write a grant to get funds to do some surveys to find out about how to make things better in the world using our causal model, for the grant writing process we’ll need this quantity NIE… here’s how it works, please compute it for me” and the RA goes out and using the enormously wide priors, computes the NIE. Perhaps even though some of the underlying parameters HAVE NO MEAN VALUE the NIE does actually have a mean value, and the RA reports it… it’s say 75, and everyone rejoices because this is a big number in this field, and so we can make small changes to X and get big improvements in the world according to this model!

          Now, the grant is approved, and we get our surveys and our data, and the posterior values concentrate, and we re-compute the NIE and we discover NIE is close to 0.

          How can this be?

          If instead of NIE where we take an expectation, we instead had looked at the distribution of the quantity over which you take your expectation, we might have seen that given the uselessly broad priors, the effect could have been anything from 0 to 1000 with a mean of 75, but that this was due to our lack of knowledge, as soon as some data is available, some parameters concentrated sharply and the NIE post-data was near 0.

          In general, taking the expectation of a highly uncertain quantity could leave you with a false sense of what might happen. Only after we have some level of concentration does the width of the distribution around the expected value not matter so much. The quantity f(0,g(1),u)-f(0,g(0),u) like any other quantity in a Bayesian model, has a distribution, and it seems more useful to consider the entire distribution, with the expectation being something of interest only when the distribution isn’t too wide.

          So, I certainly think the quantity seems like a useful idea, but I am less convinced of it having a well defined useful single value through the expectation operator.

  8. Somewhat tangential, but I don’t think that this statement is as universally true (and has as broad implications) as you seem to:

    > Another way of putting it is that most effects are not large: they can’t be, there’s just not room in the world for zillions of large and consistent effects, it just wouldn’t be mathematically possible.

    The effects of, say, many medical treatments is very large. But there’s not a lot of naturally-observed variation in the treatment so they have very little observable effect in most circumstances. So there can absolutely be zillions of large and consistent effects as long as the treatment assignment has almost no variation.

    In social science settings, where there’s usually some degree of self-interested planning and decision making, we should even expect to see low variation in the treatment assignment. There are absolutely “zillions” of actions that either of our current presidential candidates could take that would have a large and immediate *negative* effect on their chances of being elected. There aren’t the same *positive* actions, but that’s almost entirely because the candidates have already taken those actions. Not because the effect of the action is necessarily small.

  9. Some general points summarized from the above discussion with Carlos, Judea, CK, etc.

    Judea specified as a toy problem one in which we have three binary variables. X,Y,M. I then mentioned that in my conception of applied Bayesian modeling, the model depends on what Jaynes called “background or prior information” *about the types of models* not just prior information about parameter values. So, I invent some background about Cancer, drugs, and immune responses, and then I show how I would create a Bayesian model. In particular, I would seek out some function f which would help me put probabilities over Y outcomes, and that this function f may depend on X,M, and some other unknowns.

    Next, I describe how in general I don’t want to model just the frequency of the 8 possible outcomes. I prefer to model an underlying most-likely continuous process, and model a measurement process that induces the binary outcomes from the continuous ones, and use the 8 possible binary outcomes and their frequencies to help me discover something about the underlying continuous process. The Bayesian probabilities become distributions over the unknown internal parameters of the underlying model. I mention that the 8 possible outcomes may not allow me to identify all the parameters in my underlying process, and then I may need additional measurements, or additional experiments.

    Finally, I mention that if we bring other interventions into the mix in order to get more information (additional experiments), sometimes these interventions may be modeled as simply altering the value of a given variable, leaving all else “frozen” but other times they actually themselves have an effect on the underlying process that needs to be modeled directly.

    Also, I want to mention how it is possible that we can have an enormous dataset, and still not identify the underlying parameters. I will give a simple physical example here:

    A cylindrical bucket 1m tall is filled with water. At time t=0 I open a hole in the bottom of the bucket. One month later I come back and reveal whether the bucket is more or less than half full.

    In my model of the underlying process, I have a rate of flow r(h,d) which is a function of the diameter of the hole and the height of the water. Now, suppose after repeating this experiment hundreds of times, after 1 month I ALWAYS say that the bucket is LESS THAN HALF FULL. If all I care about is the probability of saying “LESS THAN HALF FULL” then I can estimate it easily as very close to 1. However, if I want to know the diameter of the hole, all I can say is that it is large enough that after solving the differential equation out to t=1month the value of the height of the water will be less than h=1/2. This will occur for ANY value of d greater than some size d0, and so I learn nothing about the d other than that it is greater than d0.

    A person who cares only about the probability of “bucket less than half full” has his answer p(h d0) = constant over a potentially large range. In fact, perhaps even 0.25mm hole will suffice, so already just in saying “there is a hole” i am typically thinking of it being larger than this value, so I have learned practically nothing. One person’s completely specified exact problem is another person’s completely uninformative useless experiment.

    • ack, blog ate things because I used a less than sign. the last paragraph should say

      A person who cares only about the probability of “bucket less than half full” has his answer p(h < 1/2) ~ 1. Whereas a person interested in the diameter of the hole has p(d > d0) = constant over a potentially large range. In fact, perhaps even 0.25mm hole will suffice, so already just in saying “there is a hole” i am typically thinking of it being larger than this value, so I have learned practically nothing. One person’s completely specified exact problem is another person’s completely uninformative useless experiment.

  10. This comments thread is giving stan-users a run for its money in being the best journal on applied statistics. Fantastic reading – thanks guys.

  11. Daniel, I don’t know if you are aware of Pearl’s paper “Bayesianism and Causality, or Why I am only a half-Bayesian.” He has been a Bayesian for 45 years, quite longer than most of us.

    I don’t think there is a fundamental disagreement, given that you say that the causal reasoning happens outside of the Bayesian framework. This is of course consistent with Jaynes, who insists on the fact that inference is concerned with logical connections, which may or may not correspond to causal physical influences.

    • No, in general I’m not aware of Pearl’s papers, I’m sure I probably should be. But statements like “We will see that the structural interpretation of this equation has in fact nothing to do with the conditional distribution of y given x; rather, it conveys causal information that is orthogonal to the statistical properties of x and y” indicate at least imprecision in the conception of a conditional distribution, as does “I did not realize though that their philosophy has bloomed into a methodological movement that is more than just a free license to attach a prior to whatever one feels uncertain about and wait for the posteriors to peak.”

      And people who call themselves Bayesian are, as is seen on this blog, not all necessarily in agreement about these topics either. So it seems possible that the explicit discussion will provoke some understanding, and if not between me and Pearl then perhaps for others such as James Savage or yourself or whoever else.

      It seems very obvious to me that if a structural equation says Y = a + b*x + Err(c) and it is interpreted causally then one way to turn this into words is:

      “if I knew the right values of a,b,c and you set the value of X to X1 then when you measure Y, you will find Y – a – b*X1 will become in the world, a value within the high probability region of the distribution I assumed for Err(c)”… that is the distribution for Err is itself a part of the causal assumptions, a part of the structural equations, a description of how precise those equations are. It is fundamentally the case that the SEM and the observed X alters our state of knowledge about where the Y will be because we believe that the structural equation is causal even if not super-precise.

      whereas another way to turn this into words is:

      “if you observe an enormous number of samples, then Y – a – b*x will in this observed set of samples have the frequency distribution defined by Err(c)” which is a fundamentally different statement about the world, does not apply at the individual level, does not require any causality in a particular instance, and is the “orthogonality” that I believe Pearl refers to in the quote above.

      Both of these have been called “probability”, but the first one is ONLY admissible in a Cox/Jaynes sense.

      • Daniel,
        I wrote:
        1. “We will see that the structural interpretation of this equation has in fact nothing to do with the conditional distribution of y given x;
        rather, it conveys causal information that is orthogonal to the statistical properties of x and y”
        I stand behind it. The structural equation Y=a +bX + err does not constrain the conditional destributioin of y given x.
        Any values of a and b are consistent with any conditional distribution P(y|x) [At least in the linear case]
        I have also defined very carefully what I mean by statistical properties “properties definable in terms of joint distribution of OBSERVED VARIABLEs”
        I you want to broaden the notion of probabilities and statistical properties, you have my blessing, but we will need to distinguish then between
        P in the statistical sense and P* in the broadened sense. I am not convinced of the wisdom of this distinction. And this bring me to your second complaint:

        2. I also wrote

        “I did not realize though that their philosophy has bloomed into a methodological movement that is more than just a free license to attach a prior to whatever one feels uncertain about and wait for the posteriors to peak.”
        I stand behind it too. Note the word MORE. I confessed to have believed that Cox\Jaynes philosophy remained in the state I saw it in the 1980’s, namely
        a free license to assgin priors to everying, fit to data and wait for the posteriors to peak. But our conversations gave me hope that it bloomed
        into something different, eg. a more disciplined science that is more productive than this general licence.. I was hoping to learn from you what else
        it permits and forbids beside that license. Why from you? Because you are the first person from that camp who showed interest in discussing a 3-variable
        concrete example as opposed to metaphysical discussions that hide assumptions and hide research questions etc.under the cover of “practical messy problems”.

        I am still hopeful to learn what the Cox\Jaynes philosphy has developed into , but from your interpretation of the structural equation example I
        infer (perhaps hastily) that EVERYTHING is probability; causality, counterfactuals, logic, metaphysics,… just everything.
        [For example, take the statement “were it not for the aspirin I would still have a headache” which I classified as counterfactual, while
        Cox\Jaynes (according to my understanding) would argue: No way! it is probabilistici, because all I am saying is that I am attributing
        high probability to it. Same goes for “this drug reduces your chances of cancer” — the moment you believe in it Boom-Boom! it turns into probabilistic
        in the Cox\Jaynes sense.]
        I believe in the virtue of distinctions, not in blurring of distinctions. So I am inclined to give up on the Cox\Jaynes ideology/methodoloty/license/
        Unless anyone can show me (using concrete 3-variable example) that it is not just a license to “assign and fit.”
        Judea

        • Success!!! Seriously, now I think we are about to discover a gulf between you and some of us on this blog which consists in a solvable misunderstanding between us which this Cox/Jaynes camp is aware of (we are aware that there are misunderstandings that is). Unfortunately while we can come up with some conventions here that will make it plain how to annotate our thoughts unambiguously, the term “probability” and the use of the letter “p” has historical baggage and there is a fight between camps related to use of this word, and I can’t snap my fingers and make it go away any more than I can snap my fingers and make hatred for people with different religious beliefs go away or whatever. Still, I am willing in this context, to do the following.

          Suppose I take a large sample of x and y out in the world and I see that whenever x has value 1, y has various values. Let’s call the frequency (or frequency density) with which we see a value y in this large sample by the annotation

          Fr(y | x=1)

          Then, I agree with you wholeheartedly that this observed fact could be consistent with any sort of causal connection between y and x. It is orthogonal to causality! Do we agree here? I hope so. We could imagine this as “amount of rain fall” and “amount by which barometer dial changes”, and we can both agree that if I go in and push on the barometer dial, I can not make the rain fall. Or, it could be “amount of rain fall” and “change in temperature relative to the dew point” and then if you can somehow force the temperature in a whole region to drop below the dew point, then rain would fall depending on how much water was in the air etc. Both are consistent with the same Fr(y | x=1), the first is non-causal, and the second is fully causal, so causality must be orthogonal to Fr(y|x=1).

          Now, lets you and I sit down with our “thinking caps” and discuss the physics of what goes on in the world related to y and x. Somehow we will both realize that x does cause y in some way, and we come up with some explanation about how that happens, and I will write down my thinking using algebra, and you will teach me how to express it in terms of graphs and do calculus. It will be a good time had by all.

          Unfortunately, both you and I will agree in this case, that even if we can set x=1 we can not predict y to 300 decimal places. The world is not so exact, it isn’t like C = 2 * pi * r exactly, and the value of pi = 3.1415926535…. and every single time we get the exactly correct answer. Still, we do agree that setting x=1 will produce values of y near to y1, and we also agree that it doesn’t seem plausible to us that if our model were correct, that it could be farther away from y1 than about s units and that values nearer to y1 seem more reasonable to us than other values. Now we ask ourselves “how to assign some number that helps us say that our equation is imprecise, but that it predicts values near y1 and more and more denies values as they depart from y1 until by the time they get out to y1 +- 2 to 3 s values… they seem totally implausible based on our causal analysis of the physical connection between x and y?

          Let’s call this function that assigns numbers for how plausible things are according to our model of causality “Pl” for “plausibility”. Now we have some logical requirements for any system of plausibility… and when we look at them it turns out they are basically equivalent to the requirements that Jaynes lays out carefully in “Probability Theory the Logic of Science” and also elsewhere, for degrees of plausibility, and then we read Jaynes’ version of Cox’s theorem, and we decide that the mathematics of plausibility is the same mathematics as probability, there is no generalization of probability, the Cox axioms say Plausibility values behave *exactly* as probability…. but the *meaning* is different from our previous function “Fr”

          So now we have Pl(y | x=1) is *a part of our structural equation* that specifies how “tightly” our equation should be expected to predict y due to these causal considerations.

          If you admit the idea that we can do such a thing, that is, include in our structural equation / causal analysis a measure of imprecision of our outcome, a measure of “for all we know this region is as good as our physical analysis gets us” and you admit to the need for some logical structure to calculations regarding this “Pl” function, then I believe you will wind up as a Cox/Jaynes Bayesian as well because Cox’s theorem says you have to be if you agree with the axioms, and the axioms are pretty mild.

          So, now, here I am in this position, I’ve read Cox’s theorem, I’ve drunk the Kool Aid, and I know some physics and biology and soforth, and then I run out and start modeling y and x. And the first thing I do is I sit down with my physics and I decide … f(x,a,b,c) is a great model for y based on physics, if only I knew the values of a,b,c but… it doesn’t predict y exactly, it only predicts that y will be “very close” on the scale of some size “s”.

          So I will wind up writing down the following *structural equation*.

          y = f(x,a,b,c) + err

          But I have more information than this, so by the way as part of my causal analysis, the err will be a number in the high-plausibility region of a function which I will assign to describe my degree of plausibility.

          Pl(err) = normal(0,s) which says that the error, which is exactly equal to y – f(x,a,b,c) is a number close to zero on the scale of s. NOTE it DOES NOT say that if I repeatedly “do” x=1 over and over I will get y values distributed according to Fr(y | x=1,a,b,c) = Pl(y | x=1,a,b,c)… that is, Frequency and Plausibility are orthogonal, just like frequency and causality… and they must be, because the Pl function is really a part of my causal equation modeling…

          So, now a Bayesian on this blog says

          p(y | x=1) = normal(f(x,a,b,c), s)

          Because they use p to mean *ambiguously* “plausibility” (a non-observable thing, unlike Fr the frequency). And then they read your statement “the structural interpretation of this equation has in fact nothing to do with the conditional distribution of y given x” and they think THIS IS ABSURD, because when they read “conditional distribution” they interpret “Pl” and when you write “conditional distribution” you intend “Fr”

          So, your objection is then “everything is a probability” and I want to modify this “everything that is not known with absolute certainty, the way the value of pi is known, can have some varying degree of Plausibility associated to it as a tool of a kind of generalized logic”

          So, is that so bad? Note, we’re not talking about “this drug reduces the frequency with which people get cancer” we’re talking about “taking this drug makes it more plausible that YOU will not have cancer any more”

          I believe in the virtue of distinctions too, so I like the distinction between Fr and Pl, but as I say, I can’t not wave a wand and force everyone to use unambiguous notations, and since Cox’s axioms says that Pl behaves exactly as “probability” it seems unlikely that we will avoid notational ambiguity. However, can we now agree that you and I do not have *conceptual* ambiguities between Pl and Fr, and so if you would like some further conceptual discussion of how Pl or Fr are different or where one or the other come into play in practice for people around this camp please ask away!

        • Daniel,
          I can’t accept your distinction between Fr and Pl as described, for several reasons.
          1.
          First, I do not restrict my P to frequencies, because I believe in the usefulness of judgmental knowledge and I would
          like to give it a symbol. So I call it P(E) where E is any event that can be defined in terms of observable variables, regardless
          if those variables are actually measured or not.
          So, P(E) is my personal belief that event E is true.
          If we are lucky, we sometimes get frequency to support our P(E), but I dont want switch to Pl just because my equipment
          is not good enough to measure all the variables supporting E.
          Everything is subjective under P, so we do not need Pl to remind us of that.
          Another way of saying it is: Go ahead put your favorite Pl in front of everything I write and we achieve peace and prosperity.
          For example, if I write P(y|do(x)) = p1, you can add Pl {P(y|do(x))=p1} = 0.9999, peace and prosperity: You are happy with the Pl,
          and I just ignore it and focus on what’s behind it, namely P(y|do(x))=p1

          2, Now, where does P(y|do(x))= p1 come from?
          Two ways. Either I am conducting a RCT and observe that in individuals subjected to X=x the frequency of Y=y happened to be p1.
          (You might prefer to call it Fr(y|do(x)) = p1, but I jump to P, because my P is enslaved to Fr when the sample is large.)
          Or, I am conducting observational studies on x,y to get P(x,y) (or Fr(x,y)), and I have a theoretical model, written as a structural equation Y= f(x,a,b)
          and I use some features of f(x,a,b) to extract P(y|do(x)) from Fr(x,y).
          (example of “features” are “y does not affect x” or “b affects only y , not x” etc.)
          I dont need Pl in this exercise.
          3
          I may need Pl when I am not sure about f, and I have two competing models f1 and f2, and I look around with bewilderment: What shall I do?
          I open the dictionary of advanced Bayesian analysis, and it tells me: Dont panic! When you have two competing models, assign priors, fit, wait for
          the posterior to peak, and call the posterior Pl (y|do(x))

          Here I know I am in trouble. As long as I decorate some formulas with Pl to pacify my Bayesian colleagues
          it does no harm. But when Pl conflicts with P, pause man, check the sources of the conflict , and proceed with caution (perhaps
          using transportability theory). But dont even look at the posteriors because, even if they peak, they may peak into nonsense.

          Conclusion:
          If you want me to decorate all my formulas with Pl — fine with me. But I would still like to analyze what is in the square brackets
          behind the Pl {****}, this is what counts. And behind those brackets I find the Causal Hierarchy distinction that is so useful to anyone
          tries to work out a concrete example under the microscope.
          Judea

        • Judea, as I said, Cox/Jaynes Bayesianism is not an *extension* of probability, since Cox’s theorem says that Pl is isomorphic to mathematical probability theory… so this is one reason the Bayesians want to continue using P just as you want to… And in a sense, this is absolutely correct.

          But, of course information about what you’re doing is important in assigning the probabilities. So, typically, when trying to be very unambiguous, those of us in this camp want to add a further symbol, a symbol that stands for the state of knowledge used to assign the P.

          P( y | x, a,b,c, K) = normal(f(x,a,b,c),s); where the K is some stand-in symbol for all the stuff we assumed to allow us to get f and to choose the normal distribution and to put priors on s and soforth.

          and when you say “I jump to P, because my P is enslaved to Fr when the sample is large”

          then the K must be KLargeSamp = “I have a large sample and I accept that it is fully representative of the future and so I enslave P to Fr”

          But, of course, in doing this, you have “used up” the information in the sample. So, you have NO data with which to infer anything else! Put another way, your probability is conditional on your data already, so if you want to do better and peak your posterior distributions of parameters etc, you must collect MORE data. And then when you do, what if Fr2 is not the same as Fr1? Whoops, go back and re-calibrate P to Fr2… but then again, you have no data to peak your posteriors of the parameters… so collect more data… but whoops, around in a circle. Eventually your Fr may stabilize, and now you can use it in your next data set… Why are my posteriors always so FLAT you ask? It’s because you’ve used up your data in enslaving P to Fr! And that should not be surprising, because if you want to learn Fr in all its details, it’s an infinite dimensional thing! It would take a lot of data!

          Alternatively you might wish to go back to right before you collected your data… when you had only the causal model… When this is the case, you CAN NOT use the state of knowledge KLargeSamp, you can only use what the causal model gives you. Perhaps it tells you “things can not be too far from the prediction” and then you are in the case I typically am in, where you assign say a normal distribution because you believe it represents your knowledge K about how good is the causal model.

          Enslaving the P to the Fr is altering your causal model and using up your data, do so at your own risk. I believe your “transportability” analysis is a method by which you undo this. The method by which I would avoid this problem is to try to be faithful to the full set of background knowledge that I have at the time I *build* the model and then put the data in and see what I get. I don’t insist on Pl = Fr before I can get started. Note, I have a specific blog post on this topic that shows in an example how I do not need to do this enslavement, and for your refreshment, it’s about Orange Juice http://models.street-artists.org/2014/03/21/the-bayesian-approach-to-frequentist-sampling-theory/

          For example of how this works, my full set of background knowledge might be “I am taking a sample of college freshmen” in which case if I’m being fully honest, I must admit that this will not “transport” to the whole population, so if I want to extrapolate to the whole population I must put into my model some information that I may have about how extreme the failures to transport might be. So my Pl must NOT be my Fr_college_freshmen() no matter how large my college freshman sample is, it must be my knowledge about how well my causal model works across all populations, which will be typically much less specific than Fr_college_freshman! much broader.

          Put another way, Pl is not a fact about the world, Pl is always a fact *about my model*, which is why it is decorated with the K to remind me what information went into building the model. And so, by constantly asserting the need for some K, we can go back to using P instead of Pl and Fr and then we must remember that if we just put p(x) we really mean P(x | K) but are being lazy.

          Now, as to the “do” operation. This distinction between p(y | do(x=1), a, b, c, K) = normal(f(1,a,b,c),s) and p(y | x=1,a,b,c,K) = normal(f(1,a,b,c),s) is already built-in to K if we’re being honest, but we often are not explicit enough, so I don’t think the “do” will cause any harm, and it can be helpful, but the bigger picture in my mind is to be more explicit about the knowledge we assume, all the baggage inside the K

          and typically some of the baggage in the K would look like this:

          “I have carried out a careful first-principles analysis of some physics/chemistry/biology and choose a function f based on this scientific analysis, and I believe there are causes in the world that enforce f to be true whether I observe x or whether I go in and set x to some value… and my analysis informs me that the accuracy with which f predicts y is such and such, and that there is a process of selection in doing my survey that makes it more likely to get x values near x1 than I would if my process of selection were more uniform across all the possibilities in the world, and the doctors doing the blood draws were not blinded to who they were drawing blood from, and my instruments were the ones available at hospital H1 but there is a whole different manufacturer whose instruments may be used at other hospitals………..”

          What a lot of stuff to cram into such a small symbol K!!!

          So, when the Bayesians on this blog go about being lazy with their notation, you must remember that those who know what’s going on admit this K under the hood, and so they have in their mind whether they are analyzing a causal connection or a non-causal connection but they are lazy in their notation. Still, they will be very confused by “P is orthogonal to causality” because they’re assuming:

          p( stuff | K = “there is a massive causal model that says what my P should be”)

          so since the causal model is telling them what the P should be, it seems so obvious that you are wrong in saying “P is orthogonal to Causality” but I believe this is a misunderstanding caused by 2 ideas:

          1) Many or even MOST people enslave the P to the Fr *mistakenly* there is no logical requirement to do so, and it is important to understand the implications of doing it!

          2) Bayesians who don’t enslave the P to the Fr are not explicit enough in their notation to write out even the tiny symbol “K” much less all the stuff that the K stands for.

          And so Judea Pearl comes to this blog and sees people writing p(y | x) and believes he knows what they mean, and he says “these people are fully confused!”

          and these modelers at this blog see “p(y | x) is orthogonal to causality” and so they say “Pearl is eating too many bananas!”

          And so, if we are honest, we will say:

          p(y | x, a, b, c, K(first_principles_causal_analysis_only_without_any_data_to_enslave_P_to_Fr_see_appendix_A) ) = normal(f(x,a,b,c),s)

          and our formulas will take up several lines, but now I think both you and I can see that with the K in place, the question of whether p is orthogonal to causality is a question of whether K is in fact a causal analysis with a first-principles analysis of the precision involved, or is a pure associational analysis into which we plan to stick a big dataset.

        • Daniel:

          Nice point about the K.

          If you write the prior and data generating models out chronologically, for instance in ABC format, you write out very different generation schemes for instance when there is random assignment to groups versus selection into groups.

          This is all lost (but should not be forgotten) in standard Bayesian formulation of
          P(theta| data) = ( P(theta) * P(data|theta) )/P(data)

        • Keith, I think this is one reason people always ask for a “real world” example, because in a real world example, background information must be used, and then the P will change with this information, and in a worked example, you’ll see the formulas or Stan code or whatever are different in different cases. Where does this difference come from? If you believe p(Y | x) is a fact out there in the world equal to the Fr(Y|X) then there should be no changing of the P from one model to another… but if you acknowledge that the P is conditional only on the background knowledge/causal model you have BEFORE the data, then of course it can change from one place to another, and it can do so based on the causal analysis, and so the P of a Cox/Jaynes Bayesian, conditioned on knowledge K, can NOT be orthogonal to the knowledge K which includes the causality model.

        • Also note, if your goal is to learn the Fr out there in the big data set, then you can put Plausibility Pl over a variety of possible Fr functions, plug in your data to the machinery and turn out posterior distributions over the Fr. This is sometimes done using say gaussian mixture models, or fancy models like Dirichlet processes, where you’re sort of putting a probability over possible histograms, and in the end a “peaked” posterior means all the histograms you plot from the posterior distribution look very close to the same.

          In this case, there is no way to even discuss the idea of the “real frequency” of histograms. Suppose there are 10 million people out there who are of interest. There is the real frequency we’d get if we did a sample of all 10 million subjects. there is the data we have from a large randomly selected data set of only 1 million subjects, and then there is what we know about the histogram in the 10 million subject case given the background info including the choice of say Dirichlet Processes with certain priors, and the 1 million subject sample. This last thing is in terms of Cox/Jaynes plausibility of what the frequency histogram in the 10 million sample case is. THere is only 1 real frequency histogram for the 10 million subject case so there is no frequency over it, but there is still plausibility over it.

          So, if you are simultaneously trying to learn the causal parameters a,b,c in your f(x,a,b,c) function and at the same time trying to learn the real frequency of the errors Fr(y-f(x,a,b,c)) as opposed to just sticking with the theoretical information in your causal model about the errors, you can do this by including the above kinds of tricks in your model, and then you’ll have a different model:

          the vague symbolic versions of these things are sometimes poor substitutes for several examples of different assumptions leading to running Stan code with different code in it. Then you can point to different lines in the code and say “see my causal assumption FOO which is incorporated into the code for file 2 makes me change line 12 from what you see in file 1 to what you see in file 2” and the use of the “Knowledge K” becomes a concrete thing that is easier to point at.

        • > P is conditional only on the background knowledge/causal model you have BEFORE the data, then of course it can change from one place to another, and it can do so based on the causal analysis, and so the P of a Cox/Jaynes Bayesian, conditioned on knowledge K, can NOT be orthogonal to the knowledge K which includes the causality model.

          The knowledge K includes everything that is known, therefore P can NOT be orthogonal to the colour of your shoes. Maybe you will tell me now that the colour of your shoes is irrelevant, but given the model what relevance does it have whether you used or not a causality argument to formulate it?

        • K includes the color of my shoes, fine, but regardless of the color of my shoes I choose the same P, so at least P is insensitive to that part of K. ie. the color of my shoes is irrelevant precisely because it doesn’t change P.

          Now, what relevance does my choice of whether P is formulated by a causal argument contained in K vs whether P is formulated by a non-causal argument contained in K have? It has precisely the effect of usually altering my probabilities P and that includes over counterfactual predictions.

          if you say: “in case 214 if we had given the treatment what would the outcome have been?” I’d say:

          K1 contains causal information, so let me plug in X=1 to the equations, and then I’d give you p(Y | X=1,a,b,c,K1) and this would be some mathematical expression… and I could for example average over my uncertainty in a,b,c and give you a number.

          or maybe I’d say:

          K2 is not causal here, so I don’t know what would happen if you actually changed X=1 in that case, but to the extent that the observed information is sufficient to predict things, I’d calculate:

          p(Y | X=1,a,b,c,K2) and because in this case K2 has different information, this would be a different formula, and the posteriors over the a,b,c would be different, and their use within the function would be different, and if I averaged over the a,b,c I’d get a different number. And, if I’m honest, and aware that I’m not doing a “structural equation model” with causal analysis, I should give the strong caveat here, we don’t even expect this to necessarily work.

          Now, it’s POSSIBLE to accidentally stumble upon the right causal formulas even without a causal analysis in K2, so you’d ask me, what if the formulas were the same? And I’d basically have to say “What a happy accident!” but in the general case, they wouldn’t be the same. For example under K2 the non-causal analysis, my a,b,c might be just some coefficients in a general polynomial, whereas in K1 perhaps they’re coefficients in an ODE that I have to integrate to get the answer.

          It’s also possible to do the causal analysis and be utterly wrong so that K1 predicts just wildly inaccurate stuff for counterfactuals. Caveats about how good your model are apply everywhere, not just in “associational” models or not just in “causal” models.

          So, I think it’s very useful to keep in mind what you’re doing, what does the K mean, and what are you requiring of your model. It’s particularly useful if you want to design experiments to test your model, where you will literally make X=1 and measure Y, and see if the model predicts the right outcomes.

        • >> given the model what relevance does it have whether you used or not a causality argument to formulate it?

          > p(Y | X=1,a,b,c,K1) and this would be some mathematical expression…
          > p(Y | X=1,a,b,c,K2) and because in this case K2 has different information, this would be a different formula,

          I’m afraid I don’t follow you. If the model is the same and the data is the same, why would the formula be different?

          There are many pieces of “knowledge” in K that might have influenced the design of the experiment and the model chosen. For example, if you are developing a drug your clinical trial might depend on how close to the market are your competitors. Would you say that P is not orthogonal to the existence of other drugs under development?

        • Carlos, another way to think about this very concretely. Suppose I want to be so honest it hurts with my modeling. So, in the case where K1 contains causal thinking, then I can attribute this causality in the observational data. When X=0 and when X=1 my causal model f(X,a,b,c) + err operates. So I can learn about both the case X=0 and X=1 even without experiments where I change the X=0 into an X=1, because I think the same mechanism is going to be at work when I do that change, as when nature does it for me in the observations.

          Now, suppose I’m fully associational, with no causal thinking, and just fitting curves and things. Now I can collect my data Yactual, and add to my data set Ycounterfactual, containing entirely missing data values “NA”. I can then say perhaps “at least I don’t think the Ycounterfactual will be outside the range of Y across all cases”. So I put a prior on Ycounterfactual that is very broad but not infinitely so. Now, with no data on Ycounterfactual, the posterior over Ycounterfactual will be the same as the prior over Ycounterfactual… So if you ask me to predict a counterfactual value, I’ll wind up just giving you the prior distribution.

          Perhaps then I go in and do some experiments, so I actually eventually observe some Ycounterfactuals (under the experimental treatment). Now maybe I can learn the counterfactual relationship even without a causal model.

        • “”
          > p(Y | X=1,a,b,c,K1) and this would be some mathematical expression…
          > p(Y | X=1,a,b,c,K2) and because in this case K2 has different information, this would be a different formula,

          I’m afraid I don’t follow you. If the model is the same and the data is the same, why would the formula be different?

          “”

          But the model ISN’T the same. as an example in the previous comment I say in the first case maybe I have an ODE for which a,b,c are coefficients, and in the second case, because I don’t have the causal model I don’t have an ODE, so I just do some polynomial fitting or whatever and the a,b,c are coefficients in the polynomial.

          This is what I mean above by “in a worked example, you’ll see the formulas or Stan code or whatever are different in different cases”

          p(foo | K1) and p(foo | K2) can be UTTERLY DIFFERENT. if K1 is causal knowledge it might be that the first p refers to a complex computational fluid dynamics problem with strange boundary conditions to predict say the acoustical noise produced by a jet turbine engine, and under K2 we’ve just got a 3 term polynomial regression against a dataset of tests of prototype engines with different length blades.

        • If the model is not the same, then you didn’t reply to my question. And as far as I can see, the question of causality being or not orthogonal to the conditional probability only has sense in the context of a model. If you want to include in your term K the causal relationships and the existence of competitor products, so you can make use of this properties later to calculate causal effects from observational data or to make a decision about filing your drug for approval, please go ahead. But I would say that, in the context of the model, they are irrelevant for the probability calculations. Of course they will be relevant for the creation of the model and for the use you will make of the results, but this happens outside of the Bayesian framework.

          Even accepting that you can put the causality information in K, it won’t be used at all when you combine your model, priors and data to get your posteriors. The Cox/Jaynes probability formalism doesn’t handle causal links, only logical links. Judea Pearl has proposed one way to formalise causal inference so you can actually use this causal assumptions to extract additional information from the experiment. There are other formulations, probably equivalent (even though some are more widely applicable than others). Maybe you’re doing it right using ‘ad hoc’ reasoning, though.

          One comment to finish (at least for today): while I didn’t really understand your example at 5:02 pm I noticed you suggested doing experiments to observe counterfactuals. I don’t think counterfactuals can be observed (by definition).

        • Carlos:

          Regarding your statement, “I don’t think counterfactuals can be observed (by definition)”:

          This is one reason I prefer Rubin’s term “potential outcomes.” The potential outcomes are defined before any treatments have been assigned; you can then do an experiment to observe some of them. “Counterfactual” is, to me, an awkward term in that what is counterfactual and what is not, is only determined after the experiment has been performed.

        • “But I would say that, in the context of the model, they are irrelevant for the probability calculations”

          The point is that if you know K1 you write down p( | K1) and if you know K2 you write down p( | K2), and if K1 contains the information about how to run a computational fluid dynamics simulation and get an acoustical noise estimate whereas K2 contains some very basic guess that a 3 term polynomial will fit your database of engine test-runs, then you’re going to wind up with very different results.

          Now, if you ask the following: Suppose K1 contains causal reasoning, and arrives at a particular model p( | K1) and K2 contains only some rough guesses and arrives at p( | K2) and p( | K1) = p( | K2) by some miracle, will there be different numbers that come out of the whole shenanigans?

          Of course, no, because once you’ve reduced things to specific formulas, algebra takes over. But the set of cases where the exact causal formula comes out of a wild-ass guess is pretty small, so small as to be a pure distraction in this conversation. The point is, different K going in usually leads to different formulas for p coming out. And so my everyday experience is *when I think causally about a problem, it affects what p I put down* and so “causal thinking is orthogonal to p(y | x)” is a statement that I experience violations of every day on a practical level provided that you understand what *I* mean by p(y|x) which is a predictive part of the structural equation model and NOT an observed frequency. If you want to say “it happens before you start to run Stan, so it’s pre Bayesian-analysis” then I could also point to the idea that I might have 4 or 5 plausible causal models, put a prior over them, and then do some model selection via Bayesian analysis, so it actually CAN be part of the Bayesian analysis to choose from among several models. You can call this “fitting a single uber-model” if you like… but I don’t think that these fine grained distinctions get us closer to the goal.

          My biggest point is, I’m trying to help bridge a gap in communication between Pearl and people like me or maybe Andrew or Corey or possibly you, or some others at the blog, and what we experience every day is that we think about how some stuff should work, and then what we dig up out of our knowledge affects what p( | K) we write down, and the reason that is possible is, it’s a plausibility given a model, not an observed frequency in a data set. Because if it were an observed frequency in a dataset there’d be no flexibility of choice, but if it’s a plausibility given a knowledge-base, there is flexibility of choice!

          As for “I don’t think counterfactuals can be observed (by definition)” ok, fine, then maybe Andrew is right a better terminology is “potential outcomes”. If I can do an experiment where I take something where x=0 and y=Y0 and then go and change x=1 and observe y=y1 afterwards, you might say “technically y might have equalled y2 if you had changed it earlier or in a different way or whatever, and y2 is the true counterfactual”

          so, fine, this gets to Andrews point about actually modeling the physics of do(x=1) which is what I take as his meaning when he says “we need to think about potential interventions or “instruments” in different places in a system” that is, have someone describe what they mean physically by “changing x to 1” and then figure out what the mechanistic results are, and if you describe an alternative method of “changing x to 1” you might wind up with alternative mechanistic result.

          The point I was trying to make was that if we want to talk about how doing a real-live experiment on a particular case produces a measureable outcome which is different from the previously measured outcome, then we could do the experiment, and observe the measured outcome. Then, we could learn from the data an associational formula without any causal thinking that nevertheless (sometimes) would successfully predict the outcome of specific experimental interventions.

        • Andrew,
          You wrote:
          “Counterfactual” is, to me, an awkward term in that what is counterfactual and what is not, is only determined after the experiment has been performed.

          It is precisely for this reason that I prefer “counterfactual” over “potential outcome” — the latter presumes a specific treatment and some experiment.
          Counterfactuals do not need treatments nor experiments.
          If you look again at the Causal Hierarchy, you will find that counterfactuals characterize the QUERY one is asking, not the result of any experiment.
          The answer to some counterfactual queries may be obtained from an experiment, but the character of the query is determine by the syntax of the query,
          whether or not an experiment is involved. e.g., there is no experiment involved in “If it were not for the aspirin my headache would still be bothering me”
          or “Had Julius Caesar not crossed the Rubicon he would have become an emperor”
          The common element here is retrospection, not experiment.

        • I acknowledge that it is useful to have a distinction between what would have happened if you had done something different in the past, and what will happen when you do something in the future. Since the past is unchangeable, the retrospection is inherently probabilistic (ie. uncertain) whereas, if you explain clearly enough what it is you will do in the future, we can then do it, and see if it does happen.

          Either way, at this point now in time, a model can only provide plausibilities over what the outcome would have been in the past, or what the outcome of the experiment will be in the future. But, at a later time after the experiment, we can at least know how good the future prediction was. We can never know how good the prediction of the alternative reality of the past was except through some kind of situation where we maybe in the future set things up similarly to how things were in the past and then do what we would have done in the past, and then see if the outcome is what we predicted, together with an assumption about whatever is different in initial conditions being irrelevant to the outcome.

    • Carlos Ungil,
      Thanks for reminding me of the paper I wrote in 2001
      “Bayesianism and Causality, or, Why I am only a
      half-Bayesian”
      http://ftp.cs.ucla.edu/pub/stat_ser/r284-reprint.pdf

      I just read it again and, thanks God for making me
      modest, otherwise I would have confessed in public
      that this is one of the best papers I read on
      Bayes inference since Savage (1962).
      Strangely, it is cited by only 34 papers when,
      in contrast, my book on Bayesian networks (1988)
      has 22,447 citations (according to google scholar).
      How on earth did you discover it?

      Let me cite a few paragraphs to tell readers
      what it is all about, and how it is connected to the discussion with Daniel:

      Introduction
      I turned Bayesian in 1971, as soon as I began reading Savage’s
      monograph {\em The Foundations of Statistical Inference}
      \cite{savage:62}. %(Savage, 1962).
      The arguments were unassailable: (i) It is plain
      silly to ignore what we know, (ii) It is
      natural and useful to cast what we know in
      the language of probabilities, and
      %how certain we are about what we know, and
      (iii) If our subjective probabilities are erroneous,
      their impact will get washed out in due time, as the
      number of observations increases.

      Thirty years later, I am still devout Bayesian in the sense of (i),
      but I now doubt the wisdom of (ii) and I know that,
      in general, (iii) is false. Like most Bayesians, I believe that the
      knowledge we carry in our skulls,
      be its origin experience, schooling or hearsay,
      is an invaluable resource in all human activity,
      and that combining this knowledge with empirical data is the key
      to scientific enquiry and intelligent behavior.
      %a small part of which is the interpretation of statistical data.
      Thus, in this broad sense, I am a still a Bayesian.
      However, in order to be combined with data,
      our knowledge must first be cast in some formal language,
      and what I have come to realize in the past ten years is that
      the language of probability is not suitable
      for the task; the bulk of
      human knowledge is organized around causal, not
      probabilistic relationships, and
      the grammar of probability calculus
      is insufficient for capturing those relationships.
      Specifically, the building blocks of our
      scientific and everyday knowledge are elementary facts such as
      “mud does not cause rain” and “symptoms do not cause
      disease” and those facts, strangely enough, cannot be expressed
      in the vocabulary of probability calculus.
      It is for this reason that I consider myself only
      a half-Bayesian.

      In the rest of the paper, I plan to
      review the dichotomy between causal and statistical
      knowledge, to show the limitation of probability calculus
      in handling the former, to explain the impact that this limitation
      has had on various scientific disciplines and, finally,
      I will express my vision for future development
      in Bayesian philosophy: the enrichment of
      personal probabilities with causal vocabulary and causal
      calculus, so as to bring mathematical analysis closer
      to where knowledge resides.

      The Demarkation Line

      The demarcation line between causal and statistical
      concepts is thus clear and crisp. A statistical concept
      is any concept that can be defined in terms of
      a distribution (be it personal or frequency-based)
      of observed variables, and a causal concept is any concept
      concerning changes in variables that cannot be defined
      from the distribution alone.

      Summary
      This paper calls attention to a basic conflict between
      mission and practice in Bayesian methodology.
      The mission is to express prior knowledge mathematically
      and reliably so as to assist the interpretation of data,
      hence the acquisition of new knowledge.
      The practice has been to express prior knowledge as prior
      %The practice is to express prior knowledge as prior
      %jp practice has been ….
      probabilities — too crude a vocabulary, given the grand mission.
      Considerations of reliability (of judgment) call for enriching
      the language of probabilities with causal vocabulary
      and for admitting causal judgments into the Bayesian
      repertoire. The mathematics for interpreting causal judgments
      has matured, and tools for using such judgments in the
      acquisition of new knowledge have been developed.
      The grounds are now ready for mission-oriented Bayesianism.

      ————–end of quotes —————-
      As I said earlier: I am nominating it for a Best Paper
      Award. Unfortunately, my colleagues at the Society for
      Bayes inference think that, to be a Bayesian, you need
      a “theta” — No theta, no Bayes.

      Anecdotically, from all my Bayesian colleagues, only
      the late Dennis Lindley (1923 – 2013) admitted
      that Bayes analysis should adopt the do(x) notation.

      Thanks again for refreshing my memory,
      Judea

      • Judea:

        “causal and statistical concepts do not mix. Statistics deals with behavior under uncertain, yet static conditions”

        Perhaps many discussions and concepts in statistics are static but that does not make all of statistics static!

        There are diachronic concepts/models.

        To me – being Bayesian means to purposefully representing empirical phenomena (data in hand or in the future) jointly as arising from a data generating (probability) model with that data generating model first being randomly set with some choice of a parameter (or more generally a distribution).

        That is
        1. A data generating model is randomly selected.
        2. Given the data generating model selected – generate data.

        With data already in hand, the joint distribution defined in 1 and 2 is to be conditioned on the data values (Bayes theorem.)
        (This provides a simple example/picture https://en.wikipedia.org/wiki/Approximate_Bayesian_computation#The_ABC_rejection_algorithm )

        But because in needs to be a purposeful representation of the world, it needs to represent what one thinks is/was happening/going on and there may be may steps within 1 and 2 to do this (e.g. in 2 selective reporting of data that was generated). For the joint model defined (and what is represents) Bayesian theorem gives what the analysis needs to be.

        David Freedman has written out simple Dags this way (using his infamous box models) and I did a set for a webinar of standard simple ones. You have to build the causal structure in the joint model – the random choice/setting of the “joint distribution and data generating models” model – but except for perhaps the third level of your hierarchy – I don’t see why there would be a problem.

        • Kieth O’Rorke,
          I wrote:
          “causal and statistical concepts do not mix. Statistics deals with behavior under uncertain, yet static conditions”
          This does not preclude statistics from modeling dynamic process. However it precludes the statistical
          model from answering questions about changes in the hypothesized process.

          A typical example is Granger causality in economics. It is used for prediction of time series, which is data generated
          by a dynamic process. Fine. But because it uses only the joint distribution of the temporal variables involved,
          it falls under “statistics”, not “causal” and, as Granger himself confessed, it has nothing to do with causality, e.g. it cannot tell
          us if the price at time t1 caused the price at time t2 or there was a third variable that caused both.
          judea

        • I don’t like to argue over the definition of words, but I agree with you that Granger Causality is not what I think of when I want to use the word causality.

          Yet, I don’t want to relegate “statistics” to be just “finding associations between variables A and B in a big dataset” because I think there is something within what is commonly called “statistics” which is “Cox/Jaynes Bayesian logic” and that it allows us also to infer the unknowns that are embedded within our imperfect mechanistic models from data and I hope you are eventually going to agree that searching for associations in datasets like Granger Causality and making some scientific assumptions and then inferring the unknowns under those assumptions, like I often do when I build models are two different things. So maybe even if they both fall under “Statistics” we should separate them out.

          So, for the sake of clarity perhaps we can make the following distinctions:

          1) Associational Statistics: finding out how observing X can help us predict Y for whatever reason. Typical application: Granger Causality.

          2) Probability as Generalized Logic: one application is finding out what some data can tell us about an assumed mechanism f. Typical causal application: finding out the rate constants needed in a 4 compartment ODE pharmacokinetic model involving the stomach and intestines, the bloodstream, the liver, and the kidneys and the goal is to figure out the coefficients describing the ODE that describe how drugs transport between the compartments and are metabolized, so that after the posterior for the coefficients has peaked we can then run simulations to help us predict how to adjust the manufacturing method of the pill to give us the best possible time-release mechanism to keep the drug concentration constant in the liver throughout the day.

          I think you will agree that the example in (2) contains a bunch of pretty strong causal assumptions, and because of that it will in general produce different predictions for drug concentration DL under the full generality of conditions, and hence the “formulas” for p(DL| PillDiameter, Model) will be different than if the model were “fit this polynomial to a dataset of DL and PillDiameter”. And so, within the context of (2) “causality is *not* orthogonal to *the choice of plausibility values* p(DL|Model,PillDiameter)” is a true statement.

        • Daniel,
          Your proposal is very clear when it comes to (1) Associational Statistics, but when it comes to
          to (2) Probability as Generalized Logic, you lost me.
          The reason you lost me is that even “Logic” cannot handle causes and counterfactuals (unless we go to
          modal logic or counterfactual logic etc), so what do you propose to generalize? The entirety of human
          reasoning, formal and informal, past and future, and call it Probability? Why not call it “everything else
          which is useful”?

          Your proposal will be much clearer if you can continue in the style of (1), and use X, Y symbols, rather
          than shifting to verbal description. What more can (2) tell us about X and Y and Z?
          For example, should it tell us how plausible it is for Y to achieve level y had X been x’, give that X is in fact x ?
          You loose me when you quit our relevant variables X,Y and Z and you shift to speak about mechanisms like f which are postulated
          to get answers to questions about X, Y and Z.
          Can we stay symbolic?
          If you do, I bet you will end up with the Causal Hierarchy and, then, I do not mind your calling it “Probability” or “Plausibility”
          or “Generalized Logic”, as long as we know what questions it is capable of answering that level (1) is not.
          Judea

        • Pearl, elsewhere I’ve given you symbolic versions plus words, but here is the very simple concept stripped down from most of my talking (hopefully the talking was useful for some others at least)

          if I know some stuff called K1, then I write down p(y | x, K1) which is a formula about as you say “how plausible it is for Y to achieve y given that X is made to be x by any means, and the mechanisms by which x causes y are truly the mechanisms known to background knowledge K1”

          and the formula for p( y | x, K1) is “told to me by my K1” and when my K1 admits counterfactual knowledge, then it will provide me with plausibilities for counterfactuals when I plug in counterfactual values for x.

          However, please note, I can use K2, some other set of knowledge, which might not express counterfactuals and mechanisms and science, it might express only things like “functions on this interval can be approximated by polynomials” and then we have a pure association. I will get p( y | x, K2) and **the formulas will usually be vastly different p(y | x, K1) != p(y | x, K2)**

          So, the generalized logic of probability theory is *the method* by which different kinds of knowledge can extract information from data and it can do this extraction regardless of whether I use K1 with counterfactual thinking, or K2 with associational thinking.

          If this is what you mean by “orthogonal” (ie. that the probability calculus applies whether or not something is causal) then I agree with you!!! Peace and prosperity as you say. Just as 2-value logic always applies, so real-valued logic of probability theory always applies.

          If on the other hand, you meant p(y | x) is a fact about the world that doesn’t change whatever my K1 or K2 or other K is, and so it is orthogonal to K1 K2 etc… then I absolutely vehemently require that this kind of p should have the notation “Fr” before I will agree with you, and better yet even a subscript on the Fr denoting the dataset you are using to get the Fr. And then I will say my p( y | x, K1) is a plausibility “Pl” and *it does change with the K* and it is not equal to the Fr(y|x) in your dataset, and in that sense my p(y|x,K) is not orthogonal to the knowledge K it is instead completely determined by the knowledge K.

          I will also mention that Fr and Pl have the same mathematical properties so it is not surprising that they both wind up being called “p( )”

          Do we have our peace and prosperity yet?

          1) I am happy if you acknowledge that your causal hierarchy is in essence a kind of “classification system for the different kinds of K that can be used in p( y | x, K)” describing to what uses it can be put, and that by “orthogonal” you meant the version that I agreed with above ie “that the math of probability can be used for either causal models or noncausal models”

          and I am not happy if

          2) you insist that p(y|x) can only be interpreted as a fact about the world, or about your dataset, and the content of a variable K1, or K2 can not alter p(y|x) and this is what you mean by orthogonal

        • Also Judea, I should say that I hope you find all of this helpful. The goal initially was something like “for Judea to find out how people over at Gelman’s blog think about causality and statistics” so I have tried to explain using as much as possible some unambiguous symbols, some hopefully helpful words, and also to discuss some ambiguity in symbols that might make you and the blog denizens talk past each other (Fr, Pl, p, p(|K) etc)

          So my attempt is to help you have an understanding of what we are up to over here, out of both respect for you and hope that you might also enlighten us in important ways either by asking very cogent questions (which you have) or by discovering where exactly is the connection between what you talk about in your Causality Heirarchy and DAGs and things, so that you can point at pieces of what we are doing here and say “here if this thing here is of type FOO then it can never do BAR, but if this thing here is of type BAZ then we accomplish QUUX” etc

          I am getting near the limits of my ability to know how to explain it better, as I’ve tried so many ways… but here is a more formula based version below…

          So, let me go back to the very beginning, we have outcome Y, observed quantity X, and observed quantity M. Then, two different people think about the problem, and one says “X causes Y by partially affecting M and partially causing Y directly according to a function y = f(X,g(X)+Merr, a) + Yerr with Merr, Yerr, and “a” all unknown according to some plausibility values in appendix A” and we call this state of knowledge K1

          and person 2 says “I know some family of functions Q parameterized by a,b,c,d,e,f is sufficient to fit any smooth function in 2 variables, so I will say y = Q(X,M,a,b,c,d,e,f) + yerr and get my large dataset of Y,X,M and find a,b,c,d,e,f by using some plausibility values in appendix A2” and this state of knowledge we will call K2

          so person one writes down in Bayesian logic

          p( y | X,M,a,K1) = normal(f(X,g(X)+Merr,a),ysigma);
          p(Merr,ysigma,msigma,a | K1) = see appendix A

          and then:

          p(ysigma,msigma,a,Merr | Y,X,M, K1) = p(Y | X,M,a,Merr,K1) p(Merr,ysigma,msigma,a|K1)/Z1

          where Z1 is a normalizing factor.

          and person 2 writes down in Bayesian logic

          p(y | X,M,a,b,c,d,e,f,K2) = normal(Q(X,M,a,b,c,d,e,f),yerrsigma)
          p(yerrsigma,a,b,c,d,e,f | K2) = see appendix A2

          and then

          p(a,b,c,d,e,f | Y,X,M,K2) = p(y | X,M,a,b,c,d,e,f,yerrsigma,K2) p(yerrsigma,a,b,c,d,e,f|K2)/Z2

          where Z2 is a normalizing factor…

          and clearly K2 and K1 resulted in different p functions and because some “causal thinking” goes into K1 the first person will answer your query “what would have happened if X=X* instead of X” by plugging in X* confident in the fact that the structural equation models f and g represent the thoughts about counterfactuals, not just factuals.

          and person 2, if they’re being honest, will say “I just fit this to what really happened, I don’t have a function for what ‘would have happened’ because my K2 didn’t tell me enough” though if they’re being lazy they might not realize this problem and might just stick in X* and see what happened.

          In the sense that probability as logic was used in both the first case and the second case, it is “orthogonal” to the causal thinking that goes into K1 or the non-causal thinking that goes into K2. But in the sense that what you think about in K1 or K2 changes p the results are dependent on the thinking…

          I believe you would say the K1 model is a “level 3” model, one that incorporates counterfactual ideas, whereas the K2 model is a “level 1” model, one that incorporates only associations observed in data… and so I think your hierarchy is a way to classify different statistical (probability as logic) models into those 3 categories you describe in your heirarchy.

          and…. importantly for the original goal of understanding what people here are doing…. whether a person at this blog is doing level 1, level 2, or level 3 thinking according to your heirarchy, they will probably use the same p( | ) symbols, and you will only be able to classify these things by asking them something to clarify what knowledge went into the construction of the model.

        • Daniel,
          I know we have been here before but I’m still not clear with how you define the extent of mediation in functions.

          For example When you say:“X causes Y by partially affecting M and partially causing Y directly according to a function y = f(X,g(X)+Merr, a) + Yerr…”, it is not clear to me how this function extract “the extent to which X affects Y through M”. I know you talk about Ks but it is not clear to me how these Ks define the extent of mediation. The formula that I’m familiar with is f(0,g(1),u)-f(0,g(0),u), the one presented by Judea, and it clearly shows where the direct effect is frozen.

          So here are the questions to you:

          Is there a problem with the way f(0,g(1),u)-f(0,g(0),u) presents the extent of mediation?
          If not, is f(X,g(X)+Merr, a) + Yerr equivalent to f(0,g(1),u)-f(0,g(0),u)? how? (they should be equivalent if they measure the same thing).

          If there is a problem, where is it and how f(X,g(X)+Merr, a) + Yerr fixes it?.

        • Ck I don’t have any problem at the moment with Pearl’s formula for the extent of mediation. Note however that its not something that I’ve ever felt the need to define or calculate. Once I have my parameters I can answer boatloads of questions about the process at hand by plugging numbers into the equations. What will be the time course of drug concentration in the liver if the pill is made 6mm in diameter? How much of the drug is excreted in the kidney instead of delivered to the liver? How much passes unabsorbed through the stool? All of these are related to mediation but are generally more salient in a specific example such as pharmacokinetics.

        • Daniel
          Thank you for your patience and effort to teach me about the
          philosophy and practice of the Cox\Jaynes camp.
          You asked me above if I can see the connection to what
          people in my “causal inference camp are doing, and I
          must confess that I am still oscillating between two
          plausible yet diametrically opposed hypothesis:

          Hypothesis 1:
          The Cox\Jaynes camp is far more advanced than us
          at the causal inference community. What we take as
          major accomplishments in the past two decades
          they have accomplished long time ago and are now
          solving mediation and other causal problems on a routine basis.
          We need to learn how they are doing it.

          Hypothesis-2
          Researchers in the Cox/Jaynes camp do not grasp what
          they are missing by thinking that they are solving
          mediation and other causal problems. Sad, but
          understandable, given their devotion to the conditioning bar “|”.

          Why cant I decide between H1 and H2?
          Because we started this track with a concrete and simple
          example on mediation among 3 binary variables. The problem
          happened to have a simple, closed form solution in terms of
          the observed probability of X, Y and M, regardless of the functional
          form of f and g, and regardless of the distribution of
          the error terms (assuming they are independent).
          Not having seen the solution makes me undecided if your comrades
          know how to solve this and other problems like that, or not.
          It would have taken just two lines to write down the solution
          and achieve a global peace and understanding.

          So far I have seen illuminating discussions on K1 and K2,
          plausibility and frequency, Bayesian logic, and more,
          but no solution.
          This makes me uncertain whether your comrades consider
          the original problem to be too trivial, or too difficult to solve.

          Whatever the case, you have been very patient with me
          and I will not bother you more. Still, just in case some
          of this blog’s readers are interested, here are a few
          pointers for the curious.

          1. Mediation problems like the 3-variable example
          are now solved for any number of variables, binary and continuous, in
          the sense that we now know what conditions are needed
          to make NIE identifiable and, once it is, what
          the estimand is for NIE.
          2. My new (2016) book “Causal Inference – a primer”
          (co-authored with Glymour and Jewell)
          describes the theory and applications of counterfactual
          reasoning at the undergraduate level.
          see http://bayes.cs.ucla.edu/PRIMER/
          You will find there numerically solved homework
          problems of the type:
          (a) what would be the expected salary of workers who are now
          at skill level s had they had another year of education.
          (b) What portion of the effect of education on salary
          is explainable by skill attainment.
          (c) What is the probability that the plaintiff would
          be alive had he not taken this drug

          3. The book by VanderWeele (2014) and the writings
          of Kosuke Imai and his students, as well as Jamie Robins,
          and Jay Kauffman in epidemiology can give you further
          insights into current research in causal mediation.

          Enjoy,

          Judea

        • Judea, I think the confusion you have can be answered by the following. You assume it is possible with the 3 variable binary problem and a large dataset to answer questions about probability.

          I on the other hand INSIST that *probability* is relative to the knowledge built into the model K.

          and since we have no substantive model in the example problem, there is really no knowledge K by which I could calculate real probabilities. The best I can do is give bucketloads of examples with more details where I could then have some knowledge by which to calculate probabilities.

          The probabilities you are thinking of are frequencies, and in this Bayesian camp, each of those frequencies has a plausibility over it (!!!) and that all depends on the knowledge K.

          So, why can’t I just give the numerical answer? Because there is no knowledge K!

        • Judea, in your most recent comment you say “The problem
          happened to have a simple, closed form solution in terms of
          the observed probability of X, Y and M, regardless of the functional
          form of f and g, and regardless of the distribution of
          the error terms (assuming they are independent).”

          and this hammers home for me our difficulty in communicating. In Cox/Jaynes theory, there is *no such thing as an observed probability*. There is only an observed *frequency*.

          The probabilities come from the knowledge K. The distinction is very very real for us here in this land. Typically, when I do consulting, someone will come to me with their dataset, and ask me to analyze it, and the first thing I will do is have them talk to me for a long time and answer questions about what they think is physically going on and how all the measurement instruments work and etc etc. They sometimes get frustrated and ask me to quickly calculate some p values or whatever because they are used to some other form of “statistics” where stuff gets put into some canned software and a button is pressed and a stamp of approval of p < 0.05 is generated and they have magical permission to publish their paper or whatever. But I don't do that.

          Until I gather enough background knowledge on what is going on and how the model will be used, and whether there are unmentioned variables that are known etc, I can not put any probabilities and I can not make any progress and it does not matter how big the dataset is.

          (I am perhaps exaggerating, sometimes it's clear from a very quick inspection what the background knowledge needed is but in the general case… no)

      • I don’t really remember if this is how I originally found the paper, but Andrew Gelman made a comment about it a few years ago: http://statmodeling.stat.columbia.edu/2012/01/21/judea-pearl-on-why-he-is-only-a-half-bayesian/

        You have reminded me of Lindley’s review of your book. http://onlinelibrary.wiley.com/doi/10.1111/j.1751-5823.2002.tb00355.x/abstract
        Unfortunately I can’t find a copy online now. There is also a section “Seeing and Doing” in his book “Understanding Uncertainty” but he doesn’t go into much detail there.

        • Carlos
          I am glad you mentioned the writings of Dennis Lindley who
          was a true gentleman and became my role model.
          At the age of 85, he was as curious as a 3-years old,
          and, instead of trying to teach me how to do Bayesian analysis
          (about which he knew much more than me), he kept on asking:
          “And how would you solve a problem like this?”
          “And what if we do not know this or that?”. Then he had the
          guts, at the end a 145-message exchange, to go back
          to his compatriots and tell them: Hey guys, I think I learned
          something from this strange alien.

          We need more thinkers like him in the sciences.
          I am glad you gave me an excuse to mention him on this
          blog, and I wish I could live up to his legacy.
          Judea

  12. > p(y | x=1) = normal(f(x,a,b,c), s)

    Did you mean p(y | x=1,a,b,c) = normal(f(x,a,b,c), s) ? No problem if you were keeping notation light, but I’m not sure if you’re assuming a,b,c are “unique” parameters (that you describe with posteriors that will eventually peak at some value). In fact a,b,c, are stochastic and depend on the individual. To give a general solution for p(y |x=1) you would have to integrate over them and the distribution can be anything.

    • Yes, p(y | x=1,a,b,c) = normal(f(x,a,b,c),s) where a,b,c may be facts about the world that are constant across all cases (such as an unknown but fixed speed of light, or diffusivity of a protein through a fixed medium or whatever), or facts about the individual case (such as for people: age, personality traits, race, adiposity, fitness levels, concentration of magnesium in the blood whatever)…

  13. I am on Chapter 3 of your book right now (I still have to go back and do the R exercises in chapter 2). I was intrigued by your discussion of counterfactual and predictive interpretations. It brought to mind Eric Hanushek’s argument that, since effective teachers boost student achievement, and achievement correlates with future income, we could boost students’ lifetime income considerably by replacing the worst teachers with just average teachers.

    http://hanushek.stanford.edu/publications/valuing-teachers-how-much-good-teacher-worth

    This seems so faulty–specifically in terms of causal reasoning–that I looked on your blog for commentary. I saw a brief comment on Kahneman (who may have been responding to arguments like this) but nothing specifically about Hanushek’s argument.

    It seems that “seriousness about causal reasoning” would help a lot. First, to what extent, and in what contexts, does high school achievement (measured via performance/growth on standardized tests) actually correlate with income? I imagine it would have a lot to do with the chosen field. There may be a *general* association between test-score achievement and later income, but I bet this breaks down when you look more closely.

    But the idea of replacing the lowest-performing teachers with “average” teachers–as a way to boost students’ eventual income, and thus the national economy–seems even more far-fetched. Hanushek defines teacher effectiveness in terms of student test score growth. A teacher could end up with an “average” rating for all sorts of reasons. Maybe she focuses on test prep (which boosts the test scores just enough to put her in the “average” category). Maybe she teaches well but is stuck with a bad curriculum. Maybe she teaches poorly within a strong curriculum. Maybe the kids are already doing well on the tests, so they show little “growth.” There’s no reason to assume that a teacher’s particular “averageness” will help a given class of students achieve more on tests. In fact, it’s a bizarre concept: “Your averageness will turn these students’ lives around.”

    So a subtler and more precise causal analysis is needed.

Leave a Reply to Ben Bolker Cancel reply

Your email address will not be published. Required fields are marked *