## Sorry, but no, you can’t learn causality by looking at the third moment of regression residuals

Under the subject line “Legit?”, Kevin Lewis pointed me to this press release, “New statistical approach will help researchers better determine cause-effect.” I responded, “No link to any of the research papers, so cannot evaluate.”

In writing this post I thought I’d go further. The press release mentions 6 published articles so I googled the first one, from the British Journal of Mathematical and Statistical Psychology (hey, I’ve published there!) and found this paper, “Significance tests to determine the direction of effects in linear regression models.”

Uh oh, significance tests. It’s almost like they’re trying to piss me off!

I’m traveling so I can’t get access to the full article. From the abstract:

Previous studies have discussed asymmetric interpretations of the Pearson correlation coefficient and have shown that higher moments can be used to decide on the direction of dependence in the bivariate linear regression setting. The current study extends this approach by illustrating that the third moment of regression residuals may also be used to derive conclusions concerning the direction of effects. Assuming non-normally distributed variables, it is shown that the distribution of residuals of the correctly specified regression model (e.g., Y is regressed on X) is more symmetric than the distribution of residuals of the competing model (i.e., X is regressed on Y). Based on this result, 4 one-sample tests are discussed which can be used to decide which variable is more likely to be the response and which one is more likely to be the explanatory variable. A fifth significance test is proposed based on the differences of skewness estimates, which leads to a more direct test of a hypothesis that is compatible with direction of dependence. . . .

The third moment of regression residuals??? This is nuts!

OK, I can see the basic idea. You have a model in which x causes y; the model looks like y = x + error. The central limit theorem tells you, roughly, that y should be more normal-looking than x, hence all those statistical tests.

Really, though, this is going to depend so much on how things are measured. I can’t imagine it will be much help in understanding causation. Actually, I think it will hurt in that if anyone takes it seriously, they’ll just muddy the waters with various poorly-supported claims. Nothing wrong with doing some research in this area, but all that hype . . . jeez!

1. Luis says:

What? Perhaps I’m missing something but if y = x + error then x = y – error. They’re perfectly symmetric! One shouldn’t be more normal-looking than the other.

• Andrew says:

Luis:

Yes, you’re missing something. Start with y = x + error, with error independent of x. Then, yes, you can write x = y – error, but error is not, in general, independent of y. So the scenarios are not symmetric.

• numeric says:

If (x,e) bivariate normal with correlation rho, then any linear combination (including x + e) is normal. In particular, (y = x + e, x, e) is trivariate normal (though degenerate), and y – e is also normal. So in the normal case independence of errors is irrelevant to the form of the distribution. It is necessary to associate a method of estimation with this scenario in order to make comments about the similarity of y = x + e and x = y – e, but even then the regression coefficient on y = xb + e would be hat{b} = (x’y)/(x’x) and that for x = yd – e would be (x’y)/(y’y). Note b != 1/d in general. Formulas are worked out in

I’m being somewhat pedantic here to note that this is not as simple an issue as it seems. and Luis probably needs more context than a simple “missing something”, which I have not even begun to provide anywhere near a satisfactory level of detail. I’ll just say that a lot of people have thought a lot about how to approach this type of problem (different simple model specifications allowing some type of deeper insight into the underlying DGM (data generating model)), with no generally accepted results (these third moments appear to be another attempt in this direction–but note it appears to depend on the method of moments estimation, another estimation method).

I will say Andrew (or someone) might run a posterior predictive check on the two models (y = x + e and x = y – e) and see if there is any discernible difference. My gut feeling is no.

• Andrew, I think your comment is confusing. For each measured pair (x_i,y_i) it’s true that there exists some number e_i so that y_i = x_i + e_i, the number e_i = y_i – x_i obviously. The distribution of e_i values is whatever it is.

Are you using algebraic notation to refer to a regression model: ie.

y ~ x + e …. meaning y_i = a*x_i + b + e_i

?

Because otherwise… y_i – x_i is a well defined number in each case, and the distribution of those numbers is just whatever it is.

• Andrew says:

Daniel:

What I’m saying is that, if y = x + e, and e is statistically independent of x, then e will not necessarily be statistically independent of y. Here’s a simple simulation in R:

```n < - 1000
x <- seq(0, 1, length=n)
e <- rnorm(n)
y <- x + e
print(cor(e, x))
print(cor(e, y))
```

I was just responding to Luis's statement, "If y = x + error then x = y – error. They’re perfectly symmetric!" Yes, Luis's statement is correct as written, but typically in statistics the statement "y = x + error" is taken to imply that the error is statistically independent of x.

• Luis says:

Ok, my apologies. I didn’t read the part about assuming non-normality – I was thinking about two normal RVs.

But still, one has to assume the error term is normal, which is not even that natural an assumption depending on the distribution of x and y. I imagine this would be incredibly sensitive to distributional assumptions, which defeats the purpose of the technique…

• Yes, you’re missing something, the residuals from regressing y on x (y = P1x + e1, where P1 projects on span(X) and e1 is the orthogonal complement) is not the same thing as the residuals from regressing X on Y (x = P2y + e2 where P2 projects on span(Y) and e2 is the orthogonal complement of x).

2. Igor Carron says:

Hey Andrew,

There is a segment of the Machine Learning population that uses this trick. It does somewhat OK but as you say it’s really a question of how the sampling of the dataset is done.

See for instance: “Non-linear Causal Inference using Gaussianity Measures” http://www.jmlr.org/papers/volume17/14-375/14-375.pdf

Also of interest is the Chalearn challenge: http://www.causality.inf.ethz.ch/cause-effect.php

Happy holidays !
Igor.

• Martha (Smith) says:

Never heard “Gaussianity” before. Is it different from just “Gaussian”?

• jrc says:

Gaussianity is a neurological disorder most commonly associated with the spiritual belief that mathematical tractability guarantees reference to the true world. Clinical symptoms closely resemble narcissistic personality disorder. Neurophysiological effects closely resemble those associated with long-term cult membership.

• Ben Bolker says:

“Gaussian” is an adjective. “Gaussianity” is a noun. According to my brief dive down the rabbit hole it is an “abstract singular term”: “Abstract singular terms are expressions like ‘triangularity’, ‘mankind’, ‘redness’, and ‘friendship’ which … are formed from predicate-expressions … by the addition of suffixes like ‘-ity’, ‘-hood’, ‘-ness’, ‘-kind’, and ‘-ship’.” http://bit.ly/2hucXig . Gaussianity: 253K Ghits. Gaussianness: 2.38K Ghits.

• Ricardo Silva says:

A Gaussian measure is a measure based on the Gaussian distribution. For instance, a prior distribution on parameters that happens to be Gaussian.

A Gaussianity measure is a quantification of how different a distribution is from a Gaussian (so “measure of Gaussianity” works here, but not “measure of Gaussian”). For instance, the magnitude of its kurtosis or the KL divergence between the distribution and a Gaussian with the same first two moments. Anyone who uses Gaussians in model should at least check some Gaussianity measures of the empirical distribution to have a sense how acceptable or poor a Gaussian approximation might be.

• Erikson says:

But wait, from the abstracts of both papers (from the post and the one mentioned by Igor):
“Assuming non-normally distributed variables, it is shown that the distribution of residuals of the correctly specified regression model (e.g., Y is regressed on X) is more symmetric than the distribution of residuals of the competing model (i.e., X is regressed on Y).”
or
“Assuming that the causes and the effects have the same distribution, we show that the distribution of the residuals of a linear fit in the anti-causal direction is closer to a Gaussian than the distribution of the residuals in the causal direction.”
I’m pretty sure that the reversal of the conclusion is due to different assumptions, but it’s quite surprising.

3. jebyrnes says:

Causality from correlation is one of those great old problems. I’m wondering, while different, what you thought of convergent cross-mapping (CCM) as laid out in Sugihara et al.’s Detecting Causality in Complex Ecosystems and its extensions into spatially replicated data (multispatial CCM) from Spatial ‘convergent cross mapping’ to detect causal relationships from short time-series – neither method of which I see being used very often – yet.

4. jrc says:

Look Andrew, if we don’t have a button to push to get causal estimates, how are we supposed to know the difference between correlation and causation? Nuanced logical and statistical argumentation? Pshhhh…..

5. Dzhaughn says:

On the bright side, one might be able to use such a test to dismiss half of the noise-mining claims. What if we found that the data show that it is the strength of the hurricane that causes it to be given a female name?

6. Christian Hennig says:

“it is shown that the distribution of residuals of the correctly specified regression model (e.g., Y is regressed on X) is more symmetric than the distribution of residuals of the competing model (i.e., X is regressed on Y)”

What if the true model is X is regressed on Y but residuals are skew?
OK, this just backs up Andrew’s “This is nuts!” – however, I wonder whether there is *any* possibility to make statements about the direction of causality from data that isn’t based on comparing competing models at the same level of complexity (understood in a pretty broad and somewhat informal sense, for example “symmetry is less complex than skewness”) and will always leave open the possibility that a model with causality working the other way round is as good or even better, only requiring a slightly higher complexity level.
If I remember it correctly, all methods for deciding the direction of causality I have seen were of this kind in some sense. Then this is not an area in which I can claim to have the best expertise, so I may be wrong.

• george says:

Should be discussed in Pearl’s probabilistic reasoning book – chapter 8?

• I think it’s a rule that if it’s discussed in only the last few chapters of a book then there isn’t likely to be a good answer (in said book)?

• Ricardo Silva says:

Hi, Christian (Happy new year from the other side of our corridor)

I’m not sure how anyone expects to say anything about causality without untestable assumptions (and saying “do an experiment” is a cop-out when we can’t, whether it is right now or ever). For my money, it is a matter of starting from different assumptions and see whether they agree or disagree in the conclusions and assess what might explain the difference. We can have assumptions at a more domain-specific level (“in this problem, this is the directionality, and here are the confounders you are looking for”) or work at a higher-level hierarchy of assumptions (“this problem belongs to this class, where in this class e.g. the world behaves as linear relationships with additive non-Gaussian contributions from the latent variables”). To put it in another way, if someone comes to you with a model based on “substantive” knowledge and it disagrees with a higher-level set of assumptions, maybe we should have some doubts about the substance of the expert claims or at least try to explain where the difference came from. By the way, I found that this paper gives interesting food for thought, even if the ideas here are hard to put in practice: http://www.sciencedirect.com/science/article/pii/S0004370212000045

• Christian Hennig says:

Thanks for the thoughts and references, also george, but well put Ricardo and happy new year from at the moment somewhat bigger distance (about 1500 km if you are where I think you are;-)!

• Very interesting paper, Ricardo, thanks for sharing that!

7. Jonathan says:

My reaction was of course: if you start with abnormal distributions then of course you can measure skew in a dimension by checking asymmetry. As I think they kind of said, you convert what you’re looking at to normal. One issue: any result is the summary of a lot of dimensions so it can shift with volatility and that you’d have a hard time knowing that because that would invoke a model fitting at that level and that extra step away invokes a host of complicating dimensions. So let’s say you simplify by assuming increasingly ideal distributions as you step away, so it makes a cone for example moving in ideal steps toward a point of disappearance, then how do you know where that point is in the distance versus any other point, especially when you simply rotate to make a sphere or even an idealized circle? It was really hard for me to understand what this abstract actually says because they’re eager to talk about whatever tests they’ve come up with, but one guess is they’ve not correctly understood that they’re assuming a remote causal point and thus the correlations implied to the skew line or point, that they’re focused on the idea of an abnormal distribution and the skew that would occur when you evaluate from both perspectives and didn’t think that far. I don’t think the idea of working with residuals is bad at all, but if you’re examining that comparison gap or potential then you’re pretty much automatically also increasing various forms of noise to the point where the effect may be too volatile to classify.

8. Jack says:

Seriously you provide no proof of your claim and bash the paper by just reading the abstract…

• Andrew says:

Jack:

Yes, I am completely serious. If someone wants to claim that they can determine causality by looking at the third moment of regression residuals, they’re the ones who need to provide the evidence. In my post I described how this sort of method might operate, and I think the whole thing is just too sensitive to distributional assumptions to make sense. I mean, sure, no reason people shouldn’t research these ideas, but I don’t buy that press release.

• John says:

How do you know they didn’t provide a good evidence if you didn’t read the papers? This actually seems a promising line of research.

• John says:

I agree the press release is ridiculous though.

• Andrew says:

John:

The abstract’s there for a reason! I can’t read every paper. I read the abstract and it has lots of problems, beginning with the idea that one should even want to “decide which variable is more likely to be the response and which one is more likely to be the explanatory variable.” It’s full of shaky assumps and significance tests and it looks like a mess to me. I’m not trying to stop you or anyone else from reading this paper; I just can’t picture it working out except by accident.

9. Cameron says:

Interjecting rather late into this conversation….. I can see how this whole notion could be instantly repellent, but it is my understanding that the methods described in the article Andrew mentioned (and many related works) are provably correct given certain (admittedly heroic) assumptions. As usual, take it or leave it. Anyway, I have about a bazillion references on this topic if anyone is interested in studying it further.

Cheers,

The Collector