Kevin Lewis points us to this interesting paper by Jacob Westfall and Tal Yarkoni entitled, “Statistically Controlling for Confounding Constructs Is Harder than You Think.” Westfall and Yarkoni write:

A common goal of statistical analysis in the social sciences is to draw inferences about the relative contributions of different variables to some outcome variable. When regressing academic performance, political affiliation, or vocabulary growth on other variables, researchers often wish to determine which variables matter to the prediction and which do not—typically by considering whether each variable’s contribution remains statistically significant after statistically controlling for other predictors. When a predictor variable in a multiple regression has a coefficient that differs significantly from zero, researchers typically conclude that the variable makes a “unique” contribution to the outcome. . . .

Incremental validity claims pervade the social and biomedical sciences. In some fields, these claims are often explicit. To take the present authors’ own field of psychology as an example, a Google Scholar search for the terms “incremental validity” AND psychology returns (in January 2016) over 18,000 hits—nearly 500 of which contained the phrase “incremental validity” in the title alone. More commonly, however, incremental validity claims are implicit—as when researchers claim that they have statistically “controlled” or “adjusted” for putative confounds—a practice that is exceedingly common in fields ranging from epidemiology to econometrics to behavioral neuroscience (a Google Scholar search for “after controlling for” and “after adjusting for” produces over 300,000 hits in each case). The sheer ubiquity of such appeals might well give one the impression that such claims are unobjectionable, and if anything, represent a foundational tool for drawing meaningful scientific inferences.

Wow—what an excellent start! They’re right. We see this reasoning so often. Yes, it is generally *not* appropriate to interpret regression coefficients this way—see, for example, “Do not control for post-treatment variables,” section 9.7 of my book with Jennifer—and things get even worse when you throw statistical significance into the mix. But researchers use this fallacious reasoning because it fulfills a need, or a perceived need, which is to disentangle their causal stories.

Westfall and Yarkoni continue:

Unfortunately, incremental validity claims can be deeply problematic. As we demonstrate below, even small amounts of error in measured predictor variables can result in extremely poorly calibrated Type 1 error probabilities.

Ummmm, I don’t like that whole Type 1 error thing. It’s the usual story: I don’t think there are zero effects, so I think it’s just a mistake overall to be saying that some predictors matter and some don’t.

That said, for people who are working in that framework, I think Westfall and Yarkoni have an important message. They say in mathematics, and with several examples, what Jennifer and I alluded to, which is that even if you control for pre-treatment variables, you have to worry about latent variables you haven’t controlled for. As they put it, there can (and will) be “residual confounding.”

So I’ll quote them one more time:

The traditional approach of using multiple regression to support incremental validity claims is associated with extremely high false positive rates under realistic parameter regimes.

Yup.

They also say, “the problem has a principled solution: inferences about the validity of latent constructs should be supported by latent-variable statistical approaches that can explicitly model measurement unreliability,” which seems reasonable enough. That said, I can’t go along with their recommendation that researchers “adopt statistical approaches like SEM”—that seems to often just make things worse! I say Yes to latent variable models but No to approaches which are designed to tease out things that just can’t be teased (as in the “affective priming” example discussed here).

I am sympathetic to Westfall and Yarkoni’s goal of providing solutions, not just criticism—but in this case I think the solutions are further away than they seem to believe, and that part of the solution will be to abandon some of researchers’ traditional goals.

See the closing section of Shear and Zumbo for a perspective on the problem.

Shear BR, Zumbo BD. (2013). False Positives in Multiple Regression Unanticipated Consequences of Measurement Error in the Predictor Variables. Educ Psychol Meas. Vol. 73(5):733–756.

Thanks for another Gelman Bump. To make sure I read your closing comments correctly, you seem to be saying “this particular usage of SEM seems okay, but generally speaking I am skeptical of many applications of SEM (such as complicated path analyses), and I would not support the idea that everyone should run around applying SEM all the time.” Is that about right? (The latter idea of course is not our recommendation.)

Jake:

I think you need to get beyond working with just what is with-in these studies by using external sources of information including informative expert based priors.

For instance https://experts.umn.edu/en/publications/multiple-bias-modelling-for-analysis-of-observational-data

Introduction to these methods – http://www.springer.com/gp/book/9780387879604

Yes, thanks for the bump! We explicitly address the Type I thing later in the paper, where we note that exactly the same issue arises if you’re doing estimation (both frequentist and Bayesian). I agree with you that Type I errors don’t really exist in psychology, but we thought it would be prudent to write the paper in a way that maximizes the likelihood other researchers would pay attention to the core message, so we stuck with the standard NHST framework for rhetorical purposes. Attacking more one statistical dogma in a paper seems like a good way to get ignored.

Regarding SEM, I’m also not a fan, but Jake succeeded in convincing me that my dislike of SEM is grounded in a dislike of what people commonly *do* with SEM, and is not really a complaint about SEM itself. As we previously discussed in the comments on one of your earlier posts, you can certainly use a Bayesian approach to achieve the same ends, and we agree that the critical thing is to model the thing you care about (the latent construct) rather than just the thing you don’t care about (the observable measurement). But if you’re going to run multiple regressions and draw pseudocausal conclusions based on estimated regression coefficients, fitting SEMs of the kind we propose is a big step in the right direction—and, in practice, in many cases the resulting conclusion will be just what you noted—that compelling, simple explanations are further away than one might otherwise think.

One way to look at it is to imagine each variable on its own axis. We like to think that each axis is independent, so its data doesn’t project onto any other axis. However, it’s a fact of life that most things aren’t fully orthogonal (i.e., have some degree of collinearity), and also that any attempt to establish orthogonality will have some errors just because statistics. Even if we use the data to try to account for non-orthogonality of the axes, we’re going to have errors in the amount of projection.

Unknown projection of a variable onto another axis is equivalent to having spurious data on that axis. Because of the squaring done by least squares methods, this spurious data can have a very exaggerated effect on the estimated parameters. This is especially so if the spurious projected values happen to come near the ends of the data range, where squared values affect the estimated parameters the most.

The result of these unknown projections can be really serious errors in the estimated parameters. Or maybe not: after all, it’s statistics… but you won’t be able to know…

I thought structural equation modeling (SEM) was just an opaque (to me) graphical language for regression-like models. That’s what it seemed like reading the SEM package doc in Stata that Sophia Rabe-Hesketh was involved with and also reading her book on hierarchical models. The Stata manual’s particularly helpful for unfolding the language into statistical models.

The Wikipedia article on SEM says it’s associated with certain modeling and estimation techniques. I think that’s what Tal Yarkoni is saying above—the notation isn’t so bad, but people abuse it in certain systematic ways.

Andrew—keep in mind that all press is good press. This caused me to go look up SEM again, keeping it in the discussion.

Perhaps it was unique to my exposure to SEM but the course I took began with a discussion of causal models via Morgan & Winship. As a grad student it was eye opening to move beyond thinking about a flat causal structure in my introductory regression course to one where I could express my beliefs about the causal ordering of variables in a model. Add to that a fuller discussion of how to incorporate latent variable measurement models from psychometrics and I think you have a package of techniques that can be incredibly useful to substantive researchers. Also ripe for abuse but I’m not sure how that is a critique unique to SEM.

Also great work Jake and Tal. Excellent read.

Even though I am often dubious about the causal claims, I think that the whole idea of saying that you do not have direct measurements is extremely important. It is also very useful in impact by pushing researchers to spell out the relationships between predictor variables rather than assuming that they are all fixed.

Yes, of course, but you might want to consider https://arxiv.org/abs/1610.02080

If everyone who wanted to draw conclusions about incremental importance were

required to learn about Shapley values, the world might not be a better place, but

it would slow things down a little.

SEM is neither friend nor foe. It’s (1) a method for specifying direct and indirect effects of endogenous and exogenous variables on one another, clearly and explicitly and (2) a way to account for measurement unreliability by separating the measured indicators from the theoretical entity of interest, the latent “factor”, and thus making inferences on the underlying factor rather than the indicators.

Its benefit, in my opinion, is that it facilitates the right kind of scientific inferences. In Poppers terms, we ought to be making risky predictions, not racking up asterisks indicating “significant p-values” (see Meehl’s: Theoretical Risks and Tabular Asterisks).

In short, with any set of measured variables, there can be hundreds or thousands of possible permutations of their network of influence. A researcher specifies one model. If the postulated model fits the data better than other models, you’re doing something interesting.

Yet, Gelman’s skepticism of SEM, and that of the authors, is also well founded. Many budding researchers fit SEMs, find the postulated model does not fit, and proceed to jigger the model until they find one that fits, and this is the one that gets published in a peer reviewed journal. Yes, researchers abuse SEM. But the tool can be used for good by someone who respects the inference of actual science: risky predictions are the key, not p-values.

” If the postulated model fits the data better than other models, you’re doing something interesting.”

As I understand SEM, this is the tricky part. You compare your preferred SEM to a couple of other SEMs that are plausible, but you may only be saying that your preferred model is the least terrible of those that you chose to put into your paper. I may misunderstand SEMs though. (I love the Little’s Longitudinal Structural Equation Modeling, and it really gave me a favorable impression of SEMs, but…)

You’re right. The predicted model has to be a risky one. By that I mean one that is not obvious, not explainable by the received or common sense view, but and one that — all things being equal — has a pretty slim chance of materializing *unless* your theory is true. In stronger terms, the prediction should have a high conditional probability under your theory, and a very small conditional probability under rival theories. This is what you are saying, and you are spot on. A prediction that has a high conditional probability of materializing under a thousand common sense or junk science theories is not worth testing. A lot of SEM is just such a contest between equally weak theories.

One could set out a scale from 1 to 10 of theoretical risk.

A score of 1 is for something as uninformative as “boys are taller than the girls.”

A score of 3 might be for the psych prediction that a person paid to give a speech does not believe their argument as fervently as someone who gives the speech for free.

A score of 10 is for the deflection of light by the sun, which Arthur Eddington observed in the 1919 solar eclipse, increasing scientific and public support for Einstein’s general relativity.

The social sciences typically can’t get a 10 in prediction risk, but we don’t have to settle for a 1 every time.

Well put.

We touched on this issue in our 2012 paper in JPSP on the problems with the practice of using really short measures of personality (often producing unreliable and construct deficient scores) as control variables.

https://www.ncbi.nlm.nih.gov/pubmed/22352328

If you want to make inferences about a treatment effect you do randomization into multiple treatment groups. Anything else will lead you into a Coleman report type of fiasco.

Hi,

I’m just beginning a SEM course as part of my social psychology PhD studies (I need to figure out if the effect of the intervention was mediated in the way the interventionist thought). Shit scared about fooling myself, along with everyone else.

Could someone please clarify, what Andrew means by “part of the solution will be to abandon some of researchers’ traditional goals”?

Also, would love to hear about resources for responsible use of SEM. I don’t trust the old-school professors any more :/

Many thanks!

Matti