Jeff Gill writes:
For some reason the misinterpretations about interactions in regression models just won’t go away. I teach the point that mathematically and statistically one doesn’t have to include the main effects along with the multiplicative component, but if you leave them out it should be because you have a strong theory supporting this decision (i.e. GDP = Price * Quantity, in rough terms). Yet I got this email from a grad student yesterday:
As I was reading the book, “Introduction to Statistical Learning,” I came across the following passage. This book is used in some of our machine learning courses, so perhaps this is where the idea of leaving the main effects in the model originates. Maybe you can send these academics a heartfelt note of disagreement.
“The hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. In other words, if the interaction between X1 and X2 seems important, then we should include both X1 and X2 in the model even if their coefficient estimates have large p-values. The rationale for this principle is that if X1 × X2 is related to the response, then whether or not the coefficients of X1 or X2 are exactly zero is of little interest. Also X1 × X2 is typically correlated with X1 and X2, and so leaving them out tends to alter the meaning of the interaction.”
(Bousquet, O., Boucheron, S. and Lugosi, G., 2004. Introduction to statistical learning theory. Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003, Revised Lectures, pp.169-207.)
There are actually two errors here. It turns out that the most cited article in the history of the journal Political Analysis was about interpreting interactions in regression models, and there are seemingly many other articles across various disciplines. I still routinely hear the “rule of thumb” in the quote above.
To put it another way, suppose you start with the model with all the main effects and interactions, and then you consider the model including the interactions but excluding one or more main effects. You can think of this smaller model in two ways:
1. You could consider it as the full model with certain coefficients set to zero, which in a Bayesian sense could be considered as very strong priors on these main effects, or in a frequentist sense could be considered as a way to lower variance and get more stable inferences by not trying to estimate certain parameters.
2. You could consider it as a different model of the world. This relates to Jeff’s reference to having a strong theory. A familiar example is a model of the form, y = a + b*t + error, with a randomly assigned treatment z that occurs right after time 0. A natural model is then, y = a + b*t + c*z*t + error. You’d not want to fit the model, y = a + b*t + c*z * d*z*t + error—except maybe as some sort of diagnostic test—because, by design, the treatment cannot effect y at time 0.
I have three problems with the above-quoted passage. The first is the “even if the p-values” bit. There’s no good reason, theoretically or practically, that p-values should determine what is in your model. So it seems weird to refer to them in this context. My second problem is where they say, “whether or not the coefficients of X1 or X2 are exactly zero is of little interest.” In all my decades of experience, whether or not certain coefficients are exactly zero is never of interest! I think the problem here is that they’re trying to turn an estimation problem (fitting a model with interactions) into a hypothesis testing problem, and I think this happened because they’re working within an old-fashioned-but-still-dominant framework in theoretical statistics in which null hypothesis significance testing is fundamental. Finally, calling it a “hierarchical principle” seems to be going too far. “Hierarchical heuristic,” perhaps?
That all said, usually I agree with the advice that, if you include an interaction in your model, you should include the corresponding main effects too. Hmmm . . . let’s see what we say in Regression and Other Stories . . . section 10.3 is called Interactions, and here’s what we’ve got . . .
We introduce the concept of interactions in the context of a linear model with a continuous predictor and a subgroup indicator:
Figure 10.3 suggests that the slopes differ substantially. A remedy for this is to include an interaction . . . that is, a new predictor defined as the product of these two variables. . . . Care must be taken in interpreting the coefficients in this model. We derive meaning from the fitted model by examining average or predicted test scores within and across specific subgroups. Some coefficients are interpretable only for certain subgroups. . . .
An equivalent way to understand the model is to look at the separate regression lines for [the two subgroups] . . .
Interactions can be important, and the first place we typically look for them is with predictors that have large coefficients when not interacted. For a familiar example, smoking is strongly associated with cancer. In epidemiological studies of other carcinogens, it is crucial to adjust for smoking both as an uninteracted predictor and as an interaction, because the strength of association between other risk factors and cancer can depend on whether the individual is a smoker. . . . Including interactions is a way to allow a model to be fit differently to different subsets of data. . . . Models with interactions can often be more easily interpreted if we preprocess the data by centering each input variable about its mean or some other convenient reference point.
We never actually get around to giving the advice that, if you include the interaction, you should usually be including the main effects, unless you have a good theoretical reason not to. I guess we don’t say that because we present interactions as flowing from the main effects, so it’s kind of implied that the main effects are already there. And we don’t have much in Regression and Other Stories about theoretically-motivated models. I guess that’s a weakness of our book!