Andy Cooper writes:
A link to an article, “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption).
I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion.
My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book with Jennifer we list the assumptions of the linear regression model. In decreasing order of importance, these assumptions are:
1. Validity. Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. . . .
2. Additivity and linearity. The most important mathematical assumption of the regression model is that its deterministic component is a linear function of the separate predictors . . .
3. Independence of errors. . . .
4. Equal variance of errors. . . .
5. Normality of errors. . . .
Further assumptions are necessary if a regression coefficient is to be given a causal interpretation . . .
Normality and equal variance are typically minor concerns, unless you’re using the model to make predictions for individual data points.