Fabio Rojas writes:
In much of the social sciences outside economics, it’s very common for people to take a regression course or two in graduate school and then stop their statistical education. This creates a situation where you have a large pool of people who have some knowledge, but not a lot of knowledge. As a result, you have a pretty big gap between people like yourself, who are heavily invested in the cutting edge of applied statistics, and other folks.
So here is the question: What are the major lessons about good statistical practice that “rank and file” social scientists should know? Sure, most people can recite “Correlation is not causation” or “statistical significance is not substantive significance.” But what are the other big lessons?
This question comes from my own experience. I have a math degree and took regression analysis in graduate school, but I definitely do not have the level of knowledge of a statistician. I also do mixed method research, and field work is very time intensive. I often feel like that I face a tough choice – I can delve into more advanced statistics, but that often requires a huge investment on my part. Is there a middle ground between the naive user of regression analysis and what you do?
My reply: You can take a look at my book with Jennifer Hill. Chapters 3-5 hit the basics, then you can jump to chapters 9-10 for causal inference.
More specifically, here are some tips:
– The difference between “significant” and “non-significant” is not itself statistically significant.
– Don’t just analyze your variables straight out of the box. You can break continuous variables into categories (for example, instead of age and age-squared, you can use indicators for 19-29, 30-44, 45-64, 65+), and, from the other direction, you can average several related variables to create a combined score.
– You can typically treat a discrete outcome (for example, responses on a 1-5 scale) as numeric. Don’t worry about ordered logit/probit/etc,, just run your regression already.
– Take the two most important input variables in your regression and throw in their interaction.
– The key assumptions of a regression model are validity and additivity. Except when you’re focused on predictions, don’t spend one minute worrying about distributional issues such as normality or equal variance of the errors.
Possibly the readers of this blog could offer some suggested tips of their own?