Freddy Garcia writes:
I read your post Vine regression?, and your phrase “I love descriptive data analysis!” make me wonder: How to do a descriptive analysis using regression models? Maybe my question could be misleading to an statistician, but I am a economics student. So we are accustomed to think in causal terms when we set up a regression model.
My reply: This is a funny question because I think of regression as a predictive tool that will only give causal inferences under strong assumptions. From a descriptive standpoint, regression is an estimate of the conditional distribution of the outcome, y, given the input variables, x. This can be seen most clearly, perhaps, with nonparametric methods such as Bart which operate as a black box: Give Bart the data and some assumptions, run it, and it produces a fitted model, where if you give it new values of x, it will give you probabilistic predictions of y. You can use this for individual data, or you can characterize your entire population in terms of x’s and then do Mister P. (That is, you can poststratify; here Bart is playing the role of the multilevel model.) It’s all descriptive.
I train my students to summarize regression fits using descriptive terminology. So, don’t say “if you increase x_1 by 1 with all the other x’s held constant, then E(y) will change by 0.3.” Instead say, “Comparing two people that differ by 1 in x_1 and who are identical in all the other x’s, you’d predict y to differ by 0.3, on average.” It’s a mouthful but it’s good practice, cos that’s what the regression actually says. I’ve pretty much trained myself to talk that way.
Of course, as Jennifer says, there’s a reason why people use causal language to describe regressions: causal inference is typically what people want. And that’s fine, but then you have to be more careful, you gotta worry about identification, you should control for as many pre-treatment variables as possible but not any post-treatment variables, you should consider treatment interactions, etc.