Bill Harris writes:

On pp. 250-251 of BDA second edition, you write about multiple comparisons, and you write about stepwise regression on p. 405. How would you look at stepwise regression analyses in light of the multiple comparisons problem? Is there an issue?

My reply:

In this case I think the right approach is to keep all the coefs but partially pool them toward 0 (after suitable transformation). But then the challenge is coming up with a general way to construct good prior distributions. I’m still thinking about that one! Yet another approach is to put something together purely nonparametrically as with Bart.

I wasn’t aware that anyone still used stepwise regression. Are there problems for which it isn’t strictly dominated by regularized regression?

In my experience, datasets that exhibit the following characteristics: (1) many correlated covariates (2) small number of datapoints and (3) significant noise levels, do much better with methods such as forward stagewise and least angle regression, that retain only a small number of covariates for the model. L2 regularized regression and Bayesian methods perform progressively worse as the number of correlated covariates increases (unless you are able to create a multi-level hierarchical model, which isnt always possible).

When isn’t it possible to construct a hierarchical model? The only situation I can think of is if the number of covariates is prohibitively huge (e.g. genomics-sized data).

I find Bayesian models are quite helpful for dealing with correlated covariates – the uncertainty in the estimates just propagates down to whatever marginalization of the posterior you’re trying to estimate to compare your populations/treatment groups. If the correlated covariates are associated with the outcome, the negative correlation between the coefficient estimates is modeled in the posterior distribution. The fact that they’re not individually identifiable doesn’t impede your ability to include them in the model and estimate marginalized posterior contrasts on the population comparison of interest.

Likewise, Bayesian models are beneficial for small, noisy datasets – shrinkage is probably the best (only?) way to deal with both of these issues.

I too, find it a bit weird that anyone uses stepwise regression.

People are often worried (sometimes needlessly, sometimes justifiably) about over-fitting; sometimes there’s good reason for people to want the best parsimonious model (such as, for instance, if it is costly to collect data for more explanatory variables); and sometimes people just want fewer explanatory variables because it’s easier to think about relatively simple models. I think those are the main reasons people look to things like stepwise regression or “lasso” methods.

I think that too often people treat these issues as purely statistical ones — by which I mean, they think of the data as just a bunch of numbers — and neglect additional information. Often you know from physical principles, or medical understanding, or in other ways, what parameters ought to be more important than others. It bugs me when someone throws 20 explanatory variables into an analysis and picks out two that have “statistically significant” effects. Or rather, it bugs me if that is the _only_ thing they do.