## Bootstrap averaging: Examples where it works and where it doesn’t work

Aki and I write:

The very generality of the boostrap creates both opportunity and peril, allowing researchers to solve otherwise intractable problems but also sometimes leading to an answer with an inappropriately high level of certainty.

We demonstrate with two examples from our own research: one problem where bootstrap smoothing was effective and led us to an improved method, and another case where bootstrap smoothing would not solve the underlying problem. Our point in these examples is not to disparage bootstrapping but rather to gain insight into where it will be more or less effective as a smoothing tool.

An example where bootstrap smoothing works well

Bayesian posterior distributions are commonly summarized using Monte Carlo simulations, and inferences for scalar parameters or quantities of interest can be summarized using 50% or 95% intervals. A $1-\alpha$ interval for a continuous quantity is typically constructed either as a central probability interval (with probability $\alpha/2$ in each direction) or a highest posterior density interval (which, if the marginal distribution is unimodal, is the shortest interval containing $1-\alpha$ probability). These intervals can in turn be computed using posterior simulations, either using order statistics (for example, the lower and upper bounds of a 95% central interval can be set to the 25th and 976th order statistics from 1000 simulations) or the empirical shortest interval (for example, the shortest interval containing 950 of the 1000 posterior draws).

For large models or large datasets, posterior simulation can be costly, the number of effective simulation draws can be small, and the empirical central or shortest posterior intervals can have a high Monte Carlo error, especially for wide intervals such as 95% that go into the tails and thus sparse regions of the simulations. We have had success using the bootstrap, in combination with analytical methods, to smooth the procedure and produce posterior intervals that have much lower mean squared error compared with the direct empirical approaches (Liu, Gelman, and Zheng, 2013).

An example where bootstrap smoothing is unhelpful

When there is separation in logistic regression, the maximum likelihood estimate of the coefficients diverges to infinity. Gelman et al. (2008) illustrate with an example of a poll from the 1964 U.S. presidential election campaign, in which none of the black respondents in the sample supported the Republican candidate, Barry Goldwater. As a result, when presidential preference was modeled using a logistic regression including several demographic predictors, the maximum likelihood for the coefficient of “black” was $-\infty$. The posterior distribution for this coefficient, assuming the usual default uniform prior density, had all its mass at $-\infty$ as well. In our paper, we recommended a posterior mode (equivalently, penalized likelihood) solution based on a weakly informative Cauchy (0, 2.5) prior distribution that pulls the coefficient toward zero. Other, similar, approaches to regularization have appeared over the years. We justified our particular solution based on an argument about the reasonableness of the prior distribution and through a cross-validation experiment. In other settings, regularized estimates have been given frequentist justifications based on coverage of posterior intervals (see, for example, the arguments given by Agresti and Coull, 1998, in support of the binomial interval based on the estimate $\hat{p}=\frac{y+2}{n+4}$).

Bootstrap smoothing does not solve problems of separation. If zero black respondents in the sample supported Barry Goldwater, then zero black respondents in any bootstrap sample will support Goldwater as well. Indeed, bootstrapping can exacerbate separation by turning near-separation into complete separation for some samples. For example, consider a survey in which only one or two of the black respondents support the Republican candidate. The resulting logistic regression estimate will be noisy but it will be finite. But, in bootstrapping, some of the resampled data will happen to contain zero black Republicans, hence complete separation, hence infinite parameter estimates. If the bootstrapped estimates are regularized, however, there is no problem.

The message from this example is that, perhaps paradoxically, bootstrap smoothing can be more effective when applied to estimates that have already been smoothed or regularized.

The full paper (actually, the above excerpt is most of it) is here. It’s a discussion of a recent paper by Brad Efron, Estimation and accuracy after model selection.

P.S. Yes, the first quoted paragraph above applies to other statistical principles, including Bayesian inference.

1. Dean Eckles says:

I’m not sure what the implications of “If the bootstrapped estimates are regularized, however, there is no problem.” are supposed to be.

The bootstrap distribution of point estimates from a penalized regression need not be a good approximation to a posterior or to the true sampling distribution of the estimator. Coefficients in the lasso (L1 penalized regression) are a standard example.

• Dean Eckles says:

Now maybe it still makes bagging work better… Having looked at the full comment, I see that is your main topic.

2. Anonymous says:

Some packages detect separation, others don’t and nonetheless produce seemingly sensible coefficients (i.e., not plus or minus infty) despite the fact that the MLE does not exist. So you can’t trust your software to warn you about this situation. See the entertaining article by Stokes, “On the Advantage of Using Two or more Software Systems to Solve the Same Problem” in Journal of Economic and Social Measurement 2004.

3. Chris G says:

I currently have a problem where I’m applying logistic regression – with limited success. The situation is a little sketchy. There only a few dozen samples in each class and the data quality is questionable. I’ve been wondering if there’s merit in applying bootstrap to deal with the small sample size. It seemed like a reasonable idea but I suspect the bigger issue is that I’m in the GIGO regime – or darn close to it. In principle there are lots of potentially useful tools available to deal with the classification problem – I’m on my way to memorizing Duda,Hart, and Stork – but in practice I suspect I’m data quality limited.

4. Anonymous says:

There’s a general problem that bootstrap and cross validation are so easy implement and use that there’s many many more people using these methods than those that understand the underlying assumptions and their failure modes.