What can be done to move cross-validation from a research idea to a routine step in Bayesian data analysis?

Cross-validation is a method for evaluating model using the following steps: (1) remove part of the data, (2) fit the model the smaller dataset excluding the removed part, (3) use the fitted model to predict the removed part, (4) summarizing the prediction error by comparing to the actual left-out data. The entire procedure can then be repeated with different pieces of data left out. Various versions of cross-validation compare to different choices of leaving out data–for example, removing just one point, or removing a randomly-selected 1/10 of the data, or removing half the data.

Several conceptual and computational challenges arise when attempting to apply cross-validation for Bayesian multilevel modeling.

**Background on cross-validation**

Unlike predictive checking (which is a method to discover ways in which a particular model does not fit the data), cross-validation is used to estimate the predictive error of a model and to compare models (choosing the model with lower estimated predictive error).

**Computational challenges**

With leave-one-out cross-validation, the model must be re-fit n times. That can take a long time, since fitting a Bayesian model even once can require iterative computation!

In classical regression, there are analytic formulas for estimates and predictions with one data point removed. But for full Bayesian computation, there are no such formulas.

Importance sampling has sometimes been suggested as a solution: if the posterior distribution is p(theta|y), and we remove data point y_i, then the leave-one-out posterior distribution is p(theta|y_{-i}), which is proportional to p(theta|y)/p(y_i|theta). One could then just use draws of theta from the posterior distribution and weight by 1/p(y_i|theta). However, this isn’t a great practical solution since the weights, 1/p(y_i|theta), are unbounded, so the importance-weighted estimate can be unstable.

I suspect a better approach would be to use importance resampling (that is, sampling without replacement from the posterior draws of theta using 1/p(y_i|theta) as sampling weights) to get a few draws from an approximate leave-one-out posterior distribution, and then use a few steps of Metropolis updating to get closer.

For particular models (such as hierarchical linear and generalized linear models) it would also seem reasonable to try various approximations, for example estimating predictive errors conditional on the posterior distribution of the hyperparameters. If we avoid re-estimating hyperparameters, the computation becomes much quicker–basically, it’s classical regression–and this should presumably be reasonable when the number of groups is high (another example of the blessing of dimensionality!).

**Leaving out larger chunks; fewer replications**

The computational cost of performing each cross-validation suggests that it might be better to do fewer. For example, instead of leaving out one data point and repeating n times, we could leave out 1/10 of the data and repeat 10 times.

**Multilevel models: cross-validating clusters**

When data have a multilevel (hierarchical) structure, it would make sense to cross-validate by leaving out data individually or in clusters, for example, leaving out a student within a school or leaving out an entire school. The two cross-validations test different things. Thus, there would be a cross-validation at each level of the model (just as there is an R-squared at each level).

Comparing models in the presence of lots of noise, as in binary-data regression

A final difficulty of cross-validation is that, in models where the data-level variation is high, most of the predictive error will be due to this data-level variation, and so vastly different models can actually have similar levels of cross-validation error.

Shouhao and I have noticed this problem in a logistic regression of vote preferences on demographic and geographic predictors. Given the information we have, most voters are predicted to have a probability between .3 and .7 of supporting either party. The predictive root mean squared error is necessarily then close to .5, no matter what we do with the model. However, when evaluating errors at the group level (leaving out data from an entire state), the cross-validation appears to be more informative.

Summary

Cross-validation is an important technique that should be standard, but there is no standard way of applying it in a Bayesian context. A good summary of some of the difficulties is in the paper, “Bayesian model assessment and comparison using cross-validation predictive densities,” by Aki Vehtari and Jouko Lampinen, Neural Computation 14 (10), 2339-2468. Yet another idea is DIC, which is a mixed analytical/computational approximation to an esitmated predictive error.

I don’t really know what’s the best next step toward routinizing Bayesian cross-validation.

Aleks commented:

I have been using cross-validation methods for a very long time,

faced problems, and begun looking at Bayesian methods!

There are a few nuisance parameters inherent to cross-validation.

For example, the number of folds is K. One then assigns the

instances into K subsets. The result is a utility for each

individual data instance {u_1,u_2,…,u_N}. What is the utility of

a particular model? It is P(u_i | i, K, S) where i is the

instance, and S is the assignment. So you have two nuisance

parameters, S and the number of folds, K. A Bayesian would want to

integrate them out, so you have to assume a prior over K and S,

and sample over this.

Leave-one-out is a nice way of integrating S out. Unfortunately,

the prior over K is P(K=n)=1: not everyone would agree with this.

In my experience, leave-one-out often leads to overfitting.

As for multilevel models, there is some theory on how to sample in

bootstrap literature. We recently had a discussion on this topic on Usenet.

Aleks

Erwann commented:

FYI I found Ch7 of "Elements of statistical learning" by Hastie et al. quite useful on this topic. Further to my comments at the seminar: the cv-k error is unbiased for a n-k datapoints statistic (but is overestimated for an n-datapoints statistics). Bias arises when cv is combined with the bootstrap but there are associated benefits (Efron, cv vs bootstrap, 1995).

Other thought (maybe it's been already suggested during the seminar?):

Would it make sense to combine cv-k as part of the mcmc, that is by alternating over k subsets of data every p-cycles of the chain? It's not obvious that the chain would converge, but if it did it would make the inference more robust w/out extra computational challenge.

plz send some links for some reading material on Cross validation in general how it helps in clustering from the basics.