I think cross-validation is a good way to estimate a model’s forecasting error but I don’t think it’s always such a great tool for comparing models. I mean, sure, if the differences are dramatic, ok. But you can easily have a few candidate models, and one model makes a lot more sense than the others (even from a purely predictive sense, I’m not talking about causality here). The difference between the model doesn’t show up in a xval measure of total error but in the patterns of the predictions.

For a simple example, imagine using a linear model with positive slope to model a function that is constrained to be increasing. If the constraint isn’t in the model, the predicted/imputed series will sometimes be nonmonotonic. The effect on the prediction error can be so tiny as to be undetectable (or it might even increase avg prediction error to include the constraint); nonetheless, the predictions will be clearly nonsensical.

That’s an extreme example but I think the general point holds. Think of xval as a way of estimating predictive accuracy. If you want to compare models, I think you’ll want to look more carefully at the output.

The above was my response to a question that Jay Ulfelder sent off to a few quantitative political scientists:

I’m at a methodological fork in the road in a project I’m doing for the U.S. Holocaust Memorial Museum, and I’m hoping that some of you might have some words of wisdom to help me figure out how to proceed. I know you’re all very busy, but the stakes in this project are pretty high, so if there’s any way you can take a few minutes to reply, I’d really appreciate it.

As noted in the blog post I linked to at the start of this email, the aim of the project is to develop a statistical model (or models) that could be used each year to assess the risk of an onset of mass killing in countries worldwide. Based on some preliminary analysis, I’ve decided that logistic regression will work fine, so I’ve got a modeling approach in hand. My goal now is to estimate and compare the forecasting power of some simple models, individually and as ensembles.

Where I’m stuck now is choosing between assessing and comparing forecasting power across model specifications via cross-validation and dealing with non-trivial missing-data problems via multiple imputation. I’ve written a script that executes 10-fold CV as a way to compare the models’ forecasting power, but that script starts with a single version of the data set. I don’t see how I can simultaneously handle the missing-data problem other than just rerunning that CV script N times, where N is the number of imputations. That would be unwieldy but do-able, but it also seems like the resulting averages of averages of estimates are going to be so noisy–these are very rare events, with only a few score onsets in each of the training sets that K-fold CV produces and about a dozen onsets in each of the test sets–that comparisons across models and ensembles will only be informative in the case of extreme differences, which I don’t expect to see at this point.

Given the inherent noisiness of these data, I’m inclined to forego the CV at this point and focus instead on the missing-data problem. Based on that preliminary research, I’m expecting that the models I specify will squeeze most of the forecasting juice out of the (quite limited) data we’ve got, and that averaging forecasts across a few different models will mitigate against the “all eggs in one basket” problem.

And Gary King offered this useful practical suggestion:

Since you’re only doing forecasting and not estimating causal effects, you could perhaps treat the ‘missingness’ as one type of observation. Then just put it all in and see how, by any means, you can improve the forecasts. it could be that missingness in some of the variables predicts mass killings. I bet it does — e.g., this paper shows that when military conflict starts, the first casualty is often the collection of vital registration records.

Let’s say you need to select point estimates for a regression model to apply predictively in the real world. For concreteness, suppose there are lots of features and full Bayes is too expensive computationally.

How would you recommend selecting the model?

The usual advice in the machine learning literature is to use cross-validation.

I’m not sure how to correct for two problems with cross-validation. First, because you choose the optimum parameters for cross-validation, often on a fixed division of the data into folds, the cross-validation performance is biased to the high side as a predictor of real world performance. Second, you get a small negative bias because you’re selecting the training/test sets without overlap or replacement, thus removing the most informative data you could’ve used for that case. I never see anyone try to account for either of these effects in the machine learning literature (but then I may not be looking in the right places).

Another bias is that often the data put together for machine learning evaluations is much cleaner than data in the wild.

Of course, in the real world, we get real feedback from predictions on real held out data.

If full Bayes is too expensive computationally, type II MAP is often better than cross-validation and usually computationally as expensive.

1) Two-level/double cross-validation (Stone, M. JRSSB 36:111-147, 1974, and many others later). This helps to get unbiased performance estimate, but does not prevent overfitting. With two-level cross-validation and selection heuristics it’s also possible to avoid some of the overfitting, but then you need and three-level cv to get unbiased performance estimate.

2) If not considering model misspecification and outliers (then you want to remove the most informative data you could’ve used for that case), e.g: Burman, P. Biometrika, 76:503-514, 1989 and Fushiki, T. Stat and Comp, 21:137-146, 2011.

For prediction, note the difference between “leave random year out” cross-validation and “leave future out” validation. “Leave random year out” has leakage (training on 1999 and 2001 to cross-validate on 2000).

I’ve seen even worse than this in natural language processing.

Human language is highly non-stationary. What and who gets talked about changes over time on several scales. As does how it’s talked about. And there are periodic effects for the calendar like we saw in the births example.

You also have document-level effects. If you sample at the level of sentences and do simple cross-validation, you wind up training and testing on data from the same document (which helps identify things like names, technical terms, and idioms).

One part of this problem that seems to actually be going somewhere is “what’s a notable difference in estimated generalization error?” I am still wrapping my head around the PAC-bayes generalization error bounds, but there are frequent claims that they are much more useful than the old VC-dimension based bounds. I also don’t know if they can be improved by building in something like WAIC’s estimate of the bias of the generalization error.

Isn’t there a very close relationship between cross-validation and the Bayes integral? I feel like the former gives you an approximation to the latter under some (possibly very strong) assumptions? If so, then it should be useable (under some circumstances) for model selection? Or am I way off base?

David: This might be of some help

Rubin DB. Bayesian justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics 1984; 12: 1151 – 1172

but in any case likely well worth reading.