I think cross-validation is a good way to estimate a model’s forecasting error but I don’t think it’s always such a great tool for comparing models. I mean, sure, if the differences are dramatic, ok. But you can easily have a few candidate models, and one model makes a lot more sense than the others (even from a purely predictive sense, I’m not talking about causality here). The difference between the model doesn’t show up in a xval measure of total error but in the patterns of the predictions.
For a simple example, imagine using a linear model with positive slope to model a function that is constrained to be increasing. If the constraint isn’t in the model, the predicted/imputed series will sometimes be nonmonotonic. The effect on the prediction error can be so tiny as to be undetectable (or it might even increase avg prediction error to include the constraint); nonetheless, the predictions will be clearly nonsensical.
That’s an extreme example but I think the general point holds. Think of xval as a way of estimating predictive accuracy. If you want to compare models, I think you’ll want to look more carefully at the output.
The above was my response to a question that Jay Ulfelder sent off to a few quantitative political scientists:
I’m at a methodological fork in the road in a project I’m doing for the U.S. Holocaust Memorial Museum, and I’m hoping that some of you might have some words of wisdom to help me figure out how to proceed. I know you’re all very busy, but the stakes in this project are pretty high, so if there’s any way you can take a few minutes to reply, I’d really appreciate it.
As noted in the blog post I linked to at the start of this email, the aim of the project is to develop a statistical model (or models) that could be used each year to assess the risk of an onset of mass killing in countries worldwide. Based on some preliminary analysis, I’ve decided that logistic regression will work fine, so I’ve got a modeling approach in hand. My goal now is to estimate and compare the forecasting power of some simple models, individually and as ensembles.
Where I’m stuck now is choosing between assessing and comparing forecasting power across model specifications via cross-validation and dealing with non-trivial missing-data problems via multiple imputation. I’ve written a script that executes 10-fold CV as a way to compare the models’ forecasting power, but that script starts with a single version of the data set. I don’t see how I can simultaneously handle the missing-data problem other than just rerunning that CV script N times, where N is the number of imputations. That would be unwieldy but do-able, but it also seems like the resulting averages of averages of estimates are going to be so noisy–these are very rare events, with only a few score onsets in each of the training sets that K-fold CV produces and about a dozen onsets in each of the test sets–that comparisons across models and ensembles will only be informative in the case of extreme differences, which I don’t expect to see at this point.
Given the inherent noisiness of these data, I’m inclined to forego the CV at this point and focus instead on the missing-data problem. Based on that preliminary research, I’m expecting that the models I specify will squeeze most of the forecasting juice out of the (quite limited) data we’ve got, and that averaging forecasts across a few different models will mitigate against the “all eggs in one basket” problem.
And Gary King offered this useful practical suggestion:
Since you’re only doing forecasting and not estimating causal effects, you could perhaps treat the ‘missingness’ as one type of observation. Then just put it all in and see how, by any means, you can improve the forecasts. it could be that missingness in some of the variables predicts mass killings. I bet it does — e.g., this paper shows that when military conflict starts, the first casualty is often the collection of vital registration records.