## Cross-validation and selection of priors

It is not unusual for statisticians to check their model with cross-validation. Last week we have seen that several replications of cross-validation reduce the standard error of mean error estimate, and that cross-validation can also be used to obtain pseudo-posteriors alike those using Bayesian statistics and bootstrap. This posting will demonstrate that the choice of prior depends on the parameters of cross-validation: a “wrong” prior may result in overfitting or underfitting according to the CV criterion. Furthermore, the number of folds in cross-validation affects the choice of the prior.

It has been recognized that the results of logistic regression and of the naive Bayes classifier are better if the weights are shrunk towards zero. Naive Bayes is a simple form of logistic regression where the weights for a multivariate logistic regression model are actually estimated in a univariate fashion: if all the variables were conditionally independent given the class, the results would be the same. The meaning of the shrinkage parameter is that we inject m cases into the data that have the same distribution in each variable as the original data, but have no association between any pair of variables – this corresponds to a variant of the usual conjugate Dirichlet prior. The usual recommendation in the context of machine learning is to use cross-validation to determine the “right” value of m, while a Bayesian would assume a prior distribution over m.

The trouble is that cross-validation itself has a parameter: the number of folds. As the graph above illustrates, the optimal value of m is 83 with predictive bootstrap (the cases that were not selected in a particular resample are used to asses the out-of-sample error), 41 with 2-fold CV, 54 with 3-fold CV, 59 with 5-fold CV, 64 with 10-fold CV, and 69 with leave-one-out. Moreover, should we use a different error measure, such as KL-divergence or classification error, the ideal amount of shrinkage would again be affected. A different data set would again imply a different amount of shrinkage.

One cannot escape assumptions. If this is so, what are reasonable assumptions? Data splitting is not an unreasonable assumption (analogous to a prior), and I find the above parameterization of shrinkage quite reasonable. But I personally feel uneasy expressing my prior as some sort of a distribution on weights. Secondly, having a more sophisticated model flunk the cross-validation test might just imply that there is a mismatch between the prior used and the prior that would be preferred by cross-validation.

1. D Das says:

Dear Sir,

I am using Agenarisk software tool to do some statistical analysis. I have built a model based on subjective data from experts. I would like to know how I can validate my model using cross-validation techniques.
Any pointer will be very much helpful as I am struggling with this for quite sometime.

Many thanks

Deb

2. Aleks says:

Deb, you should talk with the customer support of the company that sold you the software.

Or, you could join the happy family of open source software, do something for others, and others will do something for you.

3. Andrew says:

Aleks

This is really interesting–I wish I had read it 10 months ago when you posted it originally! As n (sample size) approaches infinity, I assume that all cross-validations will give the same optimal value. I expect that the phenomenon you're seeing in your graph can be thought of as inferential variation–i.e., the different cross-validated estimates are all estimating the same thing but have slightly different values for any particular finite dataset. I don't know if I'd characterize the number of folds of the cross-validation as a "parameter" in the same way that I think of a prior scale as a parameter.

4. Aleks says:

My post demonstrates that the number of folds in cross-validation isn't independent of the Bayesian prior in practical situations. The consequence is that the choice of the prior depends on the type of cross-validation used.

Finite datasets is all we have in practice. Just as infinite data would swamp out the subjective prior, infinite data would swamp out subjective choices in cross-validation / bootstrap.

5. Andrew says:

Aleks,

I'm not sure about this. For example, let's suppose that we're going to estimate regression coefs using the first third of the data, the second third, or the third third. In any given data set they will give different values of beta.hat. But that's just sampling variability–I wouldn't think of these as 3 different estimators. I wonder if something similar is going on in your #folds example.

6. Aleks says:

Andrew, I have tried to minimize the effect of this dependence on a particular partition of data into the folds by performing several replications of cross-validation. For example, the results shown involve 40 replications of 2-fold cross-validation. So, what should remain is only the dependence on the size of partitions, but not the choice of partitions.