Ilya Lipkovich writes:
I read with great interest your 2008 paper [with Aleks Jakulin, Grazia Pittau, and Yu-Sung Su] on weakly informative priors for logistic regression and also followed an interesting discussion on your blog. This discussion was within Bayesian community in relation to the validity of priors. However i would like to approach it rather from a more broad perspective on predictive modeling bringing in the ideas from machine/statistical learning approach”. Actually you were the first to bring it up by mentioning in your paper “borrowing ideas from computer science” on cross-validation when comparing predictive ability of your proposed priors with other choices.
However, using cross-validation for comparing method performance is not the only or primary use of CV in machine-learning. Most of machine learning methods have some “meta” or complexity parameters and use cross-validation to tune them up. For example, one of your comparison methods is BBR which actually resorts to CV for selecting the prior variance (whether you use Laplace or Gaussian priors). This makes their method essentially equivalent to ridge regression or lasso with tuning parameter selected by cross-validation so there is really not much Bayesian flavor left there. It was I believe a rather marketing device to call it BBR. The advance in their method was really that at the time it was created there was no fast algorithm that computed the entire cross-validation coefficient path fast enough, taking advantage of the sparseness in the input vector with most of the X’s being 0’s, but this has more to do with the algorithm than Bayesian ideas. From my personal communication with David Madigan I did not remember whether he ever advocated using default priors, he seemed to like CV approach in choosing them (and that was the whole point of making the algorithm fast), as most people in the statistical learning community would (e.g. Hastie & Tibshitani & Friedman).
Now, it is unclear from your paper, whether when comparing your automated Cauchy priors with BBR you let the BBR to chose optimal tuning parameter, or used the default values. If you let BBR tune parameters then you should have performed a “double cross-validation,” allowing BBR to select a (possibly different) value of tuning parameter (prior variance) on each fold of your “outer cross-validation,” based on a separate “inner CV” within that fold. If you used automated priors then you might not have done justice to the BBR. But then you may say that it would be unfair to let them choose optimal prior variance via CV if your method uses automated priors. Also using CV may be not strictly speaking appropriate from a Bayesian point. But this is exactly what my question is. If we leave the Bayesian grounds and move to the statistical learning (or “computer science” in your interpretation) turf, then what is the optimal way to fit a predictive model? From reading your paper it seems that you believe in the existence of default priors, which translates in having default complexity parameters when performing statistical learning. This seems to be in contrast with what the “authorities” in the statistical leaning literature tell us where they reject the idea that one can preset complexity parameters in any large-scale predictive modeling as a popular myth. They view it as the ubiquitous bias-variance trade-off that cannot be resolved by some magical pre-specified values. At least when there is a reasonably large number of candidate predictors. The answer may be that your approach with automated priors is intended only when having just few predictors? Or there is here a deeper philosophical split between the Bayesian and the statistical learning community?
1. The quick answer is that we wanted a method that would apply even for models with only one or two predictors, in which case it would not make sense to use cross-validation (or any other procedure) to estimate the tuning parameter (in this case, the scale parameter for the prior distribution of the logistic regression coefficients). If you have a lot of predictors, then, sure, it makes sense to estimate the hyperparameter from data in some way or another.
So, no, it’s not some matter of principle to me that hyperparameters should or should not be chosen a priori. It depends on the structure of the problem: the more replication, the more it is possible to estimate such tuning parameters internally. When you write that people “reject the idea that one can preset complexity parameters in any large-scale predictive modeling as a popular myth,” I think the key phrase there is “large-scale” which in this context implies having a large number of predictors so that the tuning parameters are estimated from the data. In our paper we were particularly interested in cases where the number of predictors is small.
If there are general differences between statistics and machine learning here, then, it’s not on the philosophy of automated priors or whatever; it’s that in statistics we often talk about small problems with only a few predictors (see any statistics textbooks, including mine!), whereas machine learning methods tend to be applied to problems with large numbers of predictors.
2. I don’t see cross-validation vs. Bayes as being a real thing. Or, to put it another way, once you start estimating hyperparameters from data, I see it as hierarchical Bayes. For example, you write, that the tuning parameter in BBR is “selected by cross-validation so there is really not much Bayesian flavor left there,” but I do think BBR is essentially Bayesian: to me, selecting a tuning parameter by cross-validation is just a particular implementation of hierarchical Bayes (with the recognition that, as the amount of information about the tuning parameter becomes small, it will be increasingly helpful to add prior information and more explicitly consider the uncertainty in your inference about this parameter).