Cross-validation and Bayesian estimation of tuning parameters

Ilya Lipkovich writes:

I read with great interest your 2008 paper [with Aleks Jakulin, Grazia Pittau, and Yu-Sung Su] on weakly informative priors for logistic regression and also followed an interesting discussion on your blog. This discussion was within Bayesian community in relation to the validity of priors. However i would like to approach it rather from a more broad perspective on predictive modeling bringing in the ideas from machine/statistical learning approach”. Actually you were the first to bring it up by mentioning in your paper “borrowing ideas from computer science” on cross-validation when comparing predictive ability of your proposed priors with other choices.

However, using cross-validation for comparing method performance is not the only or primary use of CV in machine-learning. Most of machine learning methods have some “meta” or complexity parameters and use cross-validation to tune them up. For example, one of your comparison methods is BBR which actually resorts to CV for selecting the prior variance (whether you use Laplace or Gaussian priors). This makes their method essentially equivalent to ridge regression or lasso with tuning parameter selected by cross-validation so there is really not much Bayesian flavor left there. It was I believe a rather marketing device to call it BBR. The advance in their method was really that at the time it was created there was no fast algorithm that computed the entire cross-validation coefficient path fast enough, taking advantage of the sparseness in the input vector with most of the X’s being 0’s, but this has more to do with the algorithm than Bayesian ideas. From my personal communication with David Madigan I did not remember whether he ever advocated using default priors, he seemed to like CV approach in choosing them (and that was the whole point of making the algorithm fast), as most people in the statistical learning community would (e.g. Hastie & Tibshitani & Friedman).

Now, it is unclear from your paper, whether when comparing your automated Cauchy priors with BBR you let the BBR to chose optimal tuning parameter, or used the default values. If you let BBR tune parameters then you should have performed a “double cross-validation,” allowing BBR to select a (possibly different) value of tuning parameter (prior variance) on each fold of your “outer cross-validation,” based on a separate “inner CV” within that fold. If you used automated priors then you might not have done justice to the BBR. But then you may say that it would be unfair to let them choose optimal prior variance via CV if your method uses automated priors. Also using CV may be not strictly speaking appropriate from a Bayesian point. But this is exactly what my question is. If we leave the Bayesian grounds and move to the statistical learning (or “computer science” in your interpretation) turf, then what is the optimal way to fit a predictive model? From reading your paper it seems that you believe in the existence of default priors, which translates in having default complexity parameters when performing statistical learning. This seems to be in contrast with what the “authorities” in the statistical leaning literature tell us where they reject the idea that one can preset complexity parameters in any large-scale predictive modeling as a popular myth. They view it as the ubiquitous bias-variance trade-off that cannot be resolved by some magical pre-specified values. At least when there is a reasonably large number of candidate predictors. The answer may be that your approach with automated priors is intended only when having just few predictors? Or there is here a deeper philosophical split between the Bayesian and the statistical learning community?

My reply:

1. The quick answer is that we wanted a method that would apply even for models with only one or two predictors, in which case it would not make sense to use cross-validation (or any other procedure) to estimate the tuning parameter (in this case, the scale parameter for the prior distribution of the logistic regression coefficients). If you have a lot of predictors, then, sure, it makes sense to estimate the hyperparameter from data in some way or another.

So, no, it’s not some matter of principle to me that hyperparameters should or should not be chosen a priori. It depends on the structure of the problem: the more replication, the more it is possible to estimate such tuning parameters internally. When you write that people “reject the idea that one can preset complexity parameters in any large-scale predictive modeling as a popular myth,” I think the key phrase there is “large-scale” which in this context implies having a large number of predictors so that the tuning parameters are estimated from the data. In our paper we were particularly interested in cases where the number of predictors is small.

If there are general differences between statistics and machine learning here, then, it’s not on the philosophy of automated priors or whatever; it’s that in statistics we often talk about small problems with only a few predictors (see any statistics textbooks, including mine!), whereas machine learning methods tend to be applied to problems with large numbers of predictors.

2. I don’t see cross-validation vs. Bayes as being a real thing. Or, to put it another way, once you start estimating hyperparameters from data, I see it as hierarchical Bayes. For example, you write, that the tuning parameter in BBR is “selected by cross-validation so there is really not much Bayesian flavor left there,” but I do think BBR is essentially Bayesian: to me, selecting a tuning parameter by cross-validation is just a particular implementation of hierarchical Bayes (with the recognition that, as the amount of information about the tuning parameter becomes small, it will be increasingly helpful to add prior information and more explicitly consider the uncertainty in your inference about this parameter).

33 thoughts on “Cross-validation and Bayesian estimation of tuning parameters

  1. I’m not sure what you mean in the second part. I can’t see what the Bayesian interpretation of cross validation is, in the sense that it explicitly depends on the data, so I don’t see how it can come out of a Bayesian model.

    That being said, I *love* cross validation in otherwise Bayesian model because, otherwise, there’s just no information at all about the parameter. (I’m thinking of estimating the variance of a component on the third level of a hierarchical model)

    I think it’s the sort of thing people talk about as “pragmatic Bayes”: Think Bayesian, compute however you can!

    • Cross validation doesn’t come from a single model: it instead arises when you add additional assumptions to your system, in particular the desire to have predictive consistency. Note that this is philosophically and practically different from self-consistent Bayesian inference but is exactly the assumption one introduces when doing posterior predictive checks of a Bayesian model.

      I love the work Aki’s been doing [1] that shows how cross validation, WAIC (and consequently AIC and DIC [2]) are all different approaches to different posterior predictive checks. This also shows how the typical squared-error cross validation is suboptimal compared to the log likelihood cross validation that’s becoming more popular in machine learning.

      That said, I like these methods for validating models, not fitting parameters. If you’re just trying to learn a hyper parameter then make it hierarchical, which formalizes the space of possibilities you would be approximating in cross validation anyways.

      [1] http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ssu/1356628931
      [2] http://www.stat.columbia.edu/~gelman/research/published/waic_understand3.pdf

      • So the situation that we ended up in was one where putting a prior on this particular parameter did nothing (in the sense that the posterior was almost identical to the prior, highly sensitive etc), but the results were still sensitive to the parameter. Hence we did exactly what you say and decided that we cared about prediction (which, we did) and basically did parameter estimation by model choice. It wasn’t the first thing we tried, but it definitely worked!

        And the work Aki and his group are doing is super-cool! We had one of his postdocs visiting here a month ago who’s doing some nice work on extending these results for Gaussian process models. (And he showed us how to stabilise the CPO calculations in INLA, which was definitely convenient!)

        • Dan and Mike,

          Thanks for your kind words, I’m blushing.

          As discussed in our survey [1] cross-validation (and some other predictive methods) is decision theoretically justified approach for choosing a parameter value or a model. I prefer “making it hierarchical” and integrating, but sometimes cross-validation is computationally much much faster (yes, I’m pragmatic).

          I have also encountered similar situation Dan describes, where predictions are sensitive to a value of hyperparameter, but the likelihood is not (so that parameter is not identified in the posterior).

          It seems that in some cases the marginal likelihood and predictive criteria (like CV) are complimentary, but this is not yet well understood topic. In some cases the marginal likelihood and predictive criteria (like CV) are very similar, but predictive criteria have higher variance, which leads to sub-optimal performance.

        • I wonder if it has to do with the sort of statistical singularity Watanabe was talking about? Is it that locally all parameters map to the same point in the statistical manifold and so classical techniques aren’t appropriate. (Tommy tried hard, but I still don’t understand all this geometry stuff….)

        • In the usual singular case different points in the analytical set would produce the same predictions (as is case with label switching in the mixture models), but it is more rare that while the likelihood value is the same the predictions are different. Watanabes proofs should hold in both cases.

        • Aki,

          Are we talking about posteriors with accidental degeneracies, where adding the validation data to the training data would solve the problem directly, or something more subtle? I guess I don’t have a strong grasp of how the model would fit the measured data well but then be so poorly predictive of new data, at least if the posterior uncertainty was properly taking into account in the prediction.

        • “I have also encountered similar situation Dan describes, where predictions are sensitive to a value of hyperparameter, but the likelihood is not (so that parameter is not identified in the posterior).”

          I’m having trouble grokking how this could be. Is there a toy model where this happens that I can play with?

  2. “Most of machine learning methods have some “meta” or complexity parameters and use cross-validation to tune them up”

    Naive question: Isn’t CV primarily about validating a model? To get an estimate of how a model tuned on your pet dataset would likely perform when unleashed on a “wild” data set? Overfitting, n-fold CV and all that.

    So how is it that we are saying CV is used to tune parameters? Wouldn’t that defeat the whole purpose of a validation exercise?

    Apologies if this is too elementary a question!

    • Rahul,

      I agree that CV should be about validation, but it often gets stretched out into tuning, especially in machine learning. The idea is that one is trying to choose the model that is “best validated”, where the possible models can be everything from varying structure to the same structure with a hyperparameter taking on different values. You’re also right in that you have to sacrifice true validation by using your hold-out data to tune those parameters.

      • Thanks @Michael!

        I’m no expert, but to me it seems that this practice defeats the whole purpose of validation. Why not just call it an extended tuning exercise or even maybe split-dataset-hierarchical-tuning. Or something.

        I think you taint your validation dataset the moment you do even a wee bit of tuning using it. If you chose the “best validated” model you are implicitly exposed to overfitting at the model-choice scale, aren’t you? And then the purpose behind the validation exercise is defeated.

        Maybe I’m wrong and my tastes are too purist.

        • Rahul,

          Philosophically I’m right there with you, but remember that many people in machine learning are frequentist (or have not yet learned the Bayesian arts) and don’t really have any other means of tuning hyperparameters so they jump on whatever methods might be available.

          Perhaps the best way to put it is that validation is a often considered a luxury when validation data is flush and models are simple, and not an absolute requirement. Given the choice of tuning for better performance and not validating or arbitrarily setting the hyper parameters and validating, a practitioner is almost always going to take the former.

          The beauty of a full Bayesian approach is that the tuning is automatic given a self-consistent model and any additional data is always available for validation.

          Now one can argue (well, I certainly would) that, because it wrings out every drop of information from the data, the Bayesian approach makes tuning and validation much more rigorous and much more practical.

        • I’m not 100% I agree with Michael here. In principle I would use Bayes (I like Bayes) when the data is informative. (In the sense that the posterior shrinks). When this isn’t true, you’re not really using every drop of information in the data (or, you are, but none of it is going into this one parameter) and so it becomes incredibly important to select a good prior. Given that it is *very* difficult to elicit good priors on variance components deep in a hierarchy (which is the type of situation where there will be little information in your data about the parameter), the Bayes machinery isn’t appropriate. In this case, it makes sense to consider a family of models indexed by this one parameter and perform model choice on it. This isn’t inconsistent with the Bayesian philosophy (I think I’m understanding what Andrew was talking about now..)

          As you pointed out above, cross validation fits beautifully into a bunch of existing model choice frameworks, it has known frequentist properties, is (relatively) cheap [this is why machine learners use it], and is based on a loss function. All of it makes it a good choice. Certainly better than, say, Empirical Bayes, which lacks these interpretations, and full Bayes, which is enormously sensitive to prior choice in this context.

        • Dan:

          Estimating parameters using cross-validation (or, as Aki calls it, cross-tuning) can be fine, but I think you are overrating its theoretical and practical properties. The minimum-cross-validated estimate of a parameter is just an estimate, and as such will perform well in some cases and not others. Ultimately it is an estimate based on the data you have.

          Just for example, since we’re talking about my 2008 paper with Jakulin et al., suppose you are fitting logistic regression to data that have complete separation. In this case the cross-validated estimate of the coefficient will be infinity. Similarly, cross-validation for hierarchical variance parameters can run into problems with the number of groups is small.

          That said, cross-validation can be improved in such settings by regularizing, i.e., adding a penalty function, i.e., using a prior. But this just continues the main thread which is that the minimum cross-validated error estimate, considered as a point estimate, can be viewed as an approximation to Bayesian inference with a flat prior. In some settings, point estimation with flat priors can work OK. But in some settings with sparse data, it can help to move to a more formally Bayesian approach to accounts for uncertainty and regularize away from bad areas of parameter space.

        • I would actually argue that it is inconsistent with a pure Bayesian philosophy – it is only when you add the additional assumption of predictive accuracy that it becomes a well-formed.

          Poor predictive performance usually indicates that the model isn’t sufficiently flexible to explain the data, and my understanding of the proper Bayesian strategy is to feed that back into your original model and try again until you achieve better performance. Conceptually I like to think of the ideal case as having the “true” model lie in the convex hull of your current models — in that case pure Bayes marginalization and model comp should work fine.

          So from that perspective cross validation succeeds only when all of your models suck and you can’t improve them, which I can see appealing to those in machine learning (joking, joking) but should be less so in statistics.

        • M

          +1 on your first two paragraphs.

          Regarding your third paragraph, I’d say, first, that sometimes people might really be interested in predictive accuracy so it’s good to have a way of assessing it while adjusting for overfitting; and, second, I have found cross-validation to be a useful “convincer” of the effectiveness of a model. And it can help convince me, not just others.

        • Yeah. I completely agree with everything you say (well, not the last paragraph). But I still feel compelled to make the case :p

          The situation wasn’t a standard one. We had a model where the posterior for one parameter (actually 3) was “flat” (well, it was almost the prior and we’d used a “standard” prior), but different values of that parameter in the area near the mode gave seriously different predictive performance.

          And (and this is the important thing) *we cared about prediction*.

          The problem is that the model is very flexible and, if we weren’t careful about these parameters, the out of sample prediction would be horrible. And while I would’ve liked to have attacked this with priors, these parameters were variances of Gaussian random fields that worked non-linearly to give the second order properties of a second GRF that was being indirectly observed, so it was difficult to work out what sort of informative prior to use.

          We also didn’t care about those parameters (they really are tuning parameters). So basically we had an infinite set of models and an idea of what a “good” model looked like, which made CV a reasonable option. It was also (relatively) cheap.

          And, if you look at this as a model choice problem (or a tuning parameter problem, which I think is the same thing?), it’s not uncommon to be inconsistent with pure Bayes. Pure Bayes is computationally difficult, which is why the xICs exist.

          So was this the only way? No. Was it a good way in the context? It was sufficient.

        • “Poor predictive performance usually indicates that the model isn’t sufficiently flexible to explain the data…”

          It was my impression that — in ML at least — poor predictive performance is more often due to the model being too flexible and fitting noise.

        • Corey:

          Overfitting comes from a model being flexible and unregularized. Making a model inflexible is a very crude form of regularization. Often we can do better.

    • Rahul: I don’t see how it would defeat the whole purpose, as long as the training CV is nested within the validation CV. They’re not both run on the entire data set.

      • Not sure, but to me the term “training CV” itself sounds an oxymoron. Can you train & validate at the same time / on the same data?

        OTOH maybe like @Aki Vehtari says, it’s only convention & semantics. In which case I’m fine, not trying to be pedantic here.

        • Although the practice seems to have been abandoned, not long ago, datasets were split into three parts: training, validation and testing, with the last one being the real “validation” data and the assumption that the validation data will end up used for tuning… both in the sense of tuning parameters and of iterating (in the case that initial validation did not look good).

  3. Our paper is Bayesian: we don’t tune parameters for each dataset. The Bayesian prior comes _before_ the dataset.

    “Now, it is unclear from your paper, whether when comparing your automated Cauchy priors with BBR you let the BBR to chose optimal tuning parameter, or used the default values.”

    If you look at Figure 6, you’ll see that Cauchy prior with scale 1 dominates fitted Gaussian or Laplace priors.

    “If you let BBR tune parameters then you should have performed a “double cross-validation,” allowing BBR to select a (possibly different) value of tuning parameter (prior variance) on each fold of your “outer cross-validation,” based on a separate “inner CV” within that fold.”

    Inner cross-validation might be a good way to game outer cross-validation – but not always a good way to predict reality.

    “If we leave the Bayesian grounds and move to the statistical learning (or “computer science” in your interpretation) turf, then what is the optimal way to fit a predictive model?”

    Some reflection about priors beats cross-validation hacks anyday. But, in the absence of reflection, cross-validation on a _corpus_ helps pick and evaluate priors, and that’s what our paper hopefully demonstrates.

    “From reading your paper it seems that you believe in the existence of default priors, which translates in having default complexity parameters when performing statistical learning.”

    It turns out that certain priors nicely match complexity penalization as measured via cross-validation.

    “This seems to be in contrast with what the “authorities” in the statistical leaning literature tell us where they reject the idea that one can preset complexity parameters in any large-scale predictive modeling as a popular myth. They view it as the ubiquitous bias-variance trade-off that cannot be resolved by some magical pre-specified values.”

    Yes. The complexity penalty is actually quite predictable according to our experiments.

  4. Mike: “Are we talking about posteriors with accidental degeneracies, where adding the validation data to the training data would solve the problem directly, or something more subtle? I guess I don’t have a strong grasp of how the model would fit the measured data well but then be so poorly predictive of new data, at least if the posterior uncertainty was properly taking into account in the prediction.”

    Dan: “The problem is that the model is very flexible and, if we weren’t careful about these parameters, the out of sample prediction would be horrible. And while I would’ve liked to have attacked this with priors, these parameters were variances of Gaussian random fields that worked non-linearly to give the second order properties of a second GRF that was being indirectly observed, so it was difficult to work out what sort of informative prior to use.”

    What I’m thinking and Dan’s example fits in there, too, are flexible, highly- or over-parameterised (p close to n or p>n), ill-posed models like in inverse problems. I think the key is that you can easily make accurate predictions at the observed x, but the behaviour of the predictive distribution out-of-sample x is sensitive to some hyperparameters (like Dan’s variance parameters). Cross-validation adds now requirement to be able to make predictions at out-of-sample locations.

    We can see this partially already in a very simple example, that is, the hierarchical Gaussian models like the 3 schools example in BDA book (p. 132, fig 5.10 in 3rd ed). The likelihood for large tau (population prior std) is flat, but the predictive performance goes down (not shown in the book, but think what happens to cross-validation when tau->infty). In this example the likelihood is higher for small tau, but even there using the uniform prior on tau makes the inference favor models with weaker predictive performance for new schools.

    Mike: “Poor predictive performance usually indicates that the model isn’t sufficiently flexible to explain the data”

    Or your model (and prior) is too flexible to say anything useful out-of-sample.

    Mike: “Conceptually I like to think of the ideal case as having the “true” model lie in the convex hull of your current models — in that case pure Bayes marginalization and model comp should work fine.”

    Agree, but it is quite common to have a model misspecification. I’ve had some examples with outliers, where when using non-robust likelihood (like Gaussian) cross-tuning worked better than Bayes, but when the likelihood was changed to more reasonable, Bayes worked just fine. Cross-tuning can help to notice model gross misspecifications, but in the end I would prefer to find the model where cross-tuning is not needed.

    • Aki:

      The 3-schools example is an interesting one. As you say, if you have a weak prior for tau, then the posterior density is close to flat for high values of tau, meaning that such values are possible (in the model) but they look really bad under cross-validation.

      There is then a temptation (as illustrated by Dan’s earlier comment) to declare that the cross-tuning estimate is superior, that cross-tuning gives the right answer in a setting where Bayes fails. But I don’t think this reasoning is correct. The real question (in the context of this model) is, what is tau? If tau is small, then the cross-tuning estimate will work just fine. But what if tau happens to be large and we just happened to have seen 3 schools with very similar data? This sort of thing can occur. Indeed, we know it can occur with some nontrivial probability; that’s what it means to say that the likelihood function is flat for large values of tau. In that case, if tau happens to be large, the cross-tuning estimate will be really bad, just in the same way as a bootstrap estimate will fail if it is applied to data that just happen to be close to each other.

      This is not to say that a Bayes estimate is necessarily better; of course the performance depends on the accuracy of the model, and statistical models (not just prior distributions!) typically have a lot of “convention” to them. But I do think this example is useful in understanding why, when cross-tuning and Bayes conflict, we shouldn’t necessarily go with the cross-tuning estimate. Cross-tuning is typically set up as a way of doing point estimation (of hyperparameters). When inferential uncertainty is large, a point estimate can be way off.

      • Andrew,

        I agree with everything you write and would add that when cross-tuning and Bayes conflict, instead of settling for cross-tuning it is better to spend more time thinking about your model and prior.

        • Interesting. So we could think of cross-tuning as a form of model check or sensitivity analysis: if the cross-tuning estimate differs from the Bayesian inference, that is a relevant piece of information.

    • Aki,

      I think we’re all no the same page here but there’s a technical detail that confuses me. When you reference troubles with big hierarchal models, “Or your model (and prior) is too flexible to say anything useful out-of-sample.”, how are you doing your predictions? Are they point predictions or posterior predictive distributions?

      My understanding is that if you’re incorporating the full posterior uncertainty and doing a posterior predictive distribution then you would get a wide range of predictions but they would be consistent with any data (high variance, low bias) which I would consider the correct behavior. And if you’re interested in decreasing the predictive uncertainty, then why not just incorporate any hold-out data into the inference in the first place?

      Are you even using a hold-out sample or reusing the data with k-fold cross validation? If k-fold cross validation improved the posterior more than incorporating all the data in the first place then we have some thinking to do!

      • Mike,

        I’m using the posterior predictive distributions.

        “If k-fold cross validation improved the posterior more than incorporating all the data in the first place then we have some thinking to do!”

        Exactly.

        I’ll try to make a simple example to illustrate this (problem is that this issue is usually observed in complex model with some model misspecification).

Comments are closed.