Bayesian inference: The advantages and the risks

This came up in an email exchange regarding a plan to come up with and evaluate Bayesian prediction algorithms for a medical application:

I would not refer to the existing prediction algorithm as frequentist. Frequentist refers to the evaluation of statistical procedures but it doesn’t really say where the estimate or prediction comes from. Rather, I’d say that the Bayesian prediction approach succeeds by adding model structure and prior information.

The advantages of Bayesian inference include:
1. Including good information should improve prediction,
2. Including structure can allow the method to incorporate more data (for example, hierarchical modeling allows partial pooling so that external data can be included in a model even if these external data share only some characteristics with the current data being modeled).

The risks of Bayesian inference include:
3. If the prior information is wrong, it can send inferences in the wrong direction.
4. Bayes inference combines different sources of information; thus it is no longer an encapsulation of a particular dataset (which is sometimes desired, for reasons that go beyond immediate predictive accuracy and instead touch on issues of statistical communication).

OK, that’s all background. The point is that we can compare Bayesian inference with existing methods. The point is not that the philosophies of inference are different—it’s not Bayes vs frequentist, despite what you sometimes hear. Rather, the issue is that we’re adding structure and prior information and partial pooling, and we have every reason to think this will improve predictive performance, but we want to check.

To evaluate, I think we can pretty much do what you say: ROC as basic summary and do graphical exploration, cross-validation (and related methods such as WAIC), and external validation.

17 thoughts on “Bayesian inference: The advantages and the risks

  1. That sounds like a good way of comparing the algorithms’ performances given a certain problem/dataset, but it would be interesting to have a more systematic way of analyzing the effects of likelihood/prior mismatch, in a general setup. Just curious, is anyone aware of theoretical studies on such matters?

    • That be hard because you need a distribution over all experiments/data. The results would be very dependent on whatever this model looks like. You could do such an analysis for a particular domain though.

  2. There is nothing inherently Bayesian about hierarchical models or the use of so-called prior information. It’s simply a matter of building a model that is sufficiently elaborate to capture the relevant information. It matters not whether that model is devised by a Bayesian via a prior distribution, by a frequentist that simply regards the “prior distribution” as part of a hierarchical model, or by someone who adopts the kind of “nondenominational” approach that I outlined in my recent TAS article.

    In many if not most applications of prediction, performance in repeated application is an important consideration. I’m assuming that’s at least part of what you have in mind in “risk 3.”

    • David:

      I agree. Anything Bayesian can be interpreted as a statistical procedure and be evaluated in that way. And I agree that there is a duality between Bayesian inference and hierarchical modeling, in that a prior distribution can be viewed as a distribution over a space of possible datasets or possible worlds.

      In practice, Bayesian methods can be useful as a way of balancing information from different data sources, but other principles can be used to derive regularized inferences.

      Also, I agree about performance in repeated applications which is, in many ways, the essence of statistics. This is related to my dictum that statistics is the science of defaults.

      • Andrew, I think the point of David is that a hierarchical model is just a way to specify a model including integrals of parameters/latent variables with respect to distributions with meta-parameters (to achieve better flexibility or to accommodate some sort is heterogeneity). But once you get the model written, and write the likelihood, you can apply any method, Bayes or MLE to compute the meta parameters. This is true regardless of wether one needs numerical methods to compute these integrals, or not. So, the model is unrelated to the estimation procedure.

    • David A. Harville. The Need for More Emphasis on Prediction: A “Nondenominational” Model-Based Approach
      The American Statistician. Volume 68, Issue 2, 2014. pages 71-83.

      is this the article you are referring to? much thanks, in advance!

  3. Due to the Bernstein-Von Mises theorem, under mild conditions, differences disappear asymptotically, provided the true parameter falls in the support of the prior, so if your sample is large enough, and the prior is absolutely continuous, it does not really matter (in regular enough models) if one uses Bayes or maximum likelihood. The posterior distribution shrinks degenerating around maximum likelihood estimator when the sample increases, so that both estimators became the same, and approximate together the true parameter. Differences appear with small samples. But in small samples, all statistics are noisy. As for medium size samples, all is relative, it depends on the prior, the model, luck….no way to do a systematic universal rule of comparison. In the end the choice between Bayes and frequent its methods is a matter of personal taste, and the type of algorithms one prefers, optimization versus numerical integration. Things are more complex, however, in models with infinite dimensional parameters, the caveats affect to both methods, but differently. In any case, I got the impression that the most versatile nonparametric estimators are based on a frequentist point of view.

    • Jose:

      As I’ve written a few thousand times now, I don’t think it makes sense to talk about a distinction between Bayes and frequentist methods. Bayes and frequentist represent different perspectives, but Bayes can be considered as a set of methods for constructing and interpreting statistical procedures, while freq can be considered as a set of methods for evaluating statistical procedures.

      • Actually, one could say the opposite too.in any case, I agree with you that there is not that much distinction between both approaches, at least not in most models.

    • Predictive inference is different from parametric inference. Asymptotic arguments are less relevant. The amount of information at one’s disposal is limited by the very nature of a prediction problem. If one is predicting tomorrow’s weather, the outcome of the Belmont Stakes, today’s closing price of IBM’s stock, or next-quarter’s GDP, eventually the result becomes known and the problem ceases to exist.

    • Not really, in Bayes methods the null is irrelevant with large samples, but with small samples…what are you getting?probably just noise. In between…good prior beliefs/info can help you to estimate the parameters better, or misguide you if it is not so good. Not so clear advantage….

  4. >If the prior information is wrong, it can send inferences in the wrong direction.

    I’m confused. Isn’t the idea that updating always brings you some amount closer to reality, regardless of where you started?

Leave a Reply to Jose M. Vidal-Sanz Cancel reply

Your email address will not be published. Required fields are marked *