Difficulties with Bayesian model averaging

In response to this article by Cosma Shalizi and myself on the philosophy of Bayesian statistics, David Hogg writes:

I [Hogg] agree–even in physics and astronomy–that the models are not “True” in the God-like sense of being absolute reality (that is, I am not a realist); and I have argued (a philosophically very naive
paper, but hey, I was new to all this) that for pretty fundamental reasons we could never arrive at the True (with a capital “T”) model of the Universe. The goal of inference is to find the “best” model, where “best” might have something to do with prediction, or explanation, or message length, or (horror!) our utility. Needless to say, most of my physics friends *are* realists, even in the face of “effective theories” as Newtonian mechanics is an effective theory of GR and GR is an effective theory of “quantum gravity” (this plays to your point, because if you think any theory is possibly an effective theory, how could you ever find Truth?). I also liked the ideas that the prior is really a testable regularization, and part of the model, and that model checking is our main work as scientists.

My only issue with the paper is around Section 4.3, where you say that you can’t even use Bayes to average or compare the probabilities of models. I agree that you don’t think any of your models are True, but if you decide that what the scientist is trying to do is explain or encode (as in the translation between inference and signal compression), then model averaging using Bayes *will* give the best possible result. That is, it seems to me like there *is* an interpretation of what a scientist *does* that makes Bayesian averaging a good idea. I guess you can say that you don’t think that is what a scientist does, but that gets into technical assumptions about epistemology that I don’t understand. I guess what I am asking is: Don’t you use–as you are required by the rules of measure theory–Bayesian averaging, and isn’t it useful? Same with updating. They are useful and correct. It is just that you are not *done* when you have done all that; you still have to do model checking and expanding and generalizing afterwards (but even this can still be understood in terms of finding the best possible effective theory or encoding for the data).

Yet another way of trying to explain my confusion is this: When you describe the convergence process in a model space that *doesn’t* contain the truth, you say that all it tries to do is match the distribution of the data. But isn’t that what science *is*? Matching the distribution of the data with a simpler model? So then Bayes is doing exactly what we want!

My reply:

Bayesian model averaging could work, and in some situations is does work, but it won’t necessarily work. The problem arises with the models being averaged, in particular the posterior probabilities of the individual models depend crucially on untestable aspects of the prior distributions. In particular, flat priors cause zero marginal likelihoods, and approximations to flat priors cause sensitivity (for example, if you use a N(0,A^2) prior with very large A, then the marginal posterior probability of your model will be proportional to 1/A, hence it matters a lot whether A is 100 or 1000 or 1 million, even though it won’t matter at all for inference within a model.

This is not to say that posterior model averaging is necessarily useless, merely that if you want to do it, I think you need to think seriously about the different pieces of the super-model that you’re estimating. At this point I’d prefer continuous model expansion rather than discrete model averaging. We discuss this point in chapter 6 of BDA.

David’s follow-up:

I agree very much with this point (the 1/A point, which is a huge issue in Bayes factors), and in part for this reason I have been using leave-one-out cross-validation to do model comparison (no free parameters, justifiable, makes sense to scientists and engineers, even skeptical ones, etc). I would also be interested in your opinion about leave-one-out cross-validation; my engineer/CS friends love it.

My reply:

Cross-validation is great and I’ve used it on occasion. I don’t really understand it. This is not a criticism, I just want to think harder about it at some point. To me, cross-validation is tied into predictive model checking in that ideas such as “leave one out” are fundamentally related do data collection. Xval is like model checking in that the data come in through the sampling distribution, not just the likelihood.

8 thoughts on “Difficulties with Bayesian model averaging

  1. This may be naive, but I don't see what is unique about model averaging. A model formed by averaging other models is still just a model. And isn't all Bayesian inference model averaging if you think of each possible value of an uncertain parameter as a model?

  2. Yes, it is just another model, but it quantifies (I think, in quite natural way) also uncertainties which are present due to model selection. Several models could supply similar information and if model validation techniques fails to discriminate between these models then which one to choose? My impression about BMA is that it is elegant and useful procedure to make better probabilistic statements (regarding uncertainty) which is very important in reliability theory.

  3. John:

    Yes, you can think of Bayesian model averaging as inference under a large model that includes the individual models as special cases. As noted above, the difficulties come if this large model is not carefully defined. Certain parameters in the prior distribution of individual models (such as the scale parameter "A" mentioned in the blog above) have essentially no effect on inference within each of the individual models, but they can be crucial to the posterior weights. My problem with some Bayesian model averaging that I've seen is that the separate models don;t always fit together so well.

    Tomas:

    Yes, if that super-model makes sense. No, if the super-model doesn't fit together well. Consider the "A" parameter above. I agree that model averaging can be a useful tool, but I think you should be really wary of how you interpret those posterior probabilities!

  4. Prof. Gelman, have you heard about PAC-learning? Since you are discussing a lot of philosophy of Bayesianism these days, It would be nice to hear what you have to say about it.

    The only instance where I read about it was following a link provided by Cosma Shalizis' site/blog, where he posted the abstract of your paper with him, and then some mockery about Bayesianism.

    here is the link [1] to the original Cosma post, and here is the link [2] where I arrived from Cosma's site/blog.

    Here is an except of the texto abou PAC-learning:

    Q: But there should be some point where the two [Bayesian and PAC-learning] either are reconciled, or disagree.

    Scott: I can speak to that. The Bayesians start out with a probability distribution over the possible hypotheses. As you get more and more data, you update this distribution using Bayes’ Rule. That’s one way to do it, but computational learning theory tells us that it's not the only way. You don’t need to start out with any assumption about a probability distribution over the hypotheses. You can make a worst-case assumption about the hypothesis (which we computer scientists love to do, being pessimists!), and then just say that you'd like to learn any hypothesis in the concept class, for any sample distribution, with high probability over the choice of samples. In other words, you can trade the Bayesians' probability distribution over hypotheses for a probability distribution over sample data. In a lot of cases, this is actually preferable: you have no idea what the true hypothesis is, which is the whole problem, so why should you assume some particular prior distribution? We don’t have to know what the prior distribution over hypotheses is in order to apply computational learning theory. We just have to assume that there is a distribution."

    [1] http://cscs.umich.edu/~crshalizi/weblog/664.html

    [2] http://www.scottaaronson.com/democritus/lec15.htm

  5. Manoel: I don't like putting a probability distribution over hypotheses, for reasons discussed above, in ch 6 of BDA, and my paper with Shalizi.

  6. I've struggled to understand your point (your main, main, point) for a while, but this article with Shalizi someone turned the page for me. Nicely written, just tight enough – I finally think I'm (closer! to) seeing it. Or so I hope. Certainly a whole bunch of things you say on this blog or your texts have become "Ah, it's obvious why he says that…". Not to say I have it right of course!

    But do you ever feel that you are adding more confusion than light than adopting the "Bayesian" adjective for your work? I see you building complex models and doing Bayes within them (but who could object, it's an elementary mathematical result!) yet profoundly denying the utility (or even the reality – in statistical practice) of "degrees of belief over several models". You addressed this head on and very well _later_ in your paper but in the first half I was tearing my hair out: saying to myself, that's not the _model_ … that's a prior (one's _belief_s) among a set of models!

    It's one thing to rebut the usefulness or intellectual coherence of the "degrees of beliefs over models" approach. But that's what I myself think of as Bayesian (in the interesting sense, not a degenerate "I've used Bayes' theorem in a complex setting" sense). But your approach: you basically assert your approach is "Bayesian" (albeit statistically realistic) – to me at least this has been profoundly confusing. Am I alone?

    You are contemplating writing a handbook article philosophy of Bayesian statistics, presumably respresenting your views as such, so it's pretty much certain that my concern is either entirely silly or not something you would accept :-(

  7. But if the set of models contains several obviously wrong models (maybe due to mistakes in our goodness of fit assessment) and several (or maybe one) models with good fit, than probabilities should be very small for those wrong models and much higher for those of good fit. And in the end inferences mainly will be influenced by those good-fit-models. So it seems that we don't have to wary about our bad choices of set of models as long as we know that this set contains at least one good model. Or am I misssing something about BMA?

  8. Tomas:

    Yes, this can work–as long as the super-model makes sense. But if the super-model does not make sense, you can get ill-fitting models with high posterior probability. Probability-weighted averaging makes sense if the probabilities make sense. Recall the 1/A problem noted above.

    If you want to make model-averaged predictions, you might be better off optimizing over a predictive error measure directly.

Comments are closed.