David MacKay and Occam’s Razor

In my comments on David MacKay’s 2003 book on Bayesian inference, I wrote that I hate all the Occam-factor stuff that MacKay talks about, and I linked to this quote from Radford Neal:

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

MacKay replied as follows:

When you said you disagree with me on Occam factors I think what you meant was that you agree with me on them. I’ve read your post on the topic and completely agreed with you (and Radford) that we should be using models the size of a house, models that we believe in, and that anyone who thinks it is a good idea to bias the model towards simpler models for some arbitrary reason is deluded and dangerous.

Your attitude is “just do honest Bayesian modelling dammit”.

I agree with you.

But my message throughout my Occam pedagogy is that when you do this – when you simply turn the handle of honest Bayesian modelling, you will (whether you like it or not!) actually be embodying Occam’s razor. You are free to ignore this fact, but it will still be true. Take a model for a function y(x) for example, which is parameterized with an infinite number of parameters, and whose prior has a hyperparameter alpha, which controls wiggliness, and which we put a hyperprior on, then get data t ~ N( y , sigma**2 ), and do inference of y and/or alpha. It will be the case that the inference prefers some values of alpha over others – the posterior on alpha usually has a bump.

This is all presumably agreed and uncontroversial.

Now I think pedagogically it is interesting to note that smaller values of alpha (associated with wigglier, more “complex” functions) would have “fitted” the data better, but they are not preferred; the value of alpha at the peak of the bump is preferred, more probable, etc.

And I think calling this an automatic Occam’s razor is appropriate and helpful for people’s understanding.

Hope this helps clarify what I intended my book to be explaining. Maybe you can decide that you like the Occam bit after all.

When you think about it from an information theory / data compression point of view, the Occam story becomes very compelling. P <-> l = log 1/P … Replace “Simpler explanation” by “Briefer explanation”, and it becomes effectively a tautology. “Of course the briefest explanation is the most probable explanation! Because brevity is just log P.” You might enjoy reading about the “bits back” coding method of Geoff Hinton, which accurately embodies this data-compression / Bayesian phenomenon for quite complicated models including latent variables.

P.S. One reason why I think it is important for people to understand the automatic Occam effects that exist in Bayesian inference is this: when they implement a computational method that approximates perfect inference (eg a Monte Carlo method), it is important to have insight into how the inference works, using perhaps thermodynamic or description-length language; one needs to understand how going from the prior to the posterior usually involves an enormous collapse in accessible state-space volume, and have an understanding of what happens to log-likelihoods (energies) and log-volume-ratios (entropies) as one changes hyperparameters or as one changes between state spaces of different dimensions. I reckon that people who understand these issues may write good Monte Carlo methods for navigating around interesting models. I reckon that people who don’t understand these issues often write atrociously bad Monte Carlo methods that (eg) claim to perform Bayesian model comparison but don’t.

So I think it is great to speak the language of energy and entropy; the complementary language of bits and description lengths; and the language of log-likelihoods added to Occam factors. It may be relevant to point readers to my (1992) tutorial paper called “Bayesian Interpolation” in addition to my book.
http://www.inference.phy.cam.ac.uk/mackay/PhD.html
There I explain this terminology [marginal likelihood = best fit likelihood * occam factor] with reasonably simple examples.

And here’s my reply to MacKay:

I don’t know what it means for a model to be “the size of a house,” but I do know that I almost always wish I was fitting a model bigger than what I’m actually using. Typically what stops me is not computation but rather the crudeness of my models. Bigger models need bigger prior distributions, and until recently I hadn’t thought very seriously about how to do this.

I like your wiggly-function example, and I’m happy to associate that with Occam’s Razor. The Occam applications I don’t like are the discrete versions such as advocated by Adrian Raftery and others, in which some version of Bayesian calculation is used to get results saying that the posterior probability is 60%, say, that a certain coefficient in a model is exactly zero. I’d rather keep the term in the model and just shrink it continuously toward zero. We discuss this sort of example further in chapter 6 of BDA.

MacKay’s formulation in terms of “briefer explanations” is interesting but it doesn’t really work for me because, from a modeling point of view, once I’ve set up a model I’d like to keep all of it, maybe shrinking some parts toward zero but not getting rid of coefficients entirely.

But I expect that, ultimately, MacKay and Hinton are correct even there, because from an information-theoretic point of view, if a parameter is smaller, it takes fewer bits to encode it. Ultimately, coefficients have to be interpreted on some real scale, and it takes less space to code an estimate of 0.02 than to code 2.13, for example. This doesn’t fit into the usual “Bayesian model comparison” or “Bayesian model averaging” that I’ve seen, but I could well believe that the principle holds on some deep level.

In practice, though, I worry that “Occam’s razor” is more likely being used as an excuse to set parameters to zero and to favor simpler models rather than to encourage researchers to think more carefully about their complicated models (as in Radford’s quote above).

1. David J. Harris says:

This sounds a bit like a family of shrinkage methods I’ve been playing with (L1 regularization, aka “the lasso,” and its relatives). They have a decent Bayesian interpretation (below), and I was wondering if you’d like to weigh in on them relative to this Occam approach. From a Bayesian perspective, the basic idea is that you pick a “pointy” prior (usually the Laplace double-exponential distribution) instead of something with a differentiable peak like a Gaussian. Apart from that, I think this prior is fairly similar to the weakly informative priors you’ve been discussing (it even has long tails like t or Cauchy). Then you parameterize your model using the maximum of the posterior distribution (instead of the mean, like in most Bayesian methods). So in some sense, it’s more like penalized maximum likelihood than it is like typical Bayes.

Anyway, the upshot is that you continuously shrink your estimates toward zero, like you suggested above. And while it doesn’t pile 60% of the posterior at any particular value, it can still give you sparse parameter estimates because the mode, unlike the mean, can be shrunk all the way toward zero when the prior is sufficiently “pointy.” So in that sense it seems to get the best of both worlds.

There are lots of empirical and theoretical results saying that this works well for prediction, and it has some computational advantages over MCMC and its relatives when the number of parameters is very large. But it’s not “really” Bayesian in that it doesn’t give you the full distribution like an MCMC-based approach does.

Some of this might just be a cultural difference: the factors you study (e.g. race, gender, income, etc.) almost certainly have nonzero effects on voting behavior, but machine learning folks throw so many data sources into their models that most of them are bound to be worthless as predictors. So they arguably need some kind of automatic variable selection in a way that you don’t. Does that make sense? Or do you think that your approach would work better in those areas as well?

2. Sean Matthews says:

Very useful exchange. Thanks for that.

3. Ted Dunning says:

0.02 isn’t necessarily going to take more or fewer bits than 2.23. Whether it does depends on your prior since that is what you are going to encode it with. If the prior likes 2.23 better than 0.02, then fewer bits will be needed.

That is Mackay’s point … that MAP estimation is equivalent to finding the smallest model when the model is compressed using the prior.

4. Gustav says:

Well, in practice you might want a simpler model that performs rather well (say predicts 90% correct) than a complicated that performs slightly better (predicts 95% correct). In ecology for example, say that you want to predict suitable habitats for a species. You have 10 environmental variables that you know are important and they are all rather complicated to measure (time/cost-consuming). I would rather pick a model with only 2 variables that predict almost as well as a model with 10 variables – then I only have to measure these two variables when I investigate how suitable new patches are.

• George says:

Before I make a decision on model choice, based on resources, urgency, accuracy, purpose… It helps to have a good approximation / understanding of the least wrong model, which, almost always, is the most complex model.

5. K? O'Rourke says:

Seems related to work by Stephen Stigler where he provides arguments for “regression to the mean” as being one of the fundamental ideas of statistics.

At least the picture of reducing prior volume seems the same http://galton.uchicago.edu/index_description.shtml

6. Brian says:

I don’t get this: “But I expect that, ultimately, MacKay and Hinton are correct even there, because from an information-theoretic point of view, if a parameter is smaller, it takes fewer bits to encode it. Ultimately, coefficients have to be interpreted on some real scale, and it takes less space to code an estimate of 0.02 than to code 2.13, for example.

How does 2.13e0 require fewer bits than 2.13e-2?

Why are you assuming that smaller magnitudes have fewer significant figures?

7. If you interpret Bayesian output in an information-theoretic way (which I learned from MacKay’s book), then the shortest possible message you can send is the Bayesian mixture of all the models you are considering. So if Occam is saying “choose the shortest message you can send that contains all the phenomena” then Bayes does indeed encode Occam, but by saying you should not choose one particular model, but instead by mixing *all* your models. I realize i am not being clear here, but my point is that if Bayes encodes Occam, it is NOT in the sense that MacKay says and to which you object, no?