Cosma Shalizi and Larry Wasserman discuss some papers from a conference on Ockham’s Razor. I don’t have anything new to add on this so let me link to past blog entries on the topic and repost the following from 2004:
A lot has been written in statistics about “parsimony”—that is, the desire to explain phenomena using fewer parameters–but I’ve never seen any good general justification for parsimony. (I don’t count “Occam’s Razor,” or “Ockham’s Razor,” or whatever, as a justification. You gotta do better than digging up a 700-year-old quote.)
Maybe it’s because I work in social science, but my feeling is: if you can approximate reality with just a few parameters, fine. If you can use more parameters to fold in more information, that’s even better.
In practice, I often use simple models—because they are less effort to fit and, especially, to understand. But I don’t kid myself that they’re better than more complicated efforts!
Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.
To put it another way, I don’t see parsimony, or Occamism, or whatever, as a freestanding principle. Simpler models are easier to understand, and that counts for a lot. I start with simple models and then work from there. I’m interested in the so-called network of models, the idea that we can and should routinely fit multiple models, not for the purpose of model choice or even model averaging, but so as to better understand how we are fitting the data. But I don’t think simpler models are better.
Part of my attitude might come from my social-science experience: we often here people saying, Your model is fine but it should also include variables X, Y, and Z. I never hear people complaining and saying that my model would be better if it did not include some factor or another.
In many practical settings there can be a problem when a model contains too many variables or too much complexity. But there I think the problem is typically that the estimation procedure is too simple. If you are using least squares, you have to control how many predictors you include. With regularization it’s less of an issue. So I think that, in some settings, Occam’s Razor is an alternative (and, to me, not the most desirable alternative) to using a more sophisticated estimation procedure.
The Occam applications I don’t like are the discrete versions such as advocated by Adrian Raftery and others, in which some version of Bayesian calculation is used to get results saying that the posterior probability is 60%, say, that a certain coefficient in a model is exactly zero. I’d rather keep the term in the model and just shrink it continuously toward zero. We discuss this sort of example further in chapter 6 of BDA. I recognize that the setting-the-coefficient-to-zero approach can be useful, especially compared to various least-squares-ish alternatives, but I still don’t really see this sort of parsimony as desirable or as some great principle; I see it more as a quick-and-dirty approximation that I’d like to move away from.
P.S. Neal Beck writes:
I think you cannot deal with this issue without specifying what you are doing.
1. Theoretical models (eg microeconomics): I would quote Milton Friedman’s Methodology of Positive Economics (1953) “Complete “realism” is clearly unattainable, and the question whether a theory is realistic “enough” can be settled only by seeing whether it yields predictions that are good enough for the purpose in hand or that are better than predictions from alternative.” The law of demand is pretty good for many purposes, but might fail in its prediction of the impact of lowering the price of Rolexes. Here simplicity has to do with understanding and prediction for the purpose at hand.
1a. I think this is related to issues where we are non-parametricly smoothing and the various bias-variance tradeoffs. I find that in practice we can often get a nice interpretible picture if we do not ask for perfect smoothness (lowest variance) but as we allow for less and less smoothness the picture becomes hard to understand. (But cross-valdiaton, see below, may be better than esthetics here).
2. Pure prediction. For some types of models, we typically find that simpler models yield better out of sample forecasts than more complex ones. I refer in particular to the choice of lag length in ARMA models (okay, not all that exciting) and to Lutkepohl’s work showing that the use of criteria like the BIC which penalizes complexity more strenuously leads to better out of sample forecasts.
3. A focus on cross-validation often leads to the choice of simpler models (though of course the data could suggest a more complicated model is superior). The nice thing here is that we do not have to use an esthetic criterion to choose between models. I do not know why we do not see more cross-validation in our discipline.
4. As a Bayesian you could just put a heavier prior on the parameters being near zero as you add more parameters. This is what Radford Neal does for his Bayesian neural nets (lots of neurons, as the number of neurons gets large, the prior on them being zero gets stronger).
In reply, let me just say that some of Neal’s examples are of the least-squares sort. As we discuss further in the comments below, if you’re doing least squares (for example, in fitting Arma models), you need to penalize those big models, but this is not such a concern if you’re regularizing.