A lot has been written in statistics about “parsimony”—that is, the desire to explain phenomena using fewer parameters–but I’ve never seen any good general justification for parsimony. (I don’t count “Occam’s Razor,” or “Ockham’s Razor,” or whatever, as a justification. You gotta do better than digging up a 700-year-old quote.)
Maybe it’s because I work in social science, but my feeling is: if you can approximate reality with just a few parameters, fine. If you can use more parameters to fold in more information, that’s even better.
In practice, I often use simple models—because they are less effort to fit and, especially, to understand. But I don’t kid myself that they’re better than more complicated efforts!
My favorite quote on this comes from Radford Neal‘s book, Bayesian Learning for Neural Networks, pp. 103-104:
Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.
Exactly!
P.S. regarding the title of this entry: there’s an interesting paper by Albert Hirschman with this title.
If anyone is interested what philosophers of science think about the issue of parsimony, Malcolm Forster's home page is a good starting point, especially the three chapters of the manuscript "Occam’s Razor and the Relational Nature of Evidence": http://philosophy.wisc.edu/forster/.
But what exactly is "simplicity"? The number of parameters is not truly a measure of simplicity, as we can have many constrained parameters or few unconstrained ones. The degrees of freedom might be a semi-objective statistical measure of simplicity. And we need to keep the degrees of freedom aligned with the amount of data, otherwise the model doesn't crystalize, the posterior is hazy, the null cannot be rejected, so we cannot be sure about the inferences made with the model.
Regarding the comment by Aleks: "we need to keep the degrees of freedom aligned with the amount of data, otherwise the model doesn't crystalize, the posterior is hazy, the null cannot be rejected, so we cannot be sure about the inferences made with the model."
This is quite contrary to Bayesian methodology. If the model and prior are correct expressions of prior beliefs, they are correct regardless of how much data we have. There is no justification in the Bayesian framework for changing the model or prior based on the number of data points gathered.
Of course, the posterior may well be hazy if you have only a few data points. But making the model simpler isn't the solution. More data is the solution.
I agree with Radford, but there are circumstances where getting more data is impossible. The dilemma is then between these two alternatives:
* (1) a vague posterior;
* (2) a different prior and model space.
If the prior and model space are true, as it is frequently assumed in the Bayesian methodology, then (1) is indeed the only option. Unfortunately, vague posteriors are not particularly useful for drawing conclusions. Many applications of Bayesian methodology in social and political science do include posterior confidence intervals to characterize the posterior uncertainty. If there is too much posterior uncertainty, the research has been a failure.
(2) may sound unclean, but is commonly practiced – how else could one come up with the priors and the model spaces in the first place anyway? Is there anything wrong if one comes up with an agreeable prior and a crisp comprehensive posterior through these means? In empirical science we don't and shouldn't generally believe in The One Correct prior or model space, but instead seek to question existing ones and develop new ones.
Reading through Radford's nice recent tutorial at ftp://ftp.cs.utoronto.ca/pub/radford/bayes-tut.ps I see that I have not been clear about a very important point – that the posterior involves two kinds of parameters:
* nuisance parameters
* query(?) parameters
We generally integrate out the nuisance parameters. The query parameters (or some other term we might give them) are the ones we are interested in. It is not a problem for inference if the posterior distribution of nuisance parameters is vague.
Parsimonious models are easier to comprehend and are therefore more persuasive. If you're in purely descriptive mode, then it makes sense to throw in every nuance that may be valid, but most every journal article out there is a persuasive essay where descriptive stats are merely a vehicle toward testing a hypothesis. Of course, in testing an `A causes B' hypothesis, everything but A and B are noise to be dismissed as quickly as possible.
Readers want to feel smarter, to feel that they've learned something about how the world works. A model which acknowledges that the world is far too difficult for us to comprehend doesn't achieve this, while a story about A directly causing B is easy to assimilate and pull out of one's back pocket whenever needed.
So parsimony is here to stay, because readers and authors both benefit from it, because both have goals which are only partially related to describing reality.
I think David Freedman said it well:
"In social-science regression analysis, usually the idea is to fit a curve to the data, rather than figuring out the process that generated the data. As a matter of fact, investigators often talk about 'modeling the data.' This is almost perverse: surely the object is to model the phenomenon, and the data are interesting only because they contain information about that phenomenon. Whatever it is that most social scientists are doing when they construct regression models, discovering natural laws does not seem to be uppermost in their minds."
"With four parameters I can fit an elephant and with five I can make him wiggle his trunk." — John von Neumann
Source: http://dx.doi.org/doi:10.1038/427297a
Regarding the comments by Aleks and Radford:
In principle, models (at least for social-science phenomena) should be ever-expanding flowers that have have within them the capacity to handle small data sets (in which case, inferences will be pulled toward prior knowledge) or large data sets (in which case, the model will automatically unfold to allow the data to reveal more about the phenomenon under study). A single model will have zillions of parameters, most of which will barely be "activated" if sample size is not large.
In practice, those of us who rely on regression-type models and estimation procedures can easily lose control of large models when fit to small datasets. So, in practice, we start with simple models that we understand, and then we complexify them as needed. This has sometimes been formalized as a "sieve" of models and is also related to Cantor's "diagonal" argument from set theory. (In this context, I'm saying that for any finite class of models, there will be a dataset for which these models don't fit, thus requiring model expansion.)
But I agree with Radford that complex models are better. If I use a simpler model because I have difficulty understanding something more complex, I'm certainly not proud of myself!
Finally, it's my impression that statisticians and computer scientists working in dense-data settings (for example, speech recognition, vision, data mining) have been somewhat successful at developing highly complex models with many more parameters that we will use in social science. I'm thinking here of work by Radford Neal, Bob Carpenter, Yingnian Wu, and others.
Regarding the quote from Freedman:
I don't know what he means by saying that something is "usually" done in "social science data analysis." A statement such as "usually" implies a numerator and a denominator.
The numerator is the number of times that "the idea is to fit a curve to the data, rather than figuring out the process that generated the data."
The denominator is the total number of "social science data analyses."
I don't know how Freedman is defining his numerator or denominator and so I find it difficult to evaluate his statement.
Regarding the quote from von Neumann:
That's nice. All I can say in my defense is that in social science, we are fitting an entire population of animals, not just one. And humans are more complicated than elephants! From this perspective, 2 parameters per person (e.g., varying intercepts and slopes) actually seems parsimonious!
Isn't the data that is seen as a random sample already a complex enough pseudo-model? That's the foundation for bootstrap. For example, in nonparametric bootstrap one assumes a multinomial model where outcomes correspond to data instances. This multinomial distribution can then be seen as a universal complex model for the data. The particular data set is just the likeliest sample from the model.
The multinomial model is not considered to be the final result, it is just a source of uncertainty alike the Bayesian prior. The uncertainty will propagate to various queries about the data (such as, is mean greater than zero), and the answers will also be uncertain.
—
As for making use of complex models, there is an old story told in computer vision about a model that perfectly classified enemy tanks from friendly tanks. It worked perfectly on the data.
Later on, they looked more closely and found that it doesn't really classify enemy from friendly tanks, but discriminates the tanks in the open photographed in broad daylight from concealed tanks photographed in the dark. It was just an artifact of the data that all the friendly tanks were represented by good-quality pictures, and all the enemy tanks as bad-quality pictures.
—
So complex models, yes, as sources of uncertainty to be used when answering understandable queries about the data. Complex models, yes, as proxies for data when building understandable models. But complex models, no, when offered as a black-box explanation of a phenomenon, even if they get excellent objective performance measure scores.
But the vision of the complex-unfolding-model is very enticing…
A long comment from Aleks:
I replied:
Then Aleks:
Finally, my (Andrew's) response to all this:
First, I don't think that Radford would say that the prior is "true." The model is an assumption, which along with data gives you posterior inferences which you can look at, try to understand, compare to data, etc. You can use this for decision analysis if you'd like although I don't think that an explicit decision analysis is always, or even often, necessary. I don't really understand what H. Rubin is saying about the social sciences but perhaps with an example it would be clearer.
Information theory provides a natural link between Bayesian approaches and parsimony, by expressing theories and data in the same language. Minimum Message Length provides an explicitly Bayesian look at explanation vs. prediction: a quantitative version of Ockham's razor.
Perhaps you object to the assumption that in the absence of a strong prior, less complicated theories should get shorter codes (hence higher probability). Ultimately, there is no justification, as Hume demonstrated. But if you are willing to assume some stable structure to the universe, or even just take past performance as a guide, there is good support for the view.