Skip to content
 

Against parsimony

A lot has been written in statistics about “parsimony”—that is, the desire to explain phenomena using fewer parameters–but I’ve never seen any good general justification for parsimony. (I don’t count “Occam’s Razor,” or “Ockham’s Razor,” or whatever, as a justification. You gotta do better than digging up a 700-year-old quote.)

Maybe it’s because I work in social science, but my feeling is: if you can approximate reality with just a few parameters, fine. If you can use more parameters to fold in more information, that’s even better.

In practice, I often use simple models–because they are less effort to fit and, especially, to understand. But I don’t kid myself that they’re better than more complicated efforts!

My favorite quote on this comes from Radford Neal‘s book, Bayesian Learning for Neural Networks, pp. 103-104:

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

Exactly!

P.S. regarding the title of this entry: there’s an interesting paper by Albert Hirschman with this title.

13 Comments

  1. Aleks says:

    If anyone is interested what philosophers of science think about the issue of parsimony, Malcolm Forster's home page is a good starting point, especially the three chapters of the manuscript "Occam’s Razor and the Relational Nature of Evidence": http://philosophy.wisc.edu/forster/.

    But what exactly is "simplicity"? The number of parameters is not truly a measure of simplicity, as we can have many constrained parameters or few unconstrained ones. The degrees of freedom might be a semi-objective statistical measure of simplicity. And we need to keep the degrees of freedom aligned with the amount of data, otherwise the model doesn't crystalize, the posterior is hazy, the null cannot be rejected, so we cannot be sure about the inferences made with the model.

  2. Radford Neal says:

    Regarding the comment by Aleks: "we need to keep the degrees of freedom aligned with the amount of data, otherwise the model doesn't crystalize, the posterior is hazy, the null cannot be rejected, so we cannot be sure about the inferences made with the model."

    This is quite contrary to Bayesian methodology. If the model and prior are correct expressions of prior beliefs, they are correct regardless of how much data we have. There is no justification in the Bayesian framework for changing the model or prior based on the number of data points gathered.

    Of course, the posterior may well be hazy if you have only a few data points. But making the model simpler isn't the solution. More data is the solution.

  3. Aleks says:

    I agree with Radford, but there are circumstances where getting more data is impossible. The dilemma is then between these two alternatives:

    * (1) a vague posterior;

    * (2) a different prior and model space.

    If the prior and model space are true, as it is frequently assumed in the Bayesian methodology, then (1) is indeed the only option. Unfortunately, vague posteriors are not particularly useful for drawing conclusions. Many applications of Bayesian methodology in social and political science do include posterior confidence intervals to characterize the posterior uncertainty. If there is too much posterior uncertainty, the research has been a failure.

    (2) may sound unclean, but is commonly practiced – how else could one come up with the priors and the model spaces in the first place anyway? Is there anything wrong if one comes up with an agreeable prior and a crisp comprehensive posterior through these means? In empirical science we don't and shouldn't generally believe in The One Correct prior or model space, but instead seek to question existing ones and develop new ones.

  4. Aleks says:

    Reading through Radford's nice recent tutorial at ftp://ftp.cs.utoronto.ca/pub/radford/bayes-tut.ps I see that I have not been clear about a very important point – that the posterior involves two kinds of parameters:

    * nuisance parameters

    * query(?) parameters

    We generally integrate out the nuisance parameters. The query parameters (or some other term we might give them) are the ones we are interested in. It is not a problem for inference if the posterior distribution of nuisance parameters is vague.

  5. Parsimonious models are easier to comprehend and are therefore more persuasive. If you're in purely descriptive mode, then it makes sense to throw in every nuance that may be valid, but most every journal article out there is a persuasive essay where descriptive stats are merely a vehicle toward testing a hypothesis. Of course, in testing an `A causes B' hypothesis, everything but A and B are noise to be dismissed as quickly as possible.

    Readers want to feel smarter, to feel that they've learned something about how the world works. A model which acknowledges that the world is far too difficult for us to comprehend doesn't achieve this, while a story about A directly causing B is easy to assimilate and pull out of one's back pocket whenever needed.

    So parsimony is here to stay, because readers and authors both benefit from it, because both have goals which are only partially related to describing reality.

  6. MDM says:

    I think David Freedman said it well:

    "In social-science regression analysis, usually the idea is to fit a curve to the data, rather than figuring out the process that generated the data. As a matter of fact, investigators often talk about 'modeling the data.' This is almost perverse: surely the object is to model the phenomenon, and the data are interesting only because they contain information about that phenomenon. Whatever it is that most social scientists are doing when they construct regression models, discovering natural laws does not seem to be uppermost in their minds."

  7. "With four parameters I can fit an elephant and with five I can make him wiggle his trunk." — John von Neumann

    Source: http://dx.doi.org/doi:10.1038/427297a

  8. Andrew says:

    Regarding the comments by Aleks and Radford:

    In principle, models (at least for social-science phenomena) should be ever-expanding flowers that have have within them the capacity to handle small data sets (in which case, inferences will be pulled toward prior knowledge) or large data sets (in which case, the model will automatically unfold to allow the data to reveal more about the phenomenon under study). A single model will have zillions of parameters, most of which will barely be "activated" if sample size is not large.

    In practice, those of us who rely on regression-type models and estimation procedures can easily lose control of large models when fit to small datasets. So, in practice, we start with simple models that we understand, and then we complexify them as needed. This has sometimes been formalized as a "sieve" of models and is also related to Cantor's "diagonal" argument from set theory. (In this context, I'm saying that for any finite class of models, there will be a dataset for which these models don't fit, thus requiring model expansion.)

    But I agree with Radford that complex models are better. If I use a simpler model because I have difficulty understanding something more complex, I'm certainly not proud of myself!

    Finally, it's my impression that statisticians and computer scientists working in dense-data settings (for example, speech recognition, vision, data mining) have been somewhat successful at developing highly complex models with many more parameters that we will use in social science. I'm thinking here of work by Radford Neal, Bob Carpenter, Yingnian Wu, and others.

  9. Andrew says:

    Regarding the quote from Freedman:

    I don't know what he means by saying that something is "usually" done in "social science data analysis." A statement such as "usually" implies a numerator and a denominator.

    The numerator is the number of times that "the idea is to fit a curve to the data, rather than figuring out the process that generated the data."

    The denominator is the total number of "social science data analyses."

    I don't know how Freedman is defining his numerator or denominator and so I find it difficult to evaluate his statement.

  10. Andrew says:

    Regarding the quote from von Neumann:

    That's nice. All I can say in my defense is that in social science, we are fitting an entire population of animals, not just one. And humans are more complicated than elephants! From this perspective, 2 parameters per person (e.g., varying intercepts and slopes) actually seems parsimonious!

  11. Aleks says:

    Isn't the data that is seen as a random sample already a complex enough pseudo-model? That's the foundation for bootstrap. For example, in nonparametric bootstrap one assumes a multinomial model where outcomes correspond to data instances. This multinomial distribution can then be seen as a universal complex model for the data. The particular data set is just the likeliest sample from the model.

    The multinomial model is not considered to be the final result, it is just a source of uncertainty alike the Bayesian prior. The uncertainty will propagate to various queries about the data (such as, is mean greater than zero), and the answers will also be uncertain.

    As for making use of complex models, there is an old story told in computer vision about a model that perfectly classified enemy tanks from friendly tanks. It worked perfectly on the data.

    Later on, they looked more closely and found that it doesn't really classify enemy from friendly tanks, but discriminates the tanks in the open photographed in broad daylight from concealed tanks photographed in the dark. It was just an artifact of the data that all the friendly tanks were represented by good-quality pictures, and all the enemy tanks as bad-quality pictures.

    So complex models, yes, as sources of uncertainty to be used when answering understandable queries about the data. Complex models, yes, as proxies for data when building understandable models. But complex models, no, when offered as a black-box explanation of a phenomenon, even if they get excellent objective performance measure scores.

    But the vision of the complex-unfolding-model is very enticing…

  12. Andrew says:

    A long comment from Aleks:

    Andrew,

    this is a comment by H. Rubin regarding the recent blog discussion, posted on sci.stat.math. Herman is known for his "…non-separability of utility from prior" paper from 1987.

    > There is no basic disagreement between Bayesian inference

    > and Occam's razor, if the Bayesian inference is done

    > properly, not rashly. One will never get the true model,

    > nor could one use the true model if by some miracle it can

    > be found, so the real question is, what somewhat wrong

    > model minimizes the combined aspects of the loss? The

    > behavioral Bayes approach does not look at the probability

    > of the right action, nor does it even require prior

    > probability in the usual sense, but minimization of the

    > Bayes risk, and even this can only be approximated in

    > practice. Some of the aspects of risk are the complexity

    > of the model, the error in predicting from the model,

    > computational costs, whether the model can improve

    > understanding, etc.

    I replied:

    i don't really understand what h. rubin is saying here. i have no idea how he would quantify "whether the model can improve understanding, etc."

    Then Aleks:

    From what I understand, H. Rubin sees any task of modelling a task of maximizing the expected utility. Any modelling task is seen decision-theoretically. So a good model is what will maximize the expected utility. And as R. Neal would say that the prior is 'true', H. Rubin would say that the utility is explicit and 'true' (and the prior is inseparable from utility, so it must be true too).

    As an example, it's quite forseeable that I would define utility as being inversely proportional to the posterior variance, for example. Or, one could include model complexity or some other quantification of understandability as a component of utility. We're all doing this implicitly (i.e., you mentioned that you don't understand complex models, so sometimes stick to simpler ones), but Rubin would suggest making it formal.

    Of course, it's very dangerous to blindly maximize utility. One can invent a "pink shades" prior that makes everything more ideal than it really is. So

    the prior must still be realistic.

    In all, Rubin's view allows greater freedom, while restricting the formal methodology. My pet problem with his approach is that I often like to use the concept of value-at-risk (VaR, which is popular in econometrics and finance), which isn't as sensitive to gambler's ruin as expected utility. The proponents of expected utility would respond that nothing prevents you from defining value-at-risk as your utility function, and that risk-averse utility functions are analogous to VaR.

    The core question at this point is finding a healthy balance between assumptions and abstractions on one hand, and subjective human judgement on the other hand.

    Attaching a bit more from Rubin, as he responded to someone else.

    Aleks

    One must distinguish between the model which is accepted

    and the model which is true. Remember that the model must

    explain the observations, not just what is happening in

    nature. This MAY be simple enough that the difference

    can be ignored, or simplified, or it may not.

    There are many examples where a Bayesian approach can

    be taken to the real problem, even though a posterior

    distribution is beyond calculation. This may even

    result in accepting a model which is usually rejected

    from the grounds of statistical significance. It is

    not necessary to compute posterior probabilities to

    compute the decision to be taken, fortunately.

    This is the position faced in the social sciences, in

    the biological sciences, and even to some extent in the

    physical sciences. Do not oversimplify the problem.

    Finally, my (Andrew's) response to all this:

    First, I don't think that Radford would say that the prior is "true." The model is an assumption, which along with data gives you posterior inferences which you can look at, try to understand, compare to data, etc. You can use this for decision analysis if you'd like although I don't think that an explicit decision analysis is always, or even often, necessary. I don't really understand what H. Rubin is saying about the social sciences but perhaps with an example it would be clearer.

  13. Information theory provides a natural link between Bayesian approaches and parsimony, by expressing theories and data in the same language. Minimum Message Length provides an explicitly Bayesian look at explanation vs. prediction: a quantitative version of Ockham's razor.

    Perhaps you object to the assumption that in the absence of a strong prior, less complicated theories should get shorter codes (hence higher probability). Ultimately, there is no justification, as Hume demonstrated. But if you are willing to assume some stable structure to the universe, or even just take past performance as a guide, there is good support for the view.