## Bayes, Jeffreys, prior distributions, and the philosophy of statistics

Christian Robert, Nicolas Chopin, and Judith Rousseau wrote this article that will appear in Statistical Science with various discussions, including mine.

I hope those of you who are interested in the foundations of statistics will read this. Sometimes I feel like banging my head against a wall, in my frustration in trying to communicate with Bayesians who insist on framing problems in terms of the probability that theta=0 or other point hypotheses. I really feel that these people are trapped in a bad paradigm and, if they would just think things through based on first principles, they could make some progress. Anyway, here’s what I wrote:

I actually own a copy of Harold Jeffreys’s Theory of Probability but have only read small bits of it, most recently over a decade ago to confirm that, indeed, Jeffreys was not too proud to use a classical chi-squared p-value when he wanted to check the misfit of a model to data (Gelman, Meng, and Stern, 2006). I do, however, feel that it is important to understand where our probability models come from, and I welcome the opportunity to use the present article by Robert, Chopin, and Rousseau as a platform for further discussion of foundational issues.

In this brief discussion I will argue the following: (1) in thinking about prior distributions, we should go beyond Jeffreys’s principles and move toward weakly informative priors; (2) it is natural for those of us who work in social and computational sciences to favor complex models, contra Jeffreys’s preference for simplicity; and (3) a key generalization of Jeffreys’s ideas is to explicitly include model checking in the process of data analysis

The role of the prior distribution in Bayesian data analysis

At least in the field of statistics, Jeffreys is best known for his eponymous prior distribution and, more generally, for the principle of constructing noninformative, or minimally informative, or objective, or reference prior distributions from the likelihood (see, for example, Kass and Wasserman, 1996). But it can notoriously difficult to choose among noninformative priors; and, even more importantly, seemingly noninformative distributions can sometimes have strong and undesirable implications, as I have found in my own experience (Gelman, 1996, 2006). As a result I have become a convert to the cause of weakly informative priors, which attempt to let the data speak while being strong enough to exclude various “unphysical” possibilities which, if not blocked, can take over a posterior distribution in settings with sparse data–a situation which is increasingly present as we continue to develop the techniques of working with complex hierarchical and nonparametric models.

How the social and computational sciences differ from physics

Robert, Chopin, and Rousseau trace the application of Ockham’s razor (the preference for simpler models) from Jeffreys’s discussion of the law of gravity through later work of a mathematical statistician (Jim Berger), an astronomer (Bill Jefferys), and a physicist (David MacKay). From their perspective, Ockham’s razor seems unquestionably reasonable, with the only point of debate being the extent to which Bayesian inference automatically encompasses it.

My own perspective as a social scientist is completely different. I’ve just about never heard someone in social science object to the inclusion of a variable or an interaction in a model; rather, the most serious criticisms of a model involve worries that certain potentially important factors have not been included. In the social science problems I’ve seen, Ockham’s razor is at best an irrelevance and at worse can lead to acceptance of models that are missing key features that the data could actually provide information on. As such, I am no fan of methods such as BIC that attempt to justify the use of simple models that do not fit observed data. Don’t get me wrong–all the time I use simple models that don’t fit the data–but no amount of BIC will make me feel good about it! (See Gelman and Rubin (1995) for a fuller expression of this position, and Raftery (1995) for a defense of BIC in general and in the context of two applications in sociology.)

I much prefer Radford Neal’s line from his Ph.D. thesis:

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I [Neal] believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

This is not really a Bayesian or a non-Bayesian issue: complicated models with virtually unlimited nonlinearity and interactions are being developed using Bayesian principles. See, for example, Dunson (2006) and Chipman, George, and McCulloch (2008). To put it another way, you can be a practicing Bayesian and prefer simpler models, or be a practicing Bayesian and prefer complicated models. Or you can follow similar inclinations toward simplicity or complexity from various non-Bayesian perspectives.

My point here is only that the Ockhamite tendencies of Jeffreys and his followers up to and including MacKay may derive to some extent from the simplicity of the best models of physics, the sense that good science moves from the particular to the general–an attitude that does not fit in so well with modern social and computational science.

Bayesian inference vs. Bayesian data analysis

One of my own epiphanies–actually stimulated by the writings of E. T. Jaynes, yet another Bayesian physicist–and incorporated into the title of my own book on Bayesian statistics, is that sometimes the most important thing to come out of an inference is the rejection of the model on which it is based. Data analysis includes model building and criticism, not merely inference. Only through careful model building is such definitive rejection possible. This idea–the comparison of predictive inferences to data–was forcefully put into Bayesian terms nearly thirty years ago by Box (1980) and Rubin (1984) but is even now still only gradually becoming standard in Bayesian practice.

A famous empiricist once said, “With great power comes great responsibility.” In Bayesian terms, the stronger we make our model–following the excellent precepts of Jeffreys and Jaynes–the more able we will be to find the model’s flaws and thus perform scientific learning.

To roughly translate into philosophy-of-science jargon: Bayesian inference within a model is “normal science,” and “scientific revolution” is the process of checking a model, seeing its mismatches with reality, and coming up with a replacement. The revolution is the glamour boy in this scenario, but, as Kuhn (1962) emphasized, it is only the careful work of normal science that makes the revolution possible: the better we can work out the implications of a theory, the more effectively we can find its flaws and thus learn about nature. In this chicken-and-egg process, both normal science (Bayesian inference) and revolution (Bayesian model revision) are useful, and they feed upon each other. It is in this sense that graphical methods and exploratory data analysis can be viewed as explicitly Bayesian, as tools for comparing posterior predictions to data (Gelman, 2003).

To get back to the Robert, Chopin, and Rousseau article: I am suggesting that their identification (and Jeffreys’s) of Bayesian data analysis with Bayesian inference is limiting and, in practice, puts an unrealistic burden on any model.

Conclusion

If you wanted to do foundational research in statistics in the mid-twentieth century, you had to be bit of a mathematician, whether you wanted to or not. As Robert, Chopin, and Rousseau’s own work reveals, if you want to do statistical research at the turn of the twenty-first century, you have to be a computer programmer.

The present discussion is fascinating in the way it reveals how many of our currently unresolved issues in Bayesian statistics were considered with sophistication by Jeffreys. It is certainly no criticism of his pioneering work that it has been a springboard for decades of development, most notably (in my opinion) involving the routine use of hierarchical models of potentially unlimited complexity, and with the recognition that much can be learned by both the successes and the failures of a statistical model’s attempt to capture reality. The Bayesian ideas of Jeffreys, de Finetti, Lindley, and others have been central to the shift in focus away from simply modeling data collection and toward the modeling of underlying processes of interest–“prior distributions,” one might say.

[References (and a few additional footnotes) appear in the posted article.]

1. Richard D. Morey says:

I mostly agree with what you've written, but I find it difficult to believe that parsimony is irrelevant in the social science. At least in Psychology, it has been long understood that "everything affects everything else" (to paraphrase Meehl) and so tossing every effect into a big pot so that you can model it can sometimes be counterproductive. For instance, if I am interested in behavioral genetics, if I had a lot of data I would probably see that pretty much every gene has some very small relationship to my research interest, and a few have bigger effects. I then want to build a causal model explaining why the genes have the effects they do. Why should I include the tiny effects? If they don't offer much in terms of prediction, I should drop them out. Sometimes, inferential methods and parsimony are useful in deciding what NOT to worry about, so we can make progress on explaining the big effects.

But maybe the problem here is the dual use of the word "model" to mean both statistical models and (formal or informal) process or causal models. Or maybe it is just a difference between social science and other sciences.

2. Keith O'Rourke says:

Thanks for the excerpt from Radford’s thesis – it is very clear and hard to argue with.

I had been worried about increased risk of model misspecification and less chance of detecting it with more complex models, thinking of more complex methods as depending on higher order moments or fitting higher order polynomial functions. But with more complex models in the sense Fisher’s exhortation "to make your models complex for non-randomized studies", the chance of detecting miss-specification often will go up. (Occams razor for design though may be more robust, if interpreted as designing a study so that the least assumptions are required for its analysis to be credible)

Also, your comments on needing to be a mathematician in the 20th century and a computer programmer in 21st century – there does also seem to be becoming a need to have some sense of philosophy of science.

Keith
p.s. CS Peirce – yet again ;-) wrote about the continuity of models (between any two models or signs there will always be an intermediary one) with these being all wrong, all science was essentially the communal process over time of infinitely progressing from less wrong models to lesser wrong models – makes the “revolutions” more common and everyday (Peircedestrian).

3. I think that Andrew is right to point out the unusual situation that hard sciences like physics and astronomy are in, as compared with social sciences. It's true that theories in physics often have great intrinsic simplicity. The remarkable thing about Einstein's general theory of relativity, about which Jim Berger and I wrote, is that it precisely predicts the value of the motion of Mercury's perihelion from very simple assumptions, a value that had been regarded as anomalous for the majority of the 19th century. Many explanations were offered, such as modifying Newton's law of gravity, postulating unseen planets or rings of material near the Sun. Indeed, sightings of the unseen planet were announced on several occasions (but never confirmed). It even had a name, "Vulcan."

All of these attempts amounted to adding an extra adjustable (fudge) parameter to the theory, essentially just the value of the anomalous motion, whereas Einstein's theory required no such extra baggage. We in effect have a nearly point null hypothesis versus a parameter with a fairly broad range of possibilities (and a prior that can be defended). So when the calculation was done, our analysis favored Einstein's simpler theory.

In other astronomical work that Jim and I have done (with others), we've used Bayesian model averaging to average over models that used Fourier polynomials of variable order to fit periodic phenomena for which the physics was too complicated. This is related to but different from the relativity example. Empirical models like this are closer to what I gather the kinds of models used in social sciences look like.

4. Andrew Gelman says:

Richard:

Good point. For example, I recently wrote a book analyzing voting in terms of state of residence, income, religious attendance, and, for some analyses, ethnicity. Why did I just pick these 4 variables? Not because of any Occam's Razor idea. Certainly there are a lot of other important factors out there, including age, sex, and party identification. I did look at other variables (including party id and issue attitudes) at different places in the book. The point is that I wasn't using formal statistical criteria or Bayes factors to pick what to look at. It's hard to say exactly how I was guided, or what are the advantage of simple or complex models here. I was doing many simple comparisons because these were easiest for me to understand.

5. Robert says:

http://www.e-publications.org/ims/submission/inde

6. Ben Bolker says:

Maybe ecologists are just backward, but there still seems to be a lot of (justifiable) concern among us about data-dredging. Even if our data sets were as big as those of many social scientists (and with telemetry and remote-sensing studies some of them are), there's still the concern that the iterative model-building procedure can lead to fitting interesting but spurious (non-generalizable) features in the data. Whatever happened to the bias-variance tradeoff … ?

7. Keith O'Rourke says:

Ben: Perhaps I could try to flesh some of the postings out without stepping into it too much – nor exceeding the 15 suggested maximum time limit for constructing blog postings.

The empirical models Bill mentions are unlikely to, as Popper would put it, not have much chance of being falsified or rejected (higher order terms may appear not to be important but unlikley "wrong").

On the other hand, highly structured models, models that are concerned with the data generating processes or complicated models as Fisher suggested for observational studies may have much better chances of being rejected – if importantly wrong – and therefor may represent less risky "model complication".

On the other other hand, if you can design the study, good design most importantly (I believe according to Fisher's later writings) reduces the complication of the model assumptions required for credible analysis – Fisher's best example being the randomized design and analysis where only assumptions of non-interference between units (SUTVA) and a single value for a single treatment parameter (strict Null) are required for a credible analysis – a dreadfully simple model (see Alex's recent post refering to Imbens recent paper for a discussion of this). It's interesting to me that when Fisher's loses input to design with observational studies in hand – he suggests complicated models.

As for Andrew not using formal statistics but informal judgement, Peirce once argued that we had evolved to be good and wise in such informal modeling judgements — but the arguments are not terribly credible …

Hope this was more than a distraction
Keith
p.s. 18 minutes