In this discussion from last month, computer science student and Judea Pearl collaborator Elias Barenboim expressed an attitude that hierarchical Bayesian methods might be fine in practice but that they lack theory, that Bayesians can’t succeed in toy problems. I posted a P.S. there which might not have been noticed so I will put it here:
I now realize that there is some disagreement about what constitutes a “guarantee.” In one of his comments, Barenboim writes, “the assurance we have that the result must hold as long as the assumptions in the model are correct should be regarded as a guarantee.” In that sense, yes, we have guarantees! It is fundamental to Bayesian inference that the result must hold if the assumptions in the model are correct. We have lots of that in Bayesian Data Analysis (particularly in the first four chapters but implicitly elsewhere as well), and this is also covered in the classic books by Lindley, Jaynes, and others. This sort of guarantee is indeed pleasant, and there is a long history of Bayesians studying it in theory and in toy problems. Arguably, many of the examples in Bayesian Data Analysis (for example, the 8 schools example in chapter 5) can be seen as toy problems. As I wrote earlier, I don’t think theoretical proofs or toy problems are useless, I just find applied examples to be more convincing. Theory and toys can be helpful in giving us a clearer understanding of our methods.
Ways of knowing
Why do I go on and on about this? I am interested in how we “know,” in this case how we decide to believe in the effectiveness of a statistical method. Here are a few potential sources of evidence in favor of a method:
- Mathematical theory (for example, coherence of inference or asymptotic convergence);
- Computer simulations (for example, demonstrating approximate coverage of interval estimates under some range of deviations from an assumed model);
- Solutions to toy problems (for example, comparing the partial pooling estimate for the 8 schools to the no pooling or complete pooling estimates);
- Improved performance on benchmark problems (for example, getting better predictions for the Boston Housing Data);
- Cross-validation and external validation of predictions;
- Success as recognized in a field of application (for example, our estimates of the incumbency advantage in congressional elections);
- Success in the marketplace (under the theory that if people are willing to pay for something, it is likely to have something to offer).
None of these is enough on its own. Theory and simulations are only as good as their assumptions; results from toy problems and benchmarks don’t necessarily generalize to applications of interest; cross-validation and external validation can work for some sorts of predictions but not others; and subject-matter experts and paying customers can be fooled.
The very imperfections of each of these sorts of evidence gives a clue as to why it makes sense to care about all of them. We can’t know for sure so it makes sense to have many ways of knowing.
Progress! Bayesian methods have moved from plaything to practical tool
Go back in time 50 years or so and read the discussions of Bayesian inference back then. At that time, there were some applied successes (for example, I. J. Good repeatedly referred to his successes using Bayesian methods to break codes in the second world war) but most of the arguments in favor of Bayes were theoretical. To start with, it was (and remains) trivially (but not unimportantly) true that, conditional on the model, Bayesian inference gives the right answer. The whole discussion then shifts to whether the model is true, or, better, how the methods perform under the (essentially certain) condition that the model’s assumptions are violated, which leads into the tangle of various theorems about robustness or lack thereof.
50 years ago one of Bayesianism’s major assets was its theoretical coference, with various theorems demonstrating that, under the right assumptions, Bayesian inference is optimal. Bayesians also spent a lot of time writing about toy problems (for example, Basu’s example of the weights of elephants). From the other direction, classical statisticians felt that Bayesians were idealistic and detached from reality.
How things have changed! To me, the key turning points occurred around 1970-1980, when statisticians such as Lindley, Novick, Smith, Dempster, and Rubin applied hierarchical Bayesian modeling to solve problems in education research that could not be easily attacked otherwise. Meanwhile Box did similar work in industrial experimentation and Efron and Morris connected these approaches to non-Bayesian theoretical ideas. The key in any case was to use partial pooling to learn about groups for which there was only a small amount of local data.
Lindley, Novick, and the others came at this problem in several ways. First, there was Bayesian theory. They realized that, rather than seeing certain aspects of Bayes (for example, the need to choose priors) as limitations, they could see them as opportunities (priors can be estimated from data!) with the next step folding this approach back into the Bayesian formalism via hierarchical modeling. We (the Bayesian community) are still doing research on these ideas; see, for example, this recent paper by Polson and Scott on prior distributions for hierarchical scale parameters.
The second way that Lindley, Novick, etc. succeeded was by applying their methods on realistic problems. This is a pattern that has happened with just about every successful statistical method I can think of: an interplay between theory and practice. Theory suggests an approach which is modified in application, or practical decisions suggest a new method which is then studied mathematically, and this process goes back and forth.
To continue with the timeline: the modern success of Bayesian methods is often attributed to our ability using methods such as the Gibbs sampler and Metropolis algorithm to fit an essentially unlimited variety of models: practitioners can use programs such as Bugs to fit their own models, and researchers can implement new models at the expense of some programming but without the need of continually developing new approximations and new theory for each model. I think that’s right—Markov chain simulation methods indeed allow us to get out of the pick-your-model-from-the-cookbook trap—but I think the hierarchical models of the 1970s (which were fit using various approximations, no MCMC) showed the way.
To get back to the discussion from last month: Of course Bayesian inference has “theoretical guarantees” of the sort that our correspondent Barenboim was looking for. Back 50 years ago, this theoretical guarantee was almost all that Bayesian statisticians had to offer. But now that we have decades of applied successes, that is naturally what we point to. From the perspective of Bayesians such as myself, theory is valuable (our Bayesian Data Analysis book is full of mathematical derivations, each of which can be viewed if you’d like as a theoretical guarantee that various procedures give correct inferences conditional on assumed models) but applications are particularly convincing.
Over the years I have become pluralistic in my attitudes toward statistical methods. Partly this comes from my understanding of the history. Bayesian inference seemed like a theoretical toy and was considered by many leading statisticians as somewhere between a joke and a menace, but the hardcore Bayesians persisted and got some useful methods out of it. Bootstrapping is an idea that in some way is obviously wrong (as it assigns zero probability to data that did not occur, which would seem to violate the most basic ideas of statistical sampling) yet has become useful to many and has since been supported in many cases by theory. Etc etc etc.