Nils Hjort, Chris Holmes, Peter Muller, and Stephen Walker have come out with a new book on Bayesian Nonparametrics. It’s great stuff, makes me realize how ignorant I am of this important area of statistics. Here are the chapters:

0. An invitation to Bayesian nonparametrics (Hjort, Holmes, Muller, and Walker)

1. Bayesian nonparametric methods: motivation and ideas (Walker)

2. The Dirichlet process, related priors and posterior asymptotics (Subhashis Ghosal)

3. Models beyond the Dirichlet process (Antonio Lijoi and Igor Prunster)

4. Further models and applications (Hjort)

5. Hierarchical Bayesian nonparametric models with applications (Yee Whye Teh and Michael I. Jordan)

6. Computational issues arising in Bayesian nonparametric hierarchical models (Jim Griffin and Chris Holmes)

7. Nonparametric Bayes applications to biostatistics (David Dunson)

8. More nonparametric Bayesian models for biostatistics (Muller and Fernando Quintana)

I have a bunch of comments, mostly addressed at some offhand remarks about Bayesian analysis made in chapters 0 and 1. But first I’ll talk a little bit about what’s in the book.

**What’s in the book**

All the applications in the book are biomedical. (In addition, Teh and Jordan mention image segmentation, but they don’t include an actual application in their chapter, instead focusing on the models in the abstract.) Does the all-biomedical-applications feature just arise from the particular list of authors, or is there some more deeper reason? I’d think there’d also be potential for Bayesian nonparametrics in political science (for time series patterns that can’t simply be modeled with trends or autoregressions), environmental science (for example, nonlinear relations between weather and tree rings), and many other areas.

The book looks like it will be useful to a wide range of researchers. I like that there is a lot of discussion of the models themselves as well as the computation. The book, especially in the early chapters, is more theoretical than I would prefer–lots and lots of asymptotic results, which perhaps reach a peak in this equation from the chapter by Ghosal:

sup(epsilon>epsilon_n) log N (xi epsiolon, {theta : epsilon < d_n (theta, theta_0) <= 2 epsilon}, d_n) <= n epsilon_n^2

But, hey, that’s just my taste. After all, the book is called “Bayesian Nonparametrics,” not “Nonparametric Bayesian Data Analysis” or “Bayesian Nonparametrics for Applications” or whatever. I’m used to “Bayesian” being a signal for “applied,” but it doesn’t have to be that way.

In any case, this isn’t a real problem. The theoretical reader can focus on the first few chapters, while the applied people can flip to the second half of the book.

My main complaint–and maybe it’s not too late for the editors to fix this, since I’m commenting on an uncorrected manuscript draft–is that it would be good to have some additional guidance on the differences between all these models and how they work together. For example, skimming through the table of contents, I don’t see anything about splines. Did I miss something? Or is there some reason that splines are not considered to be nonparametric? This is not an area I know much about.

As noted above, I have little experience in this area and thus do not have much to say on the main content of the book. Chapters 0 and 1 are more general, however, and I have a bunch of reactions to them.

**Comments on the books’ view of Bayesian statistics**

Many of these comments represent areas in which I disagree with the authors, but on the whole I think the book is excellent. If I didn’t think the book was important, I wouldn’t be spend my time pointing out my disagreements with it!

Chapter 0:

p.2: Methods based on ranks and related classical procedures are described as “distribution free.” I think this is misleading. Classical methods such as permutation tests are extremely model-dependent in that they typically assume a family of exactly additive models for their set of hypothesis tests that they invert to get confidence intervals. To me, the model y = a + bx + e, with some assumed probability distribution for independent errors e, is much more general than a model in which effects are constant across units.

p.2-3: The authors classify Bayesian parametrics as “models with a finite (and often low) number of parameters, and Bayesian nonparametrics as having “big parameter spaces . . . and construction of probability measures over these spaces.” My question is, where do my models fit in??? For example, in this estimate of public opinion on school vouchers, we fit parameters for religion/ethnicity, income, state, and their interactions. The model has a lot of parameters–and as we get more data, we would fit still more parameters (interactions by age, education, etc etc)–so it’s clearly not “Bayesian parametrics” by their definition. But we’re doing it all with logistic regression and normal (or t) prior distributions–so there’s no construction of probability measures, thus it’s not “Bayesian nonparametrics” either.

p.3: The authors write that “the ill-defined transformation from so-called ‘prior knowledge’ to ‘prior distribution’ is hard enough for elicitation in lower dimensions and of course becomes more challenging in bigger spaces.” I’d just add that this is also hard in classical statistics, where prior knowledge is encoded as null and alternative hypotheses, or (implicitly) in the form of your estimator.

p.3: The authors are upset that “some nonparametric Bayes constructions actually lead to inconsistent estimation procedures, where the truth is not possibly uncovered when the data collection grows.” This doesn’t bother me at all! As data collection grows, our model will expand also, so I don’t see the need to converge to a single parameter estimate. This is sometimes called the method of sieves.

p.5: The authors write that “parameters are fixed unknowns for the frequentists.” And for the Bayesians too! The parameters that I’m estimating are fixed, I just don’t know them. I use a probability distribution to represent uncertainty about the parameters, but I think of them has having fixed true values.

p.5: They write, “one might be satisfied with a model that predicts climate parameters and the number of lynx in the forest, without always needing or aiming to understand the finer mechanisms involved in these phenomena.” Huh? There’s a lot of study of climate now, and it’s all about understanding the mechanisms. The discussion of global climate change, for example, centers on the potential impacts of increased burning of fossil fuels. Beyond this, it’s hard for me to see that prediction will be done well in a serious way without understanding. For example, with the lynx example (I assume the authors are alluding to the famous Canadian lynx time series), Cavan Reilly demonstrated (in an article in this book) how a simple Bayesian nonlinear predator-prey model with just a few parameters dramatically outperforms the standard context-free autoregressive models from the time series literature. There was no competition between the pragmatic goal fo prediction and the more general goal of understanding (which is why people are studying these things in the first place); on the contrary, a model that captured more of the key features of the problem also did much much better at predictions.

p.5: The authors refer to “frequentist methods.” There is no such thing as a frequentist method; “frequentist” refers to an approach to evaluating methods. Or, you could say that any method that has been evaluated in a frequentist way is a “frequentist method,” in which case Bayesian methods are frequentist too.

p.5-6: They write, “one uses what one has, rather than spending three extra months on developing alternative methods.” What a minute! We’re researchers here. I certainly do sometimes spend three extra month developing new methods. That’s the best part of my job! (See here for an example.)

p.6: They write, “a person cannot be Bayesian or frequentist; rather, a particular analysis can be Bayesian or frequentist.” I completely disagree! Every statistician, from R. A. Fisher on down, will use Bayesian methods when they are appropriate–that is, when there is a physical sampling distribution for the parameters. A Bayesian is someone who uses Bayesian methods even when they’re inappropriate. I’m a Bayesian in that sense.

p.9: There is some reference to “Chinese restaurant franchises and Indian bufett processes.” I’ve always been annoyed by that sort of terminology. I accept that it’s standard in the field, and you have to use it, but it seems more cute than descriptive.

p.13: In the history of Bayesian nonparametrics, I’d think you’d want to refer to Wahba’s work from the 1970s providing a Bayesian foundation for splines.

p.15: In the discussion of “p>>n problems,” it says: “Ordinary methods do not work, and alternatives must be devised.” Here it would make sense to mention hierarchical Bayes regression, as well as latent parameter models in psychometrics and elsewhere that often have more parameters than data points.

p.15: In the discussion of regression trees, the editors miss a crucial aspect of Chipman et al.’s BART, which is that their model is additive: they are not merely growing, pruning, and averaging trees; they are putting them together in additive models, which makes a big difference.

p.17: I don’t like at all the discussion of so-called “empirical Bayes,” which as far as I can tell is just some sort of approximation to Bayes modeling. In particular, it’s extremely misleading to present so-called empirical Bayes as “opposed to the pure Bayes methodology that Lindley and others envisage.” Lindley and Novick did pathbreaking work in the 1960s on hierarchical (i.e., “empirical”) Bayes inference in psychometrics.

Chapter 1:

p.1: I know right away that this is a theoretical chapter because of its use of x rather than y. Theory’s great, I have no objection to it, but I would prefer theoreticians to be a bit less prescriptive on what applied people should do. For example, on page 2, “It is instructive to think of all Bayesians as constructing priors on spaces of density functions, and it is clear that this is the case.” Well, no. I spend a lot of time fitting regressions, estimating E(y|x). I put prior distributions on regression coefficients, not on “spaces of density functions.” It’s fine with me if this is what *you* do, but please don’t say that this is what “all Bayesians” do!

p.3: “It is not that difficult to argue that Bayesian model criticism is unsound, and the word that is often used is *incoherent*.” [italics in original] It’s not all that difficult to argue all sorts of things–for example, you’ll find lots of prominent people in the United States who will argue that the theory of evolution isn’t true, and then there’s the cold fusion community, and so on and so forth. But if you want to make the claim yourself that “Bayesian model criticism is unsound,” I suggest you start by taking a look at chapter 6 of Bayesian Data Analysis and thinking a bit about the posterior predictive distribution, p (y^rep | y), which is central in understanding this approach. You can also check out this article. Again, I say: do what you want, but please be more careful if you want to start calling things “unsound.”

p.3: “For the posterior to mean anything it must be representing genuine posterior beliefs.” OK, that’s one view. It’s called the subjectivist view. There is also the objectivist Bayesian view, which happens to be closer to my position. In the objectivist view, as I interpret it, the prior distribution and the likelihood are a model and are treated as an assumption: inference is conditional on the assumed model, and we can ditch the model if it makes predictions that don’t make sense or contradict the data. (Jaynes wrote about this, and we talk about objective prior distributions quite a bit in chapter 1 of Bayesian Data Analysis).

A bit past page 3, I started getting lost in the algebra. Getting to the end of the chapter, I see a bit that I agree with: “the Bayesian model is a learning model.” I think that’s exactly right. As Walker says earlier in his chapter, “the prior must encapsulate prior beliefs and be large enough to accommodate all uncertainties.” That’s about right, except that (a) I’d replace “the prior” with “the model” in that sentence (after all, the likelihood must be specified too, and for many problems it’s much more important than the prior), and (b) I’d be a little less optimistic about accommodating “all” uncertainties. At least, I’ve never been able to do that in my applied work!

In any case, I think we’re heading toward the same goal, and I respect that the complexities of nonparametric models may require additional mathematical work. I confess to not being able to easily read chapters 2 and 3 of the book, but this is probably just because these models are new. Back when the Gaussian distribution was being introduced, a lot of new mathematics came with it (the integral of exp(-x^2) and all the rest); it’s only with centuries of experience that it all seems simple to us now. So I’m glad there are people out there proving theorems while others of us fit models.

Later chapters: As I’ve already noted, I don’t have much to say here, but chapters 4-8 appear to make a good case that these methods are both doable and useful. I hope in the not-too-distant future to be able to use these ideas in my own applied research.

**Questions for the editors**

In summary, here are my biggest questions about the book:

– How do hierarchical Bayesian regression models fit into the framework of this book? These models, which I use all the time, seem to fall in between the “parametric” and “nonparametric” categories.

– Where do splines fit in? Are they in one of the chapters but I didn’t realize it?

A bit of random comments here…

First of all, I really appreciate your description of parameters being fixed unknowns for Bayesians too. That was always something I had a hard time with in school when I was first learning the Bayesian approach after a long line of hard-core classical classes. It seemed so strange to me that they thought of the true population parameter as random – if I have all the data in the population, how can a quantity I compute from it be random? Anyways, that's great to hear from you because I don't know how many Bayesians really think of it this way, or have thought of it at all.

My other comment/rambling is just that just yesterday while reading your multilevel models book I was thinking about how you would go about handling cases in logistic regression, for example where there are a bunch of zeros at the low values of the explanatory variable (call it x), mixing up to ones at the middle values of x, and then mixing back down to zeros at the higher values of x. The logistic fit would be terrible while a classical nonparametric approach such as local bernoulli likelihood would yield a nice fit. So I was just wondering how something like this would fit into hierarchical models and maybe this book addresses situations like this?

I could use this! When's it coming out?

You could include an x^2 term as a covariate, but that wouldn't be very flexible.

I've seen Gaussian process models applied in the geosciences to paleoclimate reconstructions, geostatistical interpolation (kriging), and the emulation of complex computer codes.

Oh yeah – I guess for my particular example x^2 would work nicely. I guess I'm talking about even more complex situations.

I was at the NP Bayes workshop in Italy this past June. Over dinner with Nils Hjort and Steve MacEachern, Nils said it would have been good to invite you (Andrew) to the workshop in order to poke holes at the area, similar to your discussion paper "criticizing" Bayes, in Bayesian Analysis.

Since you weren't available, I offered to put together a short informal talk poking holes at NP Bayes:

http://www.stat.washington.edu/~hoff/public/npbay…

While giving this informal talk, I forgot to mention to the audience that I was channeling your spirit, and that the whole thing was Nils' idea (he wasn't in attendance). The audience lambasted me and now I am not allowed to draw from the Polya urn!

####

Regarding anon's comment on parameters being fixed but unknown:

"because I don't know how many Bayesians really think of it this way"

I would say "almost all Bayesians."

Without getting too much into what randomness means or what a Bayesian is, every "Bayesian" I know thinks of a prior as some sort of measure of (someone's potential or approximate) uncertainty about a fixed but unknown quantity.

I think the confusion might be from different uses of the ever-ambiguous adjective "random."

In non-Bayesian classes, "random" is generally used to describe quantities sampled from a population, so it seems like a contradiction to call a population parameter "random." However, in the Bayesian literature "random" generally means "uncertain," so it is not a contradiction to say a parameter is "random" and "fixed but unknown".

I find that the ambiguity over the word "random" has brought about much misinterpretation and mischaracterization of Bayesian methods. Maybe we'd be better off without this word?

Peter: It's just as well I wasn't invited to speak on the topic. I probably would've embarrassed myself by begging the assembled researchers to put together a hierarchical spline model that I could use for my applied research. I hate hate hate that I'm reduced to using simple models such as linear/quadratic trends and autoregressions, just because I don't have the technology to fit regularized splines.

Fitting regularized splines in a Bayesian manner (i.e., via priors on the smoothing parameter or variance ratios) is not that hard to do in R / BUGS, I've done it perhaps a dozen times myself over the last few years. However, if you have more than a couple of variables modeled by splines, you want to start it up in the evening and check it again in the morning – and I have a pretty powerful desktop computer. Maybe that's just my bad programming, though.

The default priors I've seen in the literature don't work well in my applications, though; they seem to put way too much weight on rough functions, with the resultant poor predictive capabilities that results from substantial underfitting. Of course, more data solves all problems, except that of runtime in BUGS.

Maybe I ought to put together a little package to set everything up for a user who doesn't want to plow through the details…

Correction to previous comment: "way too much weight on rough functions" should be "way too much weight on very smooth functions".

@zbicyclist and anyone else interested:

A very readable introduction to Bayesian nonparametrics can be found in Koop (2003), Bayesian Econometrics, chapter 10. His book website also has Matlab programs.

can you (andrew) unpack the following comment a bit:

p.5: The authors refer to "frequentist methods." There is no such thing as a frequentist method; "frequentist" refers to an approach to evaluating methods. Or, you could say that any method that has been evaluated in a frequentist way is a "frequentist method," in which case Bayesian methods are frequentist too.

or point to a link that unpacks?