Deborah Mayo pointed me to this discussion by Christian Hennig of my recent article on Induction and Deduction in Bayesian Data Analysis.

A couple days ago I responded to comments by Mayo, Stephen Senn, and Larry Wasserman. I will respond to Hennig by pulling out paragraphs from his discussion and then replying.

Hennig:

for me the terms “frequentist” and “subjective Bayes” point to interpretations of probability, and not to specific methods of inference. The frequentist one refers to the idea that there is an underlying data generating process that repeatedly throws out data and would approximate the assumed distribution if one could only repeat it infinitely often.

Hennig makes the good point that, if this is the way you would define “frequentist” (it’s not how I’d define the term myself, but I’ll use Hennig’s definition here), then it makes sense to be a frequentist in some settings but not others. Dice really can be rolled over and over again; a sample survey of 1500 Americans really does have essentially infinitely many possible outcomes; but there will never be anything like infinitely many presidential elections or infinitely many worldwide flu epidemics.

Hennig:

The subjective Bayesian one is about quantifying belief in a rational way; following de Finetti, it would in fact be about belief in observable future outcomes of experiments, and not in the truth of models. Priors over model parameters, according to de Finetti, are only technical devices to deal with belief distributions for future outcomes, and should not be interpreted in their own right.

I understand the appeal of the pure predictive approach, but what I think is missing here is that what we call “parameters” are often conduits to generalizability of inference.

Consider my work with Frederic Bois in toxicology. When studying the concentrations of a toxin in blood and exhaled air, you can model the data directly with some convenient and flexible functional form—a “phenomenological” model—or you can use a more fundamental model based on latent parameters with direct physical and biological interpretations: the volume of the liver, the equilibrium concentration of the toxin in fatty tissues compared to the blood, and so forth. Modeling using latent parameters is more difficult—you have to throw in lots of prior information to get it to work, as we discuss in our article—but, on the plus side, there is biological reason to suspect that these parameters generalize from person to person. Which, in turn, gives our hierarchical prior distributions a chance to do the partial pooling that gives reasonably precise individual-level inferences.

There’s a saying, A chicken is nothing but an egg’s way of creating another egg. Similarly, the de Finetti philosophy (as described by Henning) might say that parameters are nothing but data’s way of predicting new data. But this misses the point. Parameterization encodes knowledge, and parameters with external validity encode knowledge particularly effectively.

Hennig:

However, I think that any single analysis that uses and interprets probabilities can only make sense if it is clear what is meant by “probability” in that particular situation. So I think that it’s a quite serious omission that Gelman doesn’t tell us his interpretation (he may do that elsewhere, though).

Indeed, I do give my interpretation of probabilities elsewhere. I thought a lot about this when writing Bayesian Data Analysis, and my interpretation is stated at length in chapter 1 of that book. These were my ideas 20 years ago but I still pretty much hold on to them (except that, as I’ve discussed often on this blog and elsewhere, I’ve moved away from noninformative priors and now I think that weakly informative priors are the way to go).

Hennig concludes with a statement of concern about posterior predictive checking. I will respond in three ways:

1. Posterior predictive checks reduce to classical goodness-of-fit tests when the test statistic is pivotal; when this is not the case, there truly is uncertainty about the fit, and I prefer to go the Bayesian route and average over that uncertainty.

2. Whatever you may think about them theoretically, posterior predictive checks really can work. See chapter 6 of Bayesian Data Analysis and my published papers for many examples. It might well be that something better is out there, but the alternative I always see is people simply not checking their models. I’ll see exploratory graphs of raw data, pages and pages of density plots of posterior simulations, trace plots and correlation plots of iterative simulations—but no plots comparing model to data.

The basic idea of posterior predictive checking is, as they say, breathtakingly simple: (a) graph your data, (b) fit your model to data, (c) simulate replicated data (a Bayesian can always do this, because Bayesian models are always “generative”), (d) graph the replicated data, and (e) compare the graphs in (a) and (d). It makes me want to scream scream scream scream scream when statisticians’ philosophical scruples stop them from performing these five simple steps (or, to be precise, performing the simple steps (a), (c), (d), and (e), given that they’ve already done the hard part, which is step (b)).

3. In some settings, a posterior predictive check will essentially never “reject”; that is, there are models that have a very high probability of replicating certain aspects of the data. For example, a normal distribution with flat prior distribution will reproduce the mean (but not necessarily the median) of any dataset. In some of these situations I think it’s a good thing that the posterior predictive check does not “reject”; other times I am unhappy with this property. See this long blog post from a couple years ago for a discussion.

[…] http://andrewgelman.com/2012/03/gelman-on-hennig-on-gelman-on-bayes/ […]

The basic idea of posterior predictive checking is, as they say, breathtakingly simple: (a) graph your data, (b) fit your model to data, (c) simulate replicated data (a Bayesian can always do this, because Bayesian models are always “generative”), (d) graph the replicated data, and (e) compare the graphs in (a) and (d).

Why can’t this be done with any model? In other words, why does one have to be a Bayesian to simulate data?

Numeric:

Posterior predictive checking can be done in a non-Beysaisn context by simulating replicated data from the model associated with a point estimate of the parameters, that is, simulating from p(y.rep|theta.hat(y)) rather than the fully Bayesian p(y.rep|y) averaging over theta. This should work just fine in settings where a point estimate does the job. In fact, two of my favorite examples of simulation-based model checking come from Bush and Mosteller (1954) and Ripley (1989), and both were based on point estimates.

However, there is also a whole world of non-Bayesian statistics that does not use generative models. I’m thinking of methods that go by labels such as generalized estimating equations, robust inference, and machine learning. In many of these areas, statistical methods are devised without any generative model for the data, and then you can’t do this sort of model checking.

I love posterior predictive checks. I think they are the easiest way to check certain or nearly all connections between your model and the data. Beside goodness of fit tests they can be used to identify outliers — without the need of any threshold. If you have a model that explains most aspects of your data well, you simply count how many simulated data points a greater than the observed data points. I use this method in the context of MR images of the brain.

I don’t think that there is something better out there with which you can access the fit of your model. For me, the two most obvious reasons why people do not use posterior predictive checks are (a) they do not want (why?) or cannot use Bayesian statistics (in this point, there is always the possibility on replicate fake data on the basis of the approximate normal distribution of the ML estimates) or (b) do not think beyond classical t-tests and AN(C)OVAs. I mean even when more people would use posterior predictive checks or any kind of goodness of fit test, many people are afraid of observing a serious defect of their model because they don’t know how to fix it.

Thanks Andrew, I somehow missed that long post from a couple years ago.

It provided a lot of clarification, including the posterior predictive checks being just one focus of usually many? – “different sorts of model checks corresponding to comparisons of the posterior distribution to different sorts of knowledge: prior-posterior comparisons, cross-validation, external validation, and comparisons of predictions to observed data”

Thanks for the reply, Andrew! I had a look at Chapter 1 of “Bayesian Data Analysis”. Much of your explanation there rather seems to me like a collection of various possibilities to interpret probabilities than a consistent single one. There is talk about frequencies and subjective betting rates, implicitly acknowledging that these are not the same. There is the sentence “Frequency interpretations can usually be constructed, however, and this is an extremely useful tool in statistics”, which suggests that pretty much everything can be interpreted in an at least idealistically frequentist way, but the whole Section 1.5 (which seems to be the core bit on interpretation) doesn’t say precisely how prior probabilities distributions for unobservable parameters are interpreted, for which neither bets are possible (because there is no way to determine the winner), nor a clear reference set as required for frequencies. After going through this, I still don’t know what you have in mind. The only option that I see just from this text is to imagine an artificial unobservable frequentist mechanism to throw out parameter values, of which (as pointed out in my discussion of posterior predictive checking) we can observe only a sample of effective size smaller than one (there is only one existing value, which we can’t observe precisely – of course unless we are in “empirical Bayes”-situations where a proper meaning can be given to the idea of repeating the parameter generating process).

Now I should probably say that I don’t think that such an interpretation is really wrong and should never be adopted. Neither am I against posterior predictive checking in general. If you adopt such an “idealistic frequentist” interpretation and you do observe something that doesn’t pass posterior predictive checking, this is informative and you can be quite sure that your model is wrong (in a sense in which you didn’t want it to be wrong).

However, I still would like to point out that the status of the prior distribution is *very* idealistic in that it is very weakly connected to what is actually observable. It could be tested and perhaps rejected, but testing it and failing to reject it doesn’t really “confirm” it, because such tests can only have very limited power, cannot diagnose any feature of the prior apart from location, and can certainly not achieve “severity” in Mayo’s sense (unless we are in an “empirical Bayes” situation).

I support using such a tool in some situations (e.g., I have worked a lot about mixture distributions and currently Bayesian regularisation seems superior to the frequentist approaches I have seen in order to deal with degenerating likelihoods), but I’d be reluctant to attach too much meaning to the actual probabilities defined by the posterior probability distribution (apart from taking it as a potentially valuable exploratory tool), because this is again a distribution over parameter values and as such requires very idealistic thinking.

Christian:

1. We have two detailed examples in chapter 1 demonstrating how probabilities can be estimated (that is, measured).

2. I do not recommend betting as a principle for probability. In section 1.5, I write, “Why is probability a reasonable way of quantifying uncertainty? The following reasons are often advanced. . . .” I mention betting as #3 of a list of 3 and then write, “The betting rationale has some fundamental difficulties . . .” I put in the betting rationale because it is often discussed in the literature but I don’t myself think that betting is very helpful for understanding probability.”

3. Bayesian statistics is often presented in a paradigm in which there is a single likelihood (coming from a single dataset) and then many different priors can be considered. I find it more helpful to think of a single prior that can be applied to many possible datasets. We discuss this perspective in section 2.8 of Bayesian Data Analysis. I think there is a very close connection between Bayesian inference and hierarchical models. The prior represents group-level inference.

There is also a close connection between the replications in reference sets of frequentist statistics, and the replications in exchangeable models in Bayesian statistics. In each case, the statistician is deciding to average over some set of cases. In frequentist statistics, the averaging over the reference set determines the probability distributions (from which come the p-values, confidence statements, etc); in Bayesian statistics, the exchangeable model allows a single prior to apply to many sub-problems. In both frequentist and Bayesian settings, the number of replications is in practice finite but is conceptually infinite. For a frequentist: you can never really roll any given die forever, eventually it will crumble and fall apart (or, to put it another way, the assumption of the reference set degrades, and ultimately you need a model that allows the probabilities to change over time). For a Bayesian: you never have an infinite number of schools or counties or whatever your groups are (or, to put it another way, at some point you have information that renders the exchangeable model unpalatable).

4. Your use of the word “power” suggests that you would like your hypothesis tests to reject models when they are false. But I already know my models are false. I do posterior predictive checks not to reject a model and put its scalp on the wall, but to explore the aspects in which the fitted model fails to be coherent with the data.

You write that you’d “be reluctant to attach too much meaning to the actual probabilities defined by the posterior probability distribution.” I’m reclutant too! These probabilities are conditional on the model. Moving from model-based probabilities to unconditional total probabilities (of the sort that one would want to use in a practical decision analysis) is a challenge and this is one reason I want to separate Bayesian data analysis from any formal justification based on coherence of decisions. In practice the model will not be perfect.

Andrew, I’m fine with your general use of models and the emphasis that you already know that the model is false. I also agree with you saying that subjective decisions and idealisations that may be problematic are required in frequentist analyses as well and generally when dealing with mathematical models (actually I have a quite general paper on this, C. Hennig: Mathematical models and reality – a constructivist perspective.

Foundations of Science 15: 29-49, 2010).

However I still think that the connection between the prior and anything that is observable is very weak, much weaker than frequentist/exchangeability models are in situations where there is some kind of repetition, in many situations (namely where it is assumed that the prior generates only a single unobservable parameter, which is often the case in Bayesian analysis). The examples in “Bayesian Data Analysis” are, as far as I can see (note that I currently look up the Google books version which has pages 18 and 19 skipped), about “empirical Bayes” situations in which the distributions in question can be linked to repeated (albeit imprecise) observations.

Of course in the “single parameter generated by prior” situations you may still argue, and I’d agree, that the prior can have some use, e.g., for regularisation, and if you’re not going to interpret posterior probabilities, that’s all fine by me.

However, this means that a Bayesian approach should strictly not be advertised as yielding probabilities for the unobservable quantities supposedly of interest (unless we are in an “Empirical Bayes” situation), which can be interpreted in a way appealing to “common-sense”, because these probabilities are not reliable and not of clear meaning. But as many other Bayesians, you use this marketing strategy as well, see p. 3/4 of “Bayesian Data Analysis”. Shouldn’t you rather loudly and clearly tell people that they should *not* believe that these probabilities can be interpreted intuitively and are the most meaningful thing to get out of a Bayesian analysis?

Christian:

I think we make it clear in our book that Bayesian predictions are model-dependent. Non-Bayesian predictions are model-dependent too (or, for the non-modelers in the house, we can call them “procedure-dependent”). In the concluding section of my article under discussion here, I make it clear that in my opinion my Bayesian philosophy of inference is incomplete—as are all other philosophies of inference as well. Regarding “marketing,” some Bayesians advertise the Bayesian approach as being (a) the essence of logical reasoning and (b) the only form of logical inference. I do not claim that, in fact in chapter 1 of BDA we are careful to not make such claims. Before our book came out in 1995, it was standard for books on Bayesian statistics to make much of the supposed logical advantages of Bayes; we do not do so. As I have written in my three recent philosophy articles, I think Bayesian data analysis is essentially deductive conditional on a model, but we do not yet (or may we ever) have a coherent philosophy of model building and model checking.

Christian:

Nasty typo “these probabilities can be interpreted intuitively and are [NOT?] the most meaningful thing”.

From here http://andrewgelman.com/2011/04/so-called_bayes/ – I commented that “the perhaps formally understandable “salesman’s puffing” about the value of _the_ posterior (for everyone?) that one gets – well maybe it’s times to start losing that. “

Much earlier in response to my “Two Cheers for Bayes. Controlled Clinical Trials, Volume 17, Issue 4, August 1996, Pages 350-352”, J Kadane commented

“I agree with the author that “when a posterior is presented, I believe it should be clearly and primarily stressed as being a ‘function’ of the prior probabilities and not the probability ‘of treatment effects’.” So, I think, do most Bayesians.” I believe ‘function’ was accepted as meaning the posterior’s value being highly dependent on the prior.

But in practice, that seems to be rare, Andrew being an exception. One Bayesian colleague once told me that my comments deriding the value of posterior probabilities (given the usually poorly motivated prior) as something that made his “blood boil”.