This is a long one, but it’s good stuff (if you like this sort of thing). Dana Kelly writes,

We’ve corresponded on this issue in the past, and I mentioned that I had been taken to task by a referee who claimed that model checks that rely on the posterior predictive distribution are invalid because they use the data twice.

I’ve finally tracked down the paper (see attached) that I think is behind this criticism, and I’d like to get your thoughts on it. I’ve thought about this issue before and after the referee’s critique, and I still don’t see any basic flaw in using the observed data to calculate a statistic based on the posterior predictive distribution.

There are some points in the paper that I find confusing:

1. On p. 9, the authors state that the posterior predictive p-value is not really a p-value because “for large sample sizes, [it] will essentially equal the classical p-value, which is quite non-Bayesian in character.” It seems that, on these grounds, one could claim the posterior mean of a parameter is non-Bayesian, because, for large sample sizes, it will essentially equal the classical MLE, which is non-Bayesian.

2. On pp. 9-10, the authors mentions the “apparent double use of the data…to convert the prior [into a posterior] and then for computing the tail area.” They then say they will show an example of the unnatural behavior this double use of the data can induce. The following example seems to me artificial, having no features of any practical problem that I can envision. In Example 4.4, letting the absolute value of the mean of the observed values go to infinity makes no sense to me, and therefore I don’t see the validity of the conclusion about the lower bound on the p-value, “which can be directly traced to the double use of the data.”

3. The conditional predictive p-value introduced by the authors in Sec. 4.2 seems computationally impractical. On p. 12 the authors mention the difficulty of such an approach for discrete data. Also, the effort on p. 16 to show a connection with a classical p-value seems odd, given the authors’ criticism of the posterior predictive p-value on just these grounds.

4. The partial posterior predictive p-value seems more computationally tractable, but is harder to compute than the posterior predictive p-value. Also, there appear to be some errors in the math on pp. 17-18, but I need to do some more checking to be sure.

I very much appreciate any thoughts you have on this paper and the larger issue of using the posterior predictive distribution for model-checking. As I’ve said in the past, I have become a very big fan of these kinds of checks, as described in Ch. 6 of your text.

If you think this topic is worth of discussion on your blog, feel free to post it there. Thanks in advance.

My reply:

The posterior predictive statements are indeed Bayesian, as described in Chapter 6 of Bayesian Data Analysis and also in my 1996 paper with Meng and Stern: they are statements about p(y.rep|y), where y is the observed data and y.rep are hypothetical replications from the model. These are posterior probability statements. I agree with your statement that, just because a Bayesian calculation is close to a classical calculation, that doesn’t make it non-Bayesian.

In addition, I disagree with the double-use claim: as noted in the probability expression above, the posterior predictive distribution conditions on the data y exactly once. It’s ok to consider these other methods also but, to me, the posterior predictive check is a more direct approach.

We further discuss some of these issues in this discussion of a forthcoming paper by Bayarri and Castellanos. (Their paper and other discussions can be found on the journal’s website.)

Hal Stern also adds, regarding the beginning of Dana’s question,

The referee’s comment that the p-value is “invalid” is not very precise. I suspect he/she is referring to the published results re the absence of uniform distribution under the null. Posterior predictive checks are clearly valid in the sense that a small posterior predictive probability indicates that the observed data feature is extremely unlikely given the model.

Dana then pointed me to a recent paper by Val Johnson citing the lack of asymptotic uniformity of the posterior predictive p-value as a drawback to its use in model checking.

I replied that I am a big fan of Val Johnson’s work but that he seems to focus on analytical methods where I favor graphics. It’s a stylistic difference as much as anything–if you look at his paper carefully he does say that posterior predictive checks are ok as diagnostics. As my colleagues and I and others have discussed in various papers, I don’t see why the uniform distribution is necessary. It’s more important to me to have these p-values have direct interpretations as posterior probabilities.

Regarding point 2, having an asymptotic lower bound on p that depends only on n is a downside to posterior predictive p values. If |x-bar| and p both reflect misfit, and we can make |x-bar| arbitrarily high, it seems like we should be able to make p be arbitrarily low if they reflect the same information.

Richard,

I don't understand your comment. Can you explain?

If I understand Bayarri and Berger's point in example 4.4, it that the posterior predictive p value poorly characterizes the model misfit because asymptotically, the p value doesn't go to 0. It seems like this could create some weird situations. For instance, in the example they use n=4 and show that as |xbar| goes to infinity, p->.12. It seems to me that in some cases, this will lead to strange results.

For instance, assume the null model in they do in example 4.4. Let n=8, and suppose we get the data |x-bar|=1.125 and s^2=2. Then the posterior predictive p=.10. Assume that another researcher collects n=4, and gets |x-bar|=100000 and s^2=2. Then the posterior predictive p=.12. The lower bound on p means that there is no data we could have obtained in the n=4 case that would lead to a lower posterior predictive p. If the p is an generic index of model fit, then this situation is strange; the data which clearly is most inconsistent with the null model (n=4 data) has a higher posterior predictive p.

I don't know about the claim regarding "double use" of the data (that always seems dubious to me, since it is the same criticism I get of Bayesian hierarchical modeling in general), but the asymptotic properties of posterior predictive p are worrying.

Richard,

I'll have to look at their example more closely, but my quick responses are:

1. There's no need for the p-value to attain arbitrarily low values. For example, suppose you have a simple normal model unknown mean and flat prior, and your test statistic is simply y.bar. (Not x.bar; recall the difference between a theoretical statistician and an applied statistician.) The p-value will then be 0.5 no matter what your data are. I don't see this as a problem; it's a fact that this normal model fits the mean of the data perfectly, no matter what.

A posterior predictive check (and, I would argue, model checks in general) are relevant to checking a particular aspect of misfit between data and model. If a model can always fit a certain aspect of the data, then I wouldn't want it to give low p-values. To put it another way: the posterior predictive p-value is what it is, and the real question is what to do with it. I don't see it as a tool for rejecting models: I know ahead of time that all the models I ever work with are false.

2. I don't compare different datasets by comparing p-values. This is in general an incoherent approach–you give one example, but in fact any p-value approach to comparing datasets of different sample sizes will be incoherent. This is a standard problem with classical p-values (of which posterior predictive p-values are simply a generalization), as Carl Morris and others have pointed out. I use predictive checks to evaluate the fit of a given dataset.

Andrew,

1. I see your point about the flat prior and the normal, but in your example the fit doesn't become arbitrarily bad; as you point out, the model always fits. In their example, in contrast, the model fit becomes worse and worse as |y-bar| (or |x-bar|; as they call it :) ) becomes arbitrarily large. With n=4 it seems like p=.12 really doesn't reflect the horrible misfit of the null model when |y-bar| is extremely large (and s^2 is small). In cases such as this, one would think p should approach 0.

2. I figured as much (I was thinking, after I constructed the example, that you could find similar examples, even with no lower bound problem); I was trying to unpack their example. Perhaps Bayarri and Berger wouldn't have advocated such an interpretation.

Of course, this all assumes that you find inference to be compelling. I agree with you and prefer graphical means of assessing misfit. We've found that helpful in our psychological models; sometimes the form of the misfit tells you about what is going on, and neither a p value nor a Bayes Factor will show you that.