## Bayesian Posteriors are Calibrated by Definition

Time to get positive. I was asking Andrew whether it’s true that I have the right coverage in Bayesian posterior intervals if I generate the parameters from the prior and the data from the parameters. He replied that yes indeed that is true, and directed me to:

• Cook, S.R., Gelman, A. and Rubin, D.B. 2006. Validation of software for Bayesian models using posterior quantiles. Journal of Computational and Graphical Statistics 15(3):675–692.

The argument is really simple, but I don’t remember seeing it in BDA.

How to simulate data from a model

Suppose we have a Bayesian model composed of a prior with probability function $p(\theta)$ and sampling distribution with probability function $p(y \mid \theta)$. We then simulate parameters and data as follows.

Step 1. Generate parameters $\theta^{(0)}$ according to the prior $p(\theta)$.

Step 2. Generate data $y^{(0)}$ according to the sampling distribution $p(y \mid \theta^{(0)})$.

There are three important things to note about the draws:

Consequence 1. The chain rule shows $(\theta^{(0)}, y^{(0)})$ is generated from the joint $p(\theta, y)$.

Consequence 2. Bayes’s rule shows $\theta^{(0)}$ is a random draw from the posterior $p(\theta \mid y^{(0)})$.

Consequence 3. The law of total probabiltiy shows that $y^{(0)}$ is a draw from the prior predictive $p(y)$.

Why does this matter?

Because it means the posterior is properly calibrated in the sense that if we repeat this process, the true parameter value $\theta^{(0)}_k$ for any component $k$ of the parameter vector will fall in a 50% posterior interval for that component, $p(\theta_k \mid y^{(0)})$, exactly 50% of the time under repeated simulations. Same thing for any other interval. And of course it holds up for pairs or more of variables and produces the correct dependencies among variables.

What’s the catch?

The calibration guarantee only holds if the data $y^{(0)}$ actually was geneated from the model, i.e., from the data marginal $p(y)$.

Consequence 3 assures us that this is so. The marginal distribution of the data in the model is also known as the prior predictive distribution; it is defined in terms of the prior and sampling distribution as

$p(y) = \int_{\Theta} p(y \mid \theta) \, p(\theta) \, \mathrm{d}\theta$.

It’s what we predict about potential data from the model itself. The generating process in Step 1 and Step 2 above follow the integral. Because $(\theta^{(0)}, y^{(0)})$ is drawn from the joint $p(\theta, y)$, we know that $y^{(0)}$ is drawn from the prior predictive $p(y)$. That is, we know $y^{(0)}$ was generated from the model in the simulations.

Cook et al.’s application to testing

Cook et al. outline the above steps as a means to test MCMC software for Bayesian models. The idea is to generate a bunch of data sets $(\theta^{(0)}, y^{(0)})$, then for each one make a sequence of draws $\theta^{(1)}, \ldots, \theta^{(L)}$ according to the posterior $p(\theta \mid y^{(0)})$. If the software is functioning correctly, everything is calibrated and the quantile in which $\theta^{(0)}_k$ falls in $\theta^{(1)}_k, \ldots, \theta^{(L)}_k$ is uniform. The reason it’s uniform is that it’s equally likely to be ranked anywhere because all of the $\theta^{(\ell)}$ including $\theta^{(0)}$ are just random draws from the posterior.

What overfitting? Overfitting occurs when the model fits the data too closely and fails to generalize well to new observations. In theory, if we have the right model, we get calibration, and there can’t be overfitting of this type—predictions for new observations are automatically calibrated, because predictions are just another parameter in the model. In other words, we don’t overfit by construction.

In practice, we almost never believe we have the true generative process for our data. We often make do with convenient approximations of some underlying data generating process. For example, we might choose a logistic regression because we’re dealing with binary trial data, not because we believe the log odds are truly a linear function of the predictors.

We even more rarely believe we have the “true” priors. It’s not even clear what “true” would mean when the priors are about our knowledge of the world. The posteriors suffer the same philosophical fate, being more about our knowledge than about the world. But the posteriors have the redeeming quality that we can test them predictively on new data.

In the end, the theory gives us only cold comfort.

But not all is lost!

To give ourselves some peace of mind that our inferences are not going astray, we try to calibrate against real data using cross-validation or actual hold-out testing. For an example, see my Stan case study on repeated binary trial data, which Ben and Jonah conveniently translated into RStanArm.

Basically, we treat Bayesian models like meteorologists—they are making probabilistic predictions after all. To try to assess the competence of a meteorologist, one asks on how many of the $N$ days about which we said it was 20% likely to rain did it actually rain? If the predictions are independent and calibrated, we’d you expect the distribution of rainy days to be $\mathrm{Binom}(N, 0.2)$. To try to assess the competence of a predictive model, we can do exactly the same thing.

1. Ben Goodrich says:

We implemented this test for the models in the rstanarm package via its pp_validate() function. The theory of it is nice when the software is functioning correctly. The test does not seem to have a great deal of power, but if you draw from the prior predictive distribution and then plug in “approximate” posterior draws by drawing from the estimated sampling distribution of the ML estimator or use ADVI to draw from a multivariate normal approximation to the posterior distribution in the unconstrained space, then you often get rejected by the test.

A lot of people seem to think that overfitting is a phenomenon due to using the same data to evaluate the model as you used to estimate the parameters, but it seems to have as much to do with the choice to use optimization to estimate the parameters, rather than using NUTS to draw from the posterior distribution of the parameters. That and specifying bad models.

2. Corey says:

Incidentally, this is why Nic Lewis’s claim, “I think the reason Bayesian updating can give poor probability matching, even when the original posterior used as the prior gives exact probability matching in relation to the original data, is that conditionality is not always applicable to posterior distributions,” doesn’t really make sense. And indeed when I read the paper I discovered that the thing he was calling the likelihood wasn’t actually the likelihood (and the thing he was calling the prior wasn’t actually the prior). That works fine for the original posterior because his “prior” times his “likelihood” is indeed an actual posterior — it’s the splitting of the posterior into two factors that’s wonky. But what he calls “Bayesian updating” doesn’t actually use the correct likelihood.

3. Overfitting is a bit more subtle than presented here — the calibration is guaranteed only over the _ensemble_ of true values and measurements and over any particular value and measurement. Consequently for a given true value and measurement the posterior can concentrate strongly away from the true value, it’s just that when averaged over all possible true values and measurements those biases cancel out to recover the correct calibration.

The ensemble is great for testing, but you need much stronger conditions to guarantee good performance on a given analyses. Then again, that’s why we have (both in sample and out of sample) posterior predictive checks.

• Right. The lack of overfitting is only in expectation—that is, you expect 50% of the parameters across simulations to fall in their posterior 50% intervals if you are using the right model.

One problem in the literature is that “overfitting” is never precisely defined anywhere. The Wikipedia is no help here—by the time they’ve moved into machine learning, they’re talking about overfitting arising from not stopping an iterative algorithm soon enough (i.e., not enough ad hoc regularization).

• MJT says:

i’m lucky enough to hear rod little drive his point home about a separation of concerns (echoed in various gelman posts): model vs estimand vs estimate

When summerizing / approximating generated data, the cry of umbrella overfitting can be attributed to any of these sources

4. So as a title for this post I’d prefer something like:

Sampling from the prior and the conditional likelihood produces a frequency calibrated sample from the posterior

It’s not that the posterior is somehow “calibrated by definition” but rather that this sampling process converts probability into frequency.

• I think this is the same thing as Michael was saying—we don’t get guarantees for any particular data set, only in aggregate across repetitions.

• It’s a little more subtle point. It’s not so much that the posterior is calibrated to this generating process it’s more that given any distribution you can create a sampler that has that distribution as it’s frequency distribution. The calibration is an asymptotic property of the generator.

5. Leon says:

Not sure if this is 100% correct, but …

Nowadays I think of e.g. 90% freqentist calibration as “90% worst case Bayesian *predictive* calibration”. To flesh this out: if I = I(y, theta) is the event that the interval formed from data y covers the hidden theta, then

1. Bayesian prior predictive is Pr{I(replicate y, theta)} = 90%

2. Bayesian posterior is Pr{I(observed y, theta) | observed y} = 90%

3. Bayesian posterior predictive is Pr{I(replicate y, theta) | observed y} = 90%

4. Frequentist coverage is Pr{I(replicate y, theta) | theta} = 90% for the worst-case choice of theta.

Note that the probability in (4.) is equivalent to

1. The Bayesian prior predictive with a point mass prior on the worst case theta

3. The Bayesian posterior predictive with a point mass prior on the worst case theta

because given theta, the replicate and observed data are iid.

On the other hand, I don’t think there’s any way to sensibly compare the frequentist and posterior coverage in general.

In my personal philosophy of statistics this translates to: “although frequentist coverage and Bayesian *predictive coverage* are about the same thing (worst vs. average case), frequentist coverage and posterior coverage are answering in-principle different questions, and so aren’t really comparable”.

• Corey says:

In my view, the distinction between worst-case vs. average-case analysis is at the root of the philosophical distinction between Bayesian and frequentist approaches. Frequentist techniques generally seek to minimize the maximum possible inferential error (e.g., over parameter space in some class of models) while Bayesian techniques seek to minimize the expected inferential error with respect to some prior.

• Leon says:

I think this is true of predictive error (what is usually referred to as “calibration”), but not of inferential error. The way I understand it is that sampling theory — unlike Bayes and likelihood — isn’t really inferential at all. So there’s no comparison.

• Carlos Ungil says:

Isn’t null hypothesis testing inferential?

• Carlos Ungil says:

Actually I agree that frequentist inference is not “valid” and only Bayes makes sense, so my question is why likelihood is ok. In absence of a prior, do you get only likelihood ratios or is there something else to it?

6. Louis says:

At the risk of asking something stupid:

The post suggests that if I simulate \theta^{(0)} according to some prior p(\theta), then this is also a draw from the posterior p(\theta|y^{(0)}), regardless of how the data look like. (see consequence 2)

I am a bit puzzled by this.

• Carlos Ungil says:

In the end, you’re sampling from the joint distribution { theta, y } and it doesn’t matter if you do it one way
(a) sample first theta ~ p(theta) and then y ~ p(y|theta)
or the other
(b) sample first y ~ p(y) and then theta ~ p(theta|y)

If you repeat procedure (a) many times and look at the distribution of theta conditional on y, it will be indeed distributed as p(theta|y).

But I agree that the way it is presented in the post is confusing: how can theta be a random draw from a distribution conditional on the data which is in turn conditional on that very same value of the theta?

• It’s the wonder of Bayes’s rule, from which we derive

$p(\theta \mid y) \propto p(\theta, y) = p(y \mid \theta) \, p(\theta)$.

By the way, you can include LaTeX in WordPress blogs. The problem is there’s no way to edit comments on WordPress unless you’re a blog administrator.

• Alex says:

From the joint samples, say from Carlos’ procedure (a), how do you “look at the distribution of theta conditional on y”? You just have many joint samples $(\theta^{(k)}, y^{(k)})$. Granted, by Bayes the joint is proportional to the conditional distribution, but it seems misleading that you can “look at” the conditional distribution directly.

Louis, one subtlety in Consequence 2 is that it concerns $p(\theta \mid y^{(0)}$, not $p(\theta \mid y_1, \dots, y_n)$ based on actual data, so the “posterior” in Consequence 2 is what you would obtain after updating with a single simulated datapoint $y^{(0)}$, thus the actual data do not matter as that is not what is being described. It’s quite a particular posterior being described in Consequence 2, not the posterior you normally get after updating with data.

I agree with Louis in that I don’t see how $\theta^{(0)}$ is a draw from $p(\theta \mid y^{(0)})$ because $\theta^{(0)}$ was drawn from the prior $p(\theta)$, so the “Bayes step” of multiplying prior by likelihood was not performed.

7. Keith O’Rourke says:

> The argument is really simple, but I don’t remember seeing it in BDA.
I think there is an important point underlying this – the argument is obvious but its not obvious to many (most?) as well as likely taboo topic to many (i.e. mixing Bayes and Frequentist perspectives.)

Its unlikely discussed BDA or in most other texts (some realize it clearly in the usual mathematical expositions of Bayes but I think its rare.)

When I used these ideas of two stage sampling to explain Bayes to Epidemiology students in 2005 they googled it, found nothing as so announced in the next class that I must be wrong. The only published reference I could find on it at the time was a preprint of Cook et al. So in over 10 years nothing has changed http://andrewgelman.com/2006/12/18/example_of_a_co/

I do think there should be some pedagogical advantage of using this two-stage sampling approach with simulation in teaching Bayes even to aspiring professional statisticians.

Simulate a joint distribution and form 95% intervals:
1. Note, conditional on y’ these contain the theta that generated the y’ 95% of the time for all y – probability interval.
2. Note, conditional on theta’ these contain the theta’ that generated the data sometimes more/less than 95% of the time – not probability interval or confidence intervals.
3. Note, you can search for a prior by re-weighting the prior (to change it and the intervals) that might provide (approximate) 95% (or more coverage) conditional on every possible theta – a (approximate) confidence interval and a _reference_ prior.

At some point you will want to take them to Michael’s level of discourse if you can http://andrewgelman.com/2017/04/12/bayesian-posteriors-calibrated/#comment-464863 but only after they get what its all about.

3′ Note if you are successful in 3 and get a reference prior and confidence interval you can’t use that posterior as a prior for the next sample you get and still have confidence intervals (though they will still be probability intervals for that posterior taken as a prior).

• I thought this was true, but also couldn’t find it anywhere. So I asked the oracle (aka Andrew). He only pointed me to Cook et al., so I’m really wondering where this was first derived in this form—hard to believe it was in a paper on software validation (alas, my oracle’s AWOL in old blighty at the moment).

8. psyoskeptic says:

See! Frequentist stats have a use! :)