Bayesian Posteriors are Calibrated by Definition

Time to get positive. I was asking Andrew whether it’s true that I have the right coverage in Bayesian posterior intervals if I generate the parameters from the prior and the data from the parameters. He replied that yes indeed that is true, and directed me to:

  • Cook, S.R., Gelman, A. and Rubin, D.B. 2006. Validation of software for Bayesian models using posterior quantiles. Journal of Computational and Graphical Statistics 15(3):675–692.

The argument is really simple, but I don’t remember seeing it in BDA.

How to simulate data from a model

Suppose we have a Bayesian model composed of a prior with probability function $latex p(\theta)$ and sampling distribution with probability function $latex p(y \mid \theta)$. We then simulate parameters and data as follows.

Step 1. Generate parameters $latex \theta^{(0)}$ according to the prior $latex p(\theta)$.

Step 2. Generate data $latex y^{(0)}$ according to the sampling distribution $latex p(y \mid \theta^{(0)})$.

There are three important things to note about the draws:

Consequence 1. The chain rule shows $latex (\theta^{(0)}, y^{(0)})$ is generated from the joint $latex p(\theta, y)$.

Consequence 2. Bayes’s rule shows $latex \theta^{(0)}$ is a random draw from the posterior $latex p(\theta \mid y^{(0)})$.

Consequence 3. The law of total probabiltiy shows that $latex y^{(0)}$ is a draw from the prior predictive $latex p(y)$.

Why does this matter?

Because it means the posterior is properly calibrated in the sense that if we repeat this process, the true parameter value $latex \theta^{(0)}_k$ for any component $latex k$ of the parameter vector will fall in a 50% posterior interval for that component, $latex p(\theta_k \mid y^{(0)})$, exactly 50% of the time under repeated simulations. Same thing for any other interval. And of course it holds up for pairs or more of variables and produces the correct dependencies among variables.

What’s the catch?

The calibration guarantee only holds if the data $latex y^{(0)}$ actually was geneated from the model, i.e., from the data marginal $latex p(y)$.

Consequence 3 assures us that this is so. The marginal distribution of the data in the model is also known as the prior predictive distribution; it is defined in terms of the prior and sampling distribution as

$latex p(y) = \int_{\Theta} p(y \mid \theta) \, p(\theta) \, \mathrm{d}\theta$.

It’s what we predict about potential data from the model itself. The generating process in Step 1 and Step 2 above follow the integral. Because $latex (\theta^{(0)}, y^{(0)})$ is drawn from the joint $latex p(\theta, y)$, we know that $latex y^{(0)}$ is drawn from the prior predictive $latex p(y)$. That is, we know $latex y^{(0)}$ was generated from the model in the simulations.

Cook et al.’s application to testing

Cook et al. outline the above steps as a means to test MCMC software for Bayesian models. The idea is to generate a bunch of data sets $latex (\theta^{(0)}, y^{(0)})$, then for each one make a sequence of draws $latex \theta^{(1)}, \ldots, \theta^{(L)}$ according to the posterior $latex p(\theta \mid y^{(0)})$. If the software is functioning correctly, everything is calibrated and the quantile in which $latex \theta^{(0)}_k$ falls in $latex \theta^{(1)}_k, \ldots, \theta^{(L)}_k$ is uniform. The reason it’s uniform is that it’s equally likely to be ranked anywhere because all of the $latex \theta^{(\ell)}$ including $latex \theta^{(0)}$ are just random draws from the posterior.

What about overfitting?

What overfitting? Overfitting occurs when the model fits the data too closely and fails to generalize well to new observations. In theory, if we have the right model, we get calibration, and there can’t be overfitting of this type—predictions for new observations are automatically calibrated, because predictions are just another parameter in the model. In other words, we don’t overfit by construction.

In practice, we almost never believe we have the true generative process for our data. We often make do with convenient approximations of some underlying data generating process. For example, we might choose a logistic regression because we’re dealing with binary trial data, not because we believe the log odds are truly a linear function of the predictors.

We even more rarely believe we have the “true” priors. It’s not even clear what “true” would mean when the priors are about our knowledge of the world. The posteriors suffer the same philosophical fate, being more about our knowledge than about the world. But the posteriors have the redeeming quality that we can test them predictively on new data.

In the end, the theory gives us only cold comfort.

But not all is lost!

To give ourselves some peace of mind that our inferences are not going astray, we try to calibrate against real data using cross-validation or actual hold-out testing. For an example, see my Stan case study on repeated binary trial data, which Ben and Jonah conveniently translated into RStanArm.

Basically, we treat Bayesian models like meteorologists—they are making probabilistic predictions after all. To try to assess the competence of a meteorologist, one asks on how many of the $latex N$ days about which we said it was 20% likely to rain did it actually rain? If the predictions are independent and calibrated, we’d you expect the distribution of rainy days to be $latex \mathrm{Binom}(N, 0.2)$. To try to assess the competence of a predictive model, we can do exactly the same thing.