Markov chain Monte Carlo standard errors

Galin Jones sent me this paper (by James Flegal, Murali Haran, and himself) which he said started with a suggestion I once made to him long ago. That’s pretty cool! Here’s the abstract:

Current reporting of results based on Markov chain Monte Carlo computations could be improved. In particular, a measure of the accuracy of the resulting estimates is rarely reported in the literature. Thus the reader has little ability to objectively assess the quality of the reported estimates. This paper is an attempt to address this issue in that we discuss why Monte Carlo standard errors are important, how they can be easily calculated in Markov chain Monte Carlo and how they can be used to decide when to stop the simulation. We compare their use to a popular alternative in the context of two examples.

This is a clear paper with some interesting results. My main suggestion is to distinguish two goals: estimating a parameter in a model and estimating an expectation. To use Bayesian notation, if we have simulations theta_1,…,theta_L from a posterior distribution p(theta|y), the two goals are estimating theta or estimating E(theta|y). (Assume for simplicity here that theta is a scalar, or a scalar summary of a vector parameter.)

Inference for theta or inference for E(theta)

When the goal is to estimate theta, then all you really need is to estimate theta to more accuracy than its standard error (in Bayesian terms, its posterior standard deviation). For example, if a parameter is estimated at 3.5 +/- 1.2, that’s fine. There’s no point in knowing that the posterior mean is 3.538. To put it another way, as we draw more simulations, we can estimate that “3.538” more precisely–our standard error on E(theta|y) will approach zero–but that 1.2 ain’t going down much. The standard error on theta (that is, sd(theta|y)) is what it is.

This is a general issue in simulation (not just using Markov chains), and we discuss it on page 277 of Bayesian Data Analysis (second edition): if the goal is inference about theta, and you have 100 or more independent simulation draws, then the Monte Carlo error adds almost nothing to the uncertainty coming from the actual posterior variance.

On the other hand, if your goal is to estimate E(theta|y) to some precision, then you might want lots and lots of simulation draws. This would be the example where you actually want to know that it’s 3.538, rather than simply 3.5. In my applications, I want inference about theta and have no particular desire to pinpoint the mean (or other summary) of the distribution; however, in other settings such as simulating models in physics, these expectations can be of interest in their own right.

Flegal, Haran, and Jones focus on inference for E(theta|y), which is fine. I don’t, however, agree with their recommendation (on page 2 of their paper) that statistical analyses using MCMC routinely include Monte Carlo standard errors. If my inference for theta is 3.5 +/- 1.2, that’s my summary. (More simulations might make such inference more precise, and our potential scale reduction factor, R-hat, is an estimate of how much narrower the posterior interval might become from further simulation. In my experience, a potential reduction of 10% in interval width is not such a big deal, so I generally recommend stopping once R-hat is less than 1.1, at least for routine analyses.)

For example, the Flegal et al. paper considers a hierarchical model in geostatistics with four parameters. The estimates of the posterior mean of each parameter is reported (for example, 23.23 with a Monte Carlo standard error of 0.04). But if I were fitting such a model to data, I’d actually be interested in posterior inference for the parameter–thus, I’d want the posterior standard deviation, not the Monte Carlo standard error. But they don’t report the posterior sd at all! This makes it clear to me that they are really working on a different sort of problem than I work on. In practice, the posterior standard deviation puts bounds on how accurately we need to estimate the mean, hence fewer simulations are needed to for reasonable inference for theta than for super-precise inferences for E(theta). Keeping the total number of iterations low is important when using these methods routinely in applied work.

Some cool things in our 1992 paper!

One other thing: Flegal et al. cite my 1992 paper with Rubin (thanks!). We did distinguish between posterior standard deviation and Monte Carlo error in that paper. As we wrote on pages 458-459 of that paper, “for any iterative simulation of finite length, valid inference for the target distribution must include a distributional estimate, which reflects uncertainty in the simulation, and also an estimate of how much this distributional estimate may improve as iterations continue.” Or, as Steve Brooks and I wrote, “‘convergence’ can be quantified in terms of the properties of the empirical interval, as compared to the true 95% interval from the target distribution, which would be attained in the limit as n approaches infinity.”

On pages 469-470, we discuss “estimating functionals of the target distribution” (e.g., estimating posterior means or quantiles) using the simulations. The simulations can indeed be used to get an estimate and Monte Carlo standard error. This only occupies a small part of the paper because it’s not something we often want in practice.

Summary

Flegal et al. want inference for E(theta|y) and want to stop simulations when E(theta|y) can be estimated to some specified precision (e.g., 0.025).

We want inference for theta|y and want to stop simulations when the posterior inference for theta would not change much from further simulation (because of sqrt(1+1/L) behavior). Our R-hat statistic focuses on means and variances, but as Steve Brooks and I discussed on page 441 of our paper, it’s also possible to look at this nonparametrically using interval coverage.

For Bayesian inference, I don’t think it’s generally necessary or appropriate to report Monte Carlo standard errors of posterior means and quantiles, but it is helpful to know that the chains have mixed and to have a sense that further simulation will not change inferences appreciably. Diagnostics are far from perfect, and so it’s good that people are doing research on these things.