A wise statistician once told me that to succeed in statistics, one could either be really smart, or be Bayesian. He was joking (I think), but maybe an appropriate correlary to that sentiment is that to succeed in Bayesian statistics, one should either be really smart, or be a good programmer. There’s been an explosion in recent years in the number and type of algorithms Bayesian statisticians have available for fitting models (i.e., generating a sample from a posterior distribution), and it cycles: as computers get faster and more powerful, more complex model-fitting algorithms are developed, and we can then start thinking of more complicated models to fit, which may require even more advanced computational methods, which creates a demand for bigger better faster computers, and the process continues. As new computational methods are developed, there is rarely well-tested, publicly-available software for implementing the algorithms, and so statisticians spend a fair amount of time doing computer programming. Not that there’s anything wrong with that, but I know I at least have been guilty (once or twice, a long time ago) of being a little bit lax about making sure my programs actually work. It runs without crashing, it gives reasonable-looking results, it must be doing the right thing, right? Not necessarily, and this is why standard debugging methods from computer science and software engineering aren’t always helpful for testing statistical software. The point is that we don’t know exactly what the software is supposed output (if we knew what our parameter estimates were supposed to be, for example, we wouldn’t need to write the program in the first place), so if software has an error that doesn’t cause crashes or really crazy results, we might not notice.

So computing can be a problem. When fitting Bayesian models, however, it can also come to the rescue. (Like alcohol to Homer Simpson, computing power is both the cause of, and solution to, our problems. This particular problem, anyway.) The basic idea is that if we generate data from the model we want to fit and then analyze those data under the same model, we’ll know what the results should be (on average), and can therefore test that the software is written correctly. Consider a Bayesian model p(θ)p(y|θ), where p(y|θ) is the sampling distribution of the data and p(θ) is the prior distribution for the parameter vector θ. If you draw a “true” parameter value θ0 from p(θ), then draw data y from p(y|θ0), and then analyze the data (i.e., generate a sample from the posterior distribution, p(θ|y)) under this same model, θ0 and the posterior sample will both be drawn from the same distribution, p(θ|y), if the software works correctly. Testing that the software works then amounts to testing that θ0 looks like a draw from p(θ|y). There are various ways to do this. Our proposed method is based on the idea that if θ0 and the posterior sample are drawn from the same distribution, then the quantile of θ0 with respect to the posterior sample should follow a Uniform(0,1) distribution. Our method for testing software is as follows:

1. Generate θ0 from p(θ)

2. Generate y from p(y|θ0)

3. Generate a sample from p(θ|y) using the software to be tested.

4. Calculate the quantile of θ0 with respect to the posterior sample. (If θ is a vector, do this for each scalar component of θ.)

Steps 1-4 comprise one replication. Performing many replications gives a sample of quantiles that should be uniformly distributed if the software works. To test this, we recommend performing a z test (individually for each component of θ) on the following transformation of the quantiles: h(q) = (q-.5)2. If q is uniformly distributed, h(q) has mean 1/12 and variance 1/180. Click here for a draft of our paper on software validation [reference updated], which explains why we (we being me, Andrew Gelman, and Don Rubin) like this particular transformation of the quantiles, and also presents an omnibus test for all components of θ simultaneously. We also present examples and discuss design issues, the need for proper prior distributions, why you can’t really test software for implementing most frequentist methods this way, etc.

This may take a lot of computer time, but it doesn’t take much more programming time than that required to write the model-fitting program in the first place, and the payoff could be big.

Radford Neal commented:

Methods along these lines have also been investigated by John Geweke. See his paper on "Getting it right: joint distribution tests of posterior simulators". It's available at

http://www.biz.uiowa.edu/faculty/jgeweke/papers/p…

Interesting blog! Wish I had time for one…

Radford

Andrew commented:

Hi Radford. Thanks for the reference; it's good to know that other work is being done in this area. Our method is slightly different and (I think) more efficient than Geweke's. I think we get some benefit from using the same draws for the prior and the posterior, and from using the test summary (q-.5)^2, which in some of our examples has been more sensitive than simply using q.

Andrew commented:

further update… We looked at Geweke's paper further and realize that his method is different from ours. He is testing programs that simulate from the joint distribution p(y,theta), whereas we are testing programs that simulate from the posterior distribution p(theta|y). His method has the advantage of not requiring nested looping, and ours has the advantage of more realistically testing the program that's actually being used for posterior inference. So each method seems to have its potential advantages.