Validation of Software for Bayesian Models Using Posterior Quantiles

Every once in awhile I get a question that I can directly answer from my published research. When that happens it makes me so happy.

Here’s an example. Patrick Lam wrote,

Suppose one develops a Bayesian model to estimate a parameter theta. Now suppose one wants to evaluate the model via simulation by generating fake data where you know the value of theta and see how well you recover theta with your model, assuming that you use the posterior mean as the estimate. The traditional frequentist way of evaluating it might be to generate many datasets and see how well your estimator performs each time in terms of unbiasedness or mean squared error or something. But given that unbiasedness means nothing to a Bayesian and there is no repeated sampling interpretation in a Bayesian model, how would you suggest one would evaluate a Bayesian model?

My reply:

I actually have a paper on this! It is by Cook, Gelman, and Rubin. The idea is to draw theta from the prior distribution. You can find the paper in the published papers section on my website.

P.S. Although unbiasedness doesn’t mean much to a Bayesian, calibration does.

We’re planning on implementing this in Stan at some point.

15 thoughts on “Validation of Software for Bayesian Models Using Posterior Quantiles

  1. I just took a brief look at the abstract of the paper you linked. I was surprised by the fact that you didn’t link to any of your published papers or books chapters on posterior predictive checks, but to a paper that seems to find software errors. I mean, you always talk about using posterior predictive checks to check the fit of a model and it seems to me this was a perfect fit for them!

    What am I missing?

    • Maonel:

      Predictive checks are for checking fit of model to data. The question addressed here is how to check that a Bayesian inference routine is doing what it is supposed to be doing, conditional on the model being true.

      • I know English is not my first Language, but this is kind of ridiculous. I’m reading it again and I don’t see anywhere in the question mentions to check of a Bayesian inference routine, but only to model evaluation. For instance, when he says “Now suppose one wants to evaluate the model”, I read “I want to evaluate the fit of the model”.

        I’m sure that I’m wrong (after all, I was the only one to interpret the question this way), I just can’t see what I am missing.

        • Manoel:

          “Inference engine” is just jargon. The point is that he wants to evaluate his model conditional on its assumptions being true. At least that’s what it looked like to me.

  2. The vast majority of Bayesian methodologists that I know are concerned about bias, as it is one component of mean squared error (the other component being variance).

    Generating “many datasets and see(ing) how well your estimator performs each time in terms of…mean squared error” is not exclusively in the domain of frequentist analysis. In fact, this is how one would compute the Bayes risk of an estimator (you would simulate parameter values from the prior, then data from the parameters).

    If one develops a Bayesian methodology that uses a plug-in or “default” prior, then computing the Bayes risk, or some analogue, under a different prior is useful for evaluating how much worse your default prior is as compared to the optimal Bayes procedure. For example, in many problems an empirical Bayes procedure has a Bayes risk that is close to the optimal Bayes risk.

    • I guess I should amend my question not to whether or not Bayesians are concerned about bias/mse, but rather how to evaluate a procedure while staying philosophically coherent with the Bayesian paradigm. One can simulate data over and over and do the bias/mse or some other calculation as you suggest, but that seems to be just applying a frequentist paradigm/toolkit to a Bayesian model, which I suppose is fine in practice but doesn’t seem all that “philosophically correct.”

      • I’m not really sure why you see this as philosophically incorrect. If anything, it is committing to using a point estimate that is philosophically incorrect. But once you have done so, defining a cost function (such as MSE) and measuring the cost of your procedure using simulated data seems (to me) perfectly coherent with the Bayesian approach.

        This can be done either by simulating under the prior (Andrew’s suggestion – which as he points out amounts to performing the evaluation conditional on the model being correct), or by simulating under a different prior (e.g. using a hand-chosen range of parameter values, perhaps based on point estimates obtained from real data sets) – this latter approach can be thought of as incorporating a form of model checking, where you are conditioning on correctness of the likelihood function but doing a check on the prior.

        • I guess my concern is that simulating data repeatedly is “frequentist”. Also it seems to be using an “objective” view of probability (# of successes/n) rather than the Bayesian “subjective” probability.

        • Patrick:

          I don’t think Bayesian probability is particularly subjective. See chapter 1 of BDA or various of my recent articles for several examples and much discussion of this point.

    • Peter:

      I pretty much agree with what you wrote—except that, if someone is interested in looking at departures from the default prior, I think they should even more so be looking at departures from the default likelihood!

  3. For Peter’s concern (something other than just checking if the assumed model has been properly implemented)
    – this (“a diagnostic in the form of a calibration-sensitivity simulation analysis”) might be of interest http://arxiv.org/pdf/1010.0306.pdf

    (With what they refer to as the omnipotent prior [and data model], I believe you are doing the same thing)

  4. Daniel Lee and Michael Betancourt supplied a great example from our current Stan testing for version 2.0. You can see the plots Michael made on this thread on our dev-list (no login required) in his message of 23 May 6:20 AM:

    https://groups.google.com/forum/?fromgroups#!topic/stan-dev/YLoqWCzyOWM

    It compares known parameter posterior means to our sampled values for 1.3 and 2.0, adjusting for Monte Carlo std error, and also compares effective sample sizes.

    We wanted to do heavy testing because Michael completely rewrote all the code surrounding Hamiltonians so that it lines up much more neatly with the definitions and modularizes components like the Hamiltonian computation, the integrators, adaptation phases, all of the different mass matrix types, etc.

  5. Andrew ,

    Do you like your adjectives?

    Can I say to all people who studied some statisitcs at Uni in the dim past your blog is an absolute blessing.

    I do hope you are getting more visitors down-under. Quality and quantity is unusual!

Comments are closed.