A centipede many times over

I recently came across this presentation by Charlie Geyer on diagnosing problems with Markov chain simulations. Charlie gives some good advice, and some advice that’s not so relevant to my own experience, and misses some other ideas that I and others have come across.

To a repeat a frequent theme of mine (see, for example, meta-principle #3 in this discussion), I suspect that many of the differences between Charlie’s approaches and those more familiar with me stem from the different sorts of problems we work on. Charlie works on big, tangled, discrete models in genetics–the sorts of problems with multiple modes and where, not only can computations be slow to converge, then can easily get stuck in completely unreasonable places. To solve these problems, you have to write complicated computer programs and solve problems specific to your model and your application. In contrast, I work on smaller, continuous problems in social science where, if you find yourself really far from where you want to be in parameter space, you’ll probably realize it. I develop relatively simple but general algorithms that are intended to work for a wide range of examples and applications.

Neither my approach nor Charlie’s is “better”; we just do different things.

Here are three examples.

First, on page 3 of his slides, Charlie writes:

Want to do expectation mu = E{g(X)}.

This reminds me of the gag about the difference between the theoretical and the applied statistician. More seriously, though, this statement–undoubtedly true for Charlie’s work–almost never applies in my own applied research or in most of the MCMC applications I’ve seen!

How could this be? Isn’t statistics all about expectations? No. In most of the Bayesian applications I’ve seen, the goal of simulations is not to estimate expectations but to summarize uncertainty. We give several examples in chapter 7 of ARM (as well as, of course, throughout BDA).

Why does this make a difference? Expectations or simulations, who cares? The reason why this matters is that affects the focus of what comes next. I’m concerned that my simulations cover the posterior distribution in the senses relevant to my application. Charlie focuses on point estimation which leads him to statements such as “First rule of MCMC: compute standard errors. If you don’t care how accurate your MCMC estimates are, then why should we take you seriously?” This makes sense for him but not so much for me. If I have the equivalent of 100 or 1000 independent draws from the posterior distribution, that’s generally all I need. And if my 50% posterior interval is, say, [1.1, 4.3], then I hardly care if that lower bound is really 1.083 or 1.147, or whether the posterior mean is 2.91384 or 2.86574, and I can’t really do anything useful with Monte Carlo standard errors, which in this setting will be much smaller than the uncertainty inherent in the posterior distribution. For some problems you care about Monte Carlo standard errors and for some problems you don’t.

My second example comes from page 13, where Charlie writes:

The literature is full of MCMC “diagnostics.” Except for perfect sampling, none of them diagnose nonconvergence that cannot easily be seen in simple plots.

I don’t disagree with this, but . . . What “simple plots” do you do? There are two parts of any MCMC diagnostic. More generally, there are two parts of any statistical method: the design and the analysis.

For MCMC diagnostics, the design part is to simulate multiple sequences, and the analysis part is to examine mixing between the sequences and stability within each sequence. I’m a big fan of graphical summaries (see what happens if you apply the plot() function to an object obtained by running bugs() in R), but we can do a lot better than trace plots. I strongly recommend summary plots to see where there’s lack of mixing, then trace plots to understand the problems in more detail.

In Charlie’s world, things might be different, though. I imagine he knows so much about the details of the models he’s fitting that he has a strong sense ahead of time of where there might be convergence problems. So he can focus right away on the trace plots he knows to look for. That’s fine.

My general point is that even if something “can be easily seen in simple plots,” we can work often more effectively with sophisticated plots. We sometimes fit models with thousands of parameters, and it’s hard to look carefully at thousands of simple plots. Our models are structured, so let’s take advantage of that structure in learning about our computations.

My third example also comes from Charlie’s page 13, where he writes:

In the black box situation, the best diagnostic is to run the chain for a very long time–like from the time the paper is submitted until referees reports arrive.

This captures a certain world within computational statistics, a world in which you know what model you want to fit, it’s a big model, and you run it overnight or for a week or even, as Charlie suggests, for a period of months. I can well believe this is how things work in some complex analyses in genetics, but it doesn’t really capture my world, or many others’ examples that I’ve seen. In my work, I’m never quite sure what model to fit, I’m always building up models and stripping them down, throwing in variables and checking model fit, going back and forth between subject-matter questions, data, model, and computation. My model’s always changing, and lack of MCMC convergence is just one of many related problems having to do with building confidence in model, its computation, and its fit to data.

In the second half of his presentation, Geyer considers various ideas for testing and debugging MCMC code. He gives the good advice to take apart your code and check that each part does what it says. The next step, I think, is the self-cleaning oven: set up your program so it can specify true parameter values and simulate fake data, then check that your statistical program recovers the truth (up to stated uncertainty) in the inferential step. (Geyer doesn’t mention our method for Bayesian software validation using posterior quantiles, but in this case I suspect he’s just unaware rather than disapproving of our efforts; I think this particular idea would fit in fine with his general framework.)

Finally, Geyer makes a good point that, in general, you can almost never be sure that your simulations are converging. One central idea in our work is to compare inferences to what we’ve learned from fitting simpler models. This won’t always work–when we add complexity to a model, indeed it can do new things–but can give us a sense of what we’re getting out of each new step.

I’d also like to emphasize the importance of simulating multiple sequences from different starting points and monitoring the mixing of the chains. Geyer does not mention this approach–perhaps because it’s so standard that he did not feel the need to bring it up in his talk–but it can be extremely effective. Mixing does not imply convergence, but lack of mixing surely reveals a problem.

And if I had a leg for every time that I’ve used multiple sequences to find a problem in my iterative simulation (or that someone has told me about such an experience themselves), I’d be a centipede many times over.

P.S. I hope it’s clear from the above discussion that, despite our different perspectives, I think Charlie’s experiences are valuable and his advice is worth thinking about. If I thought Charlie’s ideas were worthless, I’d ignore rather than publicize them.

2 thoughts on “A centipede many times over

  1. Does Charlie mean something more abstract by E{g(X)} such as when you want a credible interval for a parameter of focus and you take the expectation over the other parameters?

    For instance as I did for this plot
    http://www.stat.columbia.edu/~cook/movabletype/ml

    using numerical integration as described here http://www.stat.columbia.edu/~cook/movabletype/ar

    (which has an error which I'll correct soon)

    Also your http://www.stat.columbia.edu/~gelman/research/pub
    is the only place I have seen a draw from the joint distribution of (theta,X) used to get (theta|X)…

    K?

Comments are closed.