Monitor the efficiency of your Markov chain sampler using expected squared jumped distance!

Marc Tanguay writes in with a specific question that has a very general answer. First, the question:

I [Tanguay] am currently running a MCMC for which I have 3 parameters that are restricted to a specific space. 2 are bounded between 0 and 1 while the third is binary and updated by a Beta-Binomial. Since my priors are also bounded, I notice that, conditional on All the rest (which covers both data and other parameters), the density was not varying a lot within the space of the parameters. As a result, the acceptance rate is high, about 85%, and this despite the fact that all the parameter’s space is explore. Since in your book, the optimal acceptance rates prescribed are lower that 50% (in case of multiple parameters), do you think I should worry about getting 85%. Or is this normal given the restrictions on the parameters?

First off: Yes, my guess is that you should be taking bigger jumps. 85% seems like too high an acceptance rate for Metropolis jumping.

More generally, though, my recommendation is to monitor expected squared jumped distance (ESJD), which is a cleaner measure than acceptance probability. Generally, the higher is ESJD, the happier you should be. See this paper with Cristian Pasarica.

The short story is that if you maximize ESJD, you’re minimizing the first-order autocorrelation. And, within any parameterized family of jumping rules, if you minimize the first-order autocorrelation, I think you’ll pretty much be minimizing all of your autocorrelations and maximizing your efficiency.

As Cristian and I discuss in the paper, you can use a simple Rao-Blackwellization to compute expected squared jumped distance, rather than simply average squared jumped distance. We develop some tricks based on differentiation and importance sampling to adaptively optimize the algorithm to maximize ESJD, but you can always try the crude approach of trying a few different jumping scales, calculating ESJD for each, and then picking the best to go forward.

6 thoughts on “Monitor the efficiency of your Markov chain sampler using expected squared jumped distance!

  1. Matt was just showing me cases on Wednesday where HMC oscillates across across the parameter space, producing negatively correlated samples and a near 100% acceptance rate. Don't we ideally want the autocorrelation to be zero, not negative? With negative autocorrelation, the variance can be seriously misestimated.

    What Matt wound up doing was simulating the Hamiltonian dynamics until the square distance (not expected square distance) begins to decrease, but then backing off to a set of parameters that'd only get you halfway through the dynamics (in leapfrog steps, not distance, which vary due to momentum). We could instead try to choose parameters that minimize the absolute value or square of the autocorrelation.

  2. How important are the covariances? At 2K parameters we're already at the limit of what we can do effectively with covariance calcs.

    I'll code up the 8-schools example and we can see how HMC compares with the adaptive Metropolis approach in yours and Pasarica's paper.

  3. Bob:

    I expect HMC will be much more effective than the adaptive Metropolis that Pasarica and I did. I'm not recommending our algorithm; the only thing I'm recommending from that paper is the idea of monitoring expected squared jumped distance, for any MCMC algorithm.

  4. I don't see how maximizing expected squared jump distance could possibly minimize autocorrelation time: If you make a large scale change or coordinate transformation to one of your dimensions, you can change the ESJD dramatically but not the autocorrelation time, and you can weight differently different dimensions (thereby making some parameters much more sensitive to SJD and others much less sensitive etc). Another way of putting it: The autocorrelation time is affine-invariant, but the ESJD is not.

  5. David:

    I see what you're saying, and I agree that rescaling will change ESJD. But it's not clear to me that rescaling will change the ranking of ESJD's, comparing one jumping scale to another. If you use maximum ESJD to optimize a parameterized jumping rule, I'd still think that would minimize autocorrelation time. To put it another way: at least in simple examples, if you minimize autocorrelation time in one dimension, you'll probably be minimizing it in another dimension.

    To keep things perfectly safe, you could monitor ESJD on one-dimensional margins. This will be equivalent to maximizing first-order autocorrelation over such margins.

    It seems like there's room for some more research here, beyond the sketchy ideas in my paper with Pasarica.

Comments are closed.