Michael Betancourt writes:

Hamiltonian Monte Carlo has proven a remarkable empirical success, but only recently have we begun to develop a rigorous understanding of why it performs so well on difficult problems and how it is best applied in practice. Unfortunately, that understanding is con- fined within the mathematics of differential geometry which has limited its dissemination, especially to the applied communities for which it is particularly important.

In this review I [Betancourt] provide a comprehensive conceptual account of these theoretical foundations, focusing on developing a principled intuition behind the method and its optimal implementations rather of any ex- haustive rigor. Whether a practitioner or a statistician, the dedicated reader will acquire a solid grasp of how Hamiltonian Monte Carlo works, when it succeeds, and, perhaps most importantly, when it fails.

This is great stuff. He has 38 figures! Read the whole thing.

I wish Mike’s paper had existed 25 years ago, as it contains more sophisticated and useful versions of various intuitions that my colleagues and I had to work so hard to develop when working on .234.

Sorry, what’s .234?

David:

.234 is my true theorem!

https://projecteuclid.org/euclid.aoap/1034625254

It’s a really good paper. I’m enjoying it. The illustrations are terrific (noticed a few color labeling issues). Looking forward to the Riemann part because you have to control/reduce/contain complex dimensions in any sectioning. Problem I always have is that it’s very difficult to recognize when you’ve shifted dimensionally especially when you’re relying on Markov concepts which have a tendency to over-simplify chaining or looping relative to any point or measured event outcome or path of an outcome treated as a distribution or as a scalar point value. Or in English, it’s kind of like being a member of the Fukawi tribe – or the Heckawi Tribe of F Troop for the more sensitive as in “where the heck are we” – when that meets the old camp song “we’re here because we’re here because we’re here because we’re here”, rendered as some spot in some Markov chain somewhere in a slice of a distribution you approximate because you need to assume at some level that it looks kind of like this.

Your “in English” didn’t help me. Can you do that again but in pseudo-math?

Actually the post from the following day to this sort of makes my point: it’s easy to construct a line that actually shifts dimensionally and you can’t see that shift or perhaps you imagine the new dimension – new attributes, new confounding, next level relations, etc. – can be examined or traversed with the same reliability. So you have 2 issues from the issue that nearly all data sets, result mappings, etc. you have different layers of probability density and distribution: that your path may take you places which you can connect but which are actually not very connected if you look at the different layers and that you are in a place, that you’ve described a place using certain terms or specifications and now you’re actually in a different place, one which isn’t very well described by your terms or specifications. This happens in evolution: an organism develops along pathways given its context – random mutation, genetic drift, etc. applied over time to and from context and thus naturally selected to continue with this or that adaptation and then context changes and what used to work means you fail. We tend to forget that most adaptations fail so most of these pathways lead to relatively hostile contexts. That’s a huge difficulty in traversing a space: you may connect paths together and not recognize how tenuous they are and you thus may actually find yourself in a space, at a point, where your connection to the surrounding context is tenuous. (I’ve met lots of people I’d characterize that way.)

I meant the post from 1/13 and its reference to incremental validity and latent factors.

Well given that many of the intuitions in the paper were learned from you and Gareth, it would have been hard to write it any earlier… :-p

Bravo Michael, the paper is extremely readable, and actually is helping me better understand some issues in Classical Physics as well ;-)

This reads like a manual of how to understand Hamiltonian Monte Carlo written by someone who actually cares about the reader’s success.

Disappointing lack of swearing though.

Question for you regarding the efficiency of HMC. In the paper you say more or less that the transition you’re using is to start at an initial point, select a momentum at random from the momentum distribution, and then integrate forward and backward from that point until an integration time criterion is met. You then select at random from among all the points you visited with weights proportional to the density at each location. (if I’m understanding correctly in the vicinity of pg 40)

This then becomes a metropolis proposal scheme, and you accept or reject the single new point.

The advantage to this is that the sampled point becomes part of a uniformly weighted sample. But the disadvantage is that if you did a fair amount of integration, you’re throwing away information about some large number of evaluations of your probability density function. Might it make sense to simply add ALL the points you saw along the integration paths into a sample, and then let the user of the sample do their expectations via weighted sums? Or, perhaps to create a scheme that uses more of the points you saw, I’m thinking maybe of something like for EACH point along the integration, propose it as a metropolis transition and accept or reject it. Then if you have say 15 integration points you’ll wind up with 15 new samples which are a mixture of copies of the initial point (when you rejected the transition) and newly accepted transition points. All of these points will be in the typical set, and it seems a shame to throw away points in the typical set.

obviously you’d have to think carefully about what that all means mathematically, but some scheme where you use more of or all of the integration points seems to be potentially quite valuable.

Firstly, when you sample from the trajectory according to the microcanonical weights you’re no longer doing a Metropolis scheme. Really it’s a Rao-Blackwellized Metropolis scheme because you’re considering not a single proposal but all of them at once. Conveniently, the math works out the same way if you want to include all of the points — you just weight them by the microcanonical weights when computing an empirical expectation. This can’t make the estimates any worse, but it turns out that it doesn’t actually offer much improvement because the points in the trajectory are so correlated with each other. In fact, the overhead for having to store and then use all of those points ends up overwhelming the small gains that you make.

I’m glad you thought about it at least. Thanks for the explanation.

To see if I understand your comment about Rao-Blackwellized Metropolis, are you saying that the total energy plays the role of a sufficient statistic, which then determines the distribution of points by picking out a level set. The points along the trajectory then play the role of data distributed according to the distribution determined by the sufficient statistic (the energy that determines the level set). Choosing a point at random along the trajectory according to the density at the points then allows you to construct any estimator for the average of a function along that trajectory by simply evaluating the function at that sample point?

Then, by sampling one point from each level set according to that above idea, you construct an estimator for the expectation over the entire distribution?

There was a notion of a concern for numerical divergences causing problems, and the use of the Metropolis “correction” to ensure that you’re sampling from the proper distribution. Is that Metropolis correction actually occurring in this context? Perhaps because the initial point is itself in the trajectory the probability of “rejecting” a point (that is, returning to the original starting point) winds up being correct?

There are subtleties here that it seems interesting to understand, and/or you might want to expand upon in the document if you think they could be helpful.

Yes, what you said in the third paragraph, only it’s still not quite a standard Metropolis. But it has the same kind of detailed balance proof as Metropolis, only complicated by the way the NUTS criterion works forward and backward in time and by the multinomial sampling over the trajectory (which is quite importantly weighted to the second half of the trajectories produced, thus producing a greater “jump distance”).

Absolutely a must read! Where was this paper five years ago when we were posting things like this 2011 post as we were all struggling to understand the properties of HMC.

It’s fantastic. This would have saved me a lot of confusion when I first encountered many of these ideas. Essential reading for Stan users and hopefully many others.

I do agree all above. A great work makes me feel I understand the problem better simply by reading it and this is certainly one of that kind.

Wow! What great exposition. Thanks for writing this and for posting the link.

Bob.

The world needs lots more work of this quality at this level. Thanks!