In an beautiful new paper, Betancourt writes:

The geometric foundations of Hamiltonian Monte Carlo implicitly identify the optimal choice of [tuning] parameters, especially the integration time. I then consider the practical consequences of these principles in both existing algorithms and a new implementation called Exhaustive Hamiltonian Monte Carlo [XMC] before demonstrating the utility of these ideas in some illustrative examples.

The punch line is that, as measured by effective sample size, XMC is about twice as fast as regular NUTS, with the gains coming from better use of the intermediate steps in the Hamiltonian paths. XMC is already implemented in Stan, and it’s one reason that the version 2.12 is faster than earlier editions of Stan.

Mike’s graphs are so pretty, I found it hard to pick just a few for this post:

As I said, this new, more efficient NUTS algorithm is already coded in Stan and hence it’s already being used in tons of applications, including for example Kremp’s open-source and fully reproducible poll aggregator. Bob and Michael are also planning to write it up in pseudocode and put it into the Stan manual.

You missed my favourite figure! The paper is worth reading for Figure 4 and its caption alone:

“Every momentum resampling induces a change in the Hamiltonian, which allows a Hamiltonian Markov chain to randomly walk amongst energy level sets. (a) When the expected variation, ∆H, is similar to the width of the marginal energy distribution this random walk will rapidly explore this distribution, but (b) when the expected variation is small the exploration will suffer from large autocorrelations. Optimizing the exploration of the marginal energy distribution provides an implicit criteria for selecting an optimal cotangent disintegration, and the energy autocorrelations define a constructive diagnostic for a poorly chosen cotangent disintegration.”

There’s a whole paper on that! https://arxiv.org/abs/1604.00695

A quick correction. XHMC is an algorithm similar to NUTS, but it doesn’t use the No-U-Turn criterion to determine when to stop integrating. XHMC instead uses a more theoretically-motivated criterion that can perform better, but it also introduces a new tuning parameter that has to be carefully set. The code is in the Stan library but hasn’t yet been exposed to users in the interfaces.

The big change was in how we sample from the trajectories that we generate. Matt had originally used a slice sampler but it turns out that sampling in a slightly different way leads to much higher performance, even with all of the other aspects of NUTS unchanged. In any case, all of these changes were made a few versions ago so if you’re a loyal Stan user then you’ve been taking advantage of the changes for a few months now.

This is fantastic! I’ll take a look at the paper and try to grok as much of it as I can. I have a model that takes ~ 1 hour to run until convergence, but it’s far and away better than any other model for this particular purpose (working on a paper to publish the method).

In the paper, I argue that the only downside is the time required to compute (1 hr vs other ML methods, which take ~ 3 seconds); you can either obtain the wrong answer in 3 seconds or the right one in one hour :p. I then say the algorithm should be correct, and the only downside of the model will diminish as computers gain speed or the sampler improves. This sort of improvement proves my point.

Thanks as always for the fantastic work in stan.

Stephen:

If you’d like to do maximum likelihood or maximum penalized likelihood you can do this in Stan just using the optimizing setting and it should run fast. We’re also working on adapting Stan to do marginal optimization (that is, marginal maximum likelihood or penalized likelihood).

I know you can. My particular model requires a Bayesian estimator though. The ml will spit out nonsense. The bayesian output is fine however. I’m fairly certain I know why this is, but this isn’t the place to type it all out.

Andrew, it kind of makes sense in many problems to do something like optimization instead of a long “warmup” for efficiency sake. Warmup of course moves your state vector into the high probability region of space, but it also does a lot of other stuff like estimate mass matrices and step sizes and etc. The early samples from a warmup are probably not that informative unless you start from within the high probability region. What’s the best way to tell stan to first optimize your state vector into high probability region of space, then run (with a shorter warmup)?

Is this a technique that you use as standard practice?

I’ve got a model I’m working on right now that has something like 60,000 measurements and 24,000 parameters, and I suspect this technique would help me a lot but I haven’t delved into it much.

Like, for example, in rstan I don’t see that the “stan” function has a way to “init” the parameters from a previous stanfit… I guess maybe you could manually construct an initialization vector from extracting the stan samples etc… but a method init=”continue” which grabs the state from a stanfit object and continues from the same place would be great.

You can init the parameters and stepsize in RStan (the arguments are init=list(…), and control=list(stepsize=…)). I think it’s still a little awkward to extract parameters from one fit and then put them in the right list format to be input as an init, but there may be a neat way to do that in R.

What you can’t do yet is initialize mass matrix. This is coming soon because users keep asking us for it and we need it for some other algorithm development research.

In fact, it’d be great if the stan team stuck an extract.init.value function into rstan that takes a stanfit object and an index i and returns a list suitable for initializing stan by extracting the “ith” sample from the stanfit

The biggest annoyance is dealing with vector parameters, I think.

vector [N] foo;

may require parsing the parameter names in text form? I’m not sure.

Should I file a bug report / feature request against rstan?

Looks like you created the feature request. If that’s foo, then extract(foo) will give you an M x N structure (I can never tell if something’s an array or matrix or some other thing in R) indexed by saved iteration and index of foo. So if you do extract(foo)[12, ] you’ll get the vector value of foo at iteration 12 in the appropriate form to put back into RStan.

The right way to architect this for generality would be something like an extract_iteration function that gives you a list with keys equal to parameter names and values of the form required for initialization. The reason we don’t return values like that in the first place is that it’d be hugely wasteful to store all those structures. The underlying data is just a matrix indexed by iteration number and overall parameter number.

But like I said earlier, this kind of thing may already exist in RStan. We’ll know if they close the issue with a note on how to do it :-)

It’s a tricky business and very sensitive to the mode. In many cases, like hierarchical models, the modes don’t make sense and in other cases they’re well defined but very far from the typical set (volume of posterior with high probability mass).

What we’d like to be able to do is get to the typical set as fast as possible. Once there, we’d then estimate the mass matrix and step size as we do now.

As is, we start with a unit mass matrix, move a little bit, estimate a new mass matrix and step size, then move some more. Each time, we only use the most recent half of the warmup draws to estimate the mass matrix. This is explained in more detail in the algorithms part of the manual.

Yes, in a situation like mine, I think it would be good to optimize first and then hamiltonian-explore the region near the optimum. Other situations, the optimum could be far from the typical.

Right now, I’m running like 30-50 warmup steps because it takes an hour and I still am not done developing the model… but that’s hardly enough, still it might be enough if I started near the optimum. I’m fine with running a 24 hour long run once I have made all my modeling decisions and am happy with the model, but for model development, no way. Just seeing whether it will get to a reasonable mode and stay in that vicinity would be helpful.

In particular I kind of expect this would be a good situation for models like mine where there are a lot of parameters that are near-zero and then a few that are far from zero. Using a cauchy type distribution I can model that situation, but I need stan to figure out that a few of these parameters need to move out to their distant mode (say on some scale they’re maybe 3 +- 0.2)

The data may effectively constrain those parameters to be near their mode, but warmup first has to push them way out there, which means it wants a big step size, and then put on the brakes and step with small steps… it’s a tricky problem that could be helped by starting near the mode and having all the warmup steps be in that “good” space.

Please read the manual section on warmup in Stan. Before starting any adaptation Stan runs a series of unadapted transitions that are _very_ good at getting the chain into the typical set. Really the only time this fails is when the model is ill-posed or sufficiently pathological that the Markov chain won’t fully explore it anyways. The slowdown for complicated models occurs once the Markov chain has entered the typical set where the parameters are highly correlated and of different scales. This requires very small step sizes and long trajectories to adequately explore, and so the chains slow down as they try to fully explore and capture enough of the typical set to make a reasonable estimate of the covariance for the next stage of adaptation. If you plot transitions vs time then you’ll see an initial slow down before a rapid acceleration as adaptation kicks in in stages.

Consequently trying to intelligently start the chains somewhere will not do anything useful. The potential speed ups come from rescaling the model parameters so that the expected posterior variances are similar, even better if the correlations can be muted with clever reparameterizations.

Good to know! I definitely notice the slow-down on adaptation, but didn’t know what was going on, so assumed maybe starting in a good spot might help, sounds like maybe not.

I have definitely improved my model performance wise by using some of the reparameterization tricks that are documented in the stan manual. You guys have one of the best manuals for a technical piece of software in the whole world, not just the free software world.

What about the situation where there are local modes that are not meaningful? Like for example suppose x is a vector of 1000 parameters. you have

x ~ cauchy(0,1)

and 3 or 4 of the x values have true posterior peaks near say 10 or 20 due to the likelihood… but maybe if you are near 0 you still have a local mode. I know there’s no good solution to this multi-modal problem in general, but wouldn’t this be a case where you’d be likely to want to start with a good initial value?

Trying to explore just one more is not fully Bayesian, so you’re not in a well-posed mathematical framework and hence there’s no “correct” answer. Whatever heuristic does well for your problem is viable, provided that its limitations are recognized and acknowledged.

In problems where 2000 warmup steps takes 10 seconds… this just doesn’t arise, but if you need an overnight run to sample 100 warmup and 100 real samples… when you realize that you changed the constraint on the lower bound of vector foo, but forgot to change the lower bound of vector bar to correspond… you’d still like to not go back to a randomized initial vector. The information you got from the earlier run could be still very helpful in converging faster on the debugged model.

It sounds, however, like I’m probably better off working harder to reparameterize and hence speed up the way things run rather than try to start in equilibrium, especially since at the moment there’s no way to reuse the mass matrix etc.

Whoa, yeah. This paper is *way* above my head. I intuitively understand NUTS (in part due to Betancourt’s excellent presentations and figures), but I’m not “intuitively” grasping what XHMC is doing. Is it defining a region along the simulated particle that, when intersecting with a previous simulated path, will terminate the simulation as opposed to stopping the simulation upon hitting a reversal in trajectory?

A Hamiltonian system is one which moves along a constant energy curve in high dimensions. Think of a ball rolling along a single contour on a contour map. A constant energy curve is the set of points in a “microcanonical” (aka constant energy) ensemble. The average over all the points along a constant energy curve, repeated for multiple constant energy curves at different energy levels… is one way to think about the average over a bunch of HMC samples.

Staying on a single energy curve too long helps you refine the average over that energy curve… but doesn’t help you refine the overall average because it’s giving you highly specific information about just one level of energy. (note, in this analogy, energy = log probability density)

Staying on a single energy curve for too short a time/distance prevents you from moving through space to get a wide variety of samples.

Basically he’s coming up with a way to decide when to stop moving along the constant energy curve and give your state vector a “kick in the pants” (by assigning it a random momentum) so it moves to another energy curve. Rather than looking for a place where it “turns around” instead he’s thinking about “when does more sampling along this energy level give me diminishing returns?”

at least, that’s the take-away I’m getting out of it.

The energy has two components, the kinetic energy (kick in the pants!) which is initialized randomly each iteration and the potential energy, which at any given point is the negative log posterior density at that point (parameter value). The Hamiltonian (sum of the components) is what is conserved. The distribution on the kinetic energy (which is chi square, being the sum of squared normals) determines how much the log posterior can change. When you roll up the contour, the potential energy goes up and the kinetic energy goes down, then the kinetic goes to zero and you start rolling back the other way, gathering kinetic energy and losing potential energy.

To clarify, potential energy = log probability density, and kinetic energy = the random momentum you assign. Total energy is then constant during the integration. At some point, you stop integrating and slam your particle with a new momentum (changing the kinetic energy). most of the time an HMC step should be accepted. But, how long should you run before you check acceptance and then slam it with a new kinetic energy in order to get effective motion throughout the high probability space?

On page 3 and 4 he describes this as “lifting from the sample space to the cotangent bundle” and “projecting back to the sample space”. Think of “lifting” as whacking the particle with a hammer, thereby giving it a momentum, and “projecting back” as grabbing the particle and stopping it, thereby making it not move.

page 10 he basically says that we define a “how much are these two points like each other” function kappa (the auto-correlation function) and then decide to stop the particle once it gets to a region that is “not very much like where it started” and test it for acceptance and give it a whack with a hammer at that point.

Hopefully Michael will fix any inaccuracies in my interpretation here :-)

I think I understand the premise of HMC sampling, but I meant that I don’t understand XHMC specifically. NUTS uses the no-u-turn criterion in euclidean space as a simulation stopping point. I don’t understand what this new method, XHMC, is exactly doing. So I mean that I understand HMC, I understand NUTS, but I don’t understand XHMC and how it’s addressing the issue that NUTS addressed in a different way. Much of the paper is just over my head, but based on the figures alone it seems like the criterion isn’t “no u-turns allowed”, but rather “yup, we’ve been to a similar combination of these parameter values already in this path, so stop simulation” (the larger the test region, the more liberal the sampler would be about what it considers ‘similar combination of parameter values’, which as the paper says, can be problematic and may need tuning). I may be totally, totally off, but that was the point of my comment – I have no idea what XHMC (as opposed to HMC or NUTS) is intuitively doing.

The proposal is to stop moving the point along when some scalar function kappa that decreases monotonically in time along a hamiltonian path gets small enough. If kappa is a function that decreases to zero through time as you integrate, then making kappa small enough is potentially a good stopping criteria.

so, it’s not “we’ve been near here before” but rather “this point is pretty far away from where we started because our function kappa(State(Now) , State(start)) < delta"

but, what kappa to use? Page 10 he suggests looking at the time rate of change of the "virial" as it's a generic scalar function of a hamiltonian system.

In essence, he's reduced the problem of deciding when to stop to the problem of deciding what's a good threshold for a single parameter delta to determine if the virial is changing slow enough.

Thank you for this explanation! I’m grokking it a bit better now.

There’s a more introductory review coming soon, so I wouldn’t struggle with this if you’re not comfortable with the math. It was meant to be as formal as possible to provide a foundation for any followup work.

The basic idea can be derived from Figure 3 in the paper. The trajectories explore the level sets in grey, the momentum resampling jumps the chain between level sets. The longer the integration time of each trajectory the more we explore those level sets, but at some point that exploration leads to diminishing returns and it’s not worth the cost. The practical question is what is the optimal choice of integration time? The No-U-Turn criterion is one heuristic condition that works well but it known to do worse in highly-correlated models. In the paper I introduce exhaustions which are a more theoretically-motivated condition, albeit one with an additional tuning parameter that makes it awkward to use in practice.

Almost as importantly, you really want to focus on the trajectories as first-class objects when thinking about how each transition is constructed. You first sample a trajectory that contains the initial point _anywhere_ in it, then you sample a new point randomly from that trajectory. Once you have a new point you resample the momentum and continue ad nauseum.

Looking forward to the review paper. Honestly, I think understanding MCMC, HMC, gibbs, M-H, NUTS has helped me better understand model construction and inference, so having a better intuition about more novel methods can only further that understanding. Perhaps that shouldn’t be the case, but to me, understanding how HMC and NUTS samples has helped me create more effective models, has helped me diagnose potential problems in the parameterization, etc. Moreover, when I describe Bayesian inference, I often times explain such algorithms because I think they help novices of Bayesian inference better understand exactly what the output is telling them and how to specify models. Of course I understand the need for formal papers, even if I cannot currently understand them (I’m a social psych phd candidate, unfortunately nowhere near trained enough in the mathematics required for understanding such papers). I’m just grateful for the work the whole team is investing into stan.

I was reading through “Identifying the Optimal Integration Time in

Hamiltonian Monte Carlo” and had to go back to “The Geometric Foundations of

Hamiltonian Monte Carlo” to follow (some of) the notation.

I am sure I will have plenty more questions but can someone confirm

that in 1.1 of the former and 3.1 of the latter that the right hand

side is the Radon-Nikodym derivative? Here’s the equation in LaTeX but

I doubt this will get rendered :(

$$

H = -\log\frac{d(\pi^*\bar{\omega} \wedge \zeta)}{d\Omega}

$$

I think it has to be because $\Omega$ is volume form and the

differential of such a form is 0 so using $d$ to denote differential

makes no sense (even if the ratio of k-forms made any sense).

But in that case, how is the Radon-Nikodym derivative being

calculated? Presumably in some co-ordinate system (the canonical

co-ordinate system?), we can associate (real-valued) functions on the

cotangent bundle, now interpreted as a measure space, with the two

volume forms? I assume that this concept is well-defined with respect

to changes in co-ordinates?

Great papers BTW

It *is* the Radon-Nikodym derivative.

[This is the first time I am posting on this blog while sitting in a hotel only a few minutes away from Columbia University.]

I highly recommend watching Betancourt’s awesome lecture from Stancon 2017 called “What you *should* have learnt about MCMC”. Especially if you are not a mathematician or statistician, I suggest people watch it several times and then read the HMC article he wrote.

Here is the recording of the livestream. Betancourt begins at about 4:37:50.

Sorry, I meant that it begins at 4:10:10. Here is the link again.

Sometime soon we’ll also cut up the long video of the conference and post a bunch of shorter videos so the different talks are easier to find.

This should take you right to the spot where Michael starts his talk: https://youtu.be/DJ0c7Bm5Djk?t=4h40m10s