I hate to keep bumping our scheduled posts but this is just too important and too exciting to wait. So it’s time to jump the queue.
The news is a paper from Michael Betancourt that presents a super-cool new way to compute normalizing constants:
A common strategy for inference in complex models is the relaxation of a simple model into the more complex target model, for example the prior into the posterior in Bayesian inference. Existing approaches that attempt to generate such transformations, however, are sensitive to the pathologies of complex distributions and can be difficult to implement in practice. Leveraging the geometry of thermodynamic processes I introduce a principled and robust approach to deforming measures that presents a powerful new tool for inference.
The idea is to generalize Hamiltonian Monte Carlo so that it moves through a family of distributions (that is, it transitions through an “inverse temperature” variable called beta that indexes the family) as a way of continuously sweeping through to efficiently compute the normalizing function. The graphs above show a single HMC path with the temperature going from 0 to 1 on a simple test example. When running the algorithm, this would represent one of many simulated paths. As Betancourt points out, “What’s cool about this is that once you have a sample from the base distribution you only need one transition to get a sample from the target distribution, albeit an expensive transition. This should make seeding and mode exploration really fast.”
The background for the algorithm is as follows. Methods such as simulated annealing and simulated tempering are based on altering a temperature parameter. The basic idea is that you want to simulate from some distribution p(theta) but it’s too difficult, so you start from a simpler distribution p_0(theta) that you understand well, and then you work with the class of distributions p(theta|beta) = p_0(theta)^(1-beta)*p(theta)^beta. You start at beta=0 (“high temperature”) and gradually move toward beta=1 (“low temperature”). In simulated annealing or tempering, you gradually increase beta, not too slowly (or you’ll be wasting your time) and not too fast (or you’ll break the smooth operation of your simulations, assuming this is all embedded in some Markov chain simulation algorithm). You’ll want to move fast in areas where the distributions are changing slowly as a function of beta, slowly in areas where the distributions are changing fast, and you’ll want to stay still for awhile where there are phase transitions. And then once you have all your simulations you can estimate the normalizing function using path sampling. It works, but it’s clunky as it requires lots of outside tuning. It’s nothing close to a black box.
The physical analogy of tempering or annealing is clear—but the interesting thing is that it has problems. In particular, you can’t actually set or change the temperature of a physical system. What you can do is couple your system to a heat bath or cold bath and then let the temperature change naturally. That is, you can connect your system to a heat pump or a refrigerator, you can’t stick it in the microwave oven or in a (nonexistent) “microwave cooler.”
What does this imply for statistical algorithms for simulating from difficult distributions and computing normalizing constants? It implies that if we want to follow the physics, we should not have our algorithm “change the temperature” as if there were some kind of knob to set this state variable. Instead, we should alter the temperature in a more natural, physical way (that is, respecting the underlying differential equations) by mathematically coupling the system with a heat bath or cold bath and letting the temperature then evolve as it will. This is what Betancourt’s thermodynamic Monte Carlo algorithm does. I don’t follow the details of the algorithm (I’m waiting till it appears in Stan) but the basic idea makes a lot of sense to me, as it follows the Folk Theorem and the Pinocchio principle.
And here’s some more intuition from Michael on what the algorithm is doing:
In the thermodynamic evolution we have three things going on. Firstly, we have a usual dynamic evolution that mixes the positions and the momenta. Secondly we have a heat bath that adds or removes momentum depending on current potential energy relative to the average potential energy. Thirdly we have a temperature evolution term that leaches kinetic energy in order to fuel changes in temperature.
When the system is in equilibrium the kinetic energy is nonzero, some of it can be drained away to change the temperature, and then the system re-equilibrates by adding/removing kinetic to balance the average potential energy and then mixing the positions and momenta again. But when the system is far away from equilibrium the kinetic energy will approach zero and the temperature evolution stalls until the positions and momenta equilibrate: the three evolution terms feed back into each other to maintain equilibrium.
Then let’s consider what happens in Figure 4. When the trajectory is in the bulk of probability mass we have equilibrium and the temperature can evolve forward smoothly. When the trajectory approaches a tail all of the energy converts to potential and there’s little kinetic energy to fuel the temperature evolution, which slows to a crawl. This gives the trajectory a chance to bounce back towards the bulk, converting the potential back into kinetic energy which then restarts the temperature evolution.