Tempering and modes

Gustavo writes:

Tempering should always be done in the spirit of *searching* for important modes of the distribution. If we assume that we know where they are, then there is no point to tempering. Now, tempering is actually a *bad* way of searching for important modes, it just happens to be easy to program. As always, my [Gustavo’s] prescription is to FIRST find the important modes (as a pre-processing step); THEN sample from each mode independently; and FINALLY weight the samples appropriately, based on the estimated probability mass of each mode, though things might get messy if you end
up jumping between modes.

My reply:

1. Parallel tempering has always seemed like a great idea, but I have to admit that the only time I tried it (with Matt2 on the tree-ring example), it didn’t work for us.

2. You say you’d rather sample from the modes and then average over them. But that won’t work if if you have a zillion modes. Also, if you know where the modes are, the quickest way to estimate their relative masses might well be an MCMC algorithm that jumps through them.

3. Finally, pre-processing to find modes is fine, but if pre-processing is so important, it probably needs its own serious algorithm too. I think some work has been done here but I’m not up on the latest.

6 thoughts on “Tempering and modes

  1. This comment presumes that finding the locations of the modes provides important important information about the distribution. In complex hierarchical models, this is often false. The modes may be at places with very small posterior probability mass, such as where hyperparameters specifying prior variances are very small (and hence prior densities for parameters are very high). There may be several regions that are not easily moved between by local MCMC steps, but these regions aren’t necessarily well-described by a mode location, or by any other simple quantity. Good tempering-style methods for moving between modes in simple problems may also handle these more complex situations, but methods based on explicitly finding modes won’t.

  2. Andrew, re:2 I’d rather use an approach that doesn’t jump between modes. If we have good samples from each mode, it may be better to estimate relative mass directly (as the average density times mode volume), than to see how much time the Markov chain spent at each mode. But now the tricky thing is estimating mode volume.

    Radford,
    There is a slight ambiguity in the word “mode”. I used the word “important mode” to mean modes (bumps) with significant mass, which is often roughly equivalent to the modes (peaks) having significant density. But I agree with your comment, and would be interested in references to such smart tempering methods. Is there a better way to use optimization algorithms in this field? I imagine that having a smoothed objective would make mode-finding more useful, but that’s not easy to compute either.

    • Gustavo: All the tempering methods (simulated tempering, parallel tempering, tempered transitions, annealed importance sampling) are “smart” in the sense that they are correct (in the usual asymptotic sense) regardless of whether the distribution has nice Gaussian-like modes or is much more complex, with, for instance, highly skewed modes, where there is little mass near the peek. Once the distribution is complex, all simple strategies like finding the “volume of a mode” (which seems ill-defined) aren’t going to work. To give a statistical physics example, when simulating a system of molecules, the mode will be at the state where all the molecules are in a regular array, forming a perfect crystal. This of of no relevance at all if you are interested in the properties of the system at a temperature where it is in a liquid state.

      • Radford, I know that they work, but I would like to see a little more systematicity in these tempering algorithms (I don’t like to run randomized algorithms if we can avoid it), hence my fondness for Mode-Oriented Stochastic Search (MOSS), quasi Monte Carlo, and mesh-based integration methods.

    • isn’t one problem with this that it’s frequently the case that the “non-important” modes will integrate to a significant mass? To continue that stat physics analogy, there will only be a few favorable ordered states, but there will be many disordered states. The important modes may only integrate to only a small fraction of the probability.

      Or is this a problem that arises more frequently in physical systems than estimation problems commonly encountered in statistics?

  3. I thought a key MCMC trick was to introduce unidentified variables such that eg a linear combination or product is identified to allow smooth mixing. Doesn’t that blow up mode-finding?

    I was surprised to see anti-tempering aka data cloning as a strategy for finding MLEs in hierarchical models. As I understand it, their strategy is just fractional temperature.

Comments are closed.