Burn-in for MCMC, why we prefer the term warm-up

Here’s what we say on p.282 of BDA3:

In the simulation literature (including earlier editions of this book), the warm-up period is called burn-in, a term we now avoid because we feel it draws a misleading analogy to industrial processes in which products are stressed in order to reveal defects. We prefer the term ‘warm-up’ to describe the early phase of the simulations in which the sequences get closer to the mass of the distribution.

Stan does adaptation during the warm-up phase.

9 thoughts on “Burn-in for MCMC, why we prefer the term warm-up

  1. I’ve thought burn-ins were more you waiting for the chain to reach stationarity since they get initialized somewhere decidedly not in the joint posterior, where warm-ups did other stuff (like tuning parameters in your proposal mechanisms to get desirable acceptance ratios; there’s probably more to it depending on sampling algorithm, but I’ve never personally implemented e.g. HMC and so am a bit foggy on the exact details on what happens during warm-up there). Although some mcmc software (e.g. RevBayes) calls that burn-in, too, so my division is pretty wishy-washy. I think I can trace this impression back to the following quote from Richard McElreath’s book (Statistical Rethinking, pg. 256):

    > Rethinking: Warmup is not burn-in. Other MCMC algorithms and software often discuss burnin.
    With a sampling strategy like ordinary Metropolis, it is conventional and useful to trim off the
    front of the chain, the “burn-in” phase. This is done because it is unlikely that the chain has reached
    stationarity within the first few samples. Trimming off the front of the chain hopefully removes any
    influence of which starting value you chose for a parameter.

    >But Stan’s sampling algorithms use a different approach. What Stan does during warmup is quite
    different from what it does after warmup. The warmup samples are used to adapt sampling, and so
    are not actually part of the target posterior distribution at all, no matter how long warmup continues.
    They are not burning in, but rather more like cycling the motor to heat things up and get ready for
    sampling. When real sampling begins, the samples will be immediately from the target distribution,
    assuming adaptation was successful. Still, you can usually tell if adaptation was successful because
    the warmup samples will come to look very much like the real samples. But that isn’t always the case.
    For bad chains, the warmup will often look pretty good, but then actual sampling will demonstrate
    severe problems. You’ll see examples a bit later in the chapter.

    • Absolutely. Andrew ran the blog post by me first, and I wrote back, “In addition to your answer, I also try to stress that we do adaptation and thus aren’t even running a Markov chain during warmup.”

      Warmup/adaptation isn’t intrinsically part of HMC. It was part of the first NUTS algorithm in the JMLR paper. Algorithms like Metropolis can also be adaptive. So it’s all very confusing.

      As to McElreath’s explanation, let me elaborate that Stan needs to find the typical set (high probability mass volume of the posterior) and then spend enough time there to estimate adaptation parameters (mass matrix [metric], step size [integration discretization interval]). So while HMC can often find the typical set pretty quickly, it needs to spend 100 iterations or so there to get a decent estimate of the mass matrix based on the draws from the last half of the warmup iterations. That means we usually need a warmup period of 200 iterations or so at a minimum to ensure we get 100 iterations from which to estimate posterior (co)variance. The step size is tuned to a target adaptation rate.

  2. I hope you realize that after switching off adaptation, you need to do some more “warm up” before you would expect to get points from (close to) the correct distribution, since the adaptation will in general cause you to sample from the wrong distribution, and therefore the non-adaptive period doesn’t start in equilibrium.

    (Though the need would be less if adaptation wasn’t actually changing the tuning parameters much near the end of the adaptation period. I don’t know whether that’s how things work in stan.)

    • There’s something weird about the idea that, having arrived at a certain point in the typical set, the chain has further to go to be in equilibrium. By definition for a Markov chain, the future state is independent of the past state given the present state — in other words, having arrived at a given point in state space, it doesn’t matter how you got there, whether it was by some human picking the point by hand, sampling from the actual stationary distribution, or having sampled a path from a non-Markov-chain stochastic process to get there.

      • It might be more like, with adaptation, you could be biased into the tails of the distribution so your starting point isn’t really yet in the typical set. Sure, in short order it might get there, but if you take only a few samples directly after stopping adaptation, the adaptation bias, if it exists, will throw you astray. I assume this is highly dependent on the model and the adaptation procedure. So the point is to be sure that you aren’t starting outside the typical set by making sure you do some more iterations after adaptation before you trust that you are in fact in the typical set.

        • And now, having read Bob’s comment up above, I am reminded that the adaptation procedure in Stan at least kind of assumes it’s in the typical set and tries to tune the mass matrix based on that. So for Stan’s particular adaptation procedure, there might be a lot less bias in the initial post-adaptation than for other general adaptation procedures.

  3. I once worked for a company that built and used very stable, very accurate (and very expensive) oven-controlled quartz crystal oscillators into their products. If you walked over to the right spot on the production floor, you’d find racks of these oscillators plugged in and being monitored by strip-chart recorders, which compared the frequency they output against an even better standard. The primary purpose was to let them run long enough so that their output stabilized because all of the components had aged sufficiently. While some no doubt failed, the primary purpose was getting over initial instability.

    We tended to talk about warmup when we turned on an instrument that contained such an oscillator, for the frequency would of course vary as the oven was settling into its setpoint.

    References:

    https://www.febo.com/pipermail/time-nuts/2011-September/059257.html speaks about burn-in periods of several weeks. (There’s no doubt a better reference, but I don’t know where it is.)

    http://hpmemoryproject.org/an/pdf/an_116.pdf#page=7 shows a warmup curve (time measured in seconds, not weeks).

    http://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1981-03.pdf#page=25 provides another discussion of warmup time, which happens every time you turned the unit on (suggesting you didn’t want to turn it off unnecessarily).

Leave a Reply

Your email address will not be published. Required fields are marked *