Let’s play Twister, let’s play Risk

Alex Terenin, Dan Simpson, and David Draper write:

Some months ago we shared with you an arxiv draft of our paper, Asynchronous Distributed Gibbs Sampling.​

Through comments we’ve received, for which we’re highly grateful, we came to understand that

(a) our convergence proof was wrong, and

(b) we actually have two algorithms, one exact and one approximate.

We’ve now remedied (a) and thoroughly explored (b)

​The proof is constructive and interesting—it uses ideas from parallel tempering and from the Banach fixed-point contraction-mapping theorem​.

The approximate algorithm runs much faster (​than the exact version)​ at big-data scale, so we provide a diagnostic to see whether the approximation is good enough for its advantage in speed to be enjoyed ​in the problem ​currently being solved.

Wow, Gibbs sampling, that’s so 90’s. Do people still do this anymore? And parallel tempering, that’s what people used to do before adiabatic Monte Carlo, right?

All jokes aside, here’s the paper, and here’s their story:

The e-commerce company eBay, Inc. is interested in using the Bayesian paradigm for purposes of inference and decision-making, and employs a number of Bayesian models as part of its operations. One of the main practical challenges in using the Bayesian paradigm on a modern industrial scale is that the standard approach to Bayesian computation for the past 25 years – Markov Chain Monte Carlo (MCMC) [22, 33] – does not scale well, either with data set size or with model complexity. This is especially of concern in e-commerce applications, where typical data set sizes range from n = 1, 000, 000 to n = 10, 000, 000, 000.

In this paper we [Terenin, Simpson, and Draper] offer a new algorithm – Asynchronous Gibbs sampling – which removes synchronicity barriers that hamper the efficient implementation of most MCMC methods in a distributed environment. This is a crucial element in improving the behavior of MCMC samplers for complex models that fall within the current “Big Data/Big Models” paradigm: Asynchronous Gibbs is well-suited for models where the dimensionality of the parameter space increases with sample size. We show that Asynchronous Gibbs performs well on both illustrative test cases and a real large-scale Bayesian application.

Just as a minor point, I think they’re kneecapping themselves by running only one chain. Check out this, for example:

Screen Shot 2016-05-03 at 12.11.06 AM

This sort of thing really cries out for multiple chains. No big deal—it would be trivial to implement their algorithm in parallel, it’s just a good idea to do so, otherwise you’re likely to be running your simulations way too long out of paranoia.

17 thoughts on “Let’s play Twister, let’s play Risk

    • Of course, “I’m going to run it as long as I have time for, why wouldn’t I want as many chains as I can get in that time?” (given no appreciable decrease in length)

    • Kit:

      On my laptop I can run 4 chains in essentially the same amount of time as running 1 chain. The advantage of 4 chains is that I can easily diagnose convergence. A realistic scenario (meaning that this actually happens) is that I can run 4 chains of length 1000 each and see good mixing, whereas with 1 chain, someone might feel the need to run 20,000 iterations just to get the same answer. That would take 20 times as long (or 5 times as long with no parallel implementation).

      We evaluate mixing using split R-hat: we split each chain into 2 parts (after first discarding the first half of each chain) so with 4 chains each of length 1000, we discard first half to have 4 chains each of length 500, then we split each in half to have 8 chains of length 250, and we do R-hat and n_eff on that. With only one chain you can still do this but this can cause problems with slow-moving chains, hence a need to be really paranoid and run these things forever. Or people are in a hurry and when they just run 1 chain, they’ll just stop, think they’re ok, and miss something.

      R-hat’s not perfect but it allows us to diagnose mixing in all sorts of problems. As I wrote in this earlier discussion, if I had a leg for every time that I’ve used multiple sequences to find a problem in my iterative simulation (or that someone has told me about such an experience themselves), I’d be a centipede many times over.

      Charlie’s discussion is annoying and, in my opinion, a bit unscholarly, in that he does not reference the relevant literature on this such as my 1992 paper with Rubin (which he’s obviously aware of). If he wants to call my work “bogus” that’s his privilege but it’s a bit tacky to do so without providing a link or reference. On the cited webpage he presents an example with 50 chains each run with 200 iterations and they clearly don’t mix. 50 chains is a bit of overkill but the message is clear, that 200 iterations is not enough. The multiple chains did the job in this example. Perhaps the problem is that Charlie is talking about “many short runs” which is not something that we recommend.

      That’s one of the problems with a piece such as Charlie’s that gives no references. He’s attacking attacking attacking, but with no references it’s not at all clear what he’s attacking, or whether his attacks have any relevance.

        • I read Charlie’s advice as something like this: “Do as much diagnostics as you like, try to figure out whether things are working ahead of time… it’s a bit futile because there could always be some problem that takes months of runs to detect… but it won’t hurt…. but when it comes to making final inferences, run one chain per processor on each processor for as long as you can afford time-wise”

          Right at the top of that page he says: “If you can’t get a good answer with one long run, then you can’t get a good answer with many short runs either. ”

          Many short runs is something you don’t advocate either. I think there should be a lot of commonality between you two, but it seems like there’s just bad blood instead.

          Note also: “long” and “short” are relative terms. Pretty obviously NUTS/HMC tends to do more exploration per sample than typical random walk Metropolis. It also seems to do more exploration per cpu-cycle even though it takes a bunch of cycles to do the gradient calculation… So 200 samples of NUTS might explore as much as 20000 samples of RW-Metropolis in some problems… so it’s not possible to simply say something blanket about how many samples you should draw, or how much CPU time you should devote.

        • Daniel:

          I wouldn’t say there’s bad blood, it’s just kinda hard to respond to Charlie’s statements as he doesn’t actually refer to any of the literature on convergence monitoring. We run 4 chains parallel. I don’t know that there’s anyone running 50 chains each of length 200 as in Charlie’s example, but, sure, if anyone’s doing that, I agree with Charlie that this is a bad idea and I’d instead recommend they run 4 chains until there’s good mixing.

        • Asymptotically, Geyer’s right. But we only have finite amounts of time, and want to be able to diagnose when we haven’t run long enough for the asymptotics to kick in. And ideally, we’d like to minimize the finite amount of time we spend running our MCMC simulations while not being blindsided by easily diagnosable nonconvergence. Turning the argument around, the more sensitivity we can get with MCMC diagnostics and posterior checks, the more bogus results we’ll be able to throw out.

        • Bob:

          No, asymptotically if you run 4 chains you get 4 times the effective sample size, hence the asymptotic relative efficiency gain from running 4 chains is 4 (if they’re run in parallel) or 1 (if in serial).

        • I get it that diagnosing problems is useful. Geyer’s main point seems to be that if you diagnose a problem, then great, but if you don’t diagnose a problem… there’s really no guarantee that there isn’t a problem. So, you still need a big run.

          I am SURE Geyer isn’t saying “don’t run extra chains in parallel” since there’s no time-cost to extra chains in parallel. What he is saying is “don’t try to speed things up by starting from multiple starting points and hoping that this is going to explore different parts of the space, thereby fixing your startup transient problem”

          that is, don’t run 200 samples on 50 chains.

          The biggest problem with Geyer’s advice is perhaps that everyone has taken it and no-one is actually doing that anymore ;-)

        • Andrew: if you run one chain for 1000 warmup cycles and 5000 real cycles and it’s converged by that first 1000 cycles, you get 5000 “good” samples for the cost of 6000 samples…

          if you run 5 chains for 2000 cycles and throw out 1000 from each, you get 5000 samples for the cost of 10000 samples….

          if you run the 5 chains in parallel, you get 5000 samples for the cost of 2000 samples

          In other words, better to elongate the first chain than run multiple chains in serial, and better to run multiple chains in parallel if you can.

          Diagnosing nonconvergence wise… if the first 1000 samples seems to converge, then great. but the 6000 total samples in one chain DOES have a greater chance of exploring a wider range than the 5 parallel chains for 2000 samples each

          So, if your chains converge in a straightforward manner, a multi-chain run will show you this and let you estimate the warmup size you need, and this is good. If your chains converge to a metastable place and have a hard time getting into the “main” probability mass… only the really LONG chain will diagnose that, as you have very little hope of picking a “good” starting point in high dimensions, so re-starts aren’t going to help you.

          At least, that’s how I read Geyer’s thing.

    • Having started as an MCMC practitioner and having only recently learned a good bit of MCMC theory, I absolutely hate Geyer’s narrative and overall recommendations. The problem is that he’s basing his argument on very idealized assumptions that strongly cloud the discussion. Let me try to clear everything up.

      The most important aspect of MCMC that you have recognize is that it doesn’t always give the right answer in practice. Yes MCMC will get the right answer asymptotically, but it won’t always give reasonably accurate answer in the finite time available to us. To guarantee good finite-time behavior we need our Markov transition to interact sufficiently well with the posterior distribution, a condition we call _geometry ergodicity_. In particular, effective sample size, Monte Carlo standard error, and the like are defined _only if geometric ergodicity holds_.

      Unfortunately this is rarely introduced to practitioners, who end up assuming that geometry ergodicity holds and making them vulnerable to significant biases and other computational problems. Even more unfortunately, trying to determine geometric ergodicity theoretically is really, really, really hard and it’s simply not feasible for the complex models that arise in applied analyses. What we really want is an empirical diagnostic that we can identify failures of geometric ergodicity.

      Fortunately there is one general condition that is sensitive to geometric ergodicity. If geometric ergodicity holds for a given Markov transition and posterior, then after warmup _all chains_ started from _any initial condition_ will be distributed according to the same distribution. In other words, if we run multiple chains then we can check to see if their empirical distributions are consistent, which is exactly what Rhat does!

      Geyer has an unfortunate habit of making arguments based on the assumption that geometric ergodicity holds (many of his papers become a lot more clear once I realized this). When geometric ergodicity holds and you have only one CPU there is a mathematical argument that running one long chain is more computationally efficient than running multiple short chains.

      But the huge utility of running multiple chains, and checking Rhat, is about checking geometric ergodicity in the first place. In other words, Geyer is arguing about _performance assuming robustness_ where as Andrew is arguing about _checking robustness in the first place_. In practice it’s the latter that is much, much more important. Besides, as Andrew notes, Geyer’s performance argument is invalidated when you have multiple cores and you can run multiple long chains in the same time you could run one long chain.

      • “Fortunately there is one general condition that is sensitive to geometric ergodicity. If geometric ergodicity holds for a given Markov transition and posterior, then after warmup _all chains_ started from _any initial condition_ will be distributed according to the same distribution.”

        You don’t need geometric ergodicity for this to be true. Plain ergodicity is enough – in fact, the above is more-or-less the definition of plain (not necessarily geometric) ergodicity.

        Nor do you need geometric ergodicity to get the right answer, eventually. You do need geometric ergodicity to evaluate the accuracy of the answer using standard errors, the CLT, etc. And in practice, chains that aren’t geometrically ergodic usually aren’t too useful.

        But in any case, geometric ergodicity is not enough. It’s perfectly possible (indeed common) for a geometrically ergodic chain to converge slowly enough that you won’t get the right answer in any reasonable time. The reason you would use multiple chains is mostly the uncertainty about the quantitative speed of convergence, not about a qualitative aspect like geometric ergodicity.

        • Sorry, Radford, I’m not sure where you’re getting that. In an unbounded, continuous state space geometric ergodicity is the minimal condition for a CLT to exist and hence effective sample size, standard errors, and the like to be well-defined (save for a few circumstances where polynomial ergodicity is sufficient, but they are rarely of practical interest). In particular, plain ergodicity makes no guarantee on _how_ MCMC estimators converge. This is discussed in the usual references, particularly Meyn and Tweedie, as well as Gareth and Jeff’s excellent papers.

          You are correct that geometric ergodicity does not guarantee rapid convergence/mixing (they are the same time scale in this ideal case) but what is important is that the CLT guaranteed by geometric ergodicity allows us to accurately measure the convergence empirically, admitting the robust use of MCMC estimators in practice.

        • If you look back at his comment he says ” You **DO** need geometric ergodicity to evaluate the accuracy of the answer using standard errors, the CLT, etc” (emphasis added) which agrees with your comment.

          he’s just saying that plain ergodicity is enough that *any* starting condition gets you to the same stationary distribution.

        • Thanks, Daniel, and sorry, Radford. I blame sleep deprivation from the last few days of cocktails and latent Gaussian models at a workshop.

          Yes, technically plain ergodicity is enough _asymptotically_. The point is that we don’t have infinite time to run Markov chains so we need conditions like geometric ergodicity to be able to make any reasonable practical guarantees. Rhat will always go to 1 asymptotically, but it will usually converge to 1 in a finite time given only chains that are behaving sufficiently well that we can have confidence in our MCMC estimators and standard errors. And those particularly pathological exceptions where Rhat is consistent with 1 but geometric ergodicity doesn’t hold? That’s what the HMC diagnostics like divergences are for!

          And, as we both noted, this does not mean that you’ll get arbitrarily accurate answers within your finite computational budget, but it does mean that you can quantify the accuracy which is the really important part.

Leave a Reply

Your email address will not be published. Required fields are marked *