Stan uses Nuts!

We interrupt our usual program of Ed Wegman Gregg Easterbrook Niall Ferguson mockery to deliver a serious update on our statistical computing project.

Stan (“Sampling Through Adaptive Neighborhoods”) is our new C++ program (written mostly by Bob Carpenter) that draws samples from Bayesian models. Stan can take different sorts of inputs: you can write the model in a Bugs-like syntax and it goes from there, or you can write the log-posterior directly as a C++ function.

Most of the computation is done using Hamiltonian Monte Carlo. HMC requires some tuning, so Matt Hoffman up and wrote a new algorithm, Nuts (the “No-U-Turn Sampler”) which optimizes HMC adaptively. In many settings, Nuts is actually more computationally efficient than the optimal static HMC!

When the the Nuts paper appeared on Arxiv, Christian Robert noticed it and had some reactions.

In response to Xian’s comments, Matt writes:

Christian writes:

I wonder about the computing time (and the “unacceptably large amount of memory”, p.12) required by the doubling procedure: 2^j is growing fast with j! (If my intuition is right, the computing time should increase rather quickly with the dimension. And I do not get the argument within the paper that the costly part is the gradient computation: it seems to me the gradient must be computed for all of the 2^j points.)

2j does grow quickly with j, but so does the length of the trajectory, and it’s impossible to run a Hamiltonian trajectory for a seriously long time without making a U turn and stopping the doubling process. (Just like it’s impossible to throw a ball out of an infinitely deep pit with a finite amount of energy.) So j never gets very big. As far as memory goes, the “naive” implementation (algorithm 2) has to store all O(2j) states it visits, but the more sophisticated implementation (algorithm 3) only needs to store O(j) states. Finally, the gradient computations dominate precisely because we must compute 2j of them—NUTS introduces O(2jj) non-gradient overhead, which is usually trivial compared to O(2j) gradient computations.

In summary, if we assume that NUTS generates trajectories that are about the optimal length of the corresponding HMC trajectory (which is hard to prove, but empirically seems not too far off), then NUTS adds a (usually negligible) computational overhead of order O(2jj) and an (acceptable for non-huge problems) memory overhead of order O(j) compared with HMC. These costs scale linearly with dimension, like HMC’s costs.

Trajectory lengths will generally increase faster than linearly with dimension, i.e., j will grow faster than log_2(number_of_dimensions). But the optimal trajectory length for HMC does as well, and in high dimensions HMC is pretty much the best we’ve got. (Unless you can exploit problem structure in some really really clever ways [or for some specific models with lots of independence structure–ed.].)

Christian writes:

A final grumble: that the code is “only” available in the proprietary language Matlab!

Apologies to non-Matlab users! There’s also a c++ implementation in Stan, but it’s not (yet) well documented. I guess I [Matt] should also write python and R implementations, although my R expertise doesn’t extend very far beyond what’s needed to use ggplot2. [Matt doesn’t need to translate into Python and R–anyone who wants to do that should be able to do so, easily enough. And in any case, Stan will have the fast C++ version. Matt was just using Matlab as a convenient platform for developing Nuts.–ed.]

See also Bob’s comments which he posted directly on Xian’s blog.

Stan’s nearly ready for release, and we’re also working on the paper. You’ll be able to run it from R just like you can run Bugs or Jags. Compared to Bugs and Jags, Stan should be faster (especially for big models) and should be able to fit a wider range of models (for example, varying-intercept, varying-slope multilevel models with multiple coefficients per group). It’s all open-source and others should feel free to steal the good parts for their own use. The work is partly funded by a U.S. Department of Energy grant.

29 thoughts on “Stan uses Nuts!

  1. I agree with Christian’s sentiment. Can the Matlab code be converted to Octave? If you use only basic Matlab, that often works.

  2. I’m quite excited with the opportunity to use it with R…
    I have several models too complex to be fitted with Bugs and I guess now I’ll be able to fit them…
    Hope you release it soon…

    Manoel

  3. @Andrew: Just to clarify who wrote what, Matt wrote the really hard mathematical bits like NUTS and dual averaging optimization. I wrote the compiler from a BUGS-like language, all the transforms and I/O, and the basic distribution library. We collaborated on the design of the auto-dif, Matt wrote the first version, then I filled it out using an OO design.

    @Dave Backus: I have no idea about Matt’s MATLAB running in Octave, but we have a version in C++. You can either write a templated log probability density function and we can algorithmically differentiate for gradients, or you can write your own gradient function by hand. You can also write BUGS-like models (though statically typed) that are then compiled into C++ programs.

    @Corey: I’m afraid it won’t be possible to code the log posterior in R. It would’ve required us to write a compiler for pretty much the entire R language! I am working on the R interface so you can use the BUGS-like language through R in the same way as R2WinBUGS or R2jags.

  4. sounds very useful and might help me with a model that doesn’t really work when I use JAGS. Can you comment on how much work it is to move from BUGS code to Stan? So I have my BUGS model and want to switch to Stan. What steps do I need to take? I think that would be pretty helpful for all those people coming from BUGS….

    Thanks!

  5. Doesn’t the condition of equation (8) violate reversibility? This early stopping of simulation was suggested by Neal (J Comp Physics 1994) where he noted that to satisfy reversibility you need to also stop if the absolute change in energy exceeds some threshold. Also, the acceptance of states from within the entire tree seems very related to Neal’s windowed acceptance procedure with the additional advantage that you are able to automatically select the length of simulation.

    Regardless, very interesting and great work!

    • The condition in equation (8) would violate reversibility if we just proposed the point at which the condition was satisfied, as would the condition in equation (9). The trick is that we don’t consider as candidates for the next state any of the points in the subtree in which either condition is violated, which lets us preserve detailed balance.

      The choosing a state from a set of candidates is indeed related to Neal’s windowed acceptance procedure. As we mention in the discussion, it seems possible that some of the performance gains we see are because NUTS, like windowed HMC, is able to get away with a larger step size than vanilla HMC.

  6. @Dirk: I should’ve mentioned that we’re using Rcpp for the R/C++ interface for Stan.

    Using a combination of your clear documentation and help from R-whiz Michael Malecki, we were able to get the stubs up and running for compiling a Stan model from R, dynamically linking and then calling the Stan C++ code, and also implementing data interfaces so that Stan’s C++ can read from and write to R.

    If I have trouble fleshing out the rest of the app, I’ll definitely give you a shout through the mailing list.

    Thanks for the great software.

    • Sweet, so I didn’t even have to play travelling salesman and advertise it here! Nice. Looking forward to seeing it on CRAN, and of course for the beginning of a suite of pithy comments by Andrew about all the things wrong with Rcpp…

      • Andrew doesn’t operate at the C++ level — so you’re in the clear there.

        I’ll post a blog entry about all the tools and software packages we used for Stan, as well as citing them with the software itself. The short story is: emacs and vi for editng, bash shell for running, gnu make for builds, instruments (Mac) for profiling, gdb for debugging, and both g++, and clang++ for compiling. Software packages we use include googletest for unit tests, Eigen for matrices and linear algebra, and Boost for template metaprogramming, error policy configuration, random number generation, parsing, and some special functions and cumulative densities. For doc we’re using doxygen. For hosting, Google Code. For continuous integration, Jenkins. In R, we have dependencies on R itself and Rcpp. We’ve been using some variogram estimators of effective sample size written in R by Kenny Shirley (paper with Andrew forthcoming), and have used ggplot2 and Coda to analyze the output, though there’s no software depencies and we’re not redistributing any of these as part of Stan (except perhaps the ESS estimators).

        • why both g++ and clang++? Seems like you’d pick one or the other…

          I was interested in clang++, but last I checked it was missing a lot of the C++11 stuff I like.

  7. Exciting stuff. It would be nice if in time a ‘dummy manual’ would become available. Something like you did in your book with Jennifer Hill or in BDA where you provide some basic models and implementation in Bugs & R.

    PS: Now I finally know exactly what Stan is… I had to wait for a while.

  8. Not absolutely on theme, maybe, but close enough, and certainly relevant for STAN…

    I followed up on the NUTS work recently, and I was left wondering whether you had also considered Riemannian Hamiltonian MC, since this seems – at least to me – to be the most principled model for cleaning up the loose-ends in standard HMC. I assume that you have looked at it (in fact I may have heard of it first on this blog) and have an opinion.

  9. Ha, and straight after posting, I go and look at the referenced paper, and I see that there is citation of Girolami and Caulderhead. I shall go look there.

    • Yup, the RMHMC paper has some very exciting ideas in it, but it deals more with the question of step sizes (and preconditioning) and doesn’t directly address the question of how to choose trajectory lengths. Happily, the two algorithms extend HMC in nearly orthogonal ways, so there’s no reason they couldn’t be profitably combined.

  10. By the way, I’m not sure where I came up with that O(j 2^j) business. It should just be O(2^j)—it’s not like we’re sorting the trajectories or anything crazy like that. (Thanks to Bob for pointing this out.)

  11. This looks great. I’ve been using HMC fairly successfully, but the tweaking can be difficult. I’m looking forward to trying this.

  12. @Everyone: I’ve just started writing a detailed manual — I didn’t want to get too far before the language itself stabilized.

    We’re translating all three volumes of BUGS example models. If you look on the Google code site for Stan, there are already a dozen or so of the BUGS models in place (under src/models/bugs_examples). The main difference is that everything in Stan is typed. Variables types include integer, real, bounded (below, above or both) reals, vectors, row vectors and matrices, with subtypes for simplex vectors (for multinomial parameters), ordered positive vectors (for ordinal logit cutpoints), and correlation/covariance matrices. We also provide arbitrary dimensional arrays of any of these basic types. Further, variables are sorted into supplied data, derived data (values determined only based on supplied data), parameters (what gets sampled and returned), and derived parameters. The model code is order-specific as to the derived parameters, which allows more flexibility in using local variables in loops. We also supply more functions than BUGS or JAGS and will eventually have various parameterizations of the different densities (for now, we’re using the ones in Gelman et al.’s Bayesian Data Analysis book).

    • So basically it’s faster than JAGS and has more features than JAGS and uses a BUGS-style model description language and has an R interface. So …. should I just pack up shop now?

      • Martyn,

        I’m hoping that Stan will free both of us up to spend more time actually doing statistics. Realistically, though, it will probably just push forward the frontier of models we can easily compute, leading us to work on methods for improving the computation for new, larger, more complicated models.

        • Stan should improve in speed soon as we implement various improvements to the algorithm. But we have no “official” statement on speed yet. Wait for the (unfinished) paper!

      • Those “new, larger, more complicated models” Andrew points to – are always going to be there.

        In contrast to CS Peirce pointing out that “It’s a good thing we die, otherwise we would eventually discover that anything we ever thought we understood, was wrong” …

        Until you die – you never need to worry about it not being worthwhile to develop “new, larger, more complicated models” ;-)

      • The unofficial status, as of today, is that JAGS is still around order of magnitude faster than Stan on the very first BUGS model, Rats. There’s another order of magnitude variation depending on how we evaluate, because of initialization and the difference between the speed of convergence versus the speed of generating effective samples (in the effective sample size sense).

        We (and by that I mean Matt Hoffman) are indeed working on some further tuning optimizations for NUTS. Andrew is hoping we can close the gap with JAGS on simple examples.

        Stan will also scale much better than JAGS or BUGS in terms of memory required for a given number of data items or paramerters, especially for more data (our parameters are still pretty heavy due to the requirements of algorithmic differentiation).

        On the downside, it’ll be heavier over all (our distribution’s going to be around 70MB or so, mainly due to the dependence on Boost). It’ll also require a compilation step for the models, which is another tradeoff. For instance, it takes about 30 seconds to compile a simple model with -O0 (no optimization), and two or three times as long with -O3 (high optimization), but the resulting model runs MUCH faster with high optimization.

        The expressiveness of the models will also be slightly different. We don’t allow mixed vectors of parameters and data, for instance. And we require type declarations. Some truncated densities may be difficult (or even impossible) to code (at least in our first release), due to the way we’re transforming variables to unbounded ranges (we’ve come up with some workarounds, but don’t know if they’ll be fully general).

        And we don’t really have any advantage at all in discrete parameters — we’ll be using the same kind of technique as JAGS to sample them.

        Andrew teases us by saying the language for Stan models looks like JAGS with semicolons (the reason for the semicolons being that I wanted the parser to be insensitive to whitespace, which JAGS and R are not (that is, where you put a newline matters).

Comments are closed.