Skip to content
 

Stan is fast

10,000 iterations for 4 chains on the (precompiled) efficiently-parameterized 8-schools model:

> date ()
[1] "Thu Aug 30 22:12:53 2012"
> fit3 <- stan (fit=fit2, data = schools_dat, iter = 1e4, n_chains = 4)
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Iteration: 10000 / 10000 [100%] (Sampling)

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Iteration: 10000 / 10000 [100%] (Sampling)

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Iteration: 10000 / 10000 [100%] (Sampling)

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Iteration: 10000 / 10000 [100%] (Sampling)

> date ()
[1] "Thu Aug 30 22:12:55 2012"
> print (fit3)
Inference for Stan model: anon_model.
4 chains: each with iter=10000; warmup=5000; thin=1; 10000 iterations saved.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu 8.0 0.1 5.1 -2.0 4.7 8.0 11.3 18.4 4032 1
tau 6.7 0.1 5.6 0.3 2.5 5.4 9.3 21.2 2958 1
eta[1] 0.4 0.0 0.9 -1.5 -0.2 0.4 1.0 2.2 9691 1
eta[2] 0.0 0.0 0.9 -1.7 -0.6 0.0 0.6 1.7 9565 1
eta[3] -0.2 0.0 0.9 -2.0 -0.8 -0.2 0.4 1.7 10990 1
eta[4] 0.0 0.0 0.9 -1.8 -0.6 0.0 0.5 1.7 10719 1
eta[5] -0.3 0.0 0.9 -2.0 -0.9 -0.4 0.2 1.4 8876 1
eta[6] -0.2 0.0 0.9 -1.9 -0.8 -0.2 0.3 1.6 10369 1
eta[7] 0.3 0.0 0.9 -1.5 -0.2 0.4 0.9 2.1 10303 1
eta[8] 0.1 0.0 0.9 -1.8 -0.6 0.1 0.7 1.9 10428 1
theta[1] 11.6 0.1 8.5 -2.2 6.1 10.5 15.8 32.3 5414 1
theta[2] 8.0 0.1 6.3 -4.5 4.0 7.9 11.9 20.8 9549 1
theta[3] 6.2 0.1 7.9 -12.4 2.1 6.8 11.0 20.7 6547 1
theta[4] 7.7 0.1 6.4 -5.3 3.8 7.8 11.7 20.8 9097 1
theta[5] 5.1 0.1 6.3 -8.8 1.4 5.7 9.5 16.3 8700 1
theta[6] 6.2 0.1 6.8 -8.6 2.2 6.5 10.5 19.0 8518 1
theta[7] 10.8 0.1 6.8 -1.0 6.2 10.2 14.8 26.1 7236 1
theta[8] 8.6 0.1 7.8 -6.2 4.0 8.3 12.9 25.3 7375 1
lp__ -4.8 0.0 2.7 -10.6 -6.4 -4.6 -3.0 -0.3 3580 1

Sample were drawn using NUTS2 at Thu Aug 30 22:12:55 2012.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Check the timings. That took two seconds! And, as you can see from the R-hats and effective sample sizes, 10,000 iterations is overkill here. Damn that’s fast.

But, hey, 8 data points is pretty small. Let’s try the 800-schools problem. I’ll first simulate 800 schools worth of data, rerun, and see what happens.

It took 10 seconds. That’s right, 4 chains of 1000 iterations (enough for convergence) for the 800 schools problem, in 10 seconds.

Not bad, not bad at all.

Well, it’s pretty horrible if you’re planning to do something with a billion data points. But for the sorts of problems where we do full Bayes, yes, that’s just fine.

We’ll do more careful speed comparisons later. For now, let me just point out that the 8 schools is not the ideal model to show the strengths of Stan vs. Bugs. The 8 schools model is conditionally conjugate and so Gibbs can work efficiently there.

P.S. Just for laffs, I tried the (nonconjugate) Student-t model (or, as Stan puts it, student_t) with no added parameterizations, I just replaced normal with student_t with 4 df. The runs took 3 seconds for the 10,000 iterations of the 8 schools and 34 seconds for the 1000 iterations of the 800 schools. But I think the reason it took a bit longer is not the nonconjugacy but just that we haven’t vectorized the student_t model yet. Once it’s vectorized, it should be superfast too. That’s just a small implementation detail, nor requiring any tricks or changes to the algorithm.

P.P.S. These models did take 12 seconds each to compile. But that just needs to be done once per model. Once it’s compiled, you can fit it immediately on new data without needing to recompile.

7 Comments

  1. Dan says:

    So how fast does it run in BUGS? Forgive me if this is common knowledge. ;)

  2. G says:

    I’m currently running mixed effects models in R through (b)lmer with 200,000 to 500,000 observations of public opinion data. The way that my current logistic regressions are specified leads to convergence times of roughly 20-30 minutes, although I can decrease that by using fewer random effects. I need to run multiple models each day as new data comes in, which, even with multiple cores running separate models, takes up a lot of time. My understanding is that Bayesian models generally run slower. In your book with Jennifer Hill, for example, you write that it’s best to fit models with lmer before running models in BUGS for the sake of speed. Is this still the case? If Stan is quicker than lmer fitting relatively large amounts of data and 5-8 or so random effects with 4-6 categories, I’d gladly switch over to Stan.

    • Andrew says:

      G,

      Right now, lmer/glmer/blmer/bglmer is faster than Stan for logistic regression. There’s room for speed improvements in both programs for large models and datasets, making use of hierarchical structure. One of our plans, one Stan is up and running, has been to improve computation for hierarchical logistic regression beyond what we have now.

  3. G says:

    Great. Thanks for the reply Andrew. I’ll be sure to follow Stan’s progress. I think I’ll get Stan up and running for other less time-sensitive work in any case.

    • Andrew says:

      It’s so much fun to just be able to respond honestly to a question. In the world of journal articles and grant proposals, everything’s full of B.S., there’s a huge pressure to exaggerate progress and minimize difficulties. It feels great to just be able to say what we can do and what we can’t. Bob did a good job of that in writing the manual, too.

  4. Whit Armstrong says:

    Andrew, Stan looks very interesting.

    I was curious to see how the speed compares to my own bugs/jags substitute (rcppbugs) so I put this little script together:

    https://github.com/armstrtw/rcppbugs.examples/blob/master/eight.schools.stan.comparison.r

    This run is on debian linux with 100k iterations and 100k burn:

    running stan pre-compiled model (from previous fit)…
    SAMPLING FOR MODEL ‘schools_code’ NOW (CHAIN 1).


    Iteration: 200000 / 200000 [100%] (Sampling)

    user system elapsed
    3.200 0.036 3.247
    running cppbugs model…
    user system elapsed
    3.132 0.028 3.167

    So, putting aside the compile step for stan (rcppbugs models do not need to be compiled), the run times look roughly equivalent.

    I was also curious how the c++ version of cppbugs would hold up against the pre-compiled stan model, so I threw this model together:

    https://github.com/armstrtw/CppBugs/blob/master/test/eight.schools.stan.cpp

    Which runs in 0.65 seconds on the same machine (100k interations, 100k burn).

    CppBugs has a random walk sampler, so I would expect it to be somewhat faster, since it doesn’t have to do nearly as much work for the proposal distribution as Stan does.

    Perhaps there is some room for collaboration on our projects in the future.

    Cheers,
    Whit

    • Andrew says:

      Whit:

      Cool—thanks for doing this comparison! Two quick things:
      1. Relevant comparison is not time per iteration but time per effective sample size.
      2. Stan should scale well; its performance should improve relative to Metrop as the size and complexity of the problem increases.