How not to compare the speed of Stan to something else

Someone’s wrong on the internet

And I have to do something about it.

Following on from Dan’s post on Barry Gibb statistical model evaluation, here’s an example inspired by a paper I found on Google Scholar searching for Stan citations. The paper (which there is no point in citing) concluded that JAGS was faster than Stan by several orders of magnitude for a simple hierarchical model.

First problem: thinning

The first problem in their evaluation was thinning. JAGS has a long autocorrelation time for these models because of the correlated parameters in the posterior (Gibbs isn’t rotation invariant because of its componentwise sampling). So what the author did was run JAGS and Stan for a gazillion iteration and thinned.

Thinning always loses information. Don’t do it unless you can’t afford the memory not to. The problem here is that it loses more information when autocorrelation time is low. Because Stan’s effective sample size per iteration is much higher, thinning disproportionately penalizes its true speed.

The trick is to figure out how much time it takes to get the effective sample size you need and compare that. Each system should be allowed to get there on its own terms. For Stan, that might mean 2K iterations and for JAGS it might mean 100K (remember JAGS iterations can be much much faster than Stan’s). Whatever you do, don’t run Stan for 100K iterations and thin!

Second problem: convergence

The authors used centered parameterizations.

Of course, we don’t want to compare two programs that get the wrong answer. The second problem was that the models weren’t going to be fitting well with a centered parameterization in Stan and in JAGS. Stan will try hard to fit it anyway, and it will crank down the step size so it can try to get into the neck of the funnel (the portion of the posterior with low hierarchical variance and all lower-level coefficients thus near the hierarchical mean). That’s why Stan takes longer. But really, neither system is going to robustly get the right answer with a centered parameterization.

As Michael Betancourt often says, speed comparisons only make sense when conditioned on the models actually fitting. For Stan and for JAGS in these models, you want to use a non-centered parameterization. The non-centered parameterization was originally developd for Gibbs and Metropolis, not HMC. Matt Hoffman rediscovered it in the context of HMC, so we used to call it the “Matt trick” before we learned its name and history for Gibbs and Metropolis.

With parmeterizations that work and no thinning, the results should be rather different, Stan will do relatively better. It really helps with Stan to have models that are well-specified for the data—when the models can’t be made to fit the data well, Stan really struggles to explore the posterior. This is a pure example of Andrew’s folk theorem in action.

Third problem: inefficient coding

Both programs need to be coded efficiently in order to compare the speed of the underying system (unless you’re making a meta-point of what “ordinary” users who don’t read the manual or ask on the forums do with a system, which can be a valid point to make).

It looks like you can vectorize JAGS models now. I don’t know how much that helps JAGS with speed. When Matt Hoffman and I were looking at JAGS before developing Stan, he managed to implement a vectorized Bernoulli-logit that was something like an order of magnitude faster than looping. Stan will also take advantage of shared computations, dropping constants, etc. in its vectorized distributions.

Fourth problem: poor choice of priors

The examples are using the diffuse inverse gamma priors popularized by WinBUGS (and perpetuated by JAGS). Andrew’s written about their dangers at length. In summary, they pull estimates into the tails and wind up being much stronger than their users expect.

Inverse gammas are popular for variance parameters because they’re conjugate. That’s the ideal situation for JAGS, because JAGS can very efficiently exploit the conjugacy when doing Gibbs sampling. Try comparing the speed of JAGS with non-conjugate priors and you’ll see very different results.

Placing priors on variance or precision is tricky because their units are the square and inverse square of the units of the variate. It’s much easier to think in terms of scale (standard deviation) because it’s in the same units as the variate.

A final issue: comparing easy models

The models being used for comparison are very easy, especially for JAGS. They’re fairly low dimensional and use simple conjugate priors.

We recommend comparing simulated data that can grow in both parameter size and data size. You’ll find Stan tends to scale fairly well.

Model complexity is where Stan really shines. These comparisons always tend to be for simple models.

ShinyStan v2.0.0

For those of you not familiar with ShinyStan, it is a graphical user interface for exploring Stan models (and more generally MCMC output from any software). For context, here’s the post on this blog first introducing ShinyStan (formerly shinyStan) from earlier this year.

shinystan_images

ShinyStan v2.0.0 released

ShinyStan v2.0.0 is now available on CRAN. This is a major update with a new look and a lot of new features. It also has a new(ish) name: ShinyStan is the app/GUI and shinystan the R package (both had formerly been shinyStan for some reason apparently not important enough for me to remember). Like earlier versions, this version has enhanced functionality for Stan models but is compatible with MCMC output from other software packages too.

You can install the new version from CRAN like any other package:

install.packages("shinystan")

If you prefer a version with a few minor typos fixed you can install from Github using the devtools package:

devtools::install_github("stan-dev/shinystan", build_vignettes = TRUE)

(Note: after installing the new version and checking that it works we recommend removing the old one by running remove.packages(“shinyStan”).)

If you install the package and want to try it out without having to first fit a model you can launch the app using the preloaded demo model:

library(shinystan)
launch_shinystan_demo()

Notes

This update contains a lot of changes, both in terms of new features added, greater UI stability, and an entirely new look. Some release notes can be found on GitHub and there are also some instructions for getting started on the ShinyStan wiki page. Here are two highlights:

  • The new interactive diagnostic plots for Hamiltonian Monte Carlo. In particular, these are designed for models fit with Stan using NUTS (the No-U-Turn Sampler).

    Diagnostics screenshot Diagnostics screenshotshinystan_diagnostics3

  • The deploy_shinystan function, which lets you easily deploy ShinyStan apps for your models to RStudio’s ShinyApps hosting service. Each of your apps (i.e. each of your models) will have a unique URL. To use this feature please also install the shinyapps package: devtools::install_github("rstudio/shinyapps").

The plan is to release a minor update with bug fixes and other minor tweaks in a month or so. So if you find anything we should fix or change (or if you have any other suggestions) we’d appreciate the feedback.

Parallel JAGS RNGs

As a matter of convention, we usually run 3 or 4 chains in JAGS. By default, this gives rise to chains that draw samples from 3 or 4 distinct pseudorandom number generators. I didn’t go and check whether it does things 111,222,333 or 123,123,123, but in any event the “parallel chains” in JAGS are samples drawn from distinct RNGs computed on a single processor core.

But we all have multiple cores now, or we’re computing on a cluster or the cloud! So the behavior we’d like from rjags is to use the foreach package with each JAGS chain using a parallel-safe RNG. The default behavior with n.chain=1 will be that each parallel instance will use .RNG.name[1], the Wichmann-Hill RNG.

JAGS 2.2.0 includes a new lecuyer module (along with the glm module, which everyone should probably always use, and doesn’t have many undocumented tricks that I know of). But lecuyer is completely undocumented! I tried .RNG.name="lecuyer::Lecuyer", .RNG.name="lecuyer::lecuyer", and .RNG.name="lecuyer::LEcuyer"
all to no avail. It ought to be .RNG.name="lecuyer::Lecuyer" to be consistent with the other .RNG.name values! I looked around in the source to find where it checks its name from the inits, to discover that in fact it is

.Rng.name="lecuyer::RngStream"

So here’s how I set up 4 chains now:
Continue reading