On the stan-users list, Richard McElreath reports:

With 2.4 out, I ran a quick test of how much speedup I could get by changing my old non-vectorized multi_normal sampling to the new vectorized form. I get a 40% time savings, without even trying hard. This is much better than I expected.

Timings with vectorized multi_normal:

# Elapsed Time: 96.8564 seconds (Warm-up)

# 87.8308 seconds (Sampling)

# 184.687 seconds (Total)Timings without vectorized multi_normal:

# Elapsed Time: 168.544 seconds (Warm-up)

# 147.83 seconds (Sampling)

# 316.375 seconds (Total)These timings are for a Gaussian mixed model with three random effects, 5000 cases, and 500 groups. The only difference between these two models was changing this line:

vary ~ multi_normal( rep_vector(0,3) , SIGMA );

to this line:

for ( j in 1:N_id ) vary[j] ~ multi_normal( rep_vector(0,3) , SIGMA );

Really nice stuff.

And it’s only gonna be getting better.

We’ve also been talking about a Julia interface to go along with the R and Python interfaces.

What’s going on under the hood is that the vectorized version need only factor the matrix SIGMA once, and it can then reuse the result for each of the 5000 data points.

As I pointed out in a reply to the list, this is only a 3-dimensional model. The cost is cubic in the dimensionality to do the factoring in the first place. So this should be an even larger savings in higher dimensions because the factorization is a larger fraction of the overall time.

To concur with Andrew, we are nowhere near maximum headroom in speed for these algorithms. The next logical step would be to compose the autodiff of the multivariate normal with the factorization of the matrix and do everything in one go. See, for example, the section on derivatives of multivariate normals in The Matrix Cookbook — they do the derivatives we need in section 8.4.2 in the general case of a mixture of multivariate Gaussians (which we could also implement).

And as Ben is wont to point out, there are more efficient ways to express this model. We really need to do a blog post on the LKJ prior that we’re now advocating for correlation matrices. It’s not only more efficient than using a full covariance matrix, it’s easier to understand in its concentration of prior mass closer to a unit matrix based on a single parameter.

We’re about to reorganize the manual and provide a dedicated chapter on recommended priors for both Bayesian and maximum likelihood estimation. The current section on multivariate priors isn’t even our current best implementation practice — it’s hard to keep all the examples in the manual up to date.

Bob I’m confused about this. The second version of the code is faster? So writing an explicit loop is faster than having your Stan compiler write the loop? That isn’t the kind of thing I would hope to hear.

isn’t the first version the “vectorized” version? doing all of “vary” at once?

The versions line up with the times. The first one is faster.

The thing to remember, though, is that Stan is not like an interpreted language such as R or BUGS or Python. Loops themselves are not slow. They all get expanded into C++ and then compiled. The speedup from vectorization isn’t because we can write the loops faster, but because (1) through template metaprogramming, Stan only needs to factor the covariance once rather than each time it’s called, and (2) we can replace 5000 nodes in the expression tree with a single node, and thus reduce the number of virtual function calls to 1 and localize the memory.

Some functions can’t be vectorized efficiently for autodiff, such as elementwise products, because there are no redundant derivative calculations or virtual function calls in the loop versus the vectorized form. And often there’s no speedup for data variables, but there’s a speedup for parameters, as in the sum function. With sum() on parameters, you get advantage (2) but not advantage (1).

I thought Stan was pretty cool before I realized what was going on with the parsing-to-C++ machinery. Like some other fun parts of C++ this borders on magical. Now, how long will it take for this to get into the C++ standard … C++45? (_mostly_ kidding).

Good to know a little about what is going on under the hood. The advantage of not writing the explicit loop, from the human factors perspective, is that the code is clean and requires fewer potentially confusing parts, so I’m glad to hear the first more concise version is also the faster version.

My report was confusing in that regard. As Bob says, the one without the explicit loop is faster. I’ll upload my test script to the user list, so there’s no confusion.

Not just confusing but actually exactly backwards. The posting describes changing the first version into the second version and getting a speedup. Perhaps someone can edit this posting to clarify the proper thing, for internet search posterity at least.