I’ve fixed it and made a pull request, so hopefully it’ll be fixed soon. Intrinsic models are annoying.

]]>You cannot talk about scalability until you have determined that Stan is recovering a reasonable result. Formally, we need NUTS to be geometrically ergodic with respect to the given posterior so that a Central Limit Theorem holds and we get unbiased MCMC estimators whose variances can be estimated using the effective sample size. If NUTS is not geometrically ergodic then the effective sample size _is a meaningless number_. Stan will compute its estimator and print it, but it will not correspond to any meaning quantity.

Verifying geometric ergodicity is really, really, really hard and so it often is simply assumed to hold, but we have decades of naively-implemented hierarchical models fit with Gibbs samplers that demonstrate how dangerous this is! Fortunately Stan provides two very powerful diagnostics. The first is the Gelman-Rubin statistic, i.e. Rhat — when calculated from many chains each initialized from random starting points, a lack of geometric ergodicity manifests as Rhat > 1.1 or so. The second are divergences. These are a unique to Hamiltonian Monte Carlo algorithms like NUTS, but they are incredibly sensitive to the kind of pathologies that can obstruct geometric ergodicity and hence valid MCMC estimators.

So before we even talk about performance and scalability you have to demonstrate validity. _Any_ fit being compared has to have Rhat ~ 1 using multiple chains (the more the better) and no divergences encountered in sampling. None. Do these CAR models satisfy those conditions? If so, then we can talk about performance.

In particular, the techniques combined into this model implementation speeds up the centered parameterization of the latent Gaussian model, but the centered parameterization is known to be very susceptible to breakdowns in geometric ergodicity. Ben is advocating the non-centered parameterization which needs the Cholesky decomposition and hence is more expensive, but in practice it tends to be much more statistically robust. Poor fits often run much faster than good fits! But, again, speed is irrelevant until the validity of the estimates has been demonstrated.

tl:dr; Does this fast CAR model have Rhat ~ 1 when using multiple chains and show no divergences in any of those chains? If so then the centered parameterization it uses is appropriate for the data set you’re fitting and this improved implementation will be a huge gain! If not then you’re just scaling a poor fit.

P.S. There is one other concern, and that is the effective sample size per transition. When Markov chains are mixing slowly the autocorrelations, and hence the effective sample size, can be really hard to accurately estimate, even when there are no concerns about statistical validity. This manifests in n_eff / n_iter < 0.001 or so, which is then a really good indication that you probably can't trust the n_eff estimation and hence any performance metric that depends on it. In practice I am comfortable with 0.01 < n_eff / n_iter < 1, provided that all the other diagnostics look okay.

]]>Good point about the assumptions behind n_eff – is there a better way to evaluate efficiency, or diagnose whether the estimated effective sample size is reliable?

]]>Good points (regarding Chapman and intrinsic CAR).

]]>It’s mainly that the way that you typically write the prior as a product of conditionals would involve a singular gaussian. If the observations are also gaussian, then you can just compute the posterior precision matrix and use that for both the determinant and the quadratic form. Otherwise, you have a prior (singular) precision matrix and some log-likelihood term. (in which case you need a full eigendecomposition of the precision matrix, which is too expensive for most models; the lip cancer dataset is really small!)

The easiest way to fix this is to mulitply and divide by a normal approximation to the likelihood. (This allows you to combine this with the prior and have an approximate posterior that is Gaussian and non-singular). But for less sophisticated methods than HMC, this only works well if the normal distribution is a normal approximation to the likelihood for the current values of the parameters. This is usually not available in closed form, so you’d probably need to think harder for Stan

]]>1. It’s not like I was looking for a Tracy Chapman reference—actually that’s the only song of hers I can think of—it just jumped out at me given the topic.

2. The intrinsic CAR model is sooo cool. I did not try to follow your comment in detail but one thing that’s confusing me is . . . it’s the posterior we’re working with, not the prior. Once you have some data, the posterior distribution integrates, and anything Gaussian will have a non-singular precision matrix, right?

]]>1) Tracy Chapman!!!!

2) A Sparse Matrix in Stan!!!! I’m so happy there is finally a working version of this!

3) For practical examples, you need to make an intrinsic version work as well (i.e. one where the precision matrix is singular). This is harder. The standard trick when you’re using MH is to propose using some sort of quadratic approximation (the laplace approximation, the matrix that comes from the information matrix etc) and then do a Hastings correction. I’m not completely sure how to do something similar for HMC. The problem is basically that for an intrinsic prior the determinant is zero, whereas the normalising constant is the product of the non-zero eigenvalues (which is not zero! and often depends on an unknown parameter). Multiplying and dividing by the right multivariate Gaussian will fix everything, but we know from MH experience (which may or may not be tranferable to HMC) that this Gaussian should be the taylor expansion of the likelihood at the mode, which usually doesn’t have a closed form or else your sampler behaves badly. Maybe freezing that distribution won’t have as much of a speed penalty with HMC.

4) The thing that Ben suggested (the non-centred parameterisation) is what you should do for this type of data (see http://homepages.inf.ed.ac.uk/imurray2/pub/10hypers/hypers.pdf)

5) You should be never compute a full eigen-decomposition of a sparse matrix (it costs N^3 so there’s no point having a sparse matrix). In this case you can work out what the pattern of the symbolic cholesky decomposition and then hard code it into every example. It would be easier if Stan supported sparse matrices directly (and given they’re in Eigen3, I assume it’s somewhere on the “to do list”).

]]>But having an effective sample size that is 20% of the nominal sample size is not great by Stan standards. For the Scottish Lip Cancer example, it is possible to get an effective sample size that is much higher by post-multiplying a row-vector of parameters (that have a standard normal prior) by the inverse of the Cholesky factor of the precision matrix to form phi (transposed) in the transformed parameters block. However, this triangular solve is an O(N^3) operation, so it is slow for large N (which could easily be much larger than in the Scottish Lip Cancer example).

The case study calculates effective sample size per second of post-warmup time, which, in fairness is exactly what Stan developers have recommended to compare MCMC efficiency. But I think we need to qualify that. Comparing n_eff / second across implementations only makes sense if the concept of effective sample size makes sense (which essentially requires the MCMC Central Limit Theorem to hold) for each implementation being compared. And it requires that our estimates of the effective sample size be reasonable, which often seems not to be the case particularly when the estimated effective sample size is low.

Matt-tricking phi is “worse” on the estimated n_eff / second metric than the impressive 120 achieved in the case study. But I would feel more comfortable with a larger estimated n_eff and a lower estimated n_eff / second as long as the time it takes to estimate the model is feasible.

]]>True – it does look distorted. These are the polygons associated with the CARBayes package which bundles up the lip cancer data as an example, but there doesn’t seem to be any projection information in there.

2. I can’t believe that the “tau ~ gamma(0.5, .0005);” prior is a good idea. It just looks weird to me. At the very least I’d suggest a reparameterization to make it scale-free.

This prior is a carry-over from the WinBUGS example – they do seem to stick out like sore thumbs nowadays!

3. Is there a way to package up all that code into a function? I don’t have enough experience with Stan functions to be sure, but the idea would be to have a CAR model with all that code inside the function so that, as a user, I could just say theta ~ CAR(…) with all the parameters, and then I wouldn’t have to worry about all those matrix manipulations.

That would be great. I would guess most users would want to input their adjacency matrix rather than computing eigenvalues etc. I’m not sure how to efficiently implement this in Stan, since some of the efficiency gains come from pre-processing the data (e.g., in the transformed parameters block), but a CAR(…) function would be called from within the model block.

]]>Another advantage of the case studies format is that everything’s all in one place, so an outsider such as myself can quickly look it over and raise some issues. See the P.S. I added to the above post.

]]>This also is good news for people who want this blog to have more actual statistical modeling!

]]>We’ll blog them here and post them on the Stan web page. Everyone wins, because this is how our users (and developers!) learn to program useful models in Stan. The case studies page explains what we need in the way of license and content (short version is that you can keep copyright and keep source on your own repository, but need to license the content open source so we can distribute it):

http://mc-stan.org/documentation/case-studies.html

I’ll put one out on movement HMMs for ecology and on typical sets as soon as I can find the time—they’re both “in process” now.

]]>