Skip to content
 

Stan 2.10 through Stan 2.13 produce biased samples

[Update: bug found! See the follow-up post, Michael found the bug in Stan’s new sampler]

[Update: rolled in info from comments.]

After all of our nagging of people to use samplers that produce unbiased samples, we are mortified to have to announce that Stan versions 2.10 through 2.13 produce biased samples.

The issue

Thanks to Matthew R. Becker for noticing this with a simple bivariate example and for filing the issue with a reproducible example:

The change to Stan

Stan 2.10 changed the NUTS algorithm from using slice sampling along a Hamiltonian trajectory to a new algorithm that uses categorical sampling of points along the trajectory proportional to the density (plus biases to the second half of the chain, which is a subtle aspect of the original NUTS algorithm). The new approach is described here:

From Michael Betancourt on Stan’s users group:

Let me temper the panic by saying that the bias is relatively small and affects only variances but not means, which is why it snuck through all our testing and application analyses. Ultimately posterior intervals are smaller than they should be, but not so much that the inferences are misleading and the shrinkage will be noticeable only if you have more than thousands of effective samples, which is much more that we typically recommend.

What we’re doing to fix it

Michael and I are poring over the proofs and the code, but it’s unfortunate timing with the holidays here as everyone’s traveling. We’ll announce a fix and make a new release as soon
as we can. Let’s just say this is our only priority at the moment.

If all else fails, we’ll roll back the sampler to the 2.09 version in a couple days and do a new release with all the other language updates and bug fixes since then.

What you can do until then

While some people seem to think the error is of small enough magnitude not to be worrisome (see comments), we’d rather see you all getting the right answers. Until we get this fixed, the only thing I can recommend is using straight up static HMC (which is not broken in the Stan releases) or rolling back to Stan 2.09 (easy with CmdStan, not sure how to do that with other interfaces).

Even diagnosing problems like these is hard

Matthew Becker, the original poster, diagnosed the problem with fake data simulations, but it required a lot of effort.

The bug Matthew Becker reported was for this model:

parameters {
  vector[2] z;
}

model {
  matrix[2,2] sigma;
  vector[2] mu;

  mu[1] <- 0.0;
  mu[2] <- 3.0;
  sigma[1][1] <- 1.0 * 1.0;
  sigma[1][2] <- 0.5 * 1.0 * 2.0;
  sigma[2][1] <- 0.5 * 1.0 * 2.0;
  sigma[2][2] <- 2.0 * 2.0;

  z ~ multi_normal(mu, sigma);
}

So it's just a simple multivariate normal with 50% correlation and reasonably small locations and scales. It led to this result in Stan 2.13. It used four chains of 1M iterations each:

Inference for Stan model: TWOD_Gaussian_c1141a5e1a103986068b426ecd9ef5d2.
4 chains, each with iter=1000000; warmup=100000; thin=1;
post-warmup draws per chain=900000, total post-warmup draws=3600000.

and led to this posterior summary:

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
z[0]-5.1e-4  1.9e-3   0.95  -1.89  -0.61 7.7e-4   0.61   1.88 235552    1.0
z[1]    3.0  3.9e-3    1.9  -0.77   1.77    3.0   4.23   6.77 234959    1.0
lp__  -1.48  2.3e-3   0.96  -4.09  -1.84  -1.18   -0.8  -0.57 179274    1.0

rather than the correct (as known analytically and verified by our static HMC implementation):

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
z[0] 6.7e-5  5.2e-4   0.99  -1.95  -0.67-1.8e-4   0.67   1.95  3.6e6    1.0
z[1]    3.0  1.1e-3   1.99  -0.92   1.66    3.0   4.34   6.91  3.6e6    1.0
lp__  -1.54  6.8e-3    1.0  -4.25  -1.92  -1.23  -0.83  -0.57  21903    1.0

In particular, you can see that the posterior sd is too low for NUTS (not by much 0.95 vs. 1.0), and the posterior 90% intervals are (-1.89, 1.88) rather than (-1.95, 1.95) for z[1] (here, for some reason listed as "z[0]").

We're really sorry about this

Again, our sincere apologies here for messing up so badly. I hope everyone can forgive us. It is going to cause us to focus considerable energy on functional tests that'll diagnose these issues---it's a challening problem to balance sensitivity and specificity of such tests.

12 Comments

  1. Robert Kubinec says:

    Thanks for the heads up, although it seems the bias is measurably small.

  2. Andrew says:

    Bob:

    It’s good to announce this—but, as Robert says, the bias must be pretty small, at least in all the examples I’ve been working on for the past several months, as Stan’s inferences have matched my fake-data simulations.

    Of course it’s good to fix the bug! But I don’t know that, for practical work, it’s necessary for people to go back to an old version of Stan.

    I’d recommend that users stick with Stan 2.13 but just check their results using fake data simulations (which they should be doing anyway!).

    • Shravan says:

      I have had the same experience at least so far: my own fake data simulations show that I can recover the parameters, and for standard lmer type hierarchical models the parameter estimates are similar to the frequentist ones for a given data-set.

      • So glad to hear everyone’s doing fake data simulation. Diagnosing this bug requires quite a few iterations to get low enough MCMC standard error. You can follow the link to the issue to see the way Michael Betancourt coded a model for the null hypothesis that our draws were correct:

    • The bug Matthew Becker reported was for this model:

      parameters {
        vector[2] z;
      }
      
      model {
        matrix[2,2] sigma;
        vector[2] mu;
      
        mu[1] < - 0.0;
        mu[2] <- 3.0;
        sigma[1][1] <- 1.0 * 1.0;
        sigma[1][2] <- 0.5 * 1.0 * 2.0;
        sigma[2][1] <- 0.5 * 1.0 * 2.0;
        sigma[2][2] <- 2.0 * 2.0;
      
        z ~ multi_normal(mu, sigma);
      }
      

      So it's just a simple multivariate normal with 50% correlation and reasonably small locations and scales. It led to this result in Stan 2.13. It used four chains of 1M iterations each:

      Inference for Stan model: TWOD_Gaussian_c1141a5e1a103986068b426ecd9ef5d2.
      4 chains, each with iter=1000000; warmup=100000; thin=1;
      post-warmup draws per chain=900000, total post-warmup draws=3600000.
      

      and led to this posterior summary:

             mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
      z[0]-5.1e-4  1.9e-3   0.95  -1.89  -0.61 7.7e-4   0.61   1.88 235552    1.0
      z[1]    3.0  3.9e-3    1.9  -0.77   1.77    3.0   4.23   6.77 234959    1.0
      lp__  -1.48  2.3e-3   0.96  -4.09  -1.84  -1.18   -0.8  -0.57 179274    1.0
      

      rather than the correct (as known analytically and verified by our static HMC implementation):

             mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
      z[0] 6.7e-5  5.2e-4   0.99  -1.95  -0.67-1.8e-4   0.67   1.95  3.6e6    1.0
      z[1]    3.0  1.1e-3   1.99  -0.92   1.66    3.0   4.34   6.91  3.6e6    1.0
      lp__  -1.54  6.8e-3    1.0  -4.25  -1.92  -1.23  -0.83  -0.57  21903    1.0
      

      In particular, you can see that the posterior sd is too low for NUTS (not by much 0.95 vs. 1.0), and the posterior 90% intervals are (-1.89, 1.88) rather than (-1.95, 1.95) for z[1] (here, for some reason listed as "z[0]").

      I have no idea which of our interfaces is using this bad indexing from 0 in the output. It should match the Stan indexing and be "z[1]" and "z[2]"!

  3. Björn says:

    Now, this may be a really simple question, but do the version numbers between rstan and Stan match up? The release dates seem to match up, so I am guess they do? If so, maybe I should be grateful to my company’s IT organization for not updating the version since rstan 2.7.0-1. Or does the rstan version not tell me, which Stan version I am using?

  4. Claiming that a Markov chain Monte Carlo sampler is biased is a delicate issue and, unfortunately, requires more evidence than “numbers kind of look off”. The problem is that MCMC is an approximate algorithm and so there will be some deviation from the true value even when things are working properly. Fortunately, MCMC is one of the few algorithms where we can quantify the approximation error (formally this is a bit complicated, but if you heed all of the diagnostics in Stan then it’s pretty robust).

    We can see that quantification already in the results of Matthew’s example that Bob showed above — the means for both z[0] and z[1] and within a few MCMC standard errors (listed as MCSE in the fit output) of their true values, just as we’d expect them to if everything were working correctly. And exactly for what our tests look!

    But the concern wasn’t the means but rather the variances. To examine those we need a MCMC standard error for marginal variances and z[0] and z[1] and their correlation. To that end I augmented Matthew’s example with a generated quantities block that calculated some functions whose means yield the deviation of the estimated marginal variances and correlation from their true values.

    transformed data {
    vector[2] mu;
    real sigma1;
    real sigma2;
    real rho;

    matrix[2,2] Sigma;

    mu[1] = 0.0;
    mu[2] = 3.0;

    rho = 0.5;
    sigma1 = 1.0;
    sigma2 = 2.0;

    Sigma[1][1] = sigma1 * sigma1;
    Sigma[1][2] = rho * sigma1 * sigma2;
    Sigma[2][1] = rho * sigma1 * sigma2;
    Sigma[2][2] = sigma2 * sigma2;
    }

    parameters {
    vector[2] z;
    }

    model {
    z ~ multi_normal(mu, Sigma);
    }

    generated quantities {
    // The means of these quantities will give the difference between
    // estimated marginal variances and correlation and their true values.
    // If everything is going correctly then these values should be with
    // the respective MCMC-SE of zero.
    real delta_var1;
    real delta_var2;
    real delta_corr;

    delta_var1 = square(z[1] – mu[1]) – sigma1 * sigma1;
    delta_var2 = square(z[2] – mu[2]) – sigma2 * sigma2;
    delta_corr = (z[1] – mu[1]) * (z[2] – mu[2]) / (sigma1 * sigma2) – rho;
    }

    Now let’s run with the default Stan settings. What do we get?

    Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
    z[1] 3.8618e-02 3.5107e-02 9.3442e-01 -1.4819e+00 5.3343e-03 1.5026e+00 7.0843e+02 2.8641e+04 9.9906e-01
    z[2] 3.0579e+00 7.0029e-02 1.8025e+00 2.0751e-01 3.0059e+00 6.0329e+00 6.6249e+02 2.6784e+04 9.9929e-01
    delta_var1 -1.2624e-01 6.0583e-02 1.3145e+00 -9.9679e-01 -6.1142e-01 2.3132e+00 4.7077e+02 1.9032e+04 9.9923e-01
    delta_var2 -7.5097e-01 2.0546e-01 4.8717e+00 -3.9825e+00 -2.5859e+00 8.7380e+00 5.6222e+02 2.2730e+04 1.0002e+00
    delta_corr -1.6556e-01 4.2485e-02 9.3341e-01 -1.1511e+00 -4.0714e-01 1.3873e+00 4.8269e+02 1.9515e+04 1.0001e+00

    There are no sampler diagnostics indicating that something is wrong with the sampler, and the deviations are all pretty close to zero with respect to their MCMC standard errors! BUT if generate many more samples then we start to see a bias arise,

    Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
    z[1] 2.3706e-03 1.3732e-03 9.5861e-01 -1.5762e+00 3.2581e-03 1.5823e+00 4.8734e+05 1.3434e+04 1.0000e+00
    z[2] 3.0044e+00 2.7295e-03 1.9180e+00 -1.5162e-01 3.0024e+00 6.1645e+00 4.9376e+05 1.3610e+04 1.0000e+00
    delta_var1 -8.1054e-02 2.0532e-03 1.3161e+00 -9.9670e-01 -5.8907e-01 2.5470e+00 4.1089e+05 1.1326e+04 1.0000e+00
    delta_var2 -3.2145e-01 8.1639e-03 5.2735e+00 -3.9868e+00 -2.3553e+00 1.0200e+01 4.1725e+05 1.1502e+04 1.0000e+00
    delta_corr -4.0032e-02 1.6886e-03 1.0407e+00 -1.0857e+00 -3.5417e-01 1.9632e+00 3.7984e+05 1.0470e+04 1.0000e+00

    Now those deviations are all many MCMC standard errors away from zero indicating that there’s a real bias exhibited by the sampler.

    To summarize, when validating MCMC samplers you need to be very careful to not just claim a bias but quantify that bias relative to the expected error. There does seem to be a small bias in the breadth of the posterior distribution recovered by Stan. The location of the posterior, however, seems to be fine, and the bias in the breadth isn’t really significant if you’re running with the Stan defaults, so there’s no need to panic about the general results of any analyses you’ve made in the past few years.

    This is a pretty subtle bias which is why it got past all of our tests and analysis validations. As Bob noted we’re working to figure out what’s going on as quickly as we can — as subtle as the bias is, it’s cause is even more subtle and seems to be due to a complicated interaction between the sampler components, maybe? — and we’ll have a fix as soon as we find the issue.

Leave a Reply