Computational and statistical issues with uniform interval priors

There are two anti-patterns* for prior specification in Stan programs that can be sourced directly to idioms developed for BUGS. One is the diffuse gamma priors that Andrew’s already written about at length. The second is interval-based priors. Which brings us to today’s post.

Interval priors

An interval prior is something like this in Stan (and in standard mathematical notation):

sigma ~ uniform(0.1, 2);

In Stan, such a prior presupposes that the parameter sigma is declared with the same bounds.

real<lower=0.1, upper=2> sigma;

We see a lot of examples where users either don’t know or don’t remember to constrain sigma. It’s impossible to infer bounds in general in Stan because of its underlying Turing-complete imperative programming language component—the bounds might be computed in a function call. BUGS’s restriction to directed graphical models lets you infer the bounds (at runtime).

Computational problems

Suppose the true value for sigma lies somewhere near one of the boundaries. The boundaries are mapped to the unconstrained scale using a log-odds (aka logit) transform [thanks to Vadim Kantorov for the correction to the first version]:

sigma = 0.1 + 1.9 * inv_logit(sigma_free)

sigma_free = logit((sigma - 0.1) / 1.9)

where logit(u) = log(u / (1 - u)). Stan actually works on the unconstrained space, sampling sigma_free and producing sigma by inverse transform based on the declared constraints.

When sigma approaches one of the boundaries, sigma_free moves toward positive or negative infinity.

This leads to computational difficulty in setting step sizes if the possible values include both values near the boundary and values even a little bit away from the boundary. We need very large step sizes to move near the boundary and relatively small step sizes to move elsewhere. Euclidean Hamiltonian Monte Carlo, as used in Stan, fixes a single step size (it’s very challenging to try to change this assumption—jittering step size a bit rarely helps).

Statistical problems

The more worrying problem is statistical. Often these interval constraints are imposed under the assumption that the model author knows the value lies in the interval. Let’s suppose the value lies near the upper end of the interval (0.1, 2). Then what happens is that any posterior mass that would be outside of (0.1, 2) if the prior were uniform on (0.1, 5) is pushed below 2. This then reduces the posterior uncertainty and biases the mean estimate lower compared to the wider prior.

Illustration

A simple Stan program exemplifies the problem:

data {
  real L;
  real<lower=L> U;
}
parameters {
  real<lower=L, upper=U> y;
}
model {
  y ~ normal(0, 1);
}

The parameter y is given a lower bound and upper bound constraint where it is declared in the parameters block, then given a standard normal distribution in the model block. Without the uniform prior, y should clearly have a standard normal distribution.

Now let’s fit it with a very wide interval for bounds, say (-100, 100).

> fit1 <- stan("interval.stan", data = list(L = -100, U = 100), iter=10000)

> probs <- c(pnorm(-2:0))

> probs
[1] 0.02275013 0.15865525 0.50000000

     mean se_mean   sd 2.275013% 15.86553%  50% n_eff Rhat
y    0.00    0.01 0.99     -2.00     -1.00 0.00  7071    1

The pnorm() function is the inverse cumulative distribution function for the standard normal (location zero, unit scale). So the value of probs are the quantiles corresponding to values -2, -1, and 0 (roughly 0.022, 0.16, and 0.50).

The posterior appears to be standard normal, with Stan recovering the quantiles corresponding to -2, -1, and 0 values in the true distribution to within two decimal places (about the accuracy we expect here given the standard error report of 0.01).

What happens if we instead provide an interval prior with tighter bounds that is asymmetric around the mean of zero, say say uniform(-1, 2)? Let’s see.

> fit2 <- stan("interval.stan", data = list(L = -1, U = 2), iter=10000)

> print(fit2, probs=probs)

      mean se_mean   sd 2.275013% 15.86553%   50% n_eff Rhat
y     0.23    0.01 0.73     -0.93     -0.56  0.17  7224    1

Now the posterior mean is estimated as 0.23 and the posterior median as 0.17. That’s not standard normal. What happened? The posterior is not only no longer symmetric (mean and median differ), it’s no longer centered around 0.

Even though we know the “true” posterior mean would be zero without the constraint, adding an interval constraint (-1, 2) modifies the posterior so that it is not symmetric, has a higher mean, and a lower standard deviation.

If we had chosen (-1, 1), the posterior would be symmetric, the posterior mean would still be zero, but the posterior standard deviations would be lower than with the (-100, 100) uniform prior.

The take home message

The whole posterior matters for calculating both the posterior mean, posterior variance, and posterior intervals. Imposing narrow, uniform priors on an interval can bias estimates with respect to wider interval priors.

The lesson is that uniform priors are dangerous if any posterior mass would extend past the boundaries if a wider uniform interval were used.

If you want a wide uniform prior, you can just use an improper uniform prior in Stan (as long as the posterior is proper).

If you think diffuse inverse gamma priors are the answer, that was the second anti-pattern I alluded to earlier. It’s described in Andrew’s paper Prior distributions for variance parameters in hierarchical models (published as a comment on another paper!) and in BDA3.

But wait, there’s more

If you want more advice from the Stan dev team on priors, check out our wiki page:

Or you can wait a few years for Andrew and Aki to consolidate it all into BDA4.


* The Wikipedia page on anti-patterns requires “two key elements” of an anti-pattern:

  1. A commonly used process, structure, or pattern of action that despite initially appearing to be an appropriate and effective response to a problem, has more bad consequences than good ones.
  2. Another solution exists that is documented, repeatable, and proven to be effective.

Check and check.

 

A fistful of Stan case studies: divergences and bias, identifying mixtures, and weakly informative priors

Following on from his talk at StanCon, Michael Betancourt just wrote three Stan case studies, all of which are must reads:

Reproducible R scripts

They all come with fully reproducible knitr scripts to run in RStan. The same lessons hold for the other interfaces, so don’t let the R put you off.

A spectator sport

It was particularly fun sitting in the office the day Michael went and figured out all the mixture modeling properties. It followed on from one of our Stan meetings and some of my own failed experiments.

Publish or perish

It’s really a shame that this kind of methodological study is so hard to publish, because all three of these deserve to be widely cited. Maybe Andrew has some ideas of how to turn these into “regular” papers. The main thing journal articles give us is a way to argue that we got research results from our research grants. Not a small thing!

Other case studies

We have a bunch of case studies up and are always looking for more. The full list and instructions for submitting are on the Stan case studies page.

Getting information out of the posterior fit object in R

And in case you didn’t see it, Jonah wrote up a guide for how to extract the kind of information you need for extracting information from a Stan fit object in R.