Prior distributions for variance parameters

Sudipto Banerjee and I had a little exchange, inspired by a comment on the Bugs mailing list, which might be of interest to the hierarchical modelers out there. It started with Wayne Throgmartin with a set of questions about fitting, understanding, and checking complex multilevel models.

Wayne Thogmartin wrote:

I would appreciate if interested folks might comment on the attached code which attempts to model factors associated with the discrimination between four land cover states. The global model has a slew of random effects which makes convergence difficult. The attached code is more quickly able to converge, but still has some mixing problems. I’m wondering if I might make the code more efficient. I would also appreciate thoughts on approaches to interpretation; currently, I examine the results to identify patterns in the random effects (i.e., where some groups differ in interesting ways from the main). I’m also a bit unsure how I might assess model fit.

I responded:

1. Those inverse-gamma prior distributions can create problems; uniform prior dists on the sd’s would be better; see here.

2. If you have a lot of variance parameters, I suggest modeling them hierarchically using the half-Cauchy family; see Section 6 of the above-referenced paper.

3. If convergence is still slow after the above 2 steps, try parameter expansion (see pp. 597-598 in Appendix C of Bayesian Data analysis); the appendix is also here.

4. To understand the model fit, I recommend plots of data (or estimated coefficients) and fitted curve, at each level of the model. For example, see Figures 1 and 2 of this paper or Figures 2 and 3 of this paper.

5. You can check model fit by simulating replicated datasets and comparing to the actual data. Graphical comparisons are usually best, in my experience. See Chapter 6 of BDA, and p.598 of the aforementioned appendix for an example in Bugs.

Sorry for referring only to my own stuff. These are topics that I’ve thought a lot about, and naturally I find my own writing on the topic easiest to understand!

Sudipto then asked:

I downloaded and glanced over your article on the problems with the Inv-Gamma priors on random effect variances. It is an interesting article and I intend to give it a more thorough read, but here is a quick thought: The problem with the Inv-Gamma prior, I gather, is that an IG(alpha,alpha) prior with alpha really small (BUGS style parametrization) can lead to unstable posteriors (theoretically as alpha->0 the posterior becomes improper). But what if you opt for a IG(2, b) specification for the variance. Note that this is a “proper prior” but with infinite variance (hence “vague” in some sense) and yet allows you to center around a finite mean. I also believe this will ensure propriety of the posterior (tho’ haven’t proved it) for the random effect models. I was wondering if you ever experimented with this prior.

My response:

I could see that it could help in some settings but I think I still prefer something like the half-t that is flat near zero. For example, do the following in R:

> a <- 1/rgamma(100000, shape=2, scale=1) > hist (a)
> hist(a[a<10]) I'm not bothered by the behavior in the upper tail--a small chance of large values--but I don't like how quickly it falls off near zero. One can change the scale to .1, for example, but then the prior has a peak in the middle. Or one can let the scale itself be unknown--estimated from the data with a normal or flat prior distribution--and this leads you back to the half-t family that I recommend in the paper. So, given this, I don't see the inverse-gamma with alpha=2 as being so great.

Finally, Sudipto wrote,

I see your point and I think your article should be particularly useful for practising Bayesians. I have encountered some of the issues you discussed in the paper in some spatial geostat models and had the IG(2,b) (mean b, variance = Inf) sometimes bail me out. In particular, in Gaussian settings you can fit an OLS (no random effects) and obtain the estimate of the residual variance. Then, you set b apriori as if in your random effects model you are “dividing” the OLS apriori (say 50-50) between the spatial effect variances and the measurement error (or nugget) variances.