## Getting the right uncertainties when fitting multilevel models

Cesare Aloisi writes:

I am writing you regarding something I recently stumbled upon in your book Data Analysis Using Regression and Multilevel/Hierarchical Models which confused me, in hopes you could help me understand it. This book has been my reference guide for many years now, and I am extremely grateful for everything I learnt from you.

On page 261, a 95% confidence interval for the intercept in a certain group (County 26) is calculated using only the standard error of the “random effect” (the county-level error). The string is as follows:

coef(M1)$county[26,1] + c(-2,2)*se.ranef(M1)$county[26]

My understanding is that, since the group-level prediction (call it y.hat_j = coef(M1)$county[26,1]) is a linear combination of a global average and a group-level deviation from the average (y.hat_j = beta_0 + eta_j), then the variance of y.hat_j should be the sum of the covariances of beta_0 and eta_j, not just the variance of eta_j, as the code on page 261 seems to imply. In other words: Var(y.hat_j) = Var(beta_0) + Var(eta_j) + 2Cov(beta_0, eta_j) Admittedly, lme4 does not provide an estimate for the last term, the covariance between “fixed” and “random” effects. Was the code used in the book to simplify the calculations, or was there some deeper reason to it that I failed to grasp? My reply: The short answer is that it’s difficult to get this correct in lmer but very easy when using stan_lmer() in the rstanarm package. That’s what I recommend, and that’s what we’ll be doing in the 2nd edition of our book. ### 13 Comments 1. Tim Mastny says: What are your plans for the second edition? Can we expect it next year? 2. JR says: I stumbled upon the exact same problem a while ago, scratched my head for a while, and decided on using a complicated double bootstrap method (described in Chatterjee S, Lahiri P. A simple computational method for estimating mean squared prediction error in general small-area model. In: Proceedings of the section on survey research methods. 2007: 3486–3493.) Nowadays I would use stan without a moment of reflexion and thank the lord for its existence. • Alex says: I find it mildly amusing but not at all surprising that the article calls the procedure “simple” while the person implementing it calls it “complicated”. 3. Paul says: Same here. However, I used INLA back then, it has an easy way to compute the posterior marginals of linear combinations. It is sad that one is not able to do this with lme4. 4. bluepole says: I’ve been trying to following the same route for simple multilevel models with lme4 as laid out in the book BDA3 and the paper Why We (Usually) Don’t Have to Worry About Multiple Comparisons. The main reason I would like to adopt this approach is due to the heavy (sometimes even prohibitive) computational burden for large datasets. So, the formulations for the posterior mean and variance with the approximate Bayesian approach are not so accurate? 5. Shravan says: I didn’t understand this question. How can there be a non-zero covariance between the estimator \hat\beta0 and etaj? Surely the code shown in the next line on that page of the book is what Cesare wanted: as.matrix(ranef(M1)$county)[26] + c(-2,2)*se.ranef(M1)\$county[26]

Those two random variables are independent, surely?

Obviously, it’s 2017 so Stan is the way to go, but still.

6. Jens says:

Sounds pretty interesting. Do you know when we can expect 2nd edition?

7. Titus von der Malsburg says:

One problem with using Stan in a future version of Gelman & Hill (I suppose that’s what you are referring to?) is that fitting Stan models analogous to those in the current book takes considerably more time. This would slow down learning and discourage students from conducting their own experiments. Fitting models in class would be pretty much impossible. I would prefer an lmer-based book with chapters on Stan in the second half. I think this would work okay because a lot of the material in the early chapters is largely orthogonal to the Bayesian vs. frequentist issue anyway (coding of predictors, transforms of the DVs, etc.).

8. Titus von der Malsburg says:

(This was supposed to be a reply to Tim. Not sure why it’s showing up as such.)

9. Cesare says:

To contextualise, the question stemmed from a practical case I was working on. I fit a 3-level growth model of pupil scores taken at different time points, nested in pupils, themselves nested in classes. A quadratic equation provided the best fit. Someone asked me to compare the trajectories of two classes at each time point. This raised the question on how to calculate group-level confidence intervals.

If you think about the covariance matrix of a group-level estimate at a certain time point, it is not just about the covariance between one single “fixed effect” and its “random component”, which are assumed to be orthogonal. It is more complicated than this, and I was not sure at all that I could just dismiss all covariances by treating them as equal to zero. Therefore, I checked Andrew’s book to see if I could find a solution. The situation he presents has fewer terms but it is conceptually similar, so I thought I would ask for clarifications on the matter.

AFAIK, neither lme4 or merTools provide a direct implementation of this, and Andrew confirmed it is a bit of a can of worms. Looking forward the next edition of the book!

• Peter says:

Hi Cesare,

I am not totally clear on your specific model, but to some extent the appropriate
confidence interval (CI) construction depends on how you want to interpret
the CI. If you are looking for posterior coverage, then using the posterior
estimate plus or minus two standard errors is reasonable (assuming the
posterior distributions are approximately normal)

Such procedures have roughly 95% frequentist coverage
*on average across groups*, but do not generally maintain
this coverage for each group individually. This is partly
because the posterior estimates are biased.

If you want to share information across groups, but also want to
use a procedure that maintains 95% coverage for a specific group
mean or contrast, you may be interested in the following paper
on hierarchical modeling:

https://arxiv.org/abs/1612.08287

This paper shows how to construct CIs in a hierarchical model
that both shares information across groups in a Bayesian-like
way, and maintains 95% frequentist coverage for each group
individually.