Jeff Diez has a question about measures of explained variance (“R-squared”) for hierarchical logistic regression. He refers to my paper with Iain Pardoe, to be published in Technometrics, on R-squared for multilevel models. I’ll give his question and then my response and Iain Pardoe’s response.

Jeff writes:

A colleague, Sean McMahon, and I are writing a paper on the under-utilized potential of multi-level models for inference about ecological processes. We have developed a couple examples based broadly on Raudenbush and Bryk’s progression through inference at different levels, starting with an unconditional model. We use two of our ecological datasets to illustrate a 2-level Normal model and a 3-level random-intercept logistic regression (for flowering data). The 2-level is fit with an EM algorithm and the 3-level is fit in WinBugs. We have one covariate at each level that is significantly different from zero. In addition to all the coefficient estimates, we want to show how variance can be summarized at different levels. We calculate intraclass correlation coefficients and proportions of variation explained after adding a covariate to a model, using the fixed individual-level variance of 3.29 as suggested by Snijders and Bosker for logit models. We are finding it more useful though to calculate a level-specific R2 for the top two levels in the 3-level model, as described in your 2005 paper on Bayesian measures of explained variance . This approach seems better behaved (using the methods in Raudenbush we get negative variances explained at some levels using comparisons to the unconditional model) and your formulation of R2 makes good sense as a way to describe explanatory power that we think ecologists can relate to.

The question you might be anticipating by now is how best to estimate the level-1 explained variance in this logistic model. The last sentence of your paper nods toward an alternative for GLMs using deviances, but it is not immediately clear to us how to approach this. We wonder if you have hashed out an approach based on deviances, or could comment on the possibility of at least an approximation based on a similar method. It would seem something could be possible using similar assumptions to the Snijders reformulation as a threshold model, but we haven’t been able to work that out. Also, we are not currently trying to estimate any overdispersion parameters, but are interested in how a calculation of level-1 explained variance might be influenced by doing so.

My reply: Many people have asked about level-1 R-squared for logistic models. One idea we’ve thought of is to work with the latent-variable formulation of the logit (in which case, the data-level sd is 1.6 for the non-overdispersed logit). I’m not sure how this would work for a Poisson regression, however.

Iain Pardoe adds: My instinct says to not get too stuck on trying to formulate a sensible R-squared type measure for binary outcomes. For logistic regression, I find it easier to focus on other model fit summary measures or notions like how much better you can predict 0/1 with the model vs. without (e.g. in terms of “lift”).