Someone who doesn’t want his name shared (for the perhaps reasonable reason that he’ll “one day not be confused, and would rather my confusion not live on online forever”) writes:

I’m exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally clear grasp on the material, but wanted to be sure I haven’t gone astray.

The problem in working on involves a multi-nation survey of students, and I’m especially interested in understanding the effects of country, religion, and sex, and the interactions among those factors (using IRT to estimate individual-level ability, then estimating individual, school, and country effects).

Following the basic approach laid out in chapter 13 for such interactions between levels, I think I need to create a matrix of indicator variables for religion and sex. Elsewhere in the book, you recommend against indicator variables in favor of a single index variable.

Am I right in thinking that this is purely a matter of convenience, and that the matrix formulation of chapter 13 requires indicator variables, but that the matrix of indicators or the vector of indices yield otherwise identical results? I can’t see why they shouldn’t be the same, but my intuition is still developing around multi-level models.

I replied:

Yes, models can be formulated equivalently in terms of index or indicator variables. If a discrete variable can take on a bunch of different possible values (for example, 50 states), it makes sense to use a multilevel model rather than to include indicators as predictors with unmodeled coefficients. If the variable takes on only two or three values, you can still do a multilevel model but really it would be better at that point to use informative priors for any variance parameters. That’s a tactic we do not discuss in our book but which is easy to implement in Stan, and I’m hoping to do more of it in the future.

To which my correspondent wrote:

The main difference that occurs to me as I work through implementing this is that the matrix of indicator variables loses information about what the underlying variable was. So, for instance, if the matrix mixes an indicator for sex and n indicators for religion and m indicators for schools, we’d have Sigma_beta be an m+n+1 x m+n+1 matrix, when we really want a 3×3 matrix.

I could set up the basic structure of Sigma_beta, separately estimate the diagonal elements with a series of multilevel loops by sex, religion, and school, and eschew the matrix formulation in the individual model. So instead of y~N(X_iB_j[i],sigma^2_y) it would be (roughly, I’m doing this on my phone):

y_i~N(beta_sex[i]+beta_sex_country[country[i]]+beta_religion[i]+beta_religion_country[i,country[i]]+beta_school[i]+beta_school_country[i,country[i]],sigma^2_y)

And the group-level formulation unchanged. Sigma_beta becomes a 3×3 matrix rather than an m+n+1 matrix, which seems both more reasonable and more computationally tractable.

My reply:

Now I’m getting tangled in your notation. I’m not sure what Sigma_beta is.

If I had to take a guess given the input variables it’s probably the convergence of two sororities. Yes…. definitely sororities.

On the serious side… I’ve always been confused by the sometimes interchangeable use of ‘indicator’ as a term for a variable. I’ve heard the use of ‘dummy’, etc. Probably just my background speaking here, but what does this mean really? To me these are nothing more than binary variables (perhaps bernoulli, given probability of an outcome). Is it that by ‘indicator’ it’s meant that the variable is not included in a model and is only loosely (perhaps by 2 factor OLS) associated with some outcome? If so, why not just keep it simple? Modelled or unmodelled, predicts or is associated with.

I think the whole indicator/predictor/dummy/blah nomenclature gets a bit confusing (for which I remain in perpetuity anyway, so take me with a grain of salt here). Enter comp Sci folks into ‘data science’ and that appears to cloud things even further these days.