Clusters with very small numbers of observations

James O’Brien writes:

How would you explain, to a “classically-trained” hypothesis-tester, that “It’s OK to fit a multilevel model even if some groups have only one observation each”?

I [O’Brien] think I understand the logic and the statistical principles at work in this, but I’ve having trouble being clear and persuasive. I also feel like I’m contending with some methodological conventional wisdom here.

My reply: I’m so used to this idea that I find it difficult to defend it in some sort of general conceptual way. So let me retreat to a more functional defense, which is that multilevel modeling gives good estimates, especially when the number of observations per group is small.

One way to see this in any particular example in through cross-validation. Another way is to consider the alternatives. If you try really hard you can come up with a “classical hypothesis testing” approach which will do as well as the multilevel model. It would just take a lot of work. I’d rather put that effort into statistical modeling and data visualization instead.

If you are in a situation where someone really doesn’t want to do the multilevel model, you could perhaps ask your skeptical colleague what his or her goals are in your particular statistical modeling problem. Then you can go from there.

2 thoughts on “Clusters with very small numbers of observations

  1. I'm actually TFing a course on statistical computing at the moment, and I'm getting a lot of the same questions (especially about a Poisson / log-Normal model with very small groups that my students are working with at the moment).

    When cluster sizes are small, the gains from multilevel modelling can be very large in certain circumstances. Multilevel modeling uses information from multiple groups to estimate and predict variables for each individual group. When you have very small groups (down to a single observation), the gains can this can be extremely large because the amount of information available from other groups is much larger than the amount of information from a single observation. Therefore, you get better inference for the small groups by pooling information through multilevel modeling.

    However, there is a price to pay. These gains come from the assumption that the groups are similar; in the model, this is expressed by assuming that each group has certain parameters drawn from a common distribution across groups. This assumed common distribution allows for the combination of information and more efficient inferences, but, if you are incorrect, your inferences will suffer.

    Therefore, model checking and validation is very important in multilevel models, especially when you are interested in making inferences for small groups. Because the information from these inferences is coming almost exclusively from the combination of other groups' attributes, your inference will be poor if the groups are not similar in the ways you have assumed.

    For example, in an application I am working on, our initial model assumed that underlying intensity was independent of an exposure weight. This assumption proved quite poor and, as a result, the multilevel model gave poor answers in groups with low amounts of information. Subsequent results have greatly improved our results.

    Best of luck with your research :)

Comments are closed.