Kevin Van Horn sent me an email with the above title (ok, he wrote MRP, but it’s the same idea) and the following content:
I’m working on a problem that at first seemed like a clear case where multilevel modeling would be useful. As I’ve dug into it I’ve found that it doesn’t quite fit the usual pattern, because it seems to require a very difficult post-stratification.
Here is the problem. You have a question in a survey, and you need to estimate the proportion of positive responses to the question for a large number (100) of different subgroups of the total population at which the survey is aimed. The sample size for some of these subgroups can be rather small. If these were disjoint subgroups then this would be a standard multi-level modeling problem, but they are not disjoint: each subgroup is defined by one or two variables, but there are a total of over 30 variables used to define subgroups.
For example, if x[i], 1 <= i <= 30, are the variables used to define subgroups, subgroup i for i <= 30 might be defined as those individuals for which x[i] > 1, with the other subgroup definitions involving combinations of two or possibly three variables. Examples of these subgroup definitions include patterns such as
· x1 == 1 OR x2 == 1 OR x3 == 1
· (x1 == 1 OR x1 == 2) AND x3 < 4.
You could do a multilevel regression with post-stratification, but that post-stratification step looks very difficult. It seems that you would need to model the 30-dimensional joint distribution for the 30 variables describing subgroups.
Have you encountered this kind of problem before, or know of some relevant papers to read?
In your example, I agree that it sounds like it would be difficult to compute things on 2^30 cells or however many groups you have in the population. Maybe some analytic approach would be possible? What are your 30 variables?
And then he responded:
The 30+ variables are a mixture of categorical and ordinal survey responses indicating things like the person’s role in their organization, decision-making influence, familiarity with various products and services, and recognition of various ad campaigns. So you might have subgroups such as “people who recognize any of our ad campaigns,” or “people who recognize ad campaign W,” or “people with purchasing influence for product space X,” or more tightly defined subgroups such as “people with job description Y who are familiar with product Z.”
Here’s some more context. I’m looking for ways of getting better information out of tracking studies. In marketing research a tracking study is a survey that is run repeatedly to track how awareness and opinions change over time, often in the context of one or more advertising campaigns that are running during the study period. These surveys contain audience definition questions, as well as questions about familiarity with products, awareness of particular ads, and attitudes towards various products.
It’s hard to get clients to really understand just how large sampling error can be, so there tends to be a lot of upset and hand wringing when they see an unexplained fluctuation from one month to the next. Thus, there’s significant value in finding ways to (legitimately) stabilize estimates.
Where things get interesting is when the client wants to push the envelope by
a) running surveys more often, but with a smaller sample size, so that the total number surveyed per month remains the same, or
b) tracking results for many different overlapping subgroups.
I’m seeing some good results for handling (a) by treating the responses in each subgroup over time as a time series and applying a simple state-space model with binomial error model; this is based on the assumption that the quantities being tracked don’t typically change radically from one week to the next. This kind of modeling is less useful in the early stages of the study, however, when you don’t yet have much information on the typical degree of variation from one time period to the next. Multilevel modeling for b) seems like a good candidate for the next improvement in estimation, and would help even in the early stages of the study, but as I mentioned, the post-stratification looks difficult.
Now here’s me again:
I see what you’re saying about the poststrat being difficult. In this case, one starting point could be to make somewhat arbitrary (but reasonable) guesses for the sizes of the poststrat cells—for example, just use the proportion of respondents in the different categories in your sample—and then go from there. The point is that the poststrat would be giving you stability, even if it’s not matching quite to the population of interest.
And Van Horn came back with:
You write, “one starting point could be to make somewhat arbitrary (but reasonable) guesses for the sizes of the poststrat cells.”
But there are millions of poststrat cells… Or are you thinking of doing some simple modeling of the distribution for the poststrat cells, e.g. treating the stratum-defining variables as independent?
That sounds like it could often be a workable approach.
Just to stir the pot, though . . . One could argue that a good solution should have good asymptotic behavior, in the sense that, in the limit of a large subgroup sample size, the estimate for the proportion should tend to the empirical proportion. Certainly if one of the subgroups is large, in which case one would expect the empirical proportion to be a good estimate for that subgroup, and my multilevel-model-with-poststrat gives an estimate that differs significantly from the “obvious” answer, this is likely to raise questions about the validity of the approach. It seems to me that, to achieve this asymptotic behavior, I’d need to be able to model the distribution of poststrat cells at arbitrary levels of detail as the sample size increases. This line of thought has me looking into Bayesian nonparametric modeling.