## “A hard case for Mister P”

Kevin Van Horn sent me an email with the above title (ok, he wrote MRP, but it’s the same idea) and the following content:

I’m working on a problem that at first seemed like a clear case where multilevel modeling would be useful. As I’ve dug into it I’ve found that it doesn’t quite fit the usual pattern, because it seems to require a very difficult post-stratification.

Here is the problem. You have a question in a survey, and you need to estimate the proportion of positive responses to the question for a large number (100) of different subgroups of the total population at which the survey is aimed. The sample size for some of these subgroups can be rather small. If these were disjoint subgroups then this would be a standard multi-level modeling problem, but they are not disjoint: each subgroup is defined by one or two variables, but there are a total of over 30 variables used to define subgroups.

For example, if x[i], 1 <= i <= 30, are the variables used to define subgroups, subgroup i for i <= 30 might be defined as those individuals for which x[i] > 1, with the other subgroup definitions involving combinations of two or possibly three variables. Examples of these subgroup definitions include patterns such as
· x1 == 1 OR x2 == 1 OR x3 == 1
· (x1 == 1 OR x1 == 2) AND x3 < 4. You could do a multilevel regression with post-stratification, but that post-stratification step looks very difficult. It seems that you would need to model the 30-dimensional joint distribution for the 30 variables describing subgroups. Have you encountered this kind of problem before, or know of some relevant papers to read?

I replied:

In your example, I agree that it sounds like it would be difficult to compute things on 2^30 cells or however many groups you have in the population. Maybe some analytic approach would be possible? What are your 30 variables?

And then he responded:

The 30+ variables are a mixture of categorical and ordinal survey responses indicating things like the person’s role in their organization, decision-making influence, familiarity with various products and services, and recognition of various ad campaigns. So you might have subgroups such as “people who recognize any of our ad campaigns,” or “people who recognize ad campaign W,” or “people with purchasing influence for product space X,” or more tightly defined subgroups such as “people with job description Y who are familiar with product Z.”

Here’s some more context. I’m looking for ways of getting better information out of tracking studies. In marketing research a tracking study is a survey that is run repeatedly to track how awareness and opinions change over time, often in the context of one or more advertising campaigns that are running during the study period. These surveys contain audience definition questions, as well as questions about familiarity with products, awareness of particular ads, and attitudes towards various products.

It’s hard to get clients to really understand just how large sampling error can be, so there tends to be a lot of upset and hand wringing when they see an unexplained fluctuation from one month to the next. Thus, there’s significant value in finding ways to (legitimately) stabilize estimates.

Where things get interesting is when the client wants to push the envelope by
a) running surveys more often, but with a smaller sample size, so that the total number surveyed per month remains the same, or
b) tracking results for many different overlapping subgroups.

I’m seeing some good results for handling (a) by treating the responses in each subgroup over time as a time series and applying a simple state-space model with binomial error model; this is based on the assumption that the quantities being tracked don’t typically change radically from one week to the next. This kind of modeling is less useful in the early stages of the study, however, when you don’t yet have much information on the typical degree of variation from one time period to the next. Multilevel modeling for b) seems like a good candidate for the next improvement in estimation, and would help even in the early stages of the study, but as I mentioned, the post-stratification looks difficult.

Now here’s me again:

I see what you’re saying about the poststrat being difficult. In this case, one starting point could be to make somewhat arbitrary (but reasonable) guesses for the sizes of the poststrat cells—for example, just use the proportion of respondents in the different categories in your sample—and then go from there. The point is that the poststrat would be giving you stability, even if it’s not matching quite to the population of interest.

And Van Horn came back with:

You write, “one starting point could be to make somewhat arbitrary (but reasonable) guesses for the sizes of the poststrat cells.”

But there are millions of poststrat cells… Or are you thinking of doing some simple modeling of the distribution for the poststrat cells, e.g. treating the stratum-defining variables as independent?

That sounds like it could often be a workable approach.

Just to stir the pot, though . . . One could argue that a good solution should have good asymptotic behavior, in the sense that, in the limit of a large subgroup sample size, the estimate for the proportion should tend to the empirical proportion. Certainly if one of the subgroups is large, in which case one would expect the empirical proportion to be a good estimate for that subgroup, and my multilevel-model-with-poststrat gives an estimate that differs significantly from the “obvious” answer, this is likely to raise questions about the validity of the approach. It seems to me that, to achieve this asymptotic behavior, I’d need to be able to model the distribution of poststrat cells at arbitrary levels of detail as the sample size increases. This line of thought has me looking into Bayesian nonparametric modeling.

Fun stuff.

1. Rahul says:

From the description this sounds like one of those cases which needs a brave, principled guy to tell the client that what they are expecting just cannot be done.

It seems awfully optimistic to use small samples and then expect stable estimates on the level of super-numerous, overlapping sub-groups, created by combinations of attributes and also expect the time-series of said estimates to respond measurably to an advertising campaign.

2. Nick Menzies says:

Can this be resolved by tackling the inference for each of the subgroups independently? If one was just interested in inferences for one group (out of the ~100 total), you would include the variables that defined that group, you would also want to include other variables that strongly predicted the outcome, but you would probably ignore a lot of the other dimensionality. Why not just iterate through this procedure for each of the 100 groups?

I realize this would make intergroup comparisons dangerous, but time trends within a particular group should be correct (and this seems to be the focus of inference if I am reading correctly).

Though I suggest this, I claim no MRP expertise, and am interested to hear from others as to whether this would fly.

3. Andrew McDavid says:

It seems like you need to make some simplifying assumptions for the 30-dimensional joint distribution. Of course, the simplest assumption is independence, but using log-linear/graphical models you can come up with a set of nested models ranging from complete independence (30 * |X| parameters, where |X| is the cardinality of the support of |X|) to complete saturation (|X|^30 parameters). Going the log-linear route, you could do model selection using the lasso and cross validation.

4. Anonymous says:

If this makes no sense, forgive me. But could you in effect parse each case’s weight across all of its subgroups, with a case having more weight within a sparsely populated subgroup? So assume case i is a member of 5 different groups with 5, 10, 20, 40, and 100 responses in them. Case i’s weight would be allocated something like 100/175ths to the small group, 40/175ths to the next smallest group, etc. In a way this would be adding another level where case membership is assigned to the subgroups.

5. zbicyclist says:

Isn’t this a bit complicated by the fact that the proportion for some of the stratification variables will change over time.

For example, “people who recognize any of our ad campaigns,”

6. D.O. says:

I have zero insight about MRP, but the problem may benefit most from the more clear goals. What is it that the client wants to learn?

7. Corey says:

This line of thought has me looking into Bayesian nonparametric modeling.

Ooh, I smell Indian buffet! Bayesian nonparametrics are the tastiest nonparametrics.