Skip to content

Regularized Prediction and Poststratification (the generalization of Mister P)

This came up in comments recently so I thought I’d clarify the point.

Mister P is MRP, multilevel regression and poststratification. The idea goes like this:

1. You want to adjust for differences between sample and population. Let y be your outcome of interest and X be your demographic and geographic variables you’d like to adjust for. Assume X is discrete so you can define a set of poststratification cells, j=1,…,J (for example, if you’re poststratifying on 4 age categories, 5 education categories, 4 ethnicity categories, and 50 states, then J=4*5*4*50, and the cells might go from 18-29-year-old no-high-school-education whites in Alabama, to over-65-year-old, post-graduate-education latinos in Wyoming). Each cell j has a population N_j from the census.

2. You fit a regression model y | X to data, to get a predicted average response for each person in the population, conditional on their demographic and geographic variables. You’re thus estimating theta_j, for j=1,…,J. The {\em regression} part of MRP comes in because you need to make these predictions.

3. Given point estimates of theta, you can estimate the population average as sum_j (N_j*theta_j) / sum_j (N_j). Or you can estimate various intermediate-level averages (for example, state-level results) using partial sums over the relevant subsets of the poststratification cells.

4. In the Bayesian version (for example, using Stan), you get a matrix of posterior simulations, with each row of the matrix representing one simulation draw of the vector theta; this then propagates to uncertainties in any poststrat averages.

5. The {\em multilevel} part of MRP comes because you want to adjust for lots of cells j in your poststrat, so you’ll need to estimate lots of parameters theta_j in your regression, and multilevel regression is one way to get stable estimates with good predictive accuracy.

OK, fine. The point is: poststratification is key. It’s all about (a) adjusting for many ways in which your sample isn’t representative of the population, and (b) getting estimates for population subgroups of interest.

But it’s not crucial that the theta_j’s be estimated using multilevel regression. More generally, we can use any {\em regularized prediction} method that gives reasonable and stable estimates while including a potentially large number of predictors.

Hence, regularized prediction and poststratification. RPP. It doesn’t sound quite as good as MRP but it’s the more general idea.


  1. With only four or five levels in the groups (like five education levels), it’s pretty much impossible to infer an informative hierarchical prior. How is the hyperprior on the hierarchical scale set and how sensitive are the effects such as education level to these scales?

    This post also reminded me to ask you if you’d ever fit the 50 states with a spatial model? We could use some kind of GP or ICAR prior for it rather than defining regional grouping (I was never sure how you came up with the regional groupings—that seems like another modeling degree of freedom that’s combinatorially intractable).

    • I fit Census PUMAs using their lat/long centroid and a smooth radial basis function, to develop a cost of living estimate based on housing costs and soforth. It sorta worked, but in Stan I had issues getting it to run without divergences. Also, with hundreds of thousands of observations from the ACS over a decade (it also was a time-series) it ran pretty slow. It was great though, don’t get me wrong, the inference made sense, I just really need to go back to it with some of the techniques developed for detecting and debugging divergences in the last year or so, and try it again… The kind of thing I’d be spending a lot of time doing if someone were paying me to do it ;-)

    • Andrew says:


      1. Yes, in the past we’ve been too casual about using noninformative prior distributions for the group-level scale parameters. I think it makes more sense to use informative priors. The same information that suggests using certain factors as predictors, should also give us a sense of how much their coefficients should vary. Also we can think of all of this as an approximation to fitting a large model to data from several years, in which case there would be lots more information on the hyperparameters based on the variation in earlier years. The whole fit-each-dataset-from-scratch thing doesn’t generally make sense, and this is a point that Jennifer and I weren’t so clear on in our earlier book. One advantage of stan and rstanarm is that it’s really easy to add informative priors.

      2. One could do spatial correlations of states. It always seemed clearer to me to just include relevant state-level predictors and groupings.

      • Dan C says:

        @Andrew @Bob Would you consider it a lost cause to try and estimate the hierarchical SD in the case described in Bob’s comment (i.e., say with only 3-4 levels in a group) in a cross-sectional design?

        Obviously, inferences would be very tentative in such cases, but is it worth the effort by applying pretty regularizing say, half-normal, prior on the SD? At least in my limited experience, it seems like these parameters are pretty sensitive to specification (or, just lead to a bunch of divergent transitions).

  2. Justin says:

    Just to be certain I understand fully, in the example, you would stratify on 4*5*5*50=4000 cells. Predict on that, than sum at the state level, or overall. That’s pretty clear to me.

    In MRP researchers often use one macro variable; let’s say Republican support at the state level. In the multilevel regression, you would simply add a macro variable in the regression model.

    Now, you would still predict on the same 4000 rows of new data, but with one added column – Republican support- that takes 50 different values, and is repeated 4*5*4=80 times in the prediction dataframe. Does that make sense?

  3. Brian Gawalt says:

    I like the MR-P framework a lot. But the “multilevel” aspect confuses me when I try to think about more than 2 traits.

    If we are investigating “likelihood of voting Dem based on voter’s income,” and we also have information about the voter’s state, it’s easy for me to state a model where the income coefficients for {Kansas, Nebraska, Wisconsin, …} voters can all be shrunk towards each other. I can adjust my model hyperparameters to require more shrinkage or allow less shrinkage, checking which results seem to make the most sense given the data we have. It’s a great way to encode my belief that voter behavior probably isn’t *wildly* different, state-to-state.

    But if I have state, income, and education — I get tripped up. I’d like to have the coefficient {middle-income, some college, Nebraska} be shrunk towards {mid-income, some college, Kansas}. But I also want to shrink {mid-income, some coll, NE} towards {high-inc, some coll, NE}. I’ve never figured out how to express that in a multilevel model. I believe that rich, no-college Nebraskans behave a lot like rich, no-college Kentuckyans; but also that rich, no-college Nebraskans also behave a lot like rich, college-degree Nebraskans.

    To think about it in this post’s terms — that the multilevel aspect is a special case of regularization — maybe what I’m after is to add a bunch of “fused Lasso” regularizers, like:

    min_theta loss(y, x, theta) +
    lambda_{same state} * |theta_{rich, college, NE} – theta_{low, college, NE} | +
    lambda_{same income} * |theta_{rich, college, NE} – theta_{rich, college, KY} | +

    But I haven’t been able to express that as a graphical model. Are there any tried-and-true examples of this kind of “partially pool across several dimensions” I should look at?

    Thanks much for this post!

    (Also: We have a copy of this book sitting around my house and every time it catches my eye I think, “oh nice, Mr. P finally finished his degree.” )

    • Andrew says:


      See this paper and this one for examples of MRP with many factors.

    • MJT says:

      one thing thats confusing when using the umbrella term ‘regularization’ is that some people are thinking about regularizing specific components of the right hand side term, say mean coefficien

      • MJT says:

        chopped off reply part 2

        while some people use the term to mean regularizing between the outcome mle (lhs term) and a model (set of rhs terms)

        as in the comment above focusing on fused lasso for mean components to contrast with the mlm structure to regularize the outcome.

        of course the two concepts are related, its just helpful to think what aspect youre regularizing

        im about to post the summary review writeup onto arxic or someplace and peoples discussion about this would be great

        • MJT says:

          From this post, it nudged me to arxiv my opinionated intro summary to Small Area Estimation. It goes through the utility of multi level models and regularization. Would love to hear comments and critiques on how MrP fits into Small Area Estimation, and thoughts on Composite Estimators.

          As you can probably tell, a lot of the in-writeup themes are from this blog

          The way I framed it is, in the small area estimation applications, you typically follow a 3 step process

          1) Calibration 2) MLM Prediction 3) Benchmark
          I presented MrP as somewhere in between 2 and 3, say 2.5

          I think its a viable option to do: 1,2.5, and 3. If you have units individuals i, strata cells s, ‘small groups’ c, and larger groups L, then you can fit a regularized MLM prediction for each cell s, post-stratify the cells into small groups c. This results in c-many MrP predictions . Then, you can optionally consider benchmarking behavior of the c-many MrP predictions at larger groupings L. I think this would need to respect the nesting of the various resolutions.

          Composite estimators:
          Also would be curious to hear about ‘composite’ predictions.
          In this blog, I’ve seen the idea of folding previous predictions into a secondary mlm model floated out a few times. I feel it falls under that category of ‘composite’ estimators which actually motivated SAE models.

Leave a Reply