Mister P goes on a date

I recently wrote something on the much-discussed OK Cupid analysis of political attitudes of a huge sample of people in their dating database. My quick comment was that their analysis was interesting, but participants on an online dating site must certainly be far from a random sample of Americans.

But suppose I want to not just criticize but also think in a positive direction. OK Cupid’s database is huge, and one thing statistical methods are good at–Bayesian methods in particular–is combining a huge amount of noisy, biased data with a smaller amount of good data. This is what we did in our radon study, using a high-quality survey of 5000 houses in 125 counties to calibrate a set of crappier surveys totaling 80,000 houses in 3000 counties.

How would it work for OK Cupid? We’d want to take their data and poststratify on:

Age
Sex
Marital/family status
Education
Income
Partisanship
Ideology
Political participation
Religion and religious attendance
State
Urban/rural/suburban
Probably some other key variables that I’m not thinking of right now.

We’d do multilevel regression and poststratification (MRP, “Mister P”), with enough cells that it’s reasonable to think of the OK Cupid people as being a random sample within each cell. This is not a trivial project–it would involve also including Census data and large public opinion surveys such as Annenberg or Pew–but it could be worth it. The goal would be to get the flexibility and power of the OK Cupid analyses, but with the warm feelings that come from matching their sample to the U.S. population.

Inferences would necessarily be strongly model-based–for example, any claims about married people would be essentially 100% based on regression-based extrapolation–but, hey, that’s the way it is. The goal is to be as honest as possible with the data available.

2 thoughts on “Mister P goes on a date

  1. Very convenient indeed – that it's reasonable to think of the OK Cupid people as being a random sample within each cell – anyway to check?

    And thanks for this "multilevel regression and poststratification (MRP, "Mister P")" – I was never sure what the "Mister P" actually stood for ;-)

    K?

  2. I'm curious how the cells would be set up. You say you would need large public opinion surveys (Annenberg or others), would those serve as the population values for ideology and religion? How do you combine those with Census data?

    I've read your work on MRP in ARM and other places, and I'm having trouble understanding how you would use two different sources of data for the poststratification.

Comments are closed.