David Radwin asks a question which comes up fairly often in one form or another:
How should one respond to requests for statistical hypothesis tests for population (or universe) data?
I [Radwin] first encountered this issue as an undergraduate when a professor suggested a statistical significance test for my paper comparing roll call votes between freshman and veteran members of Congress. Later I learned that such tests apply only to samples because their purpose is to tell you whether the difference in the observed sample is likely to exist in the population. If you have data for the whole population, like all members of the 103rd House of Representatives, you do not need a test to discern the true difference in the population.
Sometimes researchers assume some sort of superpopulation like “all possible Congresses” or “Congresses across all time” and that the members of any given Congress constitute a sample. In my current work in education research, it is sometimes asserted that students at a particular school or set of schools is a sample of the population of all students at similar schools nationwide. But even if such a population existed, it is not credible that the observed population is a representative sample of the larger superpopulation.
Can you suggest resources that might convincingly explain why hypothesis tests are inappropriate for population data?
First let me pull out any concerns about hypothesis testing vs. other forms of inference. To keep things simple, I will consider estimates and standard errors.
Sometimes we can all agree that if you have a whole population, your standard error is zero. This is basic finite population inference from survey sampling theory, if your goal is to estimate the population average or total.
Let’s consider regressions. (And the comparison between freshman and veteran members of Congress, at the very beginning of the above question, is a special case of a regression on an indicator variable.)
You have the whole population–all the congressmembers, all 50 states, whatever, you run a regression and you get a standard error. Maybe the estimated coefficient is only 1 standard error from 0, so it’s not “statistically significant.” But what does that mean, if you have the whole population?
You can still consider the cases in which the regression will be used for prediction. For example, you have all 50 states, but you might use the model to understand these states in a different year.
Consider my papers with Gary King on estimating seats-votes curves (see here and here). We had data from the entire population of congressional elections in each year, but we got our standard error not from the variation between districts but rather from the unexplained year-to-year variation of elections within districts.
To put it another way, we would’ve got the wrong answer if we had tried to get uncertainties for our estimates by “bootstrapping” the 435 congressional elections. We wanted inferences for these 435 under hypothetical alternative conditions, not inference for the entire population or for another sample of 435. (We did make population inferences, but that was to estimate the hyperparameters that governed our inferences over individual district outcomes under hypothetical national swings.)
Sometimes it’s worth making the effort to think carefully about what replications you’re interested in. It’s sort of like the WWJD principle in causal inference: if you think seriously about your replications (for the goal of getting the right standard error), you might well get a better understanding of what you’re trying to do with your model. At least, that worked with us in the seats-votes example. Formalizing one’s intuitions, and then struggling through the technical challenges, can be a good thing.
P.S. I just reread the lexicon. I’d forgotten about the Foxhole Fallacy. That’s a good one!