David Radwin asks a question which comes up fairly often in one form or another:

How should one respond to requests for statistical hypothesis tests for population (or universe) data?

I [Radwin] first encountered this issue as an undergraduate when a professor suggested a statistical significance test for my paper comparing roll call votes between freshman and veteran members of Congress. Later I learned that such tests apply only to samples because their purpose is to tell you whether the difference in the observed sample is likely to exist in the population. If you have data for the whole population, like all members of the 103rd House of Representatives, you do not need a test to discern the true difference in the population.

Sometimes researchers assume some sort of superpopulation like “all possible Congresses” or “Congresses across all time” and that the members of any given Congress constitute a sample. In my current work in education research, it is sometimes asserted that students at a particular school or set of schools is a sample of the population of all students at similar schools nationwide. But even if such a population existed, it is not credible that the observed population is a representative sample of the larger superpopulation.

Can you suggest resources that might convincingly explain why hypothesis tests are inappropriate for population data?

My reply:

First let me pull out any concerns about hypothesis testing vs. other forms of inference. To keep things simple, I will consider estimates and standard errors.

Sometimes we can all agree that if you have a whole population, your standard error is zero. This is basic finite population inference from survey sampling theory, if your goal is to estimate the population average or total.

Let’s consider regressions. (And the comparison between freshman and veteran members of Congress, at the very beginning of the above question, is a special case of a regression on an indicator variable.)

You have the whole population–all the congressmembers, all 50 states, whatever, you run a regression and you get a standard error. Maybe the estimated coefficient is only 1 standard error from 0, so it’s not “statistically significant.” But what does that mean, if you have the whole population?

You can still consider the cases in which the regression will be used for prediction. For example, you have all 50 states, but you might use the model to understand these states in a different year.

Consider my papers with Gary King on estimating seats-votes curves (see here and here). We had data from the entire population of congressional elections in each year, but we got our standard error not from the variation between districts but rather from the unexplained year-to-year variation of elections within districts.

To put it another way, we would’ve got the wrong answer if we had tried to get uncertainties for our estimates by “bootstrapping” the 435 congressional elections. We wanted inferences for these 435 under hypothetical alternative conditions, *not* inference for the entire population or for another sample of 435. (We did make population inferences, but that was to estimate the hyperparameters that governed our inferences over individual district outcomes under hypothetical national swings.)

Sometimes it’s worth making the effort to think carefully about what replications you’re interested in. It’s sort of like the WWJD principle in causal inference: if you think seriously about your replications (for the goal of getting the right standard error), you might well get a better understanding of what you’re trying to do with your model. At least, that worked with us in the seats-votes example. Formalizing one’s intuitions, and then struggling through the technical challenges, can be a good thing.

P.S. I just reread the lexicon. I’d forgotten about the Foxhole Fallacy. That’s a good one!

Isn’t this a good case for your heuristic of reversing the argument? If you don’t estimate the uncertainty in your analysis, then you are assuming that the data and your treatment of it are perfectly representative for the purposes of all the conclusions you draw. This is unlikely to be the case – as only very rarely are people able to restrict conclusions to descriptions of the data at hand. The point that “it is not credible that the observed population is a representative sample of the larger superpopulation” is important because this is probably always true in practice – how often do you get a sample that is perfectly representative? You nearly always want some measure of uncertainty – though it can sometimes be tough to figure out the right one.

Can you suggest resources that might convincingly explain why hypothesis tests are inappropriate for population data?No, since that isn’t true – at least for the examples of a “population” that you give, and that people usually have in mind when they ask this question.

WHY are you looking at freshman versus veteran members of Congress? Why not members whose names start with a vowel versus members whose names start with a consonant? I’m pretty sure the reason is that you want to draw some conclusions about how members behave

becausethey are freshmen or veterans. So ask yourself, if you were looking a much smaller legislative body, with only 10 members, would you be equally confident in your conclusions about how freshmen and veterans behave? I hope not. And the reason is that the standard errors would be much larger with only 10 members.In my role as the biostatistics ‘expert’ where I work, I sometimes get hit with this attitude that confidence intervals (or hypothesis tests) are not appropriate for “population” data. Many people with this attitude are outspokenly dogmatic about it; the irony in this is that they claim this is the dogma of statistical theory, but people making this claim never have a background in statistical theory.

Here’s how I try to explain it (using education research as an example). If your goal is non-scientific, then you may not need to consider variation. Say, for example, you want to award a prize to the school that had the highest average score on a standardized test. Then you would just use the mean scores. You would not so a test to see if the better performing school was ‘significantly’ better than the other. But let’s say that you are doing some research in which your outcome variable is the score on this standardized test. For example, you may want to determine if students in schools with blue-painted walls do better than students in schools with red-painted walls. Student scores will be determined by many factors: wall color (possibly), student’s raw ability, their family life, their social life, their interaction with other students, the skill of their teachers, the nature of the test (which will not be a completely reliable measure of ability), how they felt on the day of the test, how their teachers felt in the week before the test, etc. With any imagination you can write a list of a few dozen things that will affect student scores. Most of these things can’t be measured, and even if they could be, most won’t be included in your analysis model. They will be subsumed in the error term. In short, student score will be determined by wall color, plus a few confounders that you do measure and model, plus random variation. This will be true if you have drawn a random sample of students (in which case the error term includes sampling error), or if you have measured all the students in the world. even if you have ‘population’ data you can’t assess the influence of wall color unless you take the randomness in student scores into account.

When you are doing research, you are typically interested in the underlying factors that lead to the outcome. The influence of these factors is never manifested without random variation.

Why do a hypothesis test? So that you can say “the probability that I would have gotten data this extreme or more extreme, given that the hypothesis is actually true, is such-and-such”?

What good does that do? First, you are making the implausible assumption that the hypothesis is actually true, when we know in real life that there are very, very few (point) hypotheses that are actually true, so (as Herman Rubin has often remarked), you don’t actually need any data at all to say with a high degree of confidence that the hypothesis you are testing isn’t true.

That’s empty.

Second, once you get your number, what substantive are you going to do with it?

The reason you might consider hypothesis testing is that you have a decision to make, that is, there are several actions under consideration, and you need to choose the best action to take. “Best” means that there are costs or benefits, i.e., a loss function, that also affect the action you’ll take. Which says that you shouldn’t be using hypothesis testing (which doesn’t take actions or losses into account at all), you should be using decision theory.

But then, as we know, it doesn’t matter if you choose to use frequentist or Bayesian decision theory, for as long as you stick to admissible decision rules (as is recommended), the frequentist decision rules will have a one-to-one correspondence with the Bayesian rules. The exceptions to this generally do not arise in practice.

So, ditch hypothesis testing. Go with decision theory. It’s harder, and requires careful consideration of all of the assumptions, but it’s the only sensible thing to do.

This is a question we get all the time, so I’m going to provide a typical context and a typical response.

Suppose you have weekly sales data for all stores of retail chain X, for brands A and B for a year –104 numbers. This is a meaningful population in itself. There is no sampling. If A sells 101 units per week and B sells 100.5 units per week, A sells more.

But there is still variability. The sales may be very steady (s=10) or they may be very variable (s=120) on a week to week basis.

It’s entirely meaningful to look at the difference in the means of A and B relative to those standard deviations, and relative to the uncertainty around those standard deviations (since the retail chain may have 4 stores or 4000).

We might, for example, divide chains into 3 groups: those where A sells “significantly” more than B, where B sells “significantly” more than A, and those that are roughly equal.

This is an issue that comes up fairly regularly in medicine. For example, you have all the inpatient or emergency room visits for a state over some period of time. You can look at year to year variation but can you also posit a prior that each visit is, say, a Bernoulli trial with some probability of happening? This advise was given to medical education researchers in 2007: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1940260/pdf/1471-2288-7-35.pdf

The link above is discouraging. It concludes, “Until a better

case can be made, researchers can

follow a simple rule. If they are studying an entire popu-

lation (e.g., all program directors, all deans, all medical

schools) and they are requesting factual information, then

they do not need to perform statistical tests. Reporting

percentages is sufficient and proper.”

How can such a simple issue be sooooo misunderstood?

Occasionally, the above advice may be correct. An example

would be when the survey asks how many researchers are at

the institution, and the purpose is to take the total amount

of government research grants, divide by the total number

of researchers, to see how much money was available per

researcher. There is no point in computing any standard

error for the number of researchers (assuming one believes

that all the answers were correct), or considering that

that number might have been something else.

I think such purposes are uncommon, however. More commonly,

the purpose of the survey is such that standard errors ARE

appropriate. For example, if the survey asks what the

institution’s faculty/student ratio is, and what fraction

of students graduate, and you then go on to compute a

correlation between these, you DO need to ask whether any

non-zero correlation found is statistically significant.

The paper linked to above does not consider the purposes of

the studies it looks at, so it is clear that they don’t

understand the issue.

Radford:

Perhaps rather than asking

“whats the real questions and what are the real uncertainties encountered when answering those?”

they ask

“what are the acceptable/conventional things one can to do to avoid being criticized while appearing to do something?”

But as in many posts on this blog, people even very smart in some senses do misunderstand some really simple things.

Fortunately never me and very very seldom you ;-)