Daljit Dhadwal writes:

On the Ask Metafilter site, someone asked the following:

How does statistical analysis differ when analyzing the entire population rather than a sample? I need to do some statistical analysis on legal cases. I happen to have the entire population rather than a sample. I’m basically interested in the relationship between case outcomes and certain features (e.g., time, the appearance of certain words or phrases in the opinion, the presence or absence of certain issues). Should I do anything different than I would if I were using a sample? For example, is a p-value meaningful in this kind of case?

My reply:

This is a question that comes up a lot. For example, what if you’re running a regression on the 50 states. These aren’t a sample from a larger number of states; they’re the whole population.

To get back to the question at hand, it might be that you’re thinking of these cases as a sample from a larger population that includes future cases as well. Or, to put it another way, maybe you’re interested in making predictions about future cases, in which case the relevant uncertainty comes from the year-to-year variation. That’s what we did when estimating the seats-votes curve: we set up a hierarchical model with year-to-year variation estimated from a separate analysis. (Original model is here, later version is here.)

So, one way of framing the problem is to think of your “entire population” as a sample from a larger population, potentially including future cases. Another frame is to think of there being an underlying probability model. If you’re trying to understand the factors that predict case outcomes, then the implicit full model includes unobserved factors (related to the notorious “error term”) that contribute to the outcome. If you set up a model including a probability distribution for these unobserved outcomes, standard errors will emerge.

For me the important point is that there is a process underlying the observations, and that's what is stochastic. The purpose of the probability model is to capture the important aspects of this process.

This still confuses me.

To follow up on Bob H's comment and perhaps to address Eric's confusion and to add to the conversation:

It may be useful to know that an underlying probability model can describe either (1) how outcomes occurred [e.g. count of deaths due to horse kicks can be well described by a Poisson-based model] or (2) how some important explanatory occurred [e.g. assignment to the experimental treatment (of horse-kicks) is well described by a model in which all units have equal probability of assignment]. Inference based on sampling (from a finite- or super-population) also implies some underlying probability model by which sets of outcomes and explanatories (and covariates) appear in a dataset [e.g. a model which states that each row in the unseen population dataset had equal probability of arriving in the observed sample dataset].

The probability model of outcomes approach is better developed. Yet, both Neyman and Fisher developed ways to do inference using probability models of treatment/sampling (very simple models that encapsulate the statement "I randomized assignment to treatment/I randomly sampled units to receive treatment using the following design."). Current survey statistics (see, for example, Kish 1965 and Lohr 1999) or randomization inference (see, for example, Rosenbaum's 2002 book) has extended Neyman and Fisher and tends to avoid models of outcomes in favor of models of assignment. Other work uses both (see Imbens and Rubin 1997 for a Bayesian approach; see Rosenbaum 2002 for a frequentist approach).

So, what about inference about a population?

To paraphrase Bob H.: inference is about that which varies. So the question is about where variation comes from: Take the Prussian horsekicks example: (1) We could get variation in that outcome by taking a new sample (or imagining the possibility of doing so) and recording the model of said sampling. (2) If we assumed a Poisson process generating the deaths, we get variation by assumption. (3) We can also get variation by assumption by describing how horse-kicks are assigned to people/army units.

In each case we have to justify why our model is meaningful and useful.

The nice thing in the end is that we can infer from a sample to a population via probability model of sampling cases *and/or* to a probability model of outcomes *and/or* to a probability model of explanatories. Thus, someone with a "population" and "not a sample" has different options [3 distinct modes of statistical inference plus others that I ignore here because of my own ignorance]. What is more, with large samples and "well-behaved" outcomes/assignments, these modes agree with each other (e.g Lehmann, Cox). (See Rubin 1990, 1991 for nice descriptions of these modes of statistical inference in the context of the potential outcomes model of causal inference)

Which mode to choose depends on the substance, design, data and purposes of the analysis. The good thing, however, is that statistical inference is possible and justifiable even when one's data has not arrived via some well-defined sampling model from some clear population.

Jake I think you just confused Eric even more.

Daljit no one ever has the entire population. It's like saying you know where infinity ends.

Instead just think of it as you have a giant sample.

First thing you might want to test for is whether your sample has a normal distribution or not.

This problem arises in the survey sampling literature regarding inferences about regression parameters.

The superpopulation approach, which Andrew describes, assumes that the finite population (50 states in this example) is generated by a probability model (e.g. y_i ~ N(a + bx_i, s^2)). You can make inferences about either the regression parameters (a,b,s^2) or the finite population parameters (e.g., the population total sum_i y_i). The inferences can be either frequentist or Bayesian. Inferences about regression parameters are done in the usual way. Inferences about finite population parameters are a prediction problem (you need to predict the values of y_i for the unsampled units using the regression model; there are two sources of error–the estimates of the regression parameters and the unit level error terms).

The traditional (i.e., design-based) approach in survey sampling, however, does not employ a model. Instead, the parameters are usually defined to be what would be obtained by applying least squares to the finite population, so b = sum_i (x_i – xbar)y_i / sum_i (x_i – xbar)^2 and a = ybar – b * xbar (where sums and means are calculated over the entire finite population, e.g. the 50 states). The primary estimating principles are frequentist (usually unbiasedness and equivariance). One odd consequence is that one performs weighted least squares with the weights equal coming from the selection probabilities, not the variance of the regression disturbances (in the design-based approach, the regression errors aren't even random variables).

Neither approach is entirely satisfactory. Some people like the design-based approach because it doesn't involve a model, but the definition of the regression parameters is arbitrary (why not LAD or some other estimator applied to the population?) and it's not obvious how to interpret the parameters.

On the other hand, the superpopulation model is a bit awkward to describe (e.g., "imagine that the states are drawn from a hypothetical population of states"). I think the problem arises from thinking that randomness comes only from sampling (a common belief among many survey statisticians, who would strongly object to Jake's last sentence). However, you shouldn't be estimating regression coefficients unless you have some sort of model in mind, and any reasonable model should include those "notorious error terms."