Statistical analysis on a dataset that consists of a population

This is an oldie but a goodie.

Donna Towns writes:

I am wondering if you could help me solve an ongoing debate?

My colleagues and I are discussing (disagreeing) on the ability of a researcher to analyze information on a population. My colleagues are sure that a researcher is unable to perform statistical analysis on a dataset that consists of a population, whereas I believe that statistical analysis is appropriate if you are testing future outcomes. For example, a group of inmates in a detention centre receive a new program. As it would contravene ethics, all offenders receive the program. Therefore, a researcher would need to compare a group of inmates prior to the introduction of the program. Assuming, or after confirm that these two populations are similar, are we able to apply statistical analysis to compare the outcomes of these to populations (such as time to return to detention)? If so, what would be the methodologies used? Do you happen to know of any articles that discuss this issue?

I replied with a link to this post from 2009, which concludes:

If you set up a model including a probability distribution for these unobserved outcomes, standard errors will emerge.

16 thoughts on “Statistical analysis on a dataset that consists of a population

  1. This is my area of research, corrections that is, and I get this frankly stupid perspective often from people who should know better. When someone says this the conversation goes:

    “Oh, is this the only correctional facility in the world?”

    “No.”

    “Well then it’s not a population. It’s observational data, not some magical entity that precludes data analysis.”

  2. I don’t think it is a “stupid” perspective. One must make some assumption about the nature of this sample – is it a random sample of all correctional facilities, is it a random sample of this particular facility over time? If it is nonrandom, can we specify something about the biases involved with this sample? I believe any data that is a population confronts these same questions. To use Andrew’s cited example, any study of the 50 states often consists of the population of states – but the study is being used to infer what will happen at a different point in time. If one makes the grandiose assumption (which is usually made, at least in economics) that the data is a “random” sample of the 50 states at all other times, then standard inferences follow (or perhaps even nonstandard inferences). But if you are not willing to make that claim, and if you are not willing or able to specify the ways the particular 50 state observations you have might be biased, then I don’t think you can do anything other than report what you found in the population data you had.

    Even if you do this, people will interpret your findings as if they were a random sample of other time periods – or worse yet, they will assume your point estimates are “truth” with no uncertainty at all! So, it seems to me that you might as well conduct your analysis as if it were a sample and clearly explain the assumptions you are making. Of course, this begs the (important) question of what type of analysis is most appropriate for this data (e.g., classical frequentist analysis, Bayesian, etc.).

    • Dale:

      Considering the states as a random sample of 50 is generally not such a good idea. But modeling the states’ variation over time can definitely be done. It’s not a matter of “specifying the ways in which the observations might be biased,” it’s modeling the change that you are actually interested in. My 1990 paper with King gives an example.

      Also, I agree with the point in your second paragraph, that the random-sampling analysis, although generally wrong, still provides a baseline and can be useful as such.

    • Thinking that a set of data from a particular facility is a population is stupid because people hear population and think generalizable when it’s usually not. I reviewed a paper where the authors said that 200 or so low level sex offenders supervised in a specific parole office somehow represented a population because it consisted of all the sex offenders supervised in that office at that time. They then went on to say that these offenders in the data were only a fraction supervised in that state that had 50+ parole offices and 45,000 on parole. If your data are a subset of any other data, they’re not a population.

      The authors then tried to make the case that the particular intervention that used one provider and no control group gave ample evidence that other correctional agencies could confidently use said treatment to lower recidivism, results that were made all the more convincing because the results came from a population.

      So yes, that’s stupid. It’s generally better to think of any particular set of data as a sample and then define its characteristics so that its representativeness and limitations (e.g. Nonrandom, observational) are clear to avoid such errors.

  3. What if you are not interested in predictions but descriptions? For example, one of my students surveyed all butchers in a town and then described attributes using confidence intervals. I said that was incorrect because every element in the population was included in the survey.

  4. Two comments:

    1. It is not useful to think of _the_ population as something exogenously given. It is in the eye of the beholder. My population could be your sample. So it is important to define or enumerate the population of interest.

    2. I’d say 9/10 times people think they are working with a population they aren’t. Ask enough questions and you’ll probably find they are trying to make inferences from some observables to some unobserved or unobservable quantity.

  5. The fact that there was a group of inmates before the program and one after tells you right away that these are selected from time periods. How long before was the before group? A month, a year, five years? What they are doing is representing a theoretical population of people who never received the intervention. The inmates who received the program are just part of the population of inmates who might receive the program. If you had started the intervention a few months earlier or later you would have had a different group. Surely you are doing the analysis not only because you are interested in those specific inmates.

    Also, I really don’t buy the argument that it’s an ethical problem to not provide the program to everyone if you don’t know if the program works (not to mention if you don’t know that it has a backfire effect or negative side effects). It’s a political and social problem. To me it’s an ethical problem to have inmates or anyone spend time in a program of unknown consequences when the data produced will be weaker than it should be. Of course this assumes planned research as opposed to a prison wide program being started and then later someone says maybe we should see if it makes a difference.

  6. So, maybe I’m misunderstanding something here. But it’s never really possible to have a non-random sample if time is involved. At best, the probability of any event happening that affects the sample is very small, but non-zero (although possibly unmeasurable.) So, I think this is actually because people have trouble thinking of time as a dimension similar to other dimensions.

    As a note, this study might be better considered as a longitudinal study, since the effect of a program will likely have effects similar to medicinal interventions.

    • The fact that there might be some unknown systematic difference between the groups because they are from different times will increase the uncertainty of the results. So if you construct a confidence interval based on an assumption that the two groups were randomly sampled from the same population, the resulting interval might be too small. But you should still calculate it, because it gives a lower bound on the uncertainty. If you can’t conclude anything interesting even assuming this confidence interval is the right size, then you certainly couldn’t if you took account of possible unknown systematic time effects.

      Treating the estimates as exact because you treated the entire “population” of inmates, when you are not actually interested in this “population”, is simply wrong. When people ask this question, they should be told that it is wrong to consider what is actually a sample to be a population. They do not need to be confused by additional possible complications when they aren’t clear on this very basic point.

  7. In the paper referenced in the comment by Dean Eckles (thanks, Dean!) we precisely want to avoid what Dale Lehman calls the “grandiose assumption that the data is a random sample of the 50 states at all other times,” because that assumption seems implausible. We develop an interpretation of the standard errors as capturing the uncertainty that is the result of not observing outcomes for the 50 states in the counterfactual worlds where some of their characteristics would have been different (e.g., different regulations): we observe the crime rate in Massachusetts given that Massachusetts does not have the death penalty, but we do not see what the crime rate would have been in Massachusetts in the counterfactual world where Massachusetts did have the death penalty. Because we are not interested in the difference in average crime rates for states with the death penalty compared to the average crime rates for states without the death penalty, but we are interested in the average causal effect of the death penalty, there is uncertainty even if we observe crime rates for all fifty states. As Winston Lin comments, this is related to the uncertainty that is captured in analyses of randomized experiments: not sampling uncertainty but uncertainty about causal effects, rather than descriptive statistics, arising from alternative assignments.

    • A metaphysical question in the guise of a specific problem: Suppose I have a sample of schools, some of which were randomly assigned some pedagogical intervention. Suppose I use both standard robust inference calculations (which fail to reject the null hypothesis of no effect), and those proposed in this (Abadie et al.) paper (which do reject the null hypothesis of no effect*).

      Can I then claim an effect “in the sample schools” but not an effect “in the world”? I don’t mean this as a question about p-values and NHST, I mean this as a metaphysical question about the difference between a “sample” and a “population.” Maybe we are only concerned about those schools, in which case the sample \emph{becomes} a population and the Abadie et al. estimator makes sense, right?

      *Suppose the setting is one in which the estimates proposed in the Abadie et al. are more powerful because the robust estimates are too conservative. I didn’t get to an understanding of exactly when this is/isn’t the case, so maybe this doesn’t make sense.

Leave a Reply

Your email address will not be published. Required fields are marked *