## Yes, you can do statistical inference from nonrandom samples. Which is a good thing, considering that nonrandom samples are pretty much all we’ve got.

Luiz Caseiro writes:

1. P-values and Confidence Intervals are used to draw inferences about a population from a sample. Is that right?

2. As far as I researched, standard statistical softwares usually compute confidence intervals (CI) and p-values assuming that we have a simple random sample. Is that right?

3. If we have another kind of representative sample, different from a simple random sample (i.e. a complex sample), we should take into account our sample design before calculating CI and p-values. Is that right?

4. If we do not have a representative sample, as it is often the case in political science (specially when the sample is a convenience sample, made of some countries for which data is available), would not it be irrelevant and even misleading to report CI and p-values?

This question comes up from time to time (for example, in 2009, 2011, 2014, and 2014), so I’m well prepared to reply to this one.

In response to Caseiro: Yes, the starting point in statistical theory is the assumption of simple random sampling, but there are methods for dealing with stratified samples, cluster samples, etc. There are textbooks on this and statistical packages that do it. If you have a convenience sample, it’s still a good idea to report standard errors etc.; you just need to make assumptions.

Caseiro follows up:

1. If I have a convenience sample the assumption that I need to make when reporting standard errors, CI, etc. is that my convenience sample is not very different from a random sample? This sounds like a very strong assumption.

2. Would not it be more accurate to just say that I cannot reach external validity from my sample?

3. If I do not claim external validity, then the standard errors become unnecessary?

My reply: It all depends on what questions you want to answer. If you simply want to describe the data you have, go for it. But usually we gather data to understand something about unobservables or to make predictions about new situations. In that case, you’ll have to make some assumptions. If your data are weak, your assumptions need to be correspondingly stronger.

To put it another way: Sure, it’s fine to say that you “cannot reach external validity” from your sample alone. But in the meantime you still need to make decisions. We don’t throw away the entire polling industry just cos their response rates are below 10%; we work on doing better. Our samples are never perfect but we can make them closer to the population.

Remember the Chestertonian principle that extreme skepticism is a form of credulity.

1. Taeblin says:

“…we work on doing better.”

well …. an extremely unconvincing overall response to the major issue of non-random samples. Sounds like a football coach impromptu pep talk, not a science perspective.

• Andrew says:

Taeblin:

If you want some convincing, take a look at the entire history of polling, since every public opinion poll ever done has been from a non-random sample. If you don’t want to call it science, fine, I don’t care; call it engineering. A poll from a non-random sample—that is, any poll—is somewhere in quality between a true random sample (which we never have) and a wild guess. Where it falls in the spectrum depends on the difficulty of getting accurate responses, the quality of the data collection, and the quality of the adjustment. As the data get worse, we need to work harder on adjustment.

Apparently “we work on doing better” is something that football coaches say. It’s also something that engineers say, and that statisticians say. We don’t have all the answers, and we’re working on terrain that’s continually changing. Methods of adjustment that worked ok in the 1950s might not work so well now, with our much noisier media environment and much lower response rates.

• Regarding “Working on doing better”, there are many techniques for fields which are far more substantial (?) than polling. My favorites, which I borrow when I do Internet measurements work, come from field and quantitative biology. There are delightfully creative ways of doing field sampling. Here are a couple, which I grabbed from my bookshelf for illustration:

* S. T. Buckland, D. R. Anderson, K. P. Burnham, J. L Laake, D. L. Borchers, L. Thomas, INTRODUCTION TO DISTANCE SAMPLING, Oxford, 2001.
* (same authors), ADVANCED DISTANCE SAMPLING, Oxford, 2007.
* S. C. Amstrup. T. L. McDonald, B. F. J. Manly, HANDBOOK OF CAPTURE-RECAPTURE ANALYSIS, Princeton, 2005
* W. L. Thompson (ed.), SAMPLING RARE OR ELUSIVE SPECIES, Island Press, 2004.

And, of course:
* C.-E. Särndal, B. Swensson, J. Wretman, MODEL ASSISTED SURVEY SAMPLING, Springer, 1992
* M. Ghosh, G. Meeden, BAYESIAN METHODS FOR FINITE POPULATION SAMPLING, Chapman & Hall/CRC, 1998.

2. awm says:

This is something I’ve thought a lot about, and I think one big problem is that many practitioners aren’t clear themselves on what they’re assuming when they use nonrandom samples (or random samples with nonresponse, etc…), and the assumptions that relate to bias (strong ignorability) tend to get conflated with the assumptions related to variance (that the data generating process is IID or if not, it’s modeled correctly). If you’ve got a real random sample, you don’t really have to think as hard about any of this stuff, and since that’s what tends to get taught, that’s how people look at data.

I couldn’t say whether or not these would be the kinds of things that need to make their way into intro stats classes classes for undergrads, or if it’s better to leave them to more advanced stats classes or subject specific methods classes. For me, they certainly weren’t concepts that I encountered at all in any sort of systematic way until grad school. But it seems like we need to do a better job teaching social scientists about some of these concepts.

• Peter Dorman says:

As an undergrad stats (among other things) teacher, I strongly relate to this. I typically spend a lot of time on the topic of sample bias and how to test for it. In the end, and with the techniques I can muster at this level, there’s not much that can be done correctively beyond cultivating a qualitative sense of the direction(s) and intensity of the bias and qualifying external inferences accordingly. But I also preach multiple data sources, so triangulating around bias becomes feasible too.

• WHAT IS NOT BEING taught tho? It seems as we are now suggesting that we should question most all theories and practices. Well gee, that’s confidence inducing.

• ItKeith O'Rourke says:

That is the challenge. I made sure my undergrads at Duke (2007/8) understood that with bias the confidence interval coverage approaches zero as the sample size increases. It took more than one class and pretty sure most got it – but it seemed to make them very upset.

The problems with Peter’s multiple data sources triangulation around bias idea is that the bias could be the same/similar in all studies (e.g. sometimes/often in epidemiology). On the other hand, if its haphazard, its unlikely to be centered about zero and the observed variability is thus challenging to make any sense of. Uncertainty can be broken into haphazard about zero (random or random like) versus haphazard about non-zero (where this can be constant or variable by study).

The only sensible (but confidence deflating) realization perhaps being that you cannot become certain about uncertainty with any statistical approach. Rather just less uncertain but often less enough that the learning is not overly risky.

• Peter Dorman says:

Yes, but the point of triangulation is to actively seek out multiple sources with at least partially offsetting biases. It’s not always possible, but in the areas I work in (not epi), it’s more often possible than not.

3. Fritz Strack says:

What if you don’t have a (finite) population from which a sample could have been drawn, like in most (if not all) psychology experiments?

• Justin says:

You are confusing issues. Per the p-values in phychology experiments, the relevant population is the pre-assignment population of participants, from which the treatment and control groups are subsequently drawn. Those experimental groups are random samples from this population, which is definitely finite.

• Keith O'Rourke says:

Perhaps Rob Kass’ big picture http://www.stat.cmu.edu/~kass/papers/bigpic.pdf will address your (I believe thoughtful) question.

4. Erikson says:

Freedman’s ‘Statistical models and shoe leather’ has already been mentioned in this blog, but I think it’s worth mentioning it again in this context.

Bottom line: strong modeling assumptions might be nice, but they won’t take us very far on their own. We also need careful design, meaningful data collection and strong reality checks (comparing predictions from the model with other samples from the population we would like to generalize to, e.g.).

5. Martha (Smith) says:

I’m not sure how to interpret Andrew’s statement “Yes, the starting point in statistical theory is the assumption of simple random sampling, but there are methods for dealing with stratified samples, cluster samples, etc.”

If by “The starting point” he means that the types of statistical inference textbooks start with are those requiring simple random samples, then I agree. (His statement “but there are methods for dealing with stratified samples, cluster samples, etc. There are textbooks on this and statistical packages that do it.” suggests that this might indeed be what he is talking about.)

But I think it is important to emphasize from early on that every statistical method has model assumptions, and that some of these assumptions concern what type of sample is appropriate for the model to fit– so “appropriate sample” means, ideally, a sample fitting the assumptions of the model. However, in practice, in our imperfect world, sometimes the best we can do is to give a plausible argument that these assumptions are close to reasonably satisfied — and the credibility of our results depends (among other things) on how well they are satisfied.

• Andrew says:

Martha:

I wrote, “the starting point in statistical theory,” not “the starting point in statistical practice.”

6. Jacob says:

I’ve found it hard in my social psych-influenced field to convince many people to take sample quality seriously. It seems that in certain fields, especially experimental psychology, the burden of proof is on the person who wants to argue that the convenience sample (usually students) is not generalizable. In some other fields, like much of political science, the burden of proof is on the researcher to argue/demonstrate why the convenience sample is appropriate. Andrew is right to say that there’s no true representative sample, but I think there’s something to be said for at least having people in all the relevant cells if you think of it in the sense of a cross-tabs of the relevant characteristics for the question at hand (age, sex, race, education, maybe political party, etc.).

One nut I’ve yet to crack is how to use the Mr. P logic to derive regression coefficient (and standard error) estimates.

7. Andre says:

Andrew wrote: “methods for dealing with stratified samples, cluster samples, etc.”

Can you point me to references? Given that most all samples contain some kind bias/non-random components, I would like to include more of these techniques when modeling. Things I’m thinking about are confidence intervals (perhaps of coefficients in a linear model, to start) for non-random samples. MCMC and areas of ‘highest posterior density’ is great, but, not practical for real-time applications.

8. Andre says:

Ok –

question regarding the Xbox paper. I enjoy flexibility of hierarchical models, but in a high dimensional space with a lot of latent parameters, waiting for a sampler (yes, even with stan, and using techniques like parameter expansion) can take a year (I’m thinking fMRI applications), or trying to use the samplers when doing real time applications, is no good.

Anyway in xbox: “However, for the sake of computational convenience,
we use the approximate marginal maximum likelihood
estimates obtained from the glmer()…”

How much does my model estimation suffer if I’m using a technique like this as a go-to for things I need done quickly (i.e. is the bias bounded)?

I don’t expect a typed response but if you can point me to a paper that goes in detail, that would be appreciated.

• Andrew says:

Andre:

It does not take a year to fit these models in Stan. Stan is much faster than it was when we did the Xbox analysis several years ago, and we’re making it even faster. If we were doing the Xbox thing right now, we’d just fit in Stan directly.

• Andre says:

Hey thanks,

I was referring to some models in fMRI, not the xbox model, but I’ll give it a shot!