Luiz Caseiro writes:

1. P-values and Confidence Intervals are used to draw inferences about a population from a sample. Is that right?

2. As far as I researched, standard statistical softwares usually compute confidence intervals (CI) and p-values assuming that we have a simple random sample. Is that right?

3. If we have another kind of representative sample, different from a simple random sample (i.e. a complex sample), we should take into account our sample design before calculating CI and p-values. Is that right?

4. If we do not have a representative sample, as it is often the case in political science (specially when the sample is a convenience sample, made of some countries for which data is available), would not it be irrelevant and even misleading to report CI and p-values?

This question comes up from time to time (for example, in 2009, 2011, 2014, and 2014), so I’m well prepared to reply to this one.

In response to Caseiro: Yes, the starting point in statistical theory is the assumption of simple random sampling, but there are methods for dealing with stratified samples, cluster samples, etc. There are textbooks on this and statistical packages that do it. If you have a convenience sample, it’s still a good idea to report standard errors etc.; you just need to make assumptions.

Caseiro follows up:

1. If I have a convenience sample the assumption that I need to make when reporting standard errors, CI, etc. is that my convenience sample is not very different from a random sample? This sounds like a very strong assumption.

2. Would not it be more accurate to just say that I cannot reach external validity from my sample?

3. If I do not claim external validity, then the standard errors become unnecessary?

My reply: It all depends on what questions you want to answer. If you simply want to describe the data you have, go for it. But usually we gather data to understand something about unobservables or to make predictions about new situations. In that case, you’ll have to make some assumptions. If your data are weak, your assumptions need to be correspondingly stronger.

To put it another way: Sure, it’s fine to say that you “cannot reach external validity” from your sample alone. But in the meantime you still need to make decisions. We don’t throw away the entire polling industry just cos their response rates are below 10%; we work on doing better. Our samples are never perfect but we can make them closer to the population.

Remember the Chestertonian principle that extreme skepticism is a form of credulity.

“…we work on doing better.”

well …. an extremely unconvincing overall response to the major issue of non-random samples. Sounds like a football coach impromptu pep talk, not a science perspective.

Taeblin:

If you want some convincing, take a look at the entire history of polling, since every public opinion poll ever done has been from a non-random sample. If you don’t want to call it science, fine, I don’t care; call it engineering. A poll from a non-random sample—that is, any poll—is somewhere in quality between a true random sample (which we never have) and a wild guess. Where it falls in the spectrum depends on the difficulty of getting accurate responses, the quality of the data collection, and the quality of the adjustment. As the data get worse, we need to work harder on adjustment.

Apparently “we work on doing better” is something that football coaches say. It’s also something that engineers say, and that statisticians say. We don’t have all the answers, and we’re working on terrain that’s continually changing. Methods of adjustment that worked ok in the 1950s might not work so well now, with our much noisier media environment and much lower response rates.

Regarding “Working on doing better”, there are many techniques for fields which are far more substantial (?) than polling. My favorites, which I borrow when I do Internet measurements work, come from field and quantitative biology. There are delightfully creative ways of doing field sampling. Here are a couple, which I grabbed from my bookshelf for illustration:

* S. T. Buckland, D. R. Anderson, K. P. Burnham, J. L Laake, D. L. Borchers, L. Thomas, INTRODUCTION TO DISTANCE SAMPLING, Oxford, 2001.

* (same authors), ADVANCED DISTANCE SAMPLING, Oxford, 2007.

* S. C. Amstrup. T. L. McDonald, B. F. J. Manly, HANDBOOK OF CAPTURE-RECAPTURE ANALYSIS, Princeton, 2005

* W. L. Thompson (ed.), SAMPLING RARE OR ELUSIVE SPECIES, Island Press, 2004.

And, of course:

* C.-E. SÃ¤rndal, B. Swensson, J. Wretman, MODEL ASSISTED SURVEY SAMPLING, Springer, 1992

* M. Ghosh, G. Meeden, BAYESIAN METHODS FOR FINITE POPULATION SAMPLING, Chapman & Hall/CRC, 1998.

This is something I’ve thought a lot about, and I think one big problem is that many practitioners aren’t clear themselves on what they’re assuming when they use nonrandom samples (or random samples with nonresponse, etc…), and the assumptions that relate to bias (strong ignorability) tend to get conflated with the assumptions related to variance (that the data generating process is IID or if not, it’s modeled correctly). If you’ve got a real random sample, you don’t really have to think as hard about any of this stuff, and since that’s what tends to get taught, that’s how people look at data.

I couldn’t say whether or not these would be the kinds of things that need to make their way into intro stats classes classes for undergrads, or if it’s better to leave them to more advanced stats classes or subject specific methods classes. For me, they certainly weren’t concepts that I encountered at all in any sort of systematic way until grad school. But it seems like we need to do a better job teaching social scientists about some of these concepts.

As an undergrad stats (among other things) teacher, I strongly relate to this. I typically spend a lot of time on the topic of sample bias and how to test for it. In the end, and with the techniques I can muster at this level, there’s not much that can be done correctively beyond cultivating a qualitative sense of the direction(s) and intensity of the bias and qualifying external inferences accordingly. But I also preach multiple data sources, so triangulating around bias becomes feasible too.

WHAT IS NOT BEING taught tho? It seems as we are now suggesting that we should question most all theories and practices. Well gee, that’s confidence inducing.

That is the challenge. I made sure my undergrads at Duke (2007/8) understood that with bias the confidence interval coverage approaches zero as the sample size increases. It took more than one class and pretty sure most got it – but it seemed to make them very upset.

The problems with Peter’s multiple data sources triangulation around bias idea is that the bias could be the same/similar in all studies (e.g. sometimes/often in epidemiology). On the other hand, if its haphazard, its unlikely to be centered about zero and the observed variability is thus challenging to make any sense of. Uncertainty can be broken into haphazard about zero (random or random like) versus haphazard about non-zero (where this can be constant or variable by study).

The only sensible (but confidence deflating) realization perhaps being that you cannot become certain about uncertainty with any statistical approach. Rather just less uncertain but often less enough that the learning is not overly risky.

Yes, but the point of triangulation is to actively seek out multiple sources with at least partially offsetting biases. It’s not always possible, but in the areas I work in (not epi), it’s more often possible than not.

What if you don’t have a (finite) population from which a sample could have been drawn, like in most (if not all) psychology experiments?

You are confusing issues. Per the p-values in phychology experiments, the relevant population is the pre-assignment population of participants, from which the treatment and control groups are subsequently drawn. Those experimental groups are random samples from this population, which is definitely finite.

Perhaps Rob Kass’ big picture http://www.stat.cmu.edu/~kass/papers/bigpic.pdf will address your (I believe thoughtful) question.

Freedman’s ‘Statistical models and shoe leather’ has already been mentioned in this blog, but I think it’s worth mentioning it again in this context.

https://web.math.rochester.edu/people/faculty/cmlr/Advice-Files/Freedman-Shoe-Leather.pdf

Bottom line: strong modeling assumptions might be nice, but they won’t take us very far on their own. We also need careful design, meaningful data collection and strong reality checks (comparing predictions from the model with other samples from the population we would like to generalize to, e.g.).

+1

I’m not sure how to interpret Andrew’s statement “Yes, the starting point in statistical theory is the assumption of simple random sampling, but there are methods for dealing with stratified samples, cluster samples, etc.”

If by “The starting point” he means that the types of statistical inference textbooks start with are those requiring simple random samples, then I agree. (His statement “but there are methods for dealing with stratified samples, cluster samples, etc. There are textbooks on this and statistical packages that do it.” suggests that this might indeed be what he is talking about.)

But I think it is important to emphasize from early on that every statistical method has model assumptions, and that some of these assumptions concern what type of sample is appropriate for the model to fit– so “appropriate sample” means, ideally, a sample fitting the assumptions of the model. However, in practice, in our imperfect world, sometimes the best we can do is to give a plausible argument that these assumptions are close to reasonably satisfied — and the credibility of our results depends (among other things) on how well they are satisfied.

Martha:

I wrote, “the starting point in statistical

theory,” not “the starting point in statisticalpractice.”I’ve found it hard in my social psych-influenced field to convince many people to take sample quality seriously. It seems that in certain fields, especially experimental psychology, the burden of proof is on the person who wants to argue that the convenience sample (usually students) is not generalizable. In some other fields, like much of political science, the burden of proof is on the researcher to argue/demonstrate why the convenience sample is appropriate. Andrew is right to say that there’s no true representative sample, but I think there’s something to be said for at least having people in all the relevant cells if you think of it in the sense of a cross-tabs of the relevant characteristics for the question at hand (age, sex, race, education, maybe political party, etc.).

One nut I’ve yet to crack is how to use the Mr. P logic to derive regression coefficient (and standard error) estimates.

Andrew wrote: “methods for dealing with stratified samples, cluster samples, etc.”

Can you point me to references? Given that most all samples contain some kind bias/non-random components, I would like to include more of these techniques when modeling. Things I’m thinking about are confidence intervals (perhaps of coefficients in a linear model, to start) for non-random samples. MCMC and areas of ‘highest posterior density’ is great, but, not practical for real-time applications.

Andre:

Some examples are here:

http://www.stat.columbia.edu/~gelman/research/published/rayleigh_final.pdf

http://www.stat.columbia.edu/~gelman/research/published/forecasting-with-nonrepresentative-polls.pdf

http://www.stat.columbia.edu/~gelman/research/unpublished/clustersamplingpaper_draft_Oct2.pdf

We fit these models in Stan nowadays, which works just fine. We don’t use highest posterior density areas, we just use the posterior simulations directly.

There is, of course, also a frequentist literature on confidence interval computation and coverage with clustered sampling and treatment assignment – several actually. In applied micro the so-called “cluster robust methods”, in particular sandwich estimators that are generalizations of of the White robust standard error calculation, are most common. It is just another way to think about the problem – instead of modeling the error term directly, you let the residuals and covariances in the data tell you (and then you use the magic of averaging, or something like that).

An overview of cluster robust methods: http://cameron.econ.ucdavis.edu/research/Cameron_Miller_JHR_2015_February.pdf

Rejection rates when clustering is not properly accounted for: https://economics.mit.edu/files/750

Alwyn Young on permutation-type inference with clusters: http://personal.lse.ac.uk/YoungA/ChannellingFisher.pdf

….wow, Andrew must hate these papers in increasing order of “no no no no no”.

I would also just note one thing about your comment – these methods are about computing appropriately sized confidence intervals/generating appropriate rejection rates under the null. None of them address “bias” due to a non-representative sample or due to selection into treatment. Re-weighting (or the Mr. P stuff, I think) can be used to get back certain-kinds of population-representative parameter estimates, but the “clustering” literature here is about getting the standard errors right, not getting BetaHat right.

Ok –

question regarding the Xbox paper. I enjoy flexibility of hierarchical models, but in a high dimensional space with a lot of latent parameters, waiting for a sampler (yes, even with stan, and using techniques like parameter expansion) can take a year (I’m thinking fMRI applications), or trying to use the samplers when doing real time applications, is no good.

Anyway in xbox: “However, for the sake of computational convenience,

we use the approximate marginal maximum likelihood

estimates obtained from the glmer()…”

How much does my model estimation suffer if I’m using a technique like this as a go-to for things I need done quickly (i.e. is the bias bounded)?

I don’t expect a typed response but if you can point me to a paper that goes in detail, that would be appreciated.

Andre:

It does not take a year to fit these models in Stan. Stan is much faster than it was when we did the Xbox analysis several years ago, and we’re making it even faster. If we were doing the Xbox thing right now, we’d just fit in Stan directly.

Hey thanks,

I was referring to some models in fMRI, not the xbox model, but I’ll give it a shot!