Alison McCook from Retraction Watch interviewed Eric Loken and me regarding our recent article, “Measurement error and the replication crisis.” We talked about why traditional statistics are often counterproductive to research in the human sciences.
Retraction Watch: Your article focuses on the “noise” that’s present in research studies. What is “noise” and how is it created during an experiment?
Andrew Gelman: Noise is random error that interferes with our ability to observe a clear signal. It can have many forms, including sampling variability by using small samples, or unexplained error from unmeasured factors, or measurement error from poor instruments for the things you do want to measure. In everyday life we take measurement for granted – a pound of onions is a pound of onions. But in science, and maybe especially social science, we observe phenomena that vary from person to person, that are affected by multiple factors, and that aren’t transparent to measure (things like attitudes, dispositions, abilities). So our observations are much more variable.
Noise is all the variation that you don’t happen to be currently interested in. In psychology experiments, noise typically includes measurement error (for example, ask the same person the same question on two different days, and you can get two different answers, something that’s been well known in social science for many decades) and also variation among people.
RW: In your article, you “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger.” What do you mean by that?
AG: We blogged about the “What does not kill my statistical significance makes it stronger” fallacy here. As anyone who’s designed a study and gathered data can tell you, getting statistical significance is difficult. And we also know that noisy data and small sample sizes make statistical significance even harder to attain. So if you do get statistical significance under such inauspicious conditions, it’s tempting to think of this as even stronger evidence that you’ve found something real. This reasoning is erroneous, however. Statistically speaking, a statistical significant result obtained under highly noisy conditions is more likely to be an overestimate and can even be in the wrong direction. In short: a finding from a low-noise study can be informative, while the finding at the same significance level from a high-noise study is likely to be little more than . . . noise.
RW: Which fields of research are most affected by this assumption, and the influence of noise?
AG: The human sciences feature lots of variation among people, and difficulty of accurate measurements. So psychology, education, and also much of political science, economics, and sociology can have big issues with variation and measurement error. Not always — social science also deals in aggregates — but when you get to individual data, it’s easy for researchers to be fooled by noise — especially when they’re coming to their data with a research agenda, with the goal of wanting to find something statistically significant that can get published too.
We’re not experts in medical research but, from what we’ve heard, noise is a problem there too. The workings of the human body might not differ so much from person to person, but when effects are small and measurement is variability, researchers have to be careful. Any example where the outcome is binary — life or death, or recovery from disease or not — will be tough, because yes/no data are inherently variable when there’s no in-between state to measure.
A recent example from the news was the PACE study of treatments for chronic fatigue syndrome: there’s been lots of controversy about outcome measurements, statistical significance, and specific choices made in data processing and data analysis — but at the fundamental level this is a difficult problem because measures of success are noisy and are connected only weakly to the treatments and to researchers’ understanding of the disease or condition.
RW: How do your arguments fit into discussions of replications — ie, the ongoing struggle to address why it’s so difficult to replicate previous findings?
AG: When a result comes from little more than noise mining, it’s not likely to show up in a preregistered replication. I support the idea of replication if for no other reason than the potential for replication can keep researchers honest. Consider the strategy employed by some researchers of twisting their data this way and that in order to find a “p less than .05” result which, when draped in a catchy theory, can get published in a top journal and then get publicized on NPR, Gladwell, Ted talks, etc. The threat of replication changes the cost-benefit on this research strategy. The short- and medium-term benefits (publication, publicity, jobs for students) are still there, but there’s now the medium-term risk that someone will try to replicate and fail. And the more publicity your study gets, the more likely someone will notice and try that replication. That’s what happened with “power pose.” And, long-term, enough failed replications and not too many people outside the National Academy of Sciences and your publisher’s publicity department are going to think what you’re doing is even science.
That said, in many cases we are loath to recommend pre-registered replication. This is for two reasons: First, some studies look like pure noise. What’s the point of replicating a study that is, for statistical reasons, dead on arrival? Better to just move on. Second, suppose someone is studying something for which there is an underlying effect, but his or her measurements are so noisy, or the underlying phenomenon is so variable, that it is essentially undetectable given the existing research design. In that case, we think the appropriate solution is not to run the replication, which is unlikely to produce anything interesting (even if the replication is a “success” in having a statistically significant result, that result itself is likely to be a non-replicable fluke). It’s also not a good idea to run an experiment with much larger sample size (yes, this will reduce variance but it won’t get rid of bias in research design, for example when data-gatherers or coders know what they are looking for). The best solution is to step back and rethink the study design with a focus on control of variation.
RW: Anything else you’d like to add?
AG: In many ways, we think traditional statistics, with its context-free focus on distributions and inferences and tests, has been counterproductive to research the human sciences. Here’s the problem: A researcher does a small-N study with noisy measurements, in a setting with high variation. That’s not because the researcher’s a bad guy; there are good reasons for these choices: Small-N is faster, cheaper, and less of a burden on participants; noisy measurements are what happen if you take measurements on people and you’re not really really careful; and high variation is just the way things are for most outcomes of interest. So, the researcher does this study and, through careful analysis (what we might call p-hacking or the garden of forking paths), gets a statistically significant result. The natural attitude is then that noise was not such a problem; after all, the standard error was low enough that the observed result was detected. Thus, retroactively, the researcher decides that the study was just fine. Then, when it does not replicate, lots of scrambling and desperate explanations. But the problem — the original sin, as it were –was the high noise level. It turns out that the attainment of statistical significance cannot and should not be taken as retroactive evidence that a study’s design was efficient for research purposes. And that’s where the “What does not kill my statistical significance makes it stronger” fallacy comes back in.
P.S. Discussion in comments below reveals some ambiguity in the term “measurement error” so for convenience I’ll point to the Wikipedia definition which pretty much captures what Eric and I are talking about:
Observational error (or measurement error) is the difference between a measured value of quantity and its true value. In statistics, an error is not a “mistake”. Variability is an inherent part of things being measured and of the measurement process. Measurement errors can be divided into two components: random error and systematic error. Random errors are errors in measurement that lead to measurable values being inconsistent when repeated measures of a constant attribute or quantity are taken. Systematic errors are errors that are not determined by chance but are introduced by an inaccuracy (as of observation or measurement) inherent in the system. Systematic error may also refer to an error having a nonzero mean, so that its effect is not reduced when observations are averaged.
In addition, many of the points we make regarding measurement error also apply to variation. Indeed, it’s not generally possible to draw a sharp line between variation and measurement error. This can be seen from the definition on Wikipedia: “Observational error (or measurement error) is the difference between a measured value of quantity and its true value.” The demarcation between variation and measurement error depends on how the true value is defined. For example, suppose you define the true value as a person’s average daily consumption of fish over a period of a year. Then if you ask someone how much fish they ate yesterday, this could be a precise measurement of yesterday’s fish consumption but a noisy measure of their average daily consumption over the year.
P.P.S. Here is some R code related to our paper.