As anyone who’s designed a study and gathered data can tell you, getting statistical significance is difficult. Lots of our best ideas don’t pan out, and even if a hypothesis seems to be supported by the data, the magic “p less than .05” can be elusive.
And we also know that noisy data and small sample sizes make statistical significance even harder to attain. In statistics jargon, noisy studies have low “power.”
Now suppose you’re working in a setting such as educational psychology where the underlying objects of study are highly variable and difficult to measure, so that high noise is inevitable. Also, it’s costly or time-consuming to collect data, so sample sizes are small. But it’s an important topic, so you bite the bullet and accept that your research will be noisy. And you conduct your study . . . and it’s a success! You find a comparison of interest that is statistically significant.
At this point, it’s natural to reason as follows: “We got statistical significance under inauspicious conditions, and that’s an impressive feat. The underlying effect must be really strong to have shown up in a setting where it was so hard to find.” The idea is that statistical significance is taken as an even stronger signal when it was obtained from a noisy study.
This idea, while attractive, is wrong. Eric Loken and I call it the “What does not kill my statistical significance makes it stronger” fallacy.
What went wrong? Why it is a fallacy? In short, conditional on statistical significance at some specified level, the noisier the estimate, the higher the Type M and Type S errors. Type M (magnitude) error says that a statistically significant estimate will overestimate the magnitude of the underlying effect, and Type S error says that a statistically significant estimate can have a high probability of getting the sign wrong.
We demonstrated this with an extreme case a couple years ago in a post entitled, “This is what “power = .06” looks like. Get used to it.” We were talking about a really noisy study where, if a statistically significant difference is found, it is guaranteed to be at least 9 times higher than any true effect, with a 24% chance of getting the sign backward. The example was a paper reporting a correlation between certain women’s political attitudes and the time of the month.
So, we’ve seen from statistical analysis that the “What does not kill my statistical significance makes it stronger” attitude is a fallacy: Actually, the noisier the study, the less we learn from statistical significance. And we can also see the intuition that led to the fallacy, the idea that statistical significance under challenging conditions is an impressive accomplishment. That intuition is wrong because it neglects the issue of selection, which we also call the garden of forking paths.
Even experienced researchers can fall for the “What does not kill my statistical significance makes it stronger” fallacy. For example, in an exchange involving about potential biases in summaries of some well studied, but relatively small, early childhood intervention programs, economist James Heckman wrote:
The effects reported for the programs I discuss survive batteries of rigorous testing procedures. They are conducted by independent analysts who did not perform or design the original experiments. The fact that samples are small works against finding any effects for the programs, much less the statistically significant and substantial effects that have been found.
Yes, the fact that sample are small works against finding any [statistically significant] effects. But no, this does not imply that effect estimates obtained from small, noisy studies are to be trusted. In addition, the phrase, “much less the statistically significant and substantial effects” is misleading, in that when samples are small and measurements are noisy, any statistically significant estimates will be necessarily “substantial,” as that’s what it takes for them to be at least two standard deviations from zero.
My point here is not to pick on Heckman, any more than my point a few years ago was to pick on Kahneman and Turing. No, it’s the opposite. Here you have James Heckman, a brilliant economist who’s done celebrated research on selection bias, who’s following a natural but erroneous line of reasoning that doesn’t account for selection. He’s making the “What does not kill my statistical significance makes it stronger” fallacy.
It’s an easy fallacy to make: if a world-renowned expert on selection bias can get this wrong, we can too. Hence this post.
P.S. Regarding the discussion of the Heckman quote above: He did say, and it’s true, that the measurements are good for the academic achievement etc. These aren’t ambiguous self-reports, or arbitrarily coded things. So the small sample point is still relevant, but it’s not appropriate to label those measurements as noisy. What’s relevant for this sort of study is not that they are noisy but that they are highly variable—and these are between-student comparisons, so between-student variance goes into the error term. The point is that the fallacy can arise when the underlying phenomenon is highly variable, even if the measurements themselves are not noisy.
P.P.S. More here. Eric and I published an article on this in Science.