## The “What does not kill my statistical significance makes it stronger” fallacy

[cat picture]

As anyone who’s designed a study and gathered data can tell you, getting statistical significance is difficult. Lots of our best ideas don’t pan out, and even if a hypothesis seems to be supported by the data, the magic “p less than .05” can be elusive.

And we also know that noisy data and small sample sizes make statistical significance even harder to attain. In statistics jargon, noisy studies have low “power.”

Now suppose you’re working in a setting such as educational psychology where the underlying objects of study are highly variable and difficult to measure, so that high noise is inevitable. Also, it’s costly or time-consuming to collect data, so sample sizes are small. But it’s an important topic, so you bite the bullet and accept that your research will be noisy. And you conduct your study . . . and it’s a success! You find a comparison of interest that is statistically significant.

At this point, it’s natural to reason as follows: “We got statistical significance under inauspicious conditions, and that’s an impressive feat. The underlying effect must be really strong to have shown up in a setting where it was so hard to find.” The idea is that statistical significance is taken as an even stronger signal when it was obtained from a noisy study.

This idea, while attractive, is wrong. Eric Loken and I call it the “What does not kill my statistical significance makes it stronger” fallacy.

What went wrong? Why it is a fallacy? In short, conditional on statistical significance at some specified level, the noisier the estimate, the higher the Type M and Type S errors. Type M (magnitude) error says that a statistically significant estimate will overestimate the magnitude of the underlying effect, and Type S error says that a statistically significant estimate can have a high probability of getting the sign wrong.

We demonstrated this with an extreme case a couple years ago in a post entitled, “This is what “power = .06” looks like. Get used to it.” We were talking about a really noisy study where, if a statistically significant difference is found, it is guaranteed to be at least 9 times higher than any true effect, with a 24% chance of getting the sign backward. The example was a paper reporting a correlation between certain women’s political attitudes and the time of the month.

So, we’ve seen from statistical analysis that the “What does not kill my statistical significance makes it stronger” attitude is a fallacy: Actually, the noisier the study, the less we learn from statistical significance. And we can also see the intuition that led to the fallacy, the idea that statistical significance under challenging conditions is an impressive accomplishment. That intuition is wrong because it neglects the issue of selection, which we also call the garden of forking paths.

An example

Even experienced researchers can fall for the “What does not kill my statistical significance makes it stronger” fallacy. For example, in an exchange involving about potential biases in summaries of some well studied, but relatively small, early childhood intervention programs, economist James Heckman wrote:

The effects reported for the programs I discuss survive batteries of rigorous testing procedures. They are conducted by independent analysts who did not perform or design the original experiments. The fact that samples are small works against finding any effects for the programs, much less the statistically significant and substantial effects that have been found.

Yes, the fact that sample are small works against finding any [statistically significant] effects. But no, this does not imply that effect estimates obtained from small, noisy studies are to be trusted. In addition, the phrase, “much less the statistically significant and substantial effects” is misleading, in that when samples are small and measurements are noisy, any statistically significant estimates will be necessarily “substantial,” as that’s what it takes for them to be at least two standard deviations from zero.

My point here is not to pick on Heckman, any more than my point a few years ago was to pick on Kahneman and Turing. No, it’s the opposite. Here you have James Heckman, a brilliant economist who’s done celebrated research on selection bias, who’s following a natural but erroneous line of reasoning that doesn’t account for selection. He’s making the “What does not kill my statistical significance makes it stronger” fallacy.

It’s an easy fallacy to make: if a world-renowned expert on selection bias can get this wrong, we can too. Hence this post.

P.S. Regarding the discussion of the Heckman quote above: He did say, and it’s true, that the measurements are good for the academic achievement etc. These aren’t ambiguous self-reports, or arbitrarily coded things. So the small sample point is still relevant, but it’s not appropriate to label those measurements as noisy. What’s relevant for this sort of study is not that they are noisy but that they are highly variable—and these are between-student comparisons, so between-student variance goes into the error term. The point is that the fallacy can arise when the underlying phenomenon is highly variable, even if the measurements themselves are not noisy.

P.P.S. More here. Eric and I published an article on this in Science.

1. Perhaps I am wrong, but it seems that this fallacy isn’t confined to “statistical significance” – it can include bayesian analysis as well.

It seems the error is just as problematic and as conceptually likely when using a summary statistic of a bayesian analysis. If, say, the posterior p(x<N%) is high in a noisy dataset, this is less informative, but likely to be misinterpreted in similar ways. You can diagnose such a problem in many ways, but you can do the same in a NHST setting. So perhaps it is worth noting explicitly that this isn't actually a p-values issue, despite a name referencing "Statistical Significance" and p-values comprising all of the examples noted here.

• Z says:

I think Andrew would say that if you’re in a situation where the effect size is likely to be small relative to the noise, a suitable prior would save you from this problem. But it is true (I think) that if you really don’t have any reason a priori to suspect that the effect is small then an honest prior would not save you.

• Rahul says:

So the choice of prior isn’t really important,……till it really is!

• Andrew says:

Rahul:

The weaker your data, the more important is your prior. In my above post I’m specifically talking about weak-data scenarios.

• Rahul says:

Indeed, and I think week data scenarios are a dominant class of problems. Clearly not something one can ignore.

I feel the prudent assumption when faced with a generic problem always is to go with the thinking that your choice of prior is going to matter critically.

• Start with the assumption that no matter how much data you have, your choice of Likelihood matters CRITICALLY. It’s no good explaining the outcome of some dental hygiene problem in terms of the war between blue fairies and red fairies for control of the space between your teeth and then spending a long time interviewing people about the field’s consensus belief about the relative strength of blue fairies vs red fairies.

Building models where you’re assuming “as if a random number generator with unknown mean and SD but perfectly known distributional shape” is pretty much blue and red fairies.

• Martha (Smith) says:

+1

• Jens Astrom says:

I’ve heard this point being stressed in a course by Keith Beven several years ago. As far as I recall, he meant it is a common mistake to assume to know, or to be able to make good guesses of the distributions to use in many statistical analyses (also within a Bayesian framework). The choice of distribution is often done by following tradition, and may affect your inferences. In some systems/models it may not even be possible to know a priori, or to correctly specify the distributions. Hence the field of “informal likelihoods”, such as in Bevens method GLUE, perhaps most used in hydrology.

I’m curious to hear what Andrew or someone else here think about this “problem” in the spirit of re-evaluating old statistical traditions. It the problem perhaps dependent on model “complexity”? How often have you found that your inferences depended on choice of distributions, and that there were several plausible candidates to choose between? Is this something we should care more about?

• Rahul says:

Isn’t that a straw man? Sure, models matter critically.

Does anyone argue that the choice of model does not matter much in a problem?

• No, it’s not a straw man, lots of people just assume a standard model like “normally distributed measurement errors” or “linear regression” or “regression discontinuity with a high degree polynomial”

• Or even just assuming that it makes sense to think of health outcome as a function of distance from a river where a policy changes, 50 years after the policy was put in place.

• Andrew says:

David:

I agree with you. That’s why I wrote an article called, “The problems with p-values are not just with p-values”!

• Shravan says:

I remember reading this sentence: “…to move toward a greater acceptance of uncertainty and embracing of variation.” (as an alternative to using stat. sig.)

and thinking: what does that mean in practice? Maybe you should spell this out with an example data analysis, once with a p-value based analysis, and then with an analysis embracing uncertainty and variation. Otherwise people will not see the point. All they see is a rejection of a clear algorithm that leads to a publication, replaced by a touchy-feely alternative (seemingly—I am taking their position) that has no clear path laid out.

• Rahul says:

+1.

“embrace variation” sounds like one of those cliches that are totally non-actionable.

Heck, I accept uncertainty. It is all around me. But acceptance is the easy part. *Reducing* uncertainty is where the worthwhile challenge lies.

“Embrace variation” is the dictum the Indian train system seems to run on. “Eschew variation” is more like Deutschebahn. That’s what we want.

• shravan says:

Rahul, you clearly have little experience with deutsche bahn!

• Rahul says:

I’m absolutely fascinated when they announce on board delays of like three minutes.

• Andrew says:

Shravan, Rahul:

I have a zillion examples in my applied research of embracing variation and accepting uncertainty. For example this paper from 1999 on decision making for home radon exposure. In the past, people had tried to identify houses as being at risk or not, or as high radon or low radon. In our paper we (a) accepted the uncertainty which allowed flexible decision recommendations, and (b) embraced variation by using multiple levels of variation to fit our model.

Or this paper from 2008 on estimating incumbency advantage and its variation.

Reducing uncertainty is great, but often you have to accept the uncertainty that remains. Also, when I say “embrace variation”: sure, if variation is controllable it can be a good idea to reduce it; that’s one of the fundamental principles of quality control. But if you’re studying humans, that’s not always a possibility.

A specific way that understanding of variation can help is in psychology experiments, where for the past few years Eric Loken, I, and others, have recommended between-person designs instead of within-person designs so as to better align data collection with unavoidable variation.

• Rahul says:

Eschew variation by measuring variation: Reduce the variation you can and quantify whatever variation you can’t.

• Rahul says:

Is “embrace variation” analogous to saying “use continuous models instead of discrete”? Your radon example makes it seem so.

• Shravan says:

I haven’t read the radon paper yet but I know the work from Gelman and Hill. I’m also familiar with cost-benefit analyses, which give a clear decision criterion for policy decisions. But this advice is essentially useless and not actionable for the Wansinks and Cuddys and Fiskes and Gilberts of the psych* world. They can’t do a cost-benefit analysis of power posing to decide whether one should act as if it’s useful. Or if they can, it’s not clear how.

• Shravan says:

Andrew, if your answer is that they’re asking the wrong questions, I would agree with you. But these kind of research questions (power posing, stroop task, etc.) are the entire foundation of research in psychology-type areas, including my own area, psycholinguistics.

2. anon says:

I’d love to see Heckman respond to this.

3. Tom Passin says:

Actually, there is no such thing as a noisy data set *in the abstract*. It’s always noisy (or not) *with respect to what you want to measure*.

If you set your standards according to what you find, you will often find something “significant”. If that’s what you did, you had better take that criterion and try the whole thing again with a different data set *with this same criterion*.

If you set your standards according to your findings, you are doing exploratory data analysis. If you think otherwise, you are fooling yourself.

If it’s an observational data set that can’t be repeated, too bad but it’s still exploratory data analysis. Get over it!

• This is a good point. The p value is typically a nonlinear function of a dimensionless ratio, namely the observed average effect and the standard deviation of the measurement / sampling error. That’s often called the “t” statistic.

This t statistic is one possible way to define a meaningful dimensionless ratio, but it’s usually by no means the only dimensionless ratio of interest in a study. For example a meaningful ratio in many scenarios is something you might call U for Usefulness, namely the observed average divided by the size of the effect necessary for the study to produce an economically beneficial result. For example in a drug designed as a replacement for pseudoephedrine for use in OTC decongestant pills, U = m/S where m is the average across patients of some kind of area under a curve of decongestant effect vs time, and S is the area under the curve for Pseudoephedrine (S for Standard).

You could well have a statistically significant result (that is, you can detect that there is some decongestant effect that isn’t zero) while utterly failing to have even 1/4 of the effect of the drug you’re trying to replace.

If you set your standards at “statistical significance” you can get your drug approved by simply doing a good job of measuring and using large samples. If you set your standards at actually helping people… not so much.

Not surprisingly, this is exactly the case for Phenylephrine, the actual drug that does replace Pseudoephedrine in OTC stuff after they made Pseudoephedrine a behind-the-counter-and-register-your-drivers-license drug.

The Pseudoephedrine law resulted in about a 90% decrease in sales of pseudoephedrine in some places: http://media.arkansasonline.com/img/photos/2016/02/13/0214pseudoephidrine_t630.png?30004eeab9fb5f824ff65e51d525728c55cf3980

Unsurprisingly since the replacement drug is “statistically significant” but not actually useful, the incidence of chronic sinusitis has skyrocketed since the 2006 law, so that now something like 10% of the population suffers from the chronic form.

yay statistics!

• Rahul says:

@Tom

As an aside you do use the term “exploratory data analysis”.

But I see resistance sometimes at the suggestion that authors self-label all studies as exploratory or not.

I’ve never really understood that.

4. Jonathan (another one) says:

While I agree with this as a matter of mathematics, sometimes you *do* have big effects and sometimes you *can’t* make the sample bigger. I realize that your background assumption here is that the big effects have either all been found, or alternatively, are so obvious that they don’t need p values. (Won’t a survey of size 4 suffice to demonstrate that male sumo wrestlers are larger than female gymnasts?) But for those of us toiling in one-off studies to assist courts, small noisy datasets may be all we have. While it is still true that the inferences one can get are surely limited by Type S error, a mere 24% percent chance of getting the sign backward looks pretty good! If we have good theoretical reasons to think the size of the effect *might* be large, but aren’t allowed to instantiate those reasons into a prior because Courts find Bayesian methods “subjective,” this “fallacy” looks like a reasonable way of backing into truth.

• Andrew says:

Jonathan:

I agree that sometimes you do have big effects and sometimes you can’t make the sample bigger. Lots of important examples in political science and economics are like that. But I still object the reasoning that a statistical significant result is more informative if obtained under noisy conditions. If noisy data are all you have, that’s fine, but don’t treat the noise as an argument in favor of the conclusion.

• Jonathan (another one) says:

I think it’s a matter of rhetoric. It’s not that the noise is a virtue, it’s simply that the effect stands out even against the noise. You just have to be honest about the fragility of the result. So much of what you’ve been talking about in the last year hasn’t really been about inference; it’s been about a failure to admit to the weakness of particular inferences out of professional bravado.

• Tom Passin says:

Well, if I’m going to be in court, the *last* thing I want is for the case against me to be supported by fragile conclusions. a 24% change of an adverse decision that would otherwise by in my favor? This would not be “Justice”, I say.

• Jonathan (another one) says:

In criminal court, a 24% probability of error is deadly. In civil court, under the “preponderance of the evidence” standard, it may be fine, depending on the auxiliary evidence that can be brought to bear. The Court has to decide between plaintiff and defendant, even where the evidence on both sides is weak.

• I think the solution is to educate Courts on Bayesian methods not do Bayesian analyses and then back-hack p values until they tell you what you learned from the Bayesian version.

Also, doing an analysis in which you explain the effect of the prior is hugely helpful. For example, suppose the effect of some law enforcement intervention is of interest. You run a Bayesian analysis with a broad prior allowing for potentially large or small, positive or negative effects, together with the noisy data you find an effect of 1 on some scale where 1 is a very good effect.

Now, you ask “how strongly do I need to believe that the real effect is near zero to overcome what my data tells me?” Run the analysis with a prior on effect size of normal(0,.1), and normal(0,.01) and normal(0,.001) etc. Suppose that the expected effect size drops to 0.1 only with normal(0,.001) prior…

Then you can explain to the court: “unless you go into this analysis believing strongly that there’s a 90% chance that your effect is between -.002 and +.002 you have to come out of the analysis believing that the effect is at least of size 0.1” and “if you believe that it is concievable that any size between -2 and 2 is possible, then the most likely thing is that real effect is between .9 and 1.4 (or whatever your high probability interval is under your broad prior).”

Courts understand the concept of prejudice and keeping an open mind to many possibilities pretty well. Putting it in terms they can understand will help.

• Jonathan (another one) says:

I don’t disagree with any of this, except to say that what courts ought to be interested in and what they’re actually interested in are often completely different.

• Chris Wilson says:

This is a great point. I stumbled into this way of playing around with priors when I was tinkering with Lasso models for a paper. I had started out using Stan but then ran up against the question of how to set a default prior scale for non-exchangeable effects that yielded enough shrinkage to control noisy estimates, but not too much (an open-ended goal is ever there was one!). I ended up going with ‘glmnet’ package and using cross-validation to tune the parameter, but I think the more Bayesian interpretation is very useful. Next time around, rather than cross-validation, I think trying a high-medium-low comparison might be more informative for inference (similar to what you sketch out above).

• Andrew says:

Chris:

We have some discussion of default priors in this wiki. Feel free to add your thoughts and questions to it. If you have specific issues, this could be very helpful.

Also if you’re doing variable-selection models, you might try the horseshoe prior which I believe is now implemented in rstanarm.

• Chris Wilson says:

Thanks for the link, very helpful resource! I need to do my homework on horseshoe priors…One thing unsatisfying about ‘glmnet’ is the lack of confidence intervals. I gather boot-strapping is an option, but I’d rather just use full Bayes. In that context, I like Daniel Lakeland’s idea of comparing inference under various strengths of prior scale. I think that packages like ‘glmnet’ (and rstanarm of course) are going to make Bayesian inference much more widespread – once you can link penalization/regularization to putting priors on coefficients, it makes it easier to overcome the old “priors are subjective” canard…

5. Paul Alper says:

One of the reasons the term “statistical significance” is so difficult to deal with is perhaps linguistic. There is a “natural” tendency to feel that totally unique is more alone than unique, “epicenter” is more than just the middle of something and “penultimate” is above ultimate; adding an adjective appears to glorify the noun. In like manner, attaching the mystical term “statistical” in front of significant magically enhances what it attaches to.

• Martha (Smith) says:

But then what about “practically significant”? Or “Practically significant but not statistically significant”? Or “Statistically significant but not practically significant”?

6. Rob MacCoun says:

I have a question about this: Over the years, I’ve had many students doing program evaluations of agencies with fairly small client loads. Their power analyses suggest that with alpha=.05 and anything but very large effects, they’d need more units than they exist. I suppose one could say “too bad; life is not fair.” But based on arguments years ago from Shelly Zedeck and others, I have counseled them to relax the alpha level, with an explicit statement that they are increasing the risk of Type I errors to reduce the risk of a Type II error. I’d be curious as to whether readers here agree with that advice, or have better advice — e.g., from a Bayesian perspective.

• Martha (Smith) says:

In these cases with fairly small client loads, would it be possible to do a “census” of all the cases, rather than take a sample? Then there’s no need for statistical inference for evaluating the programs that have actually been carried out. (Though I realize the interest may be in evaluating the programs for future applications — but then there is also the problem that future situations in which the program might be considered may be different from the situations in which the program has already been applied, in which case statistical inference is also inappropriate for prediction)

• Z says:

You still need to do statistical inference if you don’t know what each client’s counterfactual outcome would have been under no intervention from the program.

• Andrew says:

Martha:

I disagree 100% with your statement that statistical inference is “inappropriate for prediction” for new scenarios that are different from the old. Of course statistical inference is appropriate for prediction here! In the real world, conditions change all the time. Statistical inference is for the real world, not just for idealized random samples and roulette wheels.

• I’m going to guess that what she meant here was extrapolation without accounting for extrapolation error. Of course statistical inference is the only way to go when doing extrapolation. But it’s also wrong to do in-sample inference, quantify the uncertainty in the in-sample inference, and then apply it to all other situations as if they HAD to be the same. Unfortunately that is done a lot.

• Martha (Smith) says:

Yes, Daniel is pretty much saying what I was thinking. A prediction from the sample you have may be the best you can do, but is inherently extrapolation, so to be intellectually honest, you need to emphasize that changes may affect what happens in a future situation and give serious thought to how possible changed circumstances might affect the quality of the prediction.

It’s analogous to doing a medical study on only subjects with European ancestry, then realizing that the situation may be different for people with African ancestry. (There was a case in the news today on how some blood sugar tests may be different for people of African vs European ancestry, especially those who have sickle cell trait).

• Anoneuoid says:

In practice, alpha is the expected value of p. As a rule of thumb, the “industry standard” is too choose a cutoff so that a bit less than ~50% of the analyses yield statistical significance (maybe about ~1/3).

That is why for rare climate data they use alpha = 1e-1, in biomed studies: alpha = 5e-2 (unless there happens to be a lot of data, then alpha = 1e-2), and in particle physics we can drop it to alpha = 3e-7.

7. A.P. Salverda says:

A related fallacy is the erroneous belief that as sample size increases, the probability of a false positive decreases.

(For instance, that a p-value of .02 is more likely to be a false positive when N = 10 than when N = 2,000.)

8. Michael Lew says:

These discussions would be better with a little less hyperbole. This is an example: “if a statistically significant difference is found, it is guaranteed to be at least 9 times higher than any true effect”. The reality is less alarming in most circumstances, with the exaggeration being a function of the true effect size and the observed variance. If the true effect size is large then there will be little exaggeration of effect among significant results. When the variance is underestimated by the data then the scaled effect size (mean difference divided by SEM) might be exaggerated even when the observed mean difference is exactly correct. There is no guarantee.

• Andrew says:

Michael:

Here’s what I wrote: “We were talking about a really noisy study where, if a statistically significant difference is found, it is guaranteed to be at least 9 times higher than any true effect, with a 24% chance of getting the sign backward.” This statement is not hyperbole. It is literally true in this case because there is no way the true effect size is large. That is the point.

The sad thing is that a statement as extreme as mine, which sounds like hyperbole, isn’t! That’s how bad things are in some fields of empirical science.

• Kyle C says:

Prof, I have learned many things from your blog that I can’t say in polite company without sounding like a crackpot.

• Carlos Ungil says:

I think the point Michael is missing is that “any true effect” means “any true effect smaller than the effect assumed for the 0.06 power calculation.”

9. ojm says:

This is an important point and also applied to eg Royall’s bound on likehood ratios: something like P(l1/l2>k) < 1/k when sampling from model 2.

The implication: in small n/noisy situations the apparent effects, conditional on them being found, will have to be much exaggerated.

The only 'solution' I know of is – make sure you collect enough data relative to the expected variability of the phenomenon of interest.

10. Art Owen says:

To get 5% significance at 6% power requires at least 6.65 fold exaggeration (not 9 fold). Of course, settings with 6 percent power are still dicey at 6.65 exaggeration (and 20% wrong sign), just not quite as bad as 9-fold exaggeration, which is more like 5.5% power.

https://arxiv.org/abs/1610.10028

11. Oliver Dechant says:

Significance testing is saying that less that 1/20 chance of a contrary result. To get a significant result you only need to get that one statistic. Variation may not have anything do to with this effect. When we are observing a non-random sample of the results it may only be a misleading result of statistical significance and not of the treatment effect.

12. Jordan Anaya says:

This is the hardest post I’ve ever had to write. I’m sure there will be a backlash that misconstrues me as a bully, or even a terrorist.

But I’d like to humbly point out that the great and glorious Food and Brand Lab previously conducted a similar pizza study to the pizzagate study where they obtained the exact opposite results.

This study, http://www.mitpressjournals.org/doi/abs/10.1162/REST_a_00057, showed that diners who paid more ate more, and those who paid less rated the pizza higher. Both of these findings are contradicted in the pizzagate study.