## The statistical significance filter

Statistically significant findings tend to overestimate the magnitude of effects. This holds in general (because E(|x|) > |E(x)|) but even more so if you restrict to statistically significant results.

Here’s an example. Suppose a true effect of theta is unbiasedly estimated by y ~ N (theta, 1). Further suppose that we will only consider statistically significant results, that is, cases in which |y| > 2.

The estimate “|y| conditional on |y|>2” is clearly an overestimate of |theta|. First off, if |theta|<2, the estimate |y| conditional on statistical significance is not only too high in expectation, it's always too high. This is a problem, given that |theta| is in reality probably is less than 2. (The low-hangning fruit have already been picked, remember?)

But even if |theta|>2, the estimate |y| conditional on statistical significance will still be too high in expectation.

For a discussion of the statistical significance filter in the context of a dramatic example, see this article or the first part of this presentation.

I call it the statistical significance filter because when you select only the statistically significant results, your “type M” (magnitude) errors become worse.

And classical multiple comparisons procedures—which select at an even higher threshold—make the type M problem worse still (even if these corrections solve other problems). This is one of the troubles with using multiple comparisons to attempt to adjust for spurious correlations in neuroscience. Whatever happens to exceed the threshold is almost certainly an overestimate. This might not be a concern in some problems (for example, in identifying candidate genes in a gene-association study) but it arises in any analysis (including just about anything in social or environmental science where the magnitude of the effect is important.

[This is part of a series of posts analyzing the properties of statistical procedures as they are actually done rather than as they might be described in theory. Earlier I wrote about the problems of inverting a family of hypothesis tests to get a confidence interval and how this falls apart given the way that empty intervals are treated in practice. Here I consider the statistical properties of an estimate conditional on it being statistically significant, in contrast to the usual unconditional analysis.]

1. Perhaps this explains why many published academic articles cannot be replicated in an industrial setting? I couldn’t help but think about the “Of Beauty, Sex, and Power” article mentioned above when I read the following post by Tyler Cowen. While academics have a strong incentive to find a significant effect, industry researchers have an incentive to get the best estimate.

2. John Ioannidis has a nice article on this problem.

http://dcscience.net/ioannidis-associations-2008.pdf

3. numeric says:

Would you care to relate these points to GWAS (gene wide association studies)? Or perhaps you can provide a link to a relevant article or two.

4. revo11 says:

@numeric In GWAS studies, the effect magnitude is usually secondary to making a binary decision on whether a gene is significant or not (effect sizes are usually pitifully small anyway). These issues are definitely there, but then again GWAS studies have all kinds of interpretation problems that are probably higher on the list than over-estimating effect sizes.

In my opinion, many of the underlying assumptions underlying GWAS are long overdue for re-evaluation – “common disease common variant”, looking for individual SNPs with large effect sizes, the confirmatory/causal, rather than exploratory, style of analysis, presentation and interpretation…

5. This problem occurs frequently in astronomy, when there is multi-exposure imaging. If a source is only “detected significantly” in a subset of the exposures, the mean of the brightness measurements of the significant detections is an over-estimate of the source brightness. This is obvious when you think about it for this simple case (where the significance cut happens right before the measurement), but if the “significance cut” happened long ago in the data analysis chain, maybe even by another investigator who passed on the results in “reduced” form, it is hard to see it and note its possible effect.

6. […] that 50% of academic studies are wrong. I wonder if this is related to Andrew Gelman’s idea that statistically significant estimates are too large in expectation. Joseph at Observational […]

7. […] size, and cannot itself be checked because of company confidentiality concerns . . .More here and here. Filed under Miscellaneous Statistics, Public Health Comment (RSS) […]

8. […] make the biggest claims and generate more citations. Editors demand overestimates by functioning as statistical significance filters. The editorial policies encourage scholars who find large effects, many of which are overestimates, […]

9. […] findings that cannot be reproduced. This was in part motivated by Andrew Gelman’s recent post making me think that journal editors work as statistical significance filters, thus creating […]

10. […] Gelman kicked things off with a post about the statistical significance filter, and following up by commenting on type M errors in the lab and featuring a post about data […]

11. […] that 50% of academic studies are wrong. I wonder if this is related to Andrew Gelman’s idea that statistically significant estimates are too large in expectation. Joseph at Observational […]

12. […] they can be on small sample sizes?My reply:In increasing order of mathematical sophistication, see this blog post, this semi-popular article, and this scholarly article.I think there’s room for […]

13. […] These are beside the point with respect to this blog post, and there is a lot of stuff online about it, so I’m not feeling inclined to retrace the steps I took so many months ago to come to this conclusion. The short version is that I remember being convinced that the sample sizes in the experiments reported by Bem were reasonably likely to be indicating that the data had been probed repeatedly and that the experiments had been stopped when statistically significant results were found. That is, it seems reasonably likely to me that the effects found by Bem are due, at least in part, to the statistical significance filter. […]