The statistical significance filter

Posted on September 10, 2011 9:33 AM by Andrew

I’ve talked about this a bit but it’s never had its own blog entry (until now).

Statistically significant findings tend to overestimate the magnitude of effects. This holds in general (because E(|x|) > |E(x)|) but even more so if you restrict to statistically significant results.

Here’s an example. Suppose a true effect of theta is unbiasedly estimated by y ~ N (theta, 1). Further suppose that we will only consider statistically significant results, that is, cases in which |y| > 2.

But even if |theta|>2, the estimate |y| conditional on statistical significance will still be too high in expectation.

For a discussion of the statistical significance filter in the context of a dramatic example, see this article or the first part of this presentation.

I call it the statistical significance filter because when you select only the statistically significant results, your “type M” (magnitude) errors become worse.

And classical multiple comparisons procedures—which select at an even higher threshold—make the type M problem worse still (even if these corrections solve other problems). This is one of the troubles with using multiple comparisons to attempt to adjust for spurious correlations in neuroscience. Whatever happens to exceed the threshold is almost certainly an overestimate. This might not be a concern in some problems (for example, in identifying candidate genes in a gene-association study) but it arises in any analysis (including just about anything in social or environmental science where the magnitude of the effect is important.

[This is part of a series of posts analyzing the properties of statistical procedures as they are actually done rather than as they might be described in theory. Earlier I wrote about the problems of inverting a family of hypothesis tests to get a confidence interval and how this falls apart given the way that empty intervals are treated in practice. Here I consider the statistical properties of an estimate conditional on it being statistically significant, in contrast to the usual unconditional analysis.]

15 thoughts on “The statistical significance filter”

Carlisle Rainey on September 10, 2011 9:48 AM at 9:48 am said:

Perhaps this explains why many published academic articles cannot be replicated in an industrial setting? I couldn’t help but think about the “Of Beauty, Sex, and Power” article mentioned above when I read the following post by Tyler Cowen. While academics have a strong incentive to find a significant effect, industry researchers have an incentive to get the best estimate.

http://marginalrevolution.com/marginalrevolution/2011/09/how-good-is-published-academic-research.html
Uri Simonsohn on September 10, 2011 11:46 AM at 11:46 am said:

John Ioannidis has a nice article on this problem.

http://dcscience.net/ioannidis-associations-2008.pdf
numeric on September 10, 2011 7:54 PM at 7:54 pm said:

Would you care to relate these points to GWAS (gene wide association studies)? Or perhaps you can provide a link to a relevant article or two.
- Corey on September 10, 2011 11:16 PM at 11:16 pm said:
  
  Here’s one. I first linked to it in response to a previous post on Type M errors.
- worzel on September 10, 2011 11:29 PM at 11:29 pm said:
  
  In GWAS (where testing takes priority over estimation) the regression to the mean problem is well known; researchers call it the “Winners’ Curse”, a term borrowed from economics and game theory.
  
  Of course, estimation is still useful in GWAS, so bias-correction methods have been developed; e.g.
  
  http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796696/
  
  http://www.sph.umich.edu/csg/boehnke/pdf/ge33-453.pdf
revo11 on September 11, 2011 12:14 AM at 12:14 am said:

@numeric In GWAS studies, the effect magnitude is usually secondary to making a binary decision on whether a gene is significant or not (effect sizes are usually pitifully small anyway). These issues are definitely there, but then again GWAS studies have all kinds of interpretation problems that are probably higher on the list than over-estimating effect sizes.

In my opinion, many of the underlying assumptions underlying GWAS are long overdue for re-evaluation – “common disease common variant”, looking for individual SNPs with large effect sizes, the confirmatory/causal, rather than exploratory, style of analysis, presentation and interpretation…
David W. Hogg on September 11, 2011 8:40 AM at 8:40 am said:

This problem occurs frequently in astronomy, when there is multi-exposure imaging. If a source is only “detected significantly” in a subset of the exposures, the mean of the brightness measurements of the significant detections is an over-estimate of the source brightness. This is obvious when you think about it for this simple case (where the significance cut happens right before the measurement), but if the “significance cut” happened long ago in the data analysis chain, maybe even by another investigator who passed on the results in “reduced” form, it is hard to see it and note its possible effect.
Pingback: A Few Highlights: 9/6-11 | Bootstrapping Life
Pingback: Type M errors in the lab « Statistical Modeling, Causal Inference, and Social Science
Pingback: A Structure to Encourage Reproducibility | Bootstrapping Life
Pingback: Reproducibility in Observational Studies | Carlisle Rainey
Pingback: Week in Review: 9/12-9/18 | Carlisle Rainey
Pingback: A Few Highlights: 9/6-11 | Carlisle Rainey
Pingback: Question on Type M errors « Statistical Modeling, Causal Inference, and Social Science
Pingback: Replication in behavioral research » Source-Filter

Comments are closed.