Skip to content

I got 99 comparisons but multiplicity ain’t one

After I gave my talk at an econ seminar on Why We (Usually) Don’t Care About Multiple Comparisons, I got the following comment:

One question that came up later was whether your argument is really with testing in general, rather than only with testing in multiple comparison settings.

My reply:

Yes, my argument is with testing in general. But it arises with particular force in multiple comparisons. With a single test, we can just say we dislike testing so we use confidence intervals or Bayesian inference instead, and it’s no problem—really more of a change in emphasis than a change in methods. But with multiple tests, the classical advice is not simply to look at type 1 error rates but more specifically to make a multiplicity adjustment, for example to make confidence intervals wider to account for multiplicity. I don’t want to do this! So here there is a real battle to fight.

P.S. Here’s the article (with Jennifer and Masanao), to appear in the Journal of Research on Educational Effectiveness. (Sounds like an obscure outlet but according to Jennifer it’s read by the right people. Education researchers are very interested in multiple comparisons.)


  1. Robert Birkelbach says:

    Andrew, have you written an article on this issue? I think this is a highly interesting topic, especially since I only knew various correction methods before.

  2. Jeff says:

    I like the talk! I think one thing that is worth pointing out is that multiple testing can be really useful as a measure of noise among discoveries and for resource allocation. In genomics, a typical study will look for associations between an outcome (say cancer status) and one of 20,000 variables (say gene expression levels). In this setting, relatively little prior knowledge about which genes will be associated with the outcome is available. The goal of the study is usually hypothesis generation and some subset of the discoveries will be subjected to expensive and time consuming follow-up studies/validation.

    When performing experiments like this, the false discovery rate (expected proportion of false discoveries among the discoveries) can be really useful when interpreted as a level of noise among the discoveries. In particular, if you have a set of discoveries that have an estimated FDR of 5%, then you expect that on average that no more than 5% of them will be false. The FDR is really useful for scientists in this situation, because they are going to follow up on some (or all) of the discoveries. The FDR can be thought of as a measure of “potentially wasted graduate student time on follow-ups”. Then they can choose an error rate threshold based on what they have time/resources to follow up and how willing they are to risk wasted effort.

    I also noticed you mention the problem of correlation in your talk. In genomics, this is a problem that keeps us up at night. In particular, we worry about latent sources of correlation like batch effects (, which can have a major impact on the conclusions of a study. Since this is such a problem, there has been a lot of effort and some pretty useful solutions developed for handling latent sources of correlation in multiple testing ( Since these solutions have primarily been developed for genomic applications, I’m not sure how well known they are in neuroscience or the social sciences.

  3. revo11 says:

    I’d like to learn more about this.

    One thing that’s come up frequently is that it’s often easy to critique whether the test statistics are identically distributed (this probably leads to issues with classical corrections as well). For example, there may be multiple tests on multiple contingency tables, each with different numbers of total counts, varying amounts of missing data, or other issues. Are there any resources (particularly example case studies) on how one should proceed in those situations?

    Regarding correlations, I wonder if Brad Efron’s large scale inference methods would help? At least they would capture the deviation of the distribution of test statistics from the theoretical null due to correlations. Although I wonder if correlations can lead to more subtle bias issues that are not addressed by fitting the distribution of test statistics.

    I’ll have to read that paper…

    • fred says:

      Genovese Roeder and Wasserman developed some methods that allow one to use information about the different tests;

      The Efron approach of doing (clever) things with the distribution of p-values is unlikely to be the best solution for correlated tests. As Jeff’s work points out, it really pays to think hard about sources of correlation, and where possible to use data that directly addresses correlation between observations. Trying to address such correlation using just the p-values is pretty painful; p-values are crude and unhelpful summaries, for this purpose. Happily, better summaries are often available, even when the full data isn’t.

      • revo11 says:

        Usually I’ve seen him write about it in terms of the normalized test statistics rather than p-values, but yeah, I’m sure such a general method probably has its limitations.

  4. bxg says:

    Not on multiple comparisons, but prompted by your slide disclaiming having ever suffered either Type 1 or Type 2 errors…

    There is (as you frequently state or come close to stating) nothing of interest in the social sciences where the hypothesis theta_1 = theta_2 is serious; we always know it’s false. I’ve often been tempted to be even more assertive than this: to suspect that there has never been interesting academic question – whether in the social sciences or not – where the question of X = Y (X truly, truly equals Y) is all of (a) interesting and (c) about the real world, rather than about just pure mathematics, and (c) usefully approachable using statistical methods from experimental data. This skepticism hinges on whether there are such “real world” questions where X = Y has some anomalously large prior relative to the question of say (|X-Y| < 1/(10^(10^(10^(10^(10^(10^10)))))))) – and yet where there is still noise and statistical variability (otherwise it's a mathematical/logical puzzle). I hadn't run across a useful counterexample which I couldn't caveat away.

    Until, I think today:
    Matter and antimatter might truly be equivalent (in an absolute sense) or not, but if they are it's an fact that is usefully questioned empirically and statistically.

    I think I'm very unlikely to have made my bemusement at all clear, but to anyone who understands —
    can you propose any other similar examples?