After I gave my talk at an econ seminar on Why We (Usually) Don’t Care About Multiple Comparisons, I got the following comment:

One question that came up later was whether your argument is really with testing in general, rather than only with testing in multiple comparison settings.

My reply:

Yes, my argument is with testing in general. But it arises with particular force in multiple comparisons. With a single test, we can just say we dislike testing so we use confidence intervals or Bayesian inference instead, and it’s no problem—really more of a change in emphasis than a change in methods. But with multiple tests, the classical advice is not simply to look at type 1 error rates but more specifically to make a multiplicity adjustment, for example to make confidence intervals wider to account for multiplicity. I don’t want to do this! So here there is a real battle to fight.

P.S. Here’s the article (with Jennifer and Masanao), to appear in the Journal of Research on Educational Effectiveness. (Sounds like an obscure outlet but according to Jennifer it’s read by the right people. Education researchers are very interested in multiple comparisons.)

Andrew, have you written an article on this issue? I think this is a highly interesting topic, especially since I only knew various correction methods before.

I added link above.

Thanks a lot!

I like the talk! I think one thing that is worth pointing out is that multiple testing can be really useful as a measure of noise among discoveries and for resource allocation. In genomics, a typical study will look for associations between an outcome (say cancer status) and one of 20,000 variables (say gene expression levels). In this setting, relatively little prior knowledge about which genes will be associated with the outcome is available. The goal of the study is usually hypothesis generation and some subset of the discoveries will be subjected to expensive and time consuming follow-up studies/validation.

When performing experiments like this, the false discovery rate (expected proportion of false discoveries among the discoveries) can be really useful when interpreted as a level of noise among the discoveries. In particular, if you have a set of discoveries that have an estimated FDR of 5%, then you expect that on average that no more than 5% of them will be false. The FDR is really useful for scientists in this situation, because they are going to follow up on some (or all) of the discoveries. The FDR can be thought of as a measure of “potentially wasted graduate student time on follow-ups”. Then they can choose an error rate threshold based on what they have time/resources to follow up and how willing they are to risk wasted effort.

I also noticed you mention the problem of correlation in your talk. In genomics, this is a problem that keeps us up at night. In particular, we worry about latent sources of correlation like batch effects (http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html), which can have a major impact on the conclusions of a study. Since this is such a problem, there has been a lot of effort and some pretty useful solutions developed for handling latent sources of correlation in multiple testing (http://www.pnas.org/content/105/48/18718). Since these solutions have primarily been developed for genomic applications, I’m not sure how well known they are in neuroscience or the social sciences.

Are there any studies that have tried to calibrate the accuracy of their hypothesis tests? What I’ve seen instead is biologists using the multiple-comparison-adjusted hypothesis test scores out of something like edgeR ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796818/ ) to rank hypotheses. Then they only have resources to evaluate a handful or maybe a couple dozen. If that’s all you want to do, you don’t even need to adjust!

Although it’s not in the paper Andrew linked, I helped Masanao develop a Bayesian multiple comparisons version of the edgeR model for differential gene expression. You can compute expected false discovery rate directly from the posterior.

If you couple the hierarchical expression model with the Bayesian approach to gene (splice variant) expression I proposed ( http://lingpipe-blog.com/2010/02/05/inferring-splice-variant-mrna-expression-rna-seq/ ), which is similar to that in the Li et al. paper ( http://lingpipe-blog.com/2010/06/09/li-ruotti-stewart-thomson-and-dewey-2010-rna-seq-gene-expression-estimation-with-read-mapping-uncertainty/ ), you even account for covariance among the expression estimates.

I couldn’t agree more about the latent sources of correlation. We’d treat them exactly this way — as random effects using latent parameters. For instance, you can use a factor model to try to model correlations in high dimensions. In some cases, they’re not entirely latent. For instance, we can model things effects like differential hexamer priming during sample prep in the alignment model and in the expression model ( http://lingpipe-blog.com/2010/04/25/sequence-alignment-with-conditional-random-fields/ ). Similarly, we can model bias induced from the fluorescence on the basic read confusions down at the base level in sequencers like Illumina or SOLiD.

“Are there any studies that have tried to calibrate the accuracy of their hypothesis tests?”

I feel a little silly linking to a bunch of my papers, but we actually just published a paper on this one too (http://www.biostat.jhsph.edu/~jleek/papers/jointnull.pdf). We also have an R package that can be used to calibrate multiple testing procedures called dks (http://bioconductor.case.edu/bioconductor/2.9/bioc/html/dks.html).

“For instance, you can use a factor model to try to model correlations in high dimensions.”

Right, but the key point is that this factor model has to be supervised by the comparison of interest. If not, you will likely either (a) wash out all the signal from the true positives or (b) strongly anti-conservatively bias your significance results.

Scott and Berger have an interesting paper on this, from a decision-theoretic point of view:

http://www.stat.duke.edu/~berger/papers/multcomp.pdf

I’d like to learn more about this.

One thing that’s come up frequently is that it’s often easy to critique whether the test statistics are identically distributed (this probably leads to issues with classical corrections as well). For example, there may be multiple tests on multiple contingency tables, each with different numbers of total counts, varying amounts of missing data, or other issues. Are there any resources (particularly example case studies) on how one should proceed in those situations?

Regarding correlations, I wonder if Brad Efron’s large scale inference methods would help? At least they would capture the deviation of the distribution of test statistics from the theoretical null due to correlations. Although I wonder if correlations can lead to more subtle bias issues that are not addressed by fitting the distribution of test statistics.

I’ll have to read that paper…

Genovese Roeder and Wasserman developed some methods that allow one to use information about the different tests;

http://biomet.oxfordjournals.org/content/93/3/509.full.pdf

The Efron approach of doing (clever) things with the distribution of p-values is unlikely to be the best solution for correlated tests. As Jeff’s work points out, it really pays to think hard about sources of correlation, and where possible to use data that directly addresses correlation between observations. Trying to address such correlation using just the p-values is pretty painful; p-values are crude and unhelpful summaries, for this purpose. Happily, better summaries are often available, even when the full data isn’t.

Usually I’ve seen him write about it in terms of the normalized test statistics rather than p-values, but yeah, I’m sure such a general method probably has its limitations.

Not on multiple comparisons, but prompted by your slide disclaiming having ever suffered either Type 1 or Type 2 errors…

There is (as you frequently state or come close to stating) nothing of interest in the social sciences where the hypothesis theta_1 = theta_2 is serious; we always know it’s false. I’ve often been tempted to be even more assertive than this: to suspect that there has never been interesting academic question – whether in the social sciences or not – where the question of X = Y (X truly, truly equals Y) is all of (a) interesting and (c) about the real world, rather than about just pure mathematics, and (c) usefully approachable using statistical methods from experimental data. This skepticism hinges on whether there are such “real world” questions where X = Y has some anomalously large prior relative to the question of say (|X-Y| < 1/(10^(10^(10^(10^(10^(10^10)))))))) – and yet where there is still noise and statistical variability (otherwise it's a mathematical/logical puzzle). I hadn't run across a useful counterexample which I couldn't caveat away.

Until, I think today: http://www.bbc.co.uk/news/science-environment-15734668

Matter and antimatter might truly be equivalent (in an absolute sense) or not, but if they are it's an fact that is usefully questioned empirically and statistically.

I think I'm very unlikely to have made my bemusement at all clear, but to anyone who understands —

can you propose any other similar examples?