A lot of statistical methods have this flavor, that they are a solution to a mathematical problem that has been posed without a careful enough sense of whether the problem is worth solving in the first place

Stuart Hurlbert writes:

A colleague recently forwarded to me your 2012 paper with Hill and Yajima on the multiple comparison “non-problem”, as I call it.

You and your colleagues might find of interest a 2012 paper [with Celia Lombardi] that reaches similar conclusions by a colleague and myself which is attached. Similar but not identical, as we are a bit Bayesian-shy after seeing so many exaggerated claims made for Bayesian approaches over recent decades.

I take pride in having for a few decades defended many colleagues against editors (and many graduate students against faculty members) who demanded “corrections” for multiple comparisons. We’ve gotten no small number of editors and professors to back off their unreasonable demands. Paper tigers all!

My reply:

I agree that those lopsided tests are too-clever-by-half. I think a lot of statistical methods have this flavor, that they are a solution to a mathematical problem that has been posed without a careful enough sense of whether the problem is worth solving in the first place. Another example are tests of contingency tables with fixed margins. People have written many many papers on the topic, sometimes with much technical sophistication, but from an applied perspective it’s almost never the right question, given that it’s extremely rare to have an experiment or observational study in which it would make sense for the margins of a table to be fixed.

9 thoughts on “A lot of statistical methods have this flavor, that they are a solution to a mathematical problem that has been posed without a careful enough sense of whether the problem is worth solving in the first place

  1. Rosenbaum’s “Testing hypotheses in order” and “Sensitivity analysis for equivalence and difference in an observational study of neonatal intensive care units” might be useful for some of these instances where there are clearly many test interrelated tests in the same paper, but a worst-case scenario correction seems inappropriate. Often one can think of the tests as a sequence of refinements and use the naive p-values without shame.

    [1] http://www-stat.wharton.upenn.edu/~rosenbap/inorder.pdf
    [2] http://stat.wharton.upenn.edu/~rosenbap/senequivdif.pdf

  2. I admire the authors for blasting away with both barrels. That said, I can think of a multiplicity-ish objection to their neo-Fisherian framework*: winner’s curse (statistical version, not auction version). The authors write that they don’t think the new large-scale studies (they give the examples of microarray and neuroimaging studies, to which I would add genome-wide association studies) are so different that new methods are needed, and invite those to disagree to take up the challenge. Here’s my nickel version.

    In such large-scale studies, effects are estimated not so much for their intrinsic inferential value as for providing a basis on which to screen in targets for more labor-intensive follow-up. Right away, it is apparent why decision-theory-influenced thinking might work better in this setting than in the settings considered by the authors — the main point is to make a decision. But these studies are also used to set expectations for future results, which is an inferential sort of task; the problem is that the ranking (and/or selection) that is the main purpose of large-scale studies induces a propensity for Type M error in effect size estimators that would be unobjectionable in smaller studies. I see no hint in the paper that the neo-Fisherian framework is set up to cope with this problem.

    The paper calls Type S errors “Type III” error — yuck. The whole “Type {Roman numeral}” nomenclature is a crime against statisticianity.

    * as introduced and described in the paper; I’ve done no other background reading.

    • The genome-wide association literature needs some reform. It should take a decision-theoretic approach and treat the associations for the exploratory value that they provide when used alongside mechanistically-motivated hypothesis-driven research. Instead, p-values are bludgeoned to provide confirmatory legitimacy, with multiple comparison adjustments added to be “extra-sure”.

    • speaking of GWA, “gene set enrichment analysis” is another great example of mathematical solutions to problems we shouldn’t be solving in the first place. Taking the test statistics from marginal models as input to a statistical test? Should have stopped right there and estimated the correct test statistics in the first place!

  3. What? Is there a problem with finding a:

    solution to a mathematical problem that has been posed without a careful enough sense of whether the problem is worth solving in the first place

    Are you guys trying to devalue my Master’s degree?

    Observer
    PS. I did much better in my PhD thesis. I formulated and studied, but did not solve, a problem worth solving. Did not bother to publish, unfortunately, as it turns out to be a big deal some decades later.

Comments are closed.