Jeff Leek just posted the discussions of his paper (with Leah Jager), “An estimate of the science-wise false discovery rate and application to the top medical literature,” along with some further comments of his own.
Here are my original thoughts on an earlier version of their article. Keith O’Rourke and I expanded these thoughts into a formal comment for the journal. We’re pretty much in agreement with John Ioannidis (you can find his discussion in the top link above).
In quick summary, I agree with Jager and Leek that this is an important topic. I think there are two key places where Keith and I disagree with them:
1. They take published p-values at face value whereas we consider them as the result of a complicated process of selection. This is something I didn’t used to think much about, but now I’ve become increasingly convinced that the problems with published p-values is not a simple file-drawer effect or the case of a few p=0.051 values nudged toward p=0.049, but rather an ongoing process in which tests are performed contingent on data. As Keith and I discussed in our article, Jager and Leek’s model of p-values could make sense in a large genetics study in which many many comparisons are performed and the entire set of comparisons is analyzed (in essence, a hierarchical model), but I don’t think it works so well when you’re analyzing many different single p-values published in many different studies.
2. Jager and Leek talk about things such as “the science-wise false discovery rate.” I don’t think such a concept is so well defined. To start with, I don’t think people are usually studying zero effects. I think what is happening is that there are many effects that are small, and these effects can vary (so that, for example, a treatment could have a small positive effect in one scenario and a small negative effect in another scenario). Errors can be defined in various ways, but as a start I like to think about type S (sign) and type M (magnitude) errors. I certainly believe that the Type S error rate is less than 50%: we’d expect to get the sign of any comparison correct more than half the time, if there’s any signal whatsoever! How high is the Type M error rate? That depends on how large does an error have to be, to be considered a Type M error. Are true effects overestimated by more than a factor of 2, more than 50% of the time? Possibly. This could be worth studying.
In short, I think there are a few more steps needed before your method maps to science as practiced. But it’s great to see all this discussion. Simple calculations have their place, as long as their limitations are understood, and I believe that this sort of discussion pushes the field forward.
A separate, but related, issue, is I think an idea of Jager and Leek that underlies all this work, which is that scientists are generally pretty reasonable people and science as a whole seems to be pretty sensible. I’ll buy that. The Stapels and Hausers and Wegmans and Dr. Anil Pottis of the world are the exceptions, that’s what makes their stories so striking. And even the routine finding of statistical significance amid noise is, I’m willing to believe, usually done in the service of some true underlying effects. This makes it hard to believe that most papers are false etc.
I wonder whether this particular issue can be resolved by considering areas of research rather than single papers. Suppose it’s true (as John Ioannidis and Keith O’Rourke and I suspect) that most scientific papers have a lot more noise than is usually believed, that statistically significant results go in the wrong direction far more than 5% of the time, and that most published claims are overestimated, sometimes by a lot. This can be ok if these scientific subfields are lurching toward the truth in some way. I think this could be a useful way forward, to see if it’s possible to reconcile the feeling that science is basically OK with the evidence that individual claims are quite noisy.