Skip to content
 

I don’t believe the paper, “Empirical estimates suggest most published medical research is true.” That is, most published medical research may well be true, but I’m not at all convinced by the analysis being used to support this claim.

David Austin pointed me to this article by Leah Jager and Jeffrey Leek. The title is funny but the article is serious:

The accuracy of published medical research is critical both for scientists, physicians and patients who rely on these results. But the fundamental belief in the medical literature was called into serious question by a paper suggesting most published medical research is false. Here we adapt estimation methods from the genomics community to the problem of estimating the rate of false positives in the medical literature using reported P‐values as the data. We then collect P‐values from the abstracts of all 77,430 papers published in The Lancet, The Journal of the American Medical Association, The New England Journal of Medicine, The British Medical Journal, and The American Journal of Epidemiology between 2000 and 2010. We estimate that the overall rate of false positives among reported results is 14% (s.d. 1%), contrary to previous claims. We also find there is not a significant increase in the estimated rate of reported false positive results over time (0.5% more FP per year, P = 0.18) or with respect to journal submissions (0.1% more FP per 100 submissions, P = 0.48). Statistical analysis must allow for false positives in order to make claims on the basis of noisy data. But our analysis suggests that the medical literature remains a reliable record of scientific progress.

Jager and Leek may well be correct in their larger point, that the medical literature is broadly correct. But I don’t think the statistical framework they are using is appropriate for the questions they are asking. My biggest problem is the identification of scientific hypotheses and statistical “hypotheses” of the “theta = 0″ variety.

Here’s what I think is going on. Medical researchers are mostly studying real effects (certain wacky examples aside). But there’s a lot of variation. A new treatment will help in some cases and hurt in others. Also, studies are not perfect, there are various sorts of measurement error and selection bias that creep in, hence even the occasionally truly zero effect will not be zero in statistical expectation (i.e., with a large enough study, effects will be found). Nonetheless, there is such a thing as an error. It’s not a type 1 or type 2 error in the classical sense (and as considered by Jager and Leek), rather there are Type S errors (someone says an effect is positive when it’s actually negative) and Type M errors (someone says an effect is large when it’s actually small, or vice versa). For example, the notorious study of beauty and sex ratios was a Type M error: the claim was an 8 percentage point difference in the probability of a girl (comparing the children of beautiful and non-beautiful parents), but I’m pretty sure any actual difference is 0.3 percentage points or less, it could go in either direction, and there’s no reason to suppose it persists over time. The point in that example is not that the true effect is or is not zero (thus making the original claim “false” or “true”) but rather that the study is noninformative. If it got the sign right it’s by luck, and in any case it’s overestimating the magnitude of any difference by more than an order of magnitude.

Yes, I recognize that my own impressions may be too strongly influenced by my own experiences (very non-statistical of me); nonetheless, I see this whole false-positive, true-positive framework as a dead end.

Now to the details of the paper. Based on the word “empirical” title, I thought the authors were going to look at a large number of papers with p-values and then follow up and see if the claims were replicated. But no, they don’t follow up on the studies at all! What they seem to be doing is collecting a set of published p-values and then fitting a mixture model to this distribution, a mixture of a uniform distribution (for null effects) and a beta distribution (for non-null effects). Since only statistically significant p-values are typically reported, they fit their model restricted to p-values less than 0.05. But this all assumes that the p-values have this stated distribution. You don’t have to be Uri Simonsohn to know that there’s a lot of p-hacking going on. Also, as noted above, the problem isn’t really effects that are exactly zero, the problem is that a lot of effects are lots in the noise and are essentially undetectable given the way they are studied.

Jager and Leek write that their model is commonly used to study hypotheses in genetics and imaging. I could see how this model could make sense in those fields: First, at least in genetics I could imagine a very sharp division between a small number of nonzero effects and a large number of effects that are essentially null. Second, in these fields, a researcher is analyzing a big data dump and gets to see all the estimates and all the p-values at once, so at that initial stage there is no p-hacking or selection bias. But I don’t see this model applying to published medical research, for two reasons. First, as noted above, I don’t think there would be a sharp division between null and non-null effects; and, second, there’s just too much selection going on for me to believe that the conditional distributions of the p-values would be anything like the theoretical distributions suggested by Neyman-Pearson theory.

So, no, I don’t at all believe Jager and Leek when they write, “we are able to empirically estimate the rate of false positives in the medical literature and trends in false positive rates over time.” They’re doing this by basically assuming the model that is being questioned, the textbook model in which effects are pure and in which there is no p-hacking.

I hate to be so negative—they have a clever idea and I think they mean well. But I think this sort of analysis reveals little more than the problems arise when you take statistical jargon such as “hypothesis” too seriously.

P.S. Jager and Leek note that they’ve put all their data online so that others can do their own analyses. Also see Leek’s reply in comments.

P.P.S. More from Leek. To respond briefly to Leek’s comments: (1) No, my point about Type 1 errors is not primarily “semantics” or “philosophy.” I agree with Leek that his framework is clear—my problem is that I don’t think it applies well to reality. As noted above, I don’t think his statistical model of hypotheses corresponds to actual scientific hypotheses in general. (2) When I remarked that Jager and Leek did not follow up the published studies to see which were true (however “true” is defined), I was criticizing their claim to be “empirical.” They write, “we are able to empirically estimate the rate of false positives in the medical literature and trends in false positive rates over time”—but I don’t see this as an empirical estimate at all, I see it as almost entirely model-based. To me, an empirical estimate of the rate of false positives would use empirical data on positives and negatives. (3) Leek does some new simulation studies. That seems like a good direction to pursue.

P.P.P.S. Just to clarify: I think what Jager and Leek are trying to do is hopeless. So it’s not a matter of them doing it wrong, I just don’t think it’s possible to analyze a collection of published p-values and, from that alone, infer anything interesting about the distribution of true effects. It’s just too assumption-driven. You’re basically trying to learn things from the shape of the distribution, and to get anywhere you have to make really strong, inherently implausible assumptions. These estimates just can’t be “empirical” in any real sense of the word. It’s fine to do some simulations and see what pops up, but I think it’s silly to claim that this has any direct bearing on claims of scientific truth or progress.

56 Comments

  1. Jeff Leek says:

    Andrew-

    First of all, thanks for reading our paper and pointing out that you think it is a “serious” paper. We think so too and are very serious about our statistics. I have read your blog for a long time and appreciate your disagreement with hypothesis testing. That being said, our paper is a direct response to the original work, which defined “correct” and “incorrect” in the medical literature by the truth of the null hypothesis. We totally agree that that is a very debatable definition of correct.

    However, we felt it was important to point out that when using that definition you can actually estimate the rate of false discoveries with principled methods. These methods are well justified in the statistical literature and we took pains to point out our assumptions in both the paper and the supplemental material. Whether you agree with those assumptions is of course, a totally reasonable thing to talk about.

    However, this is a source of frustration for me and my co-author as we have had trouble with reviewers, who, like you, may not like one or the other of our assumptions and have rejected our paper without giving us a chance to discuss those assumptions in detail. We believe that we have performed a careful analysis and it deserves to be published.

    So while I respect your right to say what you want on your blog (lord knows I do), it is more than a little frustrating that you may have just biased every statistical reviewer that we might get in the future with one fast and loose post.

    Jeff

    • Andrew says:

      Jeff:

      My post may be fast but I don’t think it’s loose! I am completely serious that (a) I don’t think your model (in which some effects are exactly zero) is appropriate, and (b) your analysis appears to take p-values at face value, which, as noted above, could make sense in a large genomic or imaging study but I don’t think is an appropriate model for published p-values.

      I definitely want to give your paper a fair hearing. But you write, “this is a source of frustration for me and my co-author as we have had trouble with reviewers, who, like you, may not like one or the other of our assumptions and have rejected our paper without giving us a chance to discuss those assumptions in detail.” Indeed, I think that disagreeing with an unreasonable assumption is an excellent reason to reject a paper! You can of course discuss those assumptions in detail in the paper itself.

      In any case, by blogging this I hope I’ve brought your paper to the attention of many more readers. It may very well be that many readers will like your paper and judge my criticisms of it to be off base. Or perhaps you or others will be inspired by these and other criticisms to upgrade the model in some way. I agree that this is an important topic and you have to start somewhere!

      • Jeff Leek says:

        Andrew,

        I think that dismissing a paper on spec based on previous biases/assumptions is maybe a little “loose” (maybe there is a better word). But hey, its the blogosphere. Thanks for responding.

        Jeff

        • Andrew says:

          Jeff:

          I’m not sure what you mean by “on spec,” but in any case the blogosphere has nothing to do with it. I’d have no problem publishing my remarks in a scientific journal. My words being on a blog should give them no lower status than your words being on Arxiv.

          Regarding the assumptions in your paper, check out Uri Simonsohn’s work on p-hacking: there’s a lot of evidence that published p-values are selected from a larger population. Beyond this, yes, I think it’s fair to evaluate your assumptions based on my experiences (which are not the same as “biases/assumptions”). The story of beauty and sex ratio is something I observed. It’s an anecdote and it would be fair enough to call it n=1, but my recounting of it represents neither a bias nor an assumption.

          In any case, I appreciate your engaging in comments here. I much prefer open disagreement to people just ignoring each other.

    • Rahul says:

      Jeff:

      You can’t post a paper on Arxiv and not expect people to comment about it only positively. I think you need to take Andrew’s criticism in a better light.

      Biasing reviewers is a silly argument; if that’s such a pressing concern, don’t pre-post on Arxiv, IMHO. Calling criticism you disagree with “fast and loose” is also a bit rude, I think.

      • Jeff Leek says:

        Thanks for pointing that out. I don’t mean to be rude.

        Jeff

        • Rahul says:

          If I may suggest a modification: Your abstract mentions the number 77,430; a formidable number indeed. But on closer reading, apparently only ~5000 abstracts actually contained extractable p-values.

          I think that number deserves mention at the abstract level.

          Also, might there be a selection bias here? About what sort of papers choose to prominently report p-values in abstracts?

  2. Jeff Leek says:

    Andrew-

    One more point I would mention is that we have made an effort to make all of our code freely available online from Github:

    https://github.com/jtleek/swfdr

    So others can consider alternative assumptions/modeling strategies.

    Jeff

  3. Neuroskeptic says:

    “It’s the blogosphere” – but I can point to at least a half dozen examples of arguments I made on my blog, that later appeared in peer-reviewed journals. I’m sure I’m not alone.

    • David Manheim says:

      That’s clearly a biased sample. Of course you have some arguments that made it into published papers, but if you don’t look at the full universe of arguments, you’re just “p-hacking,” in a slightly different context.

      (Sorry for criticizing, I just wanted to make the discussion recursive.)

  4. Hal Pashler says:

    A few quick thoughts:

    –Whether or not the model the authors use to estimate “errors” is quite right, it seems to me important and reassuring that the p-curves do not show the rising slope just below .05 which everyone seems to agree is a warning sign of p-hacking (Simonsohn/Nelson/Simmons group has been exploring this in depth.) Sad to say, some literatures are already found to show such a bulge: http://www.tandfonline.com/doi/abs/10.1080/17470218.2012.711335

    –But before anybody thinks “oh OK, the rule is: Biomedicine credible, Psychology not”, is it perhaps the case that a big proportion of the studies in the Jager & Leek sample are clinical trials, an arena in which mandatory registration and regulation pretty well stomp out p-hacking? It would be interesting to see the p-curves for other areas of life sciences where investigators are captains of their own ships–such as the areas where Amgen and Bayer reported very low rates of replication.

  5. K? O'Rourke says:

    Primary assumption is not true

    “the P-­‐values for false positive findings are uniformly distributed between 0 and 1″

    is not true for non-randomized studies (with confounding it could be anything)

    and only (approx?) true for ideally conducted RCTs

    and I have never noticed an ideally conducted RCT in clinical reaserch.

  6. […] was a little surprised to see it appear on Andrew Gelman’s blog with the disheartening title, “I don’t believe the paper, “Empirical estimates suggest most published medical research i… I responded briefly this morning to his post, but then had to run off to teach class. After […]

  7. EJ Wagenmakers says:

    Without having read the paper by Jager and Leek itself (for which I apologize), I wonder how this method deals with the prior proportion of zero or near-zero effects. This proportion was one of the important variables in the analysis by Ioannidis, but it seems as if the present analysis does not take this into account. I also wonder, with Andrew, how the approach deals with publication bias and p-hacking. I am just now reading the book “Bad Pharma” by Ben Goldacre and it seems that the authors’ estimate is too good to be true! But I might be biased, “Bad Pharma” makes its case in a rather compelling fashion. Hal: it seems that the medical field requires preregistration, but it does not require publication; also, the data analysis stage apparently leaves a lot of freedom.

  8. Larry Wasserman says:

    Jeff I liked your paper but I agree with Andrew: the even when H0 is true I doubt that
    the p-value has a uniform distribution, due to hidden biases.

    Larry Wasserman

    • Jeff Leek says:

      Larry – I’ve actually done a little work on that. I worked out how the FDR is affected by long range dependence and proposed a general framework for addressing that dependence. The dependence manifests as changes in the p-value distribution just like you mention. So just to be sure we did a basic sensitivity analysis. Check it out on my blog, I give the link above. -Jeff

      • K? O'Rourke says:

        Jeff – to suggest you have finessed the unknown distribution of p-values given no effect in non-randomised studies (you included an Epi journal) or even flawed RCTs, to _me_ sounds like claiming perpetual motion.

        You need to listen to the criticisms carefully and perhaps not suggest you have already addressed them all.

  9. John Khademi says:

    I’m confused by this statement from their paper:

    “The [Ioannidis] claim is based on the assumption that most hypotheses considered by researchers have a low pre-study probability of being successful. The suggested reasons for this low pre-study probability are small sample sizes…”

    How does a small sample size influence the pre-study probability?

  10. James Thorniley says:

    I have a question:

    “You don’t have to be Uri Simonsohn to know that there’s a lot of p-hacking going on.”

    Then later:

    “I think what Jager and Leek are trying to do is hopeless. So it’s not a matter of them doing it wrong, I just don’t think it’s possible to analyze a collection of published p-values and, from that alone, infer anything interesting about the distribution of true effects.”

    How is Jager and Leek’s approach in principle any different to Simonsohn’s? Simonsohn didn’t necessarily look at p-values, but the basic idea is the same surely?

    • Andrew says:

      James:

      In this context, I think it’s easier to do what Simonsohn’s doing (say that a set of p-values does not seem consistent with a posited model or scientific reporting) than what Jager and Leek are trying to do (make a claim that some percentage of claims are true). To be highly assumption-based can be ok if you’re refuting a model, not so much if you’re trying to make inferences if the model is actually inappropriate. Also, Simonsohn analyzed a lot of specific cases, which I find more compelling than Jager and Leek’s approach of just throwing together this diverse set of numbers. (As noted above, I can see such an approach making sense in a single study of many p-values coming from a genetics or imaging experiment, which is apparently where these methods come from.)

    • Andrew says:

      To put it another way: roughly speaking, Simonsohn is looking at reliability, Jager and Leek are trying to learn about validity. In general, reliability is easier to study than validity.

  11. K? O'Rourke says:

    Andrew: Thanks for the P.P.P.S. above.

    I recalled last night, that it was well into 1960s? before WG Cochrane first pointed out to folks that in non-randomised studies, confidence interval coverage decreased rapidly with increasing sample size.

    So I thought I should make my pointing out that particular instance of the problems you raised – more easy for some of your readers to see.

    The simulated p-values using R code below from the observational study are very non-uniform for a bias of 25% of an SD (that could be larger or small depending on the context)

    RCT=replicate(1000,rnorm(20))
    RCTp=apply(RCT,2,function(x) t.test(x)$p.)
    BIAS=.25
    OBSp=apply(RCT,2,function(x) t.test(x + BIAS)$p.)
    par(mfrow=c(1,2))
    hist(RCTp)
    hist(OBSp)

    Readers who wish, may easily vary the size of the bias and study sample sizes.

    • gwern says:

      > I recalled last night, that it was well into 1960s? before WG Cochrane first pointed out to folks that in non-randomised studies, confidence interval coverage decreased rapidly with increasing sample size.

      Could you enlarge on this? I went googling and read up on Cochrane and did an interlibrary loan for the only paper in the ’60s which seemed to match (“Designing Clinical Trials”; http://dl.dropbox.com/u/29304719/Papers/designing%20clinical%20trials.pdf ) doesn’t seem to make this point anywhere in it.

  12. […] Are 80% of scientific findings wrong and impossible to replicate, or is it just 14%? A statistical argument. My argument: if most scientific findings are wrong, why do so many people assume that the finding that most scientific findings are wrong is right? And more: An argument that that statistical argument is wrong, […]

  13. Chris Auld says:

    Suppose we put aside statistical issues such as Andrew raises. Assume all p-values generated by researchers are void of “hacking,” methodological errors, and all selection biases other than that assumed by the authors: only values less than 0.05 get published. All researchers know the true distribution of their test statistics under the null and in the alternative direction. The methods in the paper still do not reveal the rate of “false positives.”

    The model is identified entirely off functional form: everything hinges on the assumption that the the distribution of p-values when the null is false is beta. Even small violations of that assumption will yield biased estimates of the rate of “false positives.” The references the paper gives supporting this functional form are not convincing, e.g., Pounds and Morris (2003) offer that the beta-uniform mixture “provides a reasonable model for the distribution of p-values arising from a microarray experiment.” But this is a context in which a “reasonable model” doesn’t suffice.

    See Jim Heckman’s paper “The Effect of Prayer on God’s Attitude Toward Mankind” (http://ftp.iza.org/dp3636.pdf) for an amusing illustration of the pitfalls of identification purely off functional form.

    I would also question the premise of the paper that it’s useful to split hypotheses in the medical (or social science) literatures into “true” or “false.” “All nulls are false” may not quite be literally true in these literatures, but surely the rate of “true nulls” is vanishingly small.

    • Andrew says:

      Chris:

      Exactly. That’s what I was getting at in my P.P.P.S.: “You’re basically trying to learn things from the shape of the distribution, and to get anywhere you have to make really strong, inherently implausible assumptions.”

      Regarding Heckman’s paper, see here.

    • conchis says:

      I agree that the over-reliance on functional form assumptions is an issue, but I think this is less because the beta functional form is wrong / too restrictive than because it’s too flexible. In the literature that the Jager-Leek technique builds on, the beta components of the mixing distribution are intended as a flexible catch-all for *anything that isn’t uniform*.

      To my mind, this creates a problematic double standard: the assumed distribution of p-values from false positives is very restrictive, but the assumed distribution of p-values from true positives is much more flexible (the beta distribution even includes the uniform as a special case). All this means that any error in the uniformity assumption will bias downwards the estimate of false positives.

      • conchis says:

        Meant to add that, in this context, I see the JL paper as providing a lower bound on the false positive rate, rather than an unbiased estimate.

  14. Kent Lyon says:

    So, if published medical studies are true (is he including the original paper on autism linked to childhood vaccines?), why do they so frequently contradict each other?
    Actually, the great body of medical literature lags current practice, is irrelevant, hokum, or fraudulent. Take the paper in the NEJM by Dr. Nissen on Avandia, released early to great fanfare and political acclaim, particularly by Henry Waxman who got a copy before even early release on the internet (e.g., he saw it before any doctor in America saw it) and immediately held a press conference to proclaim it as the poster paper for post-marketing regulatory power for the FDA, as the paper purported to show, via a meta-analysis (the paper violated all the requirments for an actual meta-analysis), that Avandia caused heart disease (it doesn’t). The leading peer reviewer on the paper (Dr. Stephen Hafner of UT San Antonio) made the following statement on the paper’s release and publication (he told the NEJM not to publish the paper, that it didn’t merit publication): “The New England Journal of Medicine has become like a British Tabloid, minus the picture of the bare-chested woman on page 3.” Dr. Nissen used an invalid statistical method for the data he had to analyze that data. He even referenced the article in Medical Statistics that showed that his statistical analysis was invalid. But the NEJM published it for political reasons, to support the efforts of Democrats in the Congress to get post-marketing regulatory authority for the FDA. The paper is still cited as if were valid by other papers.
    The medical literature is a travesty, politicized, misleading, erroneous, fraudulent, irrelevant, etc., etc., etc.
    Anyone who would consider such a question is hopelessly naive and uninformed.

    • george says:

      Kent, your comment is dangerous nonsense.

      Avandia has been severely restricted by the FDA and other regulatory agencies worldwide, and is the subject of thousands of adverse events lawsuits, of which the majority have been settled by GSK. In addition, GSK pled guilty and was fined $3bn for withholding results from early safety studies indicating that Avandia caused dangerous cardiovascular complications.

      If you want to pick on the weaknesses of the medical literature, and its ability to monitor drugs post-approval, Avandia is really not the place to start.

  15. Seth Roberts says:

    The original paper (“Why most published research findings is false”) that Jager and Leek criticize contains the statement “It can be proven that most claimed research findings are false.” Proven. The original paper contains something like a proof but no data supporting the assumptions of that proof. Jager and Leek’s paper contains a substantial amount of data. I haven’t read it but the mere use of data strikes me as a big step forward.

    • sweed says:

      All steps are not necessarily “forward.” The nihilistic argument, ‘you’ve got to start somewhere’ is a nice comfort blanket if you think that expended effort is a virtue in and of itself. Sadly, most published articles in my field (endodontics) follow that dictum. And these types of papers are not simply “useless”….they are harmful.

      • Re:sweed says:

        1. to think Seth’s comment is “nihilistic” suggests that you do not understand what “nihilistic” means

        2. The original paper by Ioannidis has no data to support his completely arbitrary choice of “numbers” — did you read both papers?

        3. you think most published data-driven articles in endodontics are useless or harmful. then do you simply follow authors who present their hypotheses or theories without providing any evidence?

        • sweed says:

          1. I did not say Seth’s comments are nihilistic. Starting “somewhere” means “anywhere” means “it doesn’t matter.”

          2. Yes. His paper is conceptual. The problems with the current paper have been discussed here. The author seems loathe to take criticism despite apparently being rejected already from peer review.

          3. Ask your dentist next time you need work done. It gets pretty personal, doesn’t it? The disease model we use in my specialty is 50 years old and full of holes. These holes are still chalked up to “cognitive dissonance.” The harm comes from the increased confidence the practitioner gets from reading this literature. The entire body of work in the past 5 years on computed tomography comes to mind. You start by reducing your confidence.

  16. Brian Caffo says:

    I think that an important aspect of this manuscript is missing from this discussion. The manuscript is in direct response to Ioannidis’ paper and approaches the problem largely from the parameters and terms set forward in that work. What’s important is that the authors were able to provide a counterargument within the framework of the original paper. The work would still stand when stipulating issues in hypothesis testing, study biases, P-hacking and other potential reasons for incorrect conclusions in published work.

    • Andrew says:

      Brian:

      I see your point. From that perspective, the key mistake in Jager and Leek’s paper is the title. An appropriate title would’ve been, “A theoretical model that could produced the observed distribution of p-values even with most published findings being true.” On the other hand, had they used that title, the paper would’ve received zero attention.

      Instead they called it, “Empirical estimates suggest most published medical research is true,” a title which I do not think is reasonable (for reasons discussed above, I think it’s a stretch to call their estimates “empirical”). The work may still stand as a response to Ioannidis’s paper but I don’t see it as any kind of estimate of the empirical correctness of scientific publications. Take away the paper’s title and some of its more dramatic claims and I’m much happier with it.

    • K? O'Rourke says:

      It is this point of Andrew’s (and the P.P.P.S.) that’s the pirmary concern.

      When making claims about the world (that some people might act on) there is a responsibility on the researcher for the credibility and relevance of the asumptions that are critical to the claims. And on reviewers to raise concerns about the credibility and relevance.

      It is not like in theoretical work where one can wash thier hands by simply stating the assumptions clearly and reviewers should not question them but only suggest alternatives.

      But what has me much more concerned is the possibility that more than just a few statisticians think it is possible or even likely the the distribution of p_values from non-randomized studies are anywhere close to uniformly distributed. If that was true, there would be no advantage to randomising. You have to _do something_ other than make assumptions to make the (empirical) distribution of p-values uniform (given no effect).

      • Mark says:

        Agreed!

      • revo11 says:

        This confusion is related to the use of “null” being overloaded again – the cliche of “practical” vs. “statistical” significance. Under a specified null model, it is true that p-values are uniformly distributed. However, practically, a confounded effect estimate is still untrue. With confounding, one is simultaneously correct in rejecting the null and incorrect regarding the specific mechanism by which the null hypothesis is rejected. The distribution of p-values associated with “untrue” results like these is going to be highly non-uniform.

        Brad Efron tries to bridge the gap with his work on “empirical null” distributions (your example code with the BIAS parameter is actually an ideal case for those methods), but they only work under certain conditions.

        • K? O'Rourke says:

          Revo11: I would put it differently.

          > The distribution of p-values associated with “untrue” results like these is going to be highly non-uniform.

          Non-randomised studies are (almost) _always_ estimating and testing “untrue” [true effect + unknow bias] results, that is why their p-values will be very non-uniformly distributed.

          As Don Rubin pointed out (though mistakenly attributing precedence to RA Fisher) it is the physical act of randomization that makes distribution of p_values uniform – not any (mathematical) assumption, like covariates equal in distribution.

          From a technical perspective “true effect + unknow bias” (given data usually available) makes the parameters non-indentified – Sander Greenland and Paul Gustufson have written cleary on this. You can’t get anywhere without extra-study based information (i.e. priors) and untestable assumptions and I am pretty sure Brad would agree with this.

          It is also why Judea Pearl’s web site is really worth visiting if you are going to deal with non-randomized studies- you need to discern what asumptions are required to make the “true effect” identified (e.g. something you can learn about from study data). But in making empirical claims you need to argue the relevance and credibility of those assumptions.

          (Maynard Keynes took Karl Pearson to task over this a long time ago, Pearson lost)

          • revo11 says:

            I agree with these points. When I say “Under a specified null model, it is true that p-values are uniformly distributed”, I should emphasize that I don’t think those “specified null models” in non-randomized studies have much to do with reality in most cases. Identifiability in observational studies is definitely something that needs to be more widely appreciated and discussed.

            I do follow and enjoy both SG and JP’s contributions to this general discussion.

  17. Martha Smith says:

    I’ve regarded Ioannidis’s 2005 paper as a “back of the envelope” argument that helps draw attention to the many problems many of us are aware of in the use of statistics (ascertainment bias, multiple testing, publication bias, using models that don’t fit, aiming to get “significance” by hook or by crook, ignoring “practical significance,” the resulting sloppy treatment of power, sweeping uncertainty under the rug, extrapolation, etc., etc.). So I was surprised to see a paper that focused on theestimate of the rate of false positives, instead of on the many underlying problems – all the more so since the paper itself seemed to display some of these problems.

    For example, I agree with Andrew and Chris’s questioning Jager and Leek’s use of the beta distribution just because it had been reported to fit in micro-array data. In addition, use of a “theoretical model” with published p-values seems to be a form of ignoring ascertainment bias. And Andrew makes a good point in criticizing the use of “empirical” in the title.

    From the first part of the paper, I had been expecting an analysis along the lines of Efron’s empirical null distributions work that revo11 commented on. But revo11’s remark that “but they only work under certain conditions” warrants elaboration: Efron mentions that one obstacle to using the null distribution approach with gene testing is what he calls “filtration”: only testing genes (or whatever) that look promising. This impedes the ability to get an empirical null distribution, since the filtration filters out the data that would allow one to construct it. It seems that the same problem would occur in studying p-values – the ones that aren’t published are needed to establish an empirical null distribution.

    What I believe needs more focus in the discussion of the scientific enterprise is the examples where the all-too-human tendency to believe what the herd thinks blinds us to possible signals to the contrary. For example, recently we are hearing that chemotherapy may in fact prompt metastasis, and that perhaps focusing on living with stable tumors rather than trying to kill them completely might be better in the long run; that what we so readily and happily called “junk DNA” in fact is crucial in regulating the protein-coding genes; and that anti-oxidants may in fact have a deleterious effect.

    • K? O'Rourke says:

      Very nicely put.

      By the way, in meta-analysis “filtration” is referred to as selection modelling and John Copas at Warwick has a series of neat papers on it, highlighting the challenges.

    • Martha Smith says:

      Another “soft data point” I forgot to include: From an interview with Deborah Zarin, the director of ClinicalTrials.gov, in the July 7, 2011 issue of Science (p. 154):

      “Q: Were you surprised by what you have been learning from the data?
      A: I call it my introduction to the sausage factory. It appears that there are a number of practices in the world of clinical trials that I hadn’t been aware of; it surprised a lot of people. For example, researchers might say, this is a trial of 400 subjects, 200 in each arm, and when they came to report results, they would be talking about 600 people. We
      would ask them to explain. They would say, “We are including 200 people from this other study because we had always intended to do that.” … There were a lot of — what would I call it? — nonrigorous practices.

      Q: Were the lapses more than clerical errors?
      A: We are finding that in some cases, investigators cannot explain their trial, cannot explain their data. Many of them rely on the biostatistician, but some bio-statisticians can’t explain the trial design. So there is a disturbing sense of some trials being done with no clear intellectual leader. That may be too strong a statement, but that’s the feeling we are left with.”

      • K? O'Rourke says:

        Martha:

        On a number of occasions I got to audit RCTs (though confidentially) and I only seemed to get sausage factories (biased sample?). The problems seemed to be largely due to the biostatistician role being delegated too much responsibility while being under-supervised, their work under-documented and their employment intermittent along with a mad rush to publish anything that looks publishable. The problem being when a sloppy error leads to an interesting publishable finding the investigators can almost be totally intolerant of any further delays like careful checking or verification.

        This is coming out now likely because of more information about trials is becoming externally available.

        (I had offered to do a systematic audit of random sample of studies funded by a given funder in 2005/6 – the seemed completely uninterested. They probably (starting to) feel that they now _have to_)

        • Martha says:

          K:
          Yes, the types of things you mention are presumably part of the “why” behind what Zarin observed — and have been known for a while among those who have worked with clinical trials or listen to those who do (e.g., although I have never worked with clinical trials, I am aware of the problem from ASA chapter meetings, ASA publications, and a few books by people who have worked in the field). So it is good to see Zarin’s comments bringing these into the wider scientific community. But the point for the current discussion is that this kind of “soft data” needs to be taken into account in any estimate of the proportion of false claims made in medical research. The Jeager-Leek paper does not do this.

  18. […] Gelman’s January 24 blog I don’t believe the paper “Empirical estimates … has an interesting discussion of a paper by Jager and Leek using modeling to get a smaller estimate […]

  19. Chris Auld says:

    Today I happened upon this old paper (“Are all economic hypotheses false?”) by Brad DeLong and Kevin Lang and thought of this thread:

    http://www.givewell.org/files/methods/De%20Long,%20Branford%20and%20Lang%201989.pdf

    They undertake the same exercise as Jager and Leek, using data from economics papers. However, instead of assuming a functional form for the distribution of p-values for false nulls, they take a semiparametric approach and impose only the condition that the density of p-values under false nulls is decreasing, and calculate bounds.

  20. K? O'Rourke says:

    Chris: You might also be interested in John Copas’ work on selection modeling of published p_values.

    http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic-research/copas/publications

  21. ezra abrams says:

    as a, sort of, biomedical researcher, i suspect most of my colleagues would agree tht there is a lot of garbage in the literture
    however, we need some bayesianism (or what i think is bayesianism)
    iirc, half of all published, peer reviewed papers get one or zero citations (the paper is cited by a later paper 1 or 2 times)]
    that has to tell you something
    also, we all know that despite the rules about republishing data…it happens,
    etc

    i would also say, if i may, that there is something wrong when, in desigining and performing large $$ clinical studies, the same errors get made over and over…and i would say, the blame has to go at least in part to the statisticians, for not clearly communicating

  22. ezra abrams says:

    As a scientists, I will bet you dollars to donuts that if you go and ask other scientists, they will tell you that articles in top, top journals like NEJM and so forth are highly non representative
    They will also wonder why you do a statistical data dredge when you could do a real experiment – say, take a paper from time X, and look for papers from time Y after X that disprove X
    Another way to do this is look at the citation index results; it is hard to define the control group, but the outlier neg papers are probably false and the outlier pos papers probably true
    etc
    not clear to me how on any planet you can derive a theory that accounts for all the wierd stuff between doing an experiment and getting it published
    In theory, theory and practice are the same
    In practice they aren’t (substitute “math” for theory)

  23. […] dealing with some criticisms Mr. Leek made a good point in his […]