Skip to content

“A Bias in the Evaluation of Bias Comparing Randomized Trials with Nonexperimental Studies”

Jessica Franklin writes:

Given your interest in post-publication peer review, I thought you might be interested in our recent experience criticizing a paper published in BMJ last year by Hemkens et al.. I realized that the method used for the primary analysis was biased, so we published a criticism with mathematical proof of the bias (we tried to publish in BMJ, but it was a no go). Now there has been some back and forth between the Hemkens group and us on the BMJ rapid response page, and BMJ is considering a retraction, but no action yet. I don’t really want to comment too much on the specifics, as I don’t want to escalate the tension here, but this has all been pretty interesting, at least to me.

Interesting, in part because both sides in the dispute include well-known figures in epidemiology: John Ioannidis is a coauthor on the Hemkens et al. paper, and Kenneth Rothman is a coauthor on the Franklin et al. criticism.


The story starts with the paper by Hemkens et al., who performed a meta-analysis on “16 eligible RCD studies [observational studies using ‘routinely collected data’], and 36 subsequent published randomized controlled trials investigating the same clinical questions (with 17 275 patients and 835 deaths),” and they found that the observational studies overestimated efficacy of treatments compared to the later randomized experiments.

Their message: be careful when interpreting observational studies.

One thing I wonder about, though, is how much of this is due to the time ordering of the studies. Forget for a moment about which studies are observational and which are experimental. In any case, I’d expect the first published study on a topic to show statistically significant results—otherwise it’s less likely to be published in the first place—whereas anything could happen in a follow-up. Thus, I’d expect to see earlier studies overestimate effect sizes relative to later studies, irrespective of which studies are observational and which are experimental. This is related to the time-reversal heuristic.

To put it another way: The Hemkens et al. project is itself an observational study, and in their study there is complete confounding between two predictors: (a) whether a result came from an observational study or an experiment, and (b) whether the result was published first or second. So I think it’s impossible to disentangle the predictive value of (a) and (b).

The criticism and the controversy

Here are the data from Hemkens et al.:

Franklin et al. expressed the following concern:

In a recent meta-analysis by Hemkens et al. (Hemkens et al. 2016), the authors compared published RCD [routinely collected data] studies and subsequent RCTs [randomized controlled trials] using the ROR, but inverted the clinical question and corresponding treatment effect estimates for all study questions where the RCD estimate was > 1, thereby ensuring that all RCD estimates indicated protective effects.

Here’s the relevant bit from Hemkens et al.:

For consistency, we inverted the RCD effect estimates where necessary so that each RCD study indicated an odds ratio less than 1 (that is, swapping the study groups so that the first study group has lower mortality risk than the second).

So, yeah, that’s what they did.

On one hand, I can see where Hemkens et al. were coming from. To the extent that the original studies purported to be definitive, it makes sense to code them in the same direction, so that you’re asking how the replications compared to what was expected.

On the other hand, Franklin et al. have a point, that in the absence of any differences, the procedure of flipping all initial estimates to have odds ratios less than 1 will bias the estimate of the difference.

Beyond this, the above graph shots a high level of noise in the comparisons, as some of the follow-up randomized trials have standard errors that are essentially infinite. (What do you say about an estimated odds ratio that can be anywhere from 0.2 to 5?) Hemkens et al. appear to be using some sort of weighting procedure, but the relevant point here is that only a few of these studies have enough data to tell us anything at all.

My take on these papers

The above figure tells the story: The 16 observational studies appear to show a strong correlation between standard error and estimated effect size. This makes sense. Go, for example, to the bottom of the graph: I don’t know anything about Hahn 2010, Fonoarow 2008, Moss 2003, Kim 2008, and Cabell 2005, but all these studies are estimated to cut mortality by 50% or more, which seems like a lot, especially considering the big standard errors. It’s no surprise that these big estimates fail to reappear under independent replication. Indeed, as noted above, I’d expect that big estimates from randomized experiments would also generally fail to reappear under independent replication.

Franklin et al. raise a valid criticism: Even if there is no effect at all, the method used by Hemkens et al. will create the appearance of an effect: in short, the Hemkens et al. estimate is indeed biased.

Put it all together, and I think that the sort of meta-analysis performed by Hemkens et al. is potentially valuable, but maybe it would’ve been enough for them to stop with the graph on the left in the above image. It’s not clear that anything is gained from their averaging; also there’s complete confounding in their data between timing (which of the two studies came first) and mode (observational or experimental).

The discussion

Here are some juicy bits from the online discussion at the BMJ site:

02 August 2017, José G Merino, US Research Editor, The BMJ:

Last August, a group led by Jessica Franklin submitted to us a criticism of the methods used by the authors of this paper, calling into question some of the assumptions and conclusion reached by Lars Hemkens and his team. We invited Franklin and colleagues to submit their comments as a rapid response rather than as a separate paper but they declined and instead published the paper in Epidemiological Methods (Epidem Meth 2-17;20160018, DOI 10.1515/em-2016-0018.) We would like to alert the BMJ’s readers about the paper, which can be found here:

We asked Hemkens and his colleagues to submit a response to the criticism. That report is undergoing statistical review at The BMJ. We will post the response shortly.

14 September 2017, Lars G Hemkens, senior researcher, Despina G Contopoulos-Ioannidis, John P A Ioannidis:

The arguments and analyses of Franklin et al. [1] are flawed and misleading. . . . It is trivial that the direction of comparisons is essential in meta-epidemiological research comparing analytic approaches. It is also essential that there must be a rule for consistent coining of the direction of comparisons. The fact that there are theoretically multiple ways to define such rules and apply the ratio-of-odds ratio method doesn’t invalidate the approach in any way. . . . We took in our study the perspective of clinicians facing new evidence, having no randomized trials, and having to decide whether they use a new promising treatment. In this situation, a treatment would be seen as promising when there are indications for beneficial effects in the RCD-study, which we defined as having better survival than the comparator (that is a OR < 1 for mortality in the RCD-study) . . . it is the only reasonable and useful selection rule in real life . . . The theoretical simulation of Franklin et al. to make all relative risk estimates <1 in RCTs makes no sense in real life and is without any relevance for patient care or health-care decision making. . . . Franklin et al. included in their analysis a clinical question where both subsequent trials were published simultaneously making it impossible to clearly determine which one is the first (Gnerlich 2007). Franklin et al. selected the data which better fit to their claim. . . .

21 September 2017, Susan Gruber, Biostatistician:

The rapid response of Hemkens, Contopoulos-Ioannidis, and Ioannidis overlooks the fact that a metric of comparison can be systematic, transparent, replicable, and also wrong. Franklin et. al. clearly explains and demonstrates that inverting the OR based on RCD study result (or on the RCT result) yields a misleading statistic. . . .

02 October 2017, Jessica M. Franklin, Assistant Professor of Medicine, Sara Dejene, Krista F. Huybrechts, Shirley V. Wang, Martin Kulldorff, and Kenneth J. Rothman:

In a recent paper [1], we provided mathematical proof that the inversion rule used in the analysis of Hemkens et al. [2] results in positive bias of the pooled relative odds ratio . . . In their response, Hemkens et al [3] do not address this core statistical problem with their analysis. . . .

We applaud the transparency with which Hemkens et al reported their analyses, which allowed us to replicate their findings independently as well as to illustrate the inherent bias in their statistical method. Our paper was originally submitted to BMJ, as recently revealed by a journal editor [4], and it was reviewed there by two prominent biostatisticians and an epidemiologist. All three reviewers recognized that we had described a fundamental flaw in the statistical approach invented and used by Hemkens et al. We believe that everyone makes mistakes, and acknowledging an honest mistake is a badge of honor. Thus, based on our paper and those three reviews, we expected Hemkens et al. and the journal editors simply to acknowledge the problem and to retract the paper. Their reaction to date is disappointing.

13 November 2017, José G Merino, US Research Editor, Elizabeth Loder, Head of Research, The BMJ:

We acknowledge receipt of this letter that includes a request for retraction of the paper. We take this request very seriously. Before we make a decision on this request, we -The BMJ’s editors and statisticians – are reviewing all the available information. We hope to reach a decision that will maintain the integrity of the scientific literature, acknowledge legitimate differences of opinion about the methods used in the analysis of data, and is fair to all the participants in the debate. We will post a rapid response once we make a decision on this issue.

The discussion also includes contributions from others on unrelated aspects of the problem; here I’m focusing about the Franklin et al. critique and the Hemkens et al. paper.

Good on ya, BMJ

I love how the BMJ is handling this. The discussion is completely open, and the journal editor is completely non-judgmental. All so much better than my recent experience with the Association for Psychological Science, where the journal editor brushed me off in a polite but content-free way, and then the chair of the journal’s publication board followed up with some gratuitous rudeness. The BMJ is doing it right, and the psychology society has a few things to learn from them.

Also, just to make my position on this clear: I don’t see why anyone would think the Hemkens et al. paper should be retracted; a link to the criticisms would seem to be enough.

P.S. Franklin adds:

Just last week I got am email from someone who thought that our conclusion in our Epi Methods paper that use of the pooled ROR without inversion is “just as flawed” was too strong. I think they are right, so we will now be preparing a correction to our paper to modify this statement. So the circle of post-publication peer review continues…

Yes, exactly!


  1. Fantastic, It’s a subject that I have been exploring myself. thank you.

  2. “I love how the BMJ is handling this. The discussion is completely open,(…).”

    Yup. It is even no problem for BMJ to publish a ‘rapid response’ in which I refer to a highly controversial paper in another journal in which I refect on the acting of BMJ in regard to issues around the unavailability of the ICMJE form of Elizabeth Moylan, the guarantor of the original study.

    See (‘A new paper about the unavailability of the ICMJE disclosure form of the guarantor of this paper’).

  3. Tom Passin says:

    “Beyond this, the above graph shots a high level of noise in the comparisons, as some of the follow-up randomized trials have standard errors that are essentially infinite”.

    I notice that in every case where the error bars of the randomized trial are smaller than for the RCD, the odds ratio of the randomized trial is also closer to unity. That’s pretty interesting in itself.

  4. Simon Gates says:

    Agree that it’s good that the discussion is all open, but should the BMJ be more active here and take a view one way or the other? And maybe retract the paper? I’m not sure what I think – on the one hand, the error seems pretty clear and the authors don’t seem to be taking it well (something we all need to get better at), on the other, maybe the authors should be the ones to decide if they want their names attached to a mistake for ever?

    Reminds me very much of the arguments about this paper a few years ago (which didn’t lead to a retraction I think) – and it still gets cited despite the errors (it was used in a conference presentation that I saw this year, without comment).
    So there are risks with allowing it to stay out there uncorrected.

  5. Anders says:

    I have written a draft response to this discussion, in which I attempt to construct a meaningful interpretation of a minor variation of Hemkens analysis. At this stage, I feel 75% confident in my line of reasoning. If anyone wants to give me feedback before I submit it as a rapid response, I would very much appreciate it. If anyone here has time to look at this, could you please contact me at ahuitfel at gmail dot com ?

  6. Anders says:

    Actually, I will just link to my draft response here:

    Caveat emptor: This is a work in progress, and I would very much appreciate feedback. Feedback can be sent either anonymously through , or by email at ahuitfel at gmail dot com. I invoke Crockers Rules for all feedback on this draft.

  7. bjs12 says:

    I’m surprised there aren’t more comments on this one. Is everyone breaking for Thanksgiving already?

    It is not obvious to me that Hemkens, et al. are wrong. I think it depends on how the ROR is calculated in the presence of the inversion. Franklin, et al. use the following example in their Section 3: “one clinical question compared CABG to Stent and the associated ROR was 2.08, indicating over-estimation of the relative effect. However, if the authors had instead reported Stent vs CABG, the ROR would instead be 0.48, indicating under-estimation of the effect.” [quote lightly edited for brevity]

    (A small point: the value of 2.08 has been pulled from the wrong column, representing ‘weight’ rather than ‘ROR’. The correct number is 1.69 and its inverse is 0.59. Using the correct numbers below.)

    I’m not convinced this is true. Since the RCD occurs first, one treatment exits this study as the presumed ‘better’ choice. I can’t tell from either Hemkens or Franklin which direction the example points, so for sake of argument let’s specify that CABG turned out better than Stent in the RCD. Reading from Figure 4 Hemkens (or Figure 2 Franklin) it looks like OR=0.7. Then, the RCT occurs (presumably, in an effort to ‘confirm’ the outcome of the earlier RCD). In this example, it turns out that CABG does worse than Stent in the RCT, OR=1.2 (again, eyeballing figures). The ROR is calculated as 1.2/0.7 = 1.69.

    Hemkens is using ROR as evidence for or against the candidate treatment of the RCD. So, ROR>1 is indicative of the RCT arriving at an answer ‘to the right’ of the estimate from the RCD. That is, given this directionality, RCT OR estimates of 0.6, 0.5, etc. would support the RCD conclusion of ‘CABG better than Stent’ and result in ROR1.

    Now, invert the ORs to report Stent vs CABG…. In the RCD, Stent did worse than CABG, OR = 1/0.7 = 1.4. In the RCT, Stent did better than CABG, OR = 1/1.2 = 0.8. How would the ROR be calculated in this situation? Franklin posits (see quote above) that ROR will be 0.8/1.4 = 1/1.69 = 0.59. I don’t think that is right. With the direction inverted, evidence against the candidate treatment of the RCD is now found ‘to the left’ of the estimate from the RCD. RCT OR estimates of 1.5, 1.6, etc. would support the RCD conclusion of ‘Stent worse than CABG’ and should result in ROR1. In this interpretation, the ROR would be 1.4/0.8 and remain 1.69 — same as when the ORs had opposite direction.

    Am I right about how ROR is calculated? This seems like a question that Hemkens can just answer. If I’m right, does this answer the challenge of Franklin? Not sure. But think of the ‘no effect’ situation (as Franklin does). If there is no effect (i.e., no bias between RCD and RCT treatment estimates), then there should be balance between >1 RORs for some clinical questions and <1 RORs for others. I think this is consistent with my formulation, but would appreciate confirmation/refutation.

    Franklin also makes another point, in Section 2, that I think is different than this issue above. Franklin states that, "a ROR 1.” I confess that I don’t follow their argument for why this would be true. It seems, at least in part, to depend on Figure 1, though. And Figure 1 appears problematic to me. It purports to show distributions of ORs; however, they look like normal distributions to me. OR distributions must be right skew, of course. If the symmetry of the distribution is a key point of their argument, then it should be revisited. If not, the distributions should at least be re-plotted to avoid confusion. And, finally, which point (the one I spent paragraphs on or the skew one from this paragraph) drives their conclusion about the problem with Hemkens’ work? Or is it both complementing each other? Or what?

    • Andrew says:


      I don’t think the analysis of Hemkens et al. is necessarily “wrong,” but I do think Franklin et al. are right that the Hemkens et al. estimate is biased. In any case, I don’t think much can be learned from these data given the confounding and variation discussed in my post.

      • bjs12 says:


        Hmm. You’re right — it doesn’t seem that ‘wrong’ is the proper word. I guess I remain unconvinced that Hemkens’ analysis is biased. I think if ROR is calculated consistently with respect to the directionality that does (or doesn’t) support the treatment favored by the RCD, then it won’t be biased. It is distinctly possible that it isn’t worth it for anyone to expend the energy to convince me otherwise, however, because as you point out there may not be much to learn here overall. Although… meta-analyses are common, RCDs are common, RCTs are common, it is common to believe that RCTs give a ‘better’ answer than RCDs, etc. Maybe it is worth it to understand this method better for future use (or non-use).

        Lastly, some of my less-than symbols were eaten in my first post, taking other text with them — I’m generally a more coherent writer than some of those paragraphs suggest.


  8. Haven’t Hemken’s et al findings been validated by other tools that you consider unbiased?

    • Andrew says:


      What findings are you talking about? The abstract concludes with this: “Overall, RCD studies showed significantly more favorable mortality estimates by 31% than subsequent trials (summary relative odds ratio 1.31 (95% con dence interval 1.03 to 1.65; I2=0%)).” I’m not quite sure what it would mean for this statement to be validated, as I expect any such claim would depend on context. I have no reason to believe that observational studies in general give more favorable mortality estimates by 31%.

      I’m not saying the Hemkens et al. paper is bad, necessarily; it’s just not clear to me how I should try to interpret this estimate or the claim of statistical significance.

      • My apologies, I meant the full range of assumptions that were in Hemken et al and in the Rapid Response BMJ responses. You also mthrew out somewhat generalized observations, pointing to biases I’ll gather them all. I am not a statistician. However nearly everyons making generalized comments > not as specific as they can be.

  9. What I really do not understand is that obsession with retraction, as well as what is almost akin to a Mesiah complex of researchers assuming they spotted errors where it is considered failure unless you got the original paper retracted. I do not mean the cases of misconduct or lack of data integrity etc, many of which we have seen in psychology research. But cases where there is this discussion about the analysis and its impact on the conclusions: what benefit would there be from a retraction? Except that it would exclude all of us from this conversation and possibly individuals in the future, who would have less of a chance to come across such exchanges. This applies to this example, several mentioned in the comments and several others.
    As much as I try to see genuine and not self-interested motives behind it, I cannot. What I do see, in a lot of cases, is an almost obsessive desire to purge the literature. I think many of us learn more from these discussions and reading different discussions of the methodologies and how these could impact conclusions than from the tidying up of the scientific record. Again, I mean papers where there is no suspicion of misconduct and lack of data integrity.
    As someone who has done my very modest share of re-analyses leading to different conclusions than the original authors, it never occurred to me I should have wanted a retraction. Particularly in cases such as the one above, where as this post discusses, it’s all observational and as such highly uncertain data.

    • Andrew says:


      I think that to say that people asking for retract articles are “self-interested . . . almost obsessive desire to purge the literature” makes about as much sense as people asking to publish are “self-interested . . . almost obsessive desire to add to the literature.” I don’t see what’s gained from this sort of general claim.

      Regarding this specific case, I have no idea who is calling for that paper to be retracted. As I wrote above, I don’t think it should be retracted, at least not based on the information available to me.

      Speaking more generally, I wrote “No Retractions, Only Corrections: A manifesto.” and I agree with you that open direction and correction is better than retraction.

      • Keith O'Rourke says:

        Also more well motivated engagement by authors being criticized (and then critics responding to that).

        It _should_ be an opportunity to learn how arguments could have been better given that would have avoided the criticism, what those making the criticism lack adequate knowledge of, what further developments could have been given and just maybe some misunderstandings and even errors actually in the paper.

        What one often sees is the authors’ rebuking the perceived underlying claim of the critics of being smart and knowing important things with often authoritative proclamations of – those things you are claiming are wrong or not that important and you are not at all very smart.

        That is, obsessing on minimizing any loss of prestige or even trying to gain some in the exchange.

        • Andrew’s blog has been serving that function, fortunately. But for those of us that are not statisticians, but fairly good logically, we try to make sense of the uses of terminology in specific contexts.

        • Keith Hemken’ et al responses to BMJ’s Rapid Response feature was lacking what specifically?

          • Keith O'Rourke says:

            The “What one often sees” was meant to be general – I only glanced at some of the back and forth but it looked like the usual. As an aside, I suspect the prospect of learning anything substantive (economy of research) will be less than even Andrew thinks.

            For a specific, there is this exchange

            For instance anyone who has any familiarity with what goes one in drug regulatory agencies would see the second sentence here as ultra naive –
            “Regulators may catch some of this, but regulatory agencies are typically understaffed and extreme selective reporting gets through despite even valiant regulatory efforts. Therefore, it may not be that uncommon for one to end up with situations that are equivalent of having 2 significant results out of 20 trials or 20 (or even many more) analyses of outcomes.”

            I believe it best when engaging with critics to pretend they might have a point – no matter how hard or ridiculous that pretending is – so that you can discover how you might actually be somewhat wrong about something.

            • Ioana Cristea says:

              I think a fair characterization should have also included some relevant quotes from the initial critical comments. It’s also extremely naive to say the very least to tout equivalence testing or equivalence trials as a solution for the already extremely low bar of drug regulation. I find myself more often than not on the side of the critics, but I disagree the presumption of ‘being right’ is by default on that side. Frequently both sides gave valid points and what one side considered a ‘mistake’, the other saw as an acceptable ‘limitation’/trade-off.

              • Keith O'Rourke says:

                > critics, but I disagree the presumption of ‘being right’ is by default on that side.
                In my first comment in this thread, in my opportunity to learn comment, 3 of the 5 were the critics learning they were wrong and only 2 of the 5 were the authors learning they were wrong. Now I perhaps should have been explicit about how critics could also be contributing to poor engagement by having a presumption of ‘being right’ and even asking for retraction.

                > one side considered a ‘mistake’, the other saw as an acceptable ‘limitation’/trade-off.
                The specific example I gave is about a factual matter – what usually is done in regulatory review of drugs (in particular at the FDA). Having worked with the FDA and at other agencies in the past I know its rare for published papers to be considered other than in a purely supportive role. That was what I meant by the “ultra naive” sense but that was poor wording. I should said it was a factual matter than could be answered by those with actual experience.

                Now John Ioannidis had one of the same mentors as I did (were not enemies) and I am hoping someone has or will point out what actually happens in regulatory reviews – but an opportunity was missed to learn that earlier.

                > very least to tout equivalence testing or equivalence trials as a solution
                I am not aware of saying anything about that.

              • Keith, I cannot reply to your comment for some reason. Just to clarify: I did not say you said that. But you posted from a critical exchange which I went and read and this was one core argument the critic was making. My point was that your comment did, in my view, give a slanted view about the authors with statements that based on your knowledge you deem wrong. I was underscoring that so so did the critic in this specific example.
                I only mean this as support for my initial point which is that this panicked rush whenever somebody thinks they spotted an error and this unreasonable expectation that something serious needs to happen (which, yes, is pretty religious). We say science is all about disagreement, but it seems to me we sure do not embrace disagreement (I mean this on both sides).
                Twitter and the social media really do not help. The comments on this very post on one the largest facebook groups focused on, one would have hoped, issues of open science, were somewhat pitiful.
                Personally, I think I disagree with you in that I like that the authors and the critics continue to disagree and not necessarily come together to some compromise. It forces them to come with arguments they would not have found otherwise and which are interesting for the rest of us to read and think about. I did academic debate and that is what we did. (Of course if it gets repetitive that’s no longer true, also when one side is not offering content etc). To stay to this example, the exchange in the BMJ rapid responses is richer than the papers.

              • Keith O'Rourke says:


                By engagement, I meant profitable debate whereas academic debate often seems to just be mostly prestige hording and broadcasting. Any actual exchange is a mix.

                > which are interesting for the rest of us to read and think about.
                Totally agree with that – perhaps the worst of all would behind the scenes compromising – so very important the profitable debate happens in public.

            • Keith, Are you referring to the specific proposition?: ‘Therefore, it may not be that uncommon for one to end up with situations that are equivalent of having 2 significant results out of 20 trials or 20 (or even many more) analyses of outcomes.” It is not a evidentially supoorted inference in the context in which it is suggested? or even more generally?

              • Keith O'Rourke says:

                Please see the comment to Ioana its about what usually is done in a regulatory agency such as the FDA.

      • Yes but which ‘bias’ ‘biases do you think is undermining the Hemken paper? Put another way, what would your subtract or add to strengthen the paper?

    • Iona, thank you. Not only an obsessive need to purge the literature but being glib and pejorative of such outstanding talent.

    • Anders says:

      Ioana, while I am not yet convinced that this paper needs to be retracted, I have been on the other side of similar arguments, and so might be able to offer some insight in the psychology behind requests for retraction.

      When I write a scientific paper, I see this is a contribution to a conversation that attempts to construct an intellectual edifice, a consensus around a web of interlinked ideas upon which future knowledge may be built. If I believe that someone has introduced an element in the literature which is incoherent with the ideas that I am building on, the foundations become shaky and it becomes almost impossible to continue building. If an invalid paper stands uncontested, I will have to expect incoherent ideas to impact on the quality of discussion around my future contributions in the same domain of scientific inquiry.

      For example, last year, someone wrote a paper that proposed a new interpretation of odds ratios, as “risk ratios conditional on treatment and control resulting in different outcomes”. I believe I have shown conclusively that this interpretation is invalid. Their framework is certainly not compatible with my own ideas about related issues. I have written about this at .

      From my perspective, I would very much like to see this paper retracted. It does not feel as if this is motivated by a messiah complex or by a desire to punish the authors. Rather, it is important for me that there is a clear signal from the scientific community that these ideas are wrong, that they are not part of the intellectual edifice upon which future contributions are to be built.

      As long as we insist of keeping an outdated, print-based publishing model as the platform for scientific discussion, where being “published” is the only widely available information about whether an idea is currently part of the scientific consensus, we end up in an unfortunate position where retraction is the only credible signal that the scientific community can send about whether future researchers should consider the idea as being part of the intellectual canon, part of what future work must strive for consistency with.

      Keeping a record of discussion that followed from an invalid idea may be very useful. In an ideal world, one would find a way to credibly send a signal that according to the current consensus the ideas are wrong, but allow the discussion that led to this conclusion to stand as part of the scientific record. Really, being “wrong” should not be interpreted as being shameful, the willingness to risk being wrong is fundamental to any real scientist. In my view, if someone writes an invalid paper that is persuasive enough to be published, and someone else later points out why it is wrong, then all sides have contributed positively to the accumulation of knowledge. They should be recognized for this. However, I am not sure we can get to this ideal world without an ad initio reboot of the scientific publishing model.

      (Of course, there is an obvious “Caveat Hanson” to be noted here; I cannot really trust what my brain tells me about my motivations. That said, this explanation feels as if it might at least be plausible)

      • Keith O'Rourke says:


        I think the down side here is that similar to multiple people converging on the same new idea/invention, the interpretation that is invalid will again be happened upon by someone else and published again. On the other hand a correction can make the invalid interpretation more widely know and prevented from being re-published elsewhere.

        The opposite problem I ran into was the Cochrane Collaboration’s Statistical Methods Group refusing to provide a link to my DPhil thesis (about 7 years ago) as it was not peer review published.

        Unfortunately, it was the only new work done in the stalled problem of dealing analytically rather than approximately with mixed summaries and likelihood methods more generally. So over the years as I notice related publications (as should be expected) I email the thesis to the authors and most of time get a reply that they wished they had been aware of the material. Peer review publishing leaves a lot to be desired!

        Thesis is here

  10. Elin says:

    I was wondering if it makes sense to think about time ordering more (it would be cool to actually have data on this). First, if the first publication is likely to be that an effect exists (which I buy), would it make sense that after that, it is the failure to replicate that becomes the interesting result that will get published? Except that some other papers may be in the pipeline anyway. I’m also thinking about when publication actually happens, when a paper is presented at a conference, when it is a working paper, when it appears in press etc. It reminds me about some research on the impact of legal changes, do you date the changes from when the law is passed, from the date it goes into effect, from when the people it impacts first hear about it or something else. (In my observation some of this is rationalizing not finding any deterrent impact of changing laws but that’s cynical.) Still, even if time ordering is not so easy to determine I wonder if there is a kind of reverse file drawer issue at a certain point (and maybe it’s related to Arrow’s Other Theorem).

Leave a Reply