Skip to content

Replication backlash

Raghuveer Parthasarathy pointed me to an article in Nature by Mina Bissell, who writes, “The push to replicate findings could shelve promising research and unfairly damage the reputations of careful, meticulous scientists.”

I can see where she’s coming from: if you work hard day after day in the lab, it’s gotta be a bit frustrating to find all your work questioned, for the frauds of the Dr. Anil Pottis and Diederik Stapels to be treated as a reason for everyone else’s work to be considered guilty until proven innocent.

That said, I pretty much disagree with Bissell’s article, and really the best thing I can say about it is that I think it’s a good sign that the push for replication is so strong that now there’s a backlash against it. Traditionally, leading scientists have been able to simply ignore the push for replication. If they are feeling that the replication movement is strong enough that they need to fight it, that to me is good news.

I’ll explain a bit in the context of Bissell’s article. She writes:

Articles in both the scientific and popular press have addressed how frequently biologists are unable to repeat each other’s experiments, even when using the same materials and methods. But I am concerned about the latest drive by some in biology to have results replicated by an independent, self-appointed entity that will charge for the service. The US National Institutes of Health is considering making validation routine for certain types of experiments, including the basic science that leads to clinical trials.

But, as she points out, such replications will be costly. As she puts it:

Isn’t reproducibility the bedrock of the scientific process? Yes, up to a point. But it is sometimes much easier not to replicate than to replicate studies, because the techniques and reagents are sophisticated, time-consuming and difficult to master. In the past ten years, every paper published on which I have been senior author has taken between four and six years to complete, and at times much longer. People in my lab often need months — if not a year — to replicate some of the experiments we have done . . .

So, yes, if we require everything to be replicate, it will reduce the resources that are available to do new research.

Replication is always a concern when dealing with systems as complex as the three-dimensional cell cultures routinely used in my lab. But with time and careful consideration of experimental conditions, they [Bissell's students and postdocs], and others, have always managed to replicate our previous data.

If all science were like Bissell’s, I guess we’d be in great shape. In fact, given her track record, perhaps we could some sort of lifetime seal of approval to the work in her lab, and agree in the future to trust all her data without need for replication.

The problem is that there appear to be labs without 100% successful replication rates. Not just fraud (although, yes, that does exist); and not just people cutting corners, for example, improperly excluding cases in a clinical trial (although, yes, that does exist); and not just selection bias and measurement error (although, yes, these do exist too); but just the usual story of results that don’t hold up under replication, perhaps because the published results just happened to stand out in an initial dataset (as Vul et al. pointed out in the context of imaging studies in neuroscience) or because certain effects are variable and appear in some settings and not in others. Lots of reasons. In any case, replications do fail, even with time and careful consideration of experimental conditions. In that sense, Bissell indeed has to pay for the sins of others, but I think that’s inevitable: in any system that is less than 100% perfect, some effort ends up being spent on checking things that, retrospectively, turned out to be ok.

Later on, Bissell writes:

The right thing to do as a replicator of someone else’s findings is to consult the original authors thoughtfully. If e-mails and phone calls don’t solve the problems in replication, ask either to go to the original lab to reproduce the data together, or invite someone from their lab to come to yours. Of course replicators must pay for all this, but it is a small price in relation to the time one will save, or the suffering one might otherwise cause by declaring a finding irreproducible.

Hmmmm . . . maybe . . . but maybe a simpler approach would be for the authors of the article to describe clearly (with videos, for example, if that is necessary to demonstrate details of lab procedure) in the public record.

After all, a central purpose of scientific publication is to communicate with other scientists. If your published material is not clear—if a paper can’t be replicated without emails, phone calls, and a lab visit—this seems like a problem to me! If outsiders can’t replicate the exact study you’ve reported, they could well have trouble using your results in future research. To put it another way, if certain findings are hard to get, requiring lots of lab technique that is nowhere published—and I accept that this is just the way things can be in modern biology—then these findings won’t necessarily apply in future work, and this seems like a serious concern.

To me, the solution is not to require e-mails, phone calls, and lab visits—which, really, would be needed not just for potential replicators but for anyone doing further research in the field—but rather to expand the idea of “publication” to go beyond the current standard telegraphic description of methods and results, and beyond the current standard supplementary material (which is not typically a set of information allowing you to replicate the study; rather, it’s extra analyses needed to placate the journal referees), to include a full description of methods and data, including videos and as much raw data as is possible (with some scrambling if human subjects is an issue). No limits—whatever it takes! This isn’t about replication or about pesky reporting requirements, it’s about science. If you publish a result, you should want others to be able to use it.

Of course, I think replicators should act in good faith. If certain aspects of a study are standard practice and have been published elsewhere, maybe they don’t need to be described in detail in the paper or the supplementary material; a reference to the literature could be enough. Indeed, to the extent that full descriptions of research methods are required, this will make life easier for people to describe their setups in future papers.

Bissell points out that describing research methods isn’t always easy:

Twenty years ago . . . Biologists were using relatively simple tools and materials, such as pre-made media and embryonic fibroblasts from chickens and mice. The techniques available were inexpensive and easy to learn, thus most experiments would have been fairly easy to double-check. But today, biologists use large data sets, engineered animals and complex culture models . . . Many scientists use epithelial cell lines that are exquisitely sensitive. The slightest shift in their microenvironment can alter the results — something a newcomer might not spot. It is common for even a seasoned scientist to struggle with cell lines and culture conditions, and unknowingly introduce changes that will make it seem that a study cannot be reproduced. . . .

If the microenvironment is important, record as much of it as you can for the publication! Again, if it really takes a year for a study to be reproduced, if your finding is that fragile, this is something that researchers should know about right away from reading the article.

Bissell gives an example of “a non-malignant human breast cell line that is now used by many for three-dimensional experiments”:

A collaborator noticed that her group could not reproduce its own data convincingly when using cells from a cell bank. She had obtained the original cells from another investigator. And they had been cultured under conditions in which they had drifted. Rather than despairing, the group analysed the reasons behind the differences and identified crucial changes in cell-cycle regulation in the drifted cells. This finding led to an exciting, new interpretation of the data that were subsequently published.

That’s great! And that’s why it’s good to publish all the information necessary so that a study can be replicated. That way, this sort of exciting research could be done all the time

Costs and benefits

The other issue that Bissell is (implicitly) raising is a cost-benefit calculation. When she writes of the suffering caused by declaring a finding irreproducible, I assume that ultimately she’s talking about a patient who will get sick or even die because some potential treatment never gets developed or never becomes available because some promising bit of research got dinged. On the other hand, when research that is published in a top journal but does not hold up, this can waste thousands of hours of researchers’ time, spending resources that otherwise could have been used on productive research.

Indeed, even when we talk about reporting requirements, we are really talking about tradeoffs. Clearly writing up one’s experimental protocol (and maybe including a Youtube) and setting up data in archival form, that takes work, it represents time and effort that could otherwise be spent on research (or evan on internal replication). On the other hand, when methods and data are not clearly set out in the public record, this can result in wasted effort by lots of other labs, following false leads as they try to figure out exactly how the experiment was done.

I can’t be sure, but my guess is that, for important, high-profile research, on balance it’s a benefit to put all the details in the public record. Sure, that takes some effort by the originating lab, but it might save lots more effort for each of dozens of other labs that are trying to move forward from the published finding.

Here’s an example. Bissell writes:

When researchers at Amgen, a pharmaceutical company in Thousand Oaks, California, failed to replicate many important studies in preclinical cancer research, they tried to contact the authors and exchange materials. They could confirm only 11% of the papers. I think that if more biotech companies had the patience to send someone to the original labs, perhaps the percentage of reproducibility would be much higher.

I worry about this. If people can’t replicate a published result, what are we supposed to make of it? If the result is so fragile that it only works under some conditions that have never been written down, what is the scientific community supposed to do with it?

And there’s this:

It is true that, in some cases, no matter how meticulous one is, some papers do not hold up. But if the steps above are taken and the research still cannot be reproduced, then these non-valid findings will eventually be weeded out naturally when other careful scientists repeatedly fail to reproduce them. But sooner or later, the paper should be withdrawn from the literature by its authors.

Yeah, right. Tell it to Daryl Bem.

What happened?

I think that where Bissell went wrong is by thinking of replication in a defensive way, and thinking of the result being to “damage the reputations of careful, meticulous scientists.” Instead, I recommend she take a forward-looking view, and think of replicability as a way of moving science forward faster. If other researchers can’t replicate what you did, they might well have problems extending your results. The easier you make it for them to replicate, indeed the more replications that people have done of your work, the more they will be able, and motivated, to carry on the torch.

Nothing magic about publication

Bissell seems to be saying that if a biology paper is published, it should be treated as correct, even if outsiders can’t replicate it, all the way until the non-replicators “consult the original authors thoughtfully,” send emails and phone calls, and “either to go to the original lab to reproduce the data together, or invite someone from their lab to come to yours.” After all of this, if the results still don’t hold up, they can be “weeded out naturally from the literature”—but, even then, only after other scientists “repeatedly fail to reproduce them.”

This seems pretty clear: you need multiple failed replications, each involving thoughtful conversation, email, phone, and a physical lab visit. Until then, you treat the published claim as true.

OK, fine. Suppose we accept this principle. How, then, do we treat an unpublished paper? Suppose someone with a Ph.D. in biology posts a paper on Arxiv (or whatever is the biology equivalent), and it can’t be replicated? Is it ok to question the original paper, to treat it as only provisional, to label it as unreplicated? That’s ok, right? I mean, you can’t just post something on the web and automatically get the benefit of the doubt that you didn’t make any mistakes. Ph.D.’s make errors all the time (just like everyone else).

Now we can engage in some salami slicing. According to Bissell (as I interpret here), if you publish an article in Cell or some top journal like that, you get the benefit of the doubt and your claims get treated as correct until there are multiple costly, failed replications. But if you post a paper on your website, all you’ve done is make a claim. Now suppose you publish in a middling journal, say, the Journal of Theoretical Biology. Does that give you the benefit of the doubt? What about Nature Neuroscience? PNAS? Plos-One? I think you get my point. A publication in Cell is nothing more than an Arxiv paper that happened to hit the right referees at the right time. Sure, approval by 3 referees or 6 referees or whatever is something, but all they did is read some words and look at some pictures.

It’s a strange view of science in which a few referee reports is enough to put something into a default-believe-it mode, but a failed replication doesn’t count for anything. Bissell is criticizing replicators for not having long talks and visits with the original researchers, but the referees don’t do any emails, phone calls, or lab visits at all! If their judgments, based simply on reading the article, carry weight, then it seems odd to me to discount failed replications that are also based on the published record.

My view that we should focus on the published record (including references, as appropriate) is not legalistic or nitpicking. I’m not trying to say: Hey, you didn’t include that in the paper, gotcha! I’m just saying that, if somebody reads your paper and can’t figure out what you did, and can only do that through lengthy emails, phone conversations, and lab visits, then this is going to limit the contribution your paper can make.

As C. Glenn Begley wrote in a comment:

A result that is not sufficiently robust that it can be independently reproduced will not provide the basis for an effective therapy in an outbred human population. A result that is not able to be independently reproduced, that cannot be translated to another lab using what most would regard as standard laboratory procedures (blinding, controls, validated reagents etc) is not a result. It is simply a ‘scientific allegation’.

To which I would add: Everyone would agree that the above paragraph applies to an unpublished article. I’m with Begley that it also applies to published articles, even those published in top journals.

A solution that should make everyone happy

Or, to put it another way, maybe Bissell is right that if someone can’t replicate your paper, it’s no big deal. But it’s information I’d like to have. So maybe we can all be happy: all failed replications can be listed on the website of the original paper (then grumps and skeptics like me will be satisfied), but Bissell and others can continue to believe published results on the grounds that the replications weren’t careful enough. And, yes, published replications should be held to the same high standard. If you fail to replicate a result and you want your failed replication to be published, it should contain full details of your lab setup, with videos as necessary.


  1. jkhartshorne says:

    Nice post, and agreed. BTW Schachner and I argued that replications need to not just be published but tracked. This way, people whose work is easy to replicate get credit for that, journals that tend to publish highly replicable work (not just highly cited work) get credit for that, etc. Dealing with the issue of “true but nearly impossible to replicate” is tricky. But I suspect that if replications were systematically published and tracked, people would do a better job of helping others replicate their work (providing more detail in manuscripts, online, etc.).

  2. BenK says:

    In terms of cost benefit, I think a more elegant solution is to not replicate at first, but generalize instead. Was the initial study done in white mice? Do the verification in rats or pigs, or, if nothing else, brown mice. If it works there, roll on. If not, try a bit harder and communicate with the authors. If it still won’t work, revisit the original animal model. If that works, then you have a useful comparison that suggests a specific result; if not, depending on the degree of original author cooperation, either work with the author to figure out what went wrong or publish the new results and consult with the journal about putting warning labels on the papers (ie. follow up research suggests this result is not robust… see paper X for details).

    This could apply to non-animal research just as well. Pick a new pond. Do the study on a different sky field. There are only a few cases where it becomes difficult – a study of a unique historical event, for example.

    • Wonks Anonymous says:

      If the generalization does not give the same result, how do you know whether the reason was because brown mice are different from white, the replication differed in some unwritten lab technique, or the original result was wrong?

      • gwern says:

        That’s a good question. This reminds me of some of the issues in experimental design, like adaptive trials: you don’t always want to use very large high-powered trials, nor very small under-powered trials, but varying based on the effect so far, the cost-benefit analysis, etc.

        Seems like we may have a similar problem here: doing a 100% pure replication of every experiment doesn’t seem optimal, but instantly jumping to a more general population and a more conceptual replication also doesn’t seem optimal. So what’s the equivalent of adaptive trials here?

      • Nony says:

        I really like the generalization approach. It gives the replicator a paper (unit of work in academia). It is easy for the journals to publish. And it even comes across as a less direct pure “checking up on you”. If you have done the slugging to attempt to generalize and it did not work, you have a strong rationale to repeat the earlier experiment. In addition, you are expanding knowledge itself.

        Let’s say I think the statistics or metrics or data handling or whatever were wrong on a dendrochronology project…I can nitpick the work or I can go resample new trees. And it’s not even necessary to have the same trees…since after all the first work was a sample of the population anyway!

  3. K? O'Rourke says:

    > considered guilty until proven innocent

    Currently readers pretty much have no option but to consider untrustworthy until proven trustworthy and especially for _terminal_ decisions such things as whether a medical treatment is effective and safe.

    Having often painted overly bleak pictures of clinical research practice, I do believe care in conducting and reporting research is not uniformly poor. Now one of the groups I worked was very, very careful, but also for various reasons more than adequately funded than most and therefor the carefulness was _affordable_.

    Most academic groups are likely competing vigorously with limited resources where there are few to no incentives or penalties for not being careful. This is especially so when research is complicated, lengthy and expensive (lack of replication less likely to be noticed). As jkhartshorne points out, it would be better if carefulness that made replicating easier was getting rewarded. This seems to be something some of the granting agencies are starting to pick up on – that could be one of their important roles – (funding) quality assurance of what they fund by making replication anticipated to those they fund (somewhere between random tax audits and ongoing tracking of performance). Most academic groups are winners (or at least non-losers) in the current system and I am not surprised that there will a fair amount of grumbling.

    • Andrew says:


      In that link, Leek writes: “the hype about the failures of science are, at the very least, premature.” Well, sure, once you call something “hype” you can be pretty sure it’s overstated. I think there’s a lot of solid science and a lot of shaky science. My impression is, until recently, if a result was statistically significant and published in a serious journal, the default reaction was to treat it as true. No longer. Now we realize that lots of published papers by serious researchers in serious journals have Type S and Type M errors. To me, the issue is not that “We need a culture that doesn’t use reproducibility as a weapon” (as Leek put it) but rather that a lot of published claims do not represent true patterns in the larger population; they are just artifacts of particular data sets (often involving selection, as in the notorious three unblinded mice example). But, sure, lots of science doesn’t have this problem, and that’s good to remember too.

      • K? O'Rourke says:

        > lots of science doesn’t have this problem

        Most likely, but some sense of exactly where and when certainly would be helpful!

      • Bill Harris says:

        When I read that article by Leek, I remembered a prof in grad school who quipped that Claude Shannon had written down many a theorem in information theory, and almost all the proofs were wrong, giving a future generation of grad students great dissertation topics. I don’t like wrong answers in print, either, but (assuming that prof was correct) the process did apparently self correct.

        How are these days different? For one, I presume no one dies as a result of incorrect proofs of valid theorems in information theory, but the topic here does include clinical trials. Is what you’re saying, though, is not that non-reproducibility is automatically a sign of bad science but that reproducibility enables good science to evolve as it should? That it’s not a crime to publish something that doesn’t hold up over time (at least on occasion), but it is problematic if people don’t make it easy to test their claims? I do like the “cargo cult science” reference.

        It was arguably easier to attempt to replicate Shannon’s manual mathematical proofs than some of the examples mentioned here.

  4. Jonathan Gilligan says:

    About the need to visit other researchers’ labs and communicate personally in order to replicate research: There is an important role in laboratory science for “tacit knowledge,” a term coined by Michel Polanyi and popularized more recently by Harry Collins. It is more the rule than the exception that there is an enormous amount of skill involved in successful laboratory work that cannot be written down, but must be acquired through experience, usually under the guidance of an expert. You can’t write down instructions in a cookbook that unambiguously define the proper texture or aroma of a dish. Experience cooking is essential to decoding what’s on the page. Similarly, trying to learn to play tennis from a book cannot provide what a coach can.

    Much of what we do training graduate students is guiding them in acquiring the tacit knowledge necessary to perform research on their own.

    Some time ago, I spent a couple of years working with molecular biologists working on microtubules and there was a lot of tacit knowledge in producing the tubulin proteins. People did their best to publish the details of their preparations (the details don’t fit into journal papers, and are most often found in Ph.D. disserations, so one aspect of facilitating replication is to make all dissertations more easily accessible online), but still it was often necessary, when we wanted to use a new preparation that had been published, for my senior colleagues to fly out to the authors’ lab and work side by side with them to learn how to reproduce it.

    These difficulties and the importance of visiting peers’ labs was pretty much part of the culture of the group I worked with. Thus, since replication of biology experiments is important, the right way to do it is probably to provide more funding to support the kind of travel and training that Bissell describes. And there, the dilemma is in allocating funds between replicating previous work and pushing forward with new work. But this is something of a false dichotomy.

    One thing that goes unsaid in Bissell’s Nature piece is that replicating the original experiment is not just about checking for fraud or error. People will want to build on previous results or perform a variants in order to test new hypotheses. To to that, they need to start by reproducing the original to make sure that they have a control with which to compare their variant (see Feynman’s “Cargo Cult Science” about needing to reproduce an earlier nuclear physics experiment with protons before running a variant with deuterons). Thus, the only reason not to want people to reproduce your results would be if you think they’re so uninteresting that no one would want to use them or add to them.

    • Andrew says:


      I agree. Indeed, one of the central threads in my own statistics research is the formalization of methods that represent tacit knowledge of statistics.

      • george says:

        Andrew: could you comment on the tension between recording every detail of a study, and modeling approaches that result in long, twisty paths through data analysis, where every detail is unlikely to be recorded? (at least by some analysts)

        • Andrew says:


          I don’t think it’s necessary to record every preliminary analysis that you did. I think it is a good idea to supply the exact survey questions, the exact rules for excluding cases, etc. It should be possible to replicate the published analysis, I don’t think it’s necessary to replicate all the intermediate steps that didn’t make it into the article.

    • Loved this blog post, Andrew. You should turn it into an article or column somewhere.

      Jonathan Gilligan already said what I was going to say — reproducing bio experiments is like reproducing a dish from a recipe. A lot is implicit, and almost necessarily so. Videos help, but aren’t really enough because the pans, dishes, and especially the basic ingredients will vary (also mentioned in comments and in the original article).

      I think recipes and cooking are very good analogies for the way bio experiments are done, at least judging from my wife’s reports from her time in bio labs. Having “good lab hands” (search it) is like being a good cook. At the risk of offending our newly empowered Title IX enforcers, the danger Andrew’s trying to guard against is the dreaded “mother-in-law recipe” (traditionally passed down from a man’s mother to his wife with intentionally misleading instructions or missing ingredients so the daughter-in-law can’t reproduce her mother-in-law’s cooking for her husband).

      Jonathan’s reference to tacit knowledge is spot on. It’s often used in cognitive psych to describe things we know how to do but don’t know how we do them, the canonical examples being riding a bicycle or speaking a language.

      But my fave part of the whole essay is “A publication in Cell is nothing more than an Arxiv paper that happened to hit the right referees at the right time.” Haven’t all the senior people joining this discussion been on editorial boards and review boards for grants? To go back to cooking metaphors, once you see how the sausage is made, you’re not so keen to eat it.

      • Andrew says:

        Hey, that’s a great line: “A publication in Cell is nothing more than an Arxiv paper that happened to hit the right referees at the right time.” And it turns out that I wrote it! Cool. I didn’t remember this bit at all, cos I actually wrote it 3 weeks ago…

        • Rahul says:

          I don’t like that line. It’s too snarky & somewhat nihilistic. I’m not sure about Cell, but surely the quality / importance / interesting-ness of a Nature paper is much better than the median Arxiv?

          Do you also think of a Michelin star restaurant as only having bumped into the right food critic on the right day? Or a football star as only having bumped into the right selector at the right practice?

          I think there’s a curatorial aspect to a top notch journal, which may not be perfect, but nevertheless does exist.

          • Andrew says:


            I can only assume that the average quality (however that is measured) of Cell articles is higher than the average quality of Arxiv articles. So, sure, I think it makes sense that we treat any claims published in a Cell article with more respect than we treat the claims of a random Arxiv article.

            The difficulty comes in the next step, when people can’t replicate the Cell article or where they note problems in the data collection or analysis. At this point, I think you have to weigh the evidence, and what is the evidence in favor of the disputed article being correct? The evidence is that 3 referees (or however many referees Cell uses) read some words and looked at some graphs and thought the article was worth publishing (which doesn’t, by the way, mean they thought it was correct; as a frequent reviewer I am aware that we often accept articles because we think they might be worth looking into even if they are likely to be wrong). That’s some evidence, but I don’t think it should overwhelm the evidence supplied by a failure to replicate. Again, Bissell criticizes nonreplicators for not doing emails, phone calls, and lab visits—but she expresses no problem with referees who accept a paper entirely based on the materials that are submitted to the public record.

            To go on to your analogies, the Michelin people don’t just give the star and stop there. They come back and re-evaluate with some frequency (I don’t know if it’s every year or every three years or what). In contrast, once something’s in the scientific literature it stays there forever, with very rare exceptions. Similarly, in football, if a player doesn’t perform, he is pushed out. Even Tim Tebow eventually stopped getting playing time.

            • Rahul says:


              The status quo is that neither Arxiv demands independent replication nor Cell. Ergo, the Cell articles are probably better because they were curated by experts and at least methodologically validated etc.

              Now would failure to replicate be damning? Yes. Do I agree with Bissell’s defensiveness? Not at all.

              My only point is that, in absence of outside validation mechanisms, there is indeed strong reason to treat Cell / Nature as better than a random Arxiv. Hence I do not like your line.

            • K? O'Rourke says:

              I was suggesting that we need this sort of Michelin thing for publishing researchers. Without it, many important decision makers feel they have almost no choice but to treat published papers as hearsay (and those numbers are likely growing fast).

              Quality assessment and analysis based on current publicly available data is largely hopeless* – some auditing of processes and records likely will be required but much less if _reproducible documentation_ becomes more the norm and this possibility of audits leads to relatively unbiased documentation.

              * For some discussion of hopeless see: On the bias produced by quality scores in meta‐analysis, and a hierarchical view of proposed solutions. S Greenland, K O’Rourke – Biostatistics, 2001

          • Fernando says:


            We have had this discussion in the past

            Arguably top journals publish good papers not bc they provide good reviews or anything but simply bc there is a professional norm whereby we send our best papers to top journals. If anything editorial decisions make the journal worse by selecting on sexy. IOW, it is selection bias rather than procedures at top journals that explains their quality.

            I’m not saying this is what goes on all the time but it is not a crazy scenario either.

    • Steve Sailer says:

      How much of the time do medical labs have an incentive to not explain fully all their techniques so that they remain the state of the art place to go? For example, when I had cancer in the 1990s, I asked around and found that if I needed a stem cell transplant, there were about a half-dozen cancer centers in the country that were famously good at not killing their stem-cell transplant patients. (And there were a lot of other places that wanted to climb the learning curve, one dead patient at a time.)

      Presumably, lots of scientific papers have been published on how to do stem cell transplants, but there appeared to be a big difference in results between reading about how to do it and actually having a lot of experience doing it: tacit knowledge.

      Is that because it’s just too hard to put the How-To section down in writing, or because the people who really know How-To have an incentive to keep a few secrets secret, or because scientific journals don’t have the right format and they need to publish much more How-To supplemental material?

      • mpledger says:

        That was my thought of reading this. If you teach someone to replicate your work then you’ve just made a competitor for funding in your area of work.

        The other point is that the effort and care to make the original breakthrough is (most likely) very, very high. The motivation for the people doing the replication is not going to be the same and they are more likely to make trivial errors that means replication isn’t going to work.

        (And if you are trying to replicate a competitor’s work then it may even be worthwhile, to “accidently”, not be able to replicate their work to cast doubt on their research quality.)

        • Steve Sailer says:

          Yes, that reminds that they often tell you when you get cancer to try to get into a clinical trial not just so that you get access to the latest medicine, but because you get more careful care in a clinical trial than in just routine medical care. For example, I was the first person in North American with my precise version of non-Hodgkins lymphoma to try the new monoclonal antibody Rituxan, which went on to become a blockbuster drug. When I developed a strong shivering reaction to it upon first getting dosed with Rituxan, about 15 medical personnel crowded into the room to keep an eye on me. The doctors and nurses really didn’t want to have me die on them and wreck the reputation of one of the most promising drugs of the 1990s.

        • K? O'Rourke says:

          Unfortunately, I do have firsthand experience of this happening in clinical research (holding back on how research was actually carried out to lessen possible competition for grants) and Steve has provided us with some sense of how seriously it can impact patients.

          I think we should do our best to minimise this sort of injurious behaviour, especially given how _understandable_ it is and how undesirable the only alternative to avoiding being patient at some point in time is.

    • patrick caldon says:

      An excellent point. And what happens when the author themselves don’t know the important steps required for replication.

      Suppose one researcher happens to be a bit rougher with handling their samples then another. And it so happens that rough handling at one particular point in the “recipe” gives the samples that little bit of extra mixing that makes the experiments work. It’s not something that can ever be written down because the researcher is completely unconscious of it, it’s not something that will show up in a video, the only way to detect it is to attach little accelerometers to all the equipment.

      The research is correct, but not replicable.

      Of course it’s very important to learn that the rough handling is an ingredient of what makes the experiment work. But it’s not something that you’ll ever discover through publication, videos, identical reagent suppliers etc. And the only way (I think) you will discover it is by going into “default believe it mode”, getting the two researchers, one who can do the experiment and the other who cannot, side by side in a kind of dumb-show to work out the precise step at which they differ. And the only way to get them side-by-side is to publish the apparently non-replicable result.

      • question says:

        While knowing about the effects of rough handling would be interesting information to have, if some result is conditional upon such exact experimental conditions it is probably not great support for the overall narrative the researchers are attempting to paint.

      • Andrew says:


        I’d think that some aspect of the rough handling would come across in the video.

        But, in any case, I have no problem with someone publishing the non-replicable result; my problem is with the attitude that publication imposes a duty on others to consider the result as correct unless proven otherwise.

    • Nony says:

      1. Little bit of a red herring. Issues with replication bedevil plenty of “professionals”. Don’t you think the Amgen crew knew how to do basic lab practices and had familiarity with the tools of the trade?

      2. Y’all’s cooking analogy is WACKED. Cookbooks are very useful. If dishes turned out as different or were basically not even what you meant to make as flawed papers, we would have a serious issue. People buy cookbooks exactly because they plan to cook the recipes! Science papers on the other hand…many researchers hope to make it sexy enough to help career while still obscure or difficult enough not to get exposed by replication (or just the stuff is all minutia anyway). Heck the term “cookbook” is a common synonym for an easy to follow procedure!

  5. LemmusLemmus says:

    I also agree. I think generally in this debate there seem to be two views of which question a replication attempt answers:

    1. Is the effect true? You can rephrase this in terms of the size of the effect or the likelihood that the effect is true and add all kinds of discussions about what “effect is true” even means, but the bottom line is that this is a question about the subject matter of the research

    2. Are the researchers incompetent – or even frauds? This question, in contrast, is about the character of the original researchers.

    The view that replication is about the second question typically remains implicit, but it seems to be underlying postulates such as that a replication attempt (in laboratory psychology) should use at least three times as many subjects as the original study and the like. Perhaps it is also the basis for Prof. Bissell’s statements about phone calls, visiting labs, etc. (Having said that, at least in some cases the blame for not enough details about procedures needs to be placed on journals’, not authors’ doorsteps.)

    I can understand that researchers care about their reputation (question 2), but of course what the research community and the world at large really care about is question 1.

  6. jrkrideau says:

    It is interesting that psychologists are rather agressively addressing the reproduction problem.

    The results do seem surprising but at least they seem to be attempting something.

    • question says:

      Were those results really replicated? Look how far away the original effect sizes are from the edges of the 99% confidence intervals. Looks more like they only replicated 2 or 3 of the results.

  7. Kaiser says:

    Sounds like Bissell is another scientist who is guilty of overconfidence in the statistical significance toolset. A 5% significance guarantees at least some of the published findings to be false positive results. If her lab has a 100% replication record, either it is an outlier or it has not generated enough results yet.
    Further the more complex is the experiment or the experimental materials, the more chance of an error, and the more we need to replicate!
    Finally, much better for a third-party to replicate.
    The key issue to understand is that failure to replicate should be a normal part of science, there should not be shame unless it’s fraud.

    • Andrew says:


      You write, “A 5% significance guarantees at least some of the published findings to be false positive results.” That’s not correct. It depends on the distribution of the true effects. In any case, I prefer a framing in terms of Type S and Type M errors rather than false positive and false negative.

      I agree with your last sentence. A key point of publication is that future researchers can take what you’ve done and then do more.

    • In contrast to research in social sciences, research in biology is often directly connected to real causal effects. The experiments involve things like turning on a gene or knocking-out the gene and seeing the effect on whatever, measuring the concentration of the protein, proving that it’s actually been knocked out etc… I suspect in good careful science labs that many of the effects are such that they have p much lower than 0.05. For example I recently helped a colleague do a bayesian model for the results of a firefly luciferase assay. The idea is that you grow cells that are genetically modified and they produce a fluorescent protein under the control of a particular DNA sequence that causes transcription to occur. The idea is to see that there are differences between the intensity of the transcription and whether or not a particular hormone is present. In the important cases highlighted in the paper I think the effect sizes were so large that a p value would have been maybe 0.00001 or something like that.

  8. Erin Jonaitis says:

    Thanks for addressing this opinion piece. I agree with your take; my reaction to the paper was similar to yours, but with more swearing.

  9. Joel says:

    Though I am essentially in agreement with you on replication issues, I think that your post actually includes some of the features that make “good” researchers defensive about replication. A big part of the issue in thinking about replication is to think about 1) why we’re replicating and 2) what a failure to replicate means. Part 1 inevitably has a big effect on how we think about Part 2.

    Many discussions of replication, including your post here lead with prominent cases of outright fraud (Pottis & Stapels). To the extent that we think of replication as dealing with issues introduced by complete fraud and fakery, the implication is that a failure to replicate means that the original scientist is dishonest, and perhaps even a criminal who stole grant money.

    To the extent that most scientists are honest, hard-working people, who are also smart enough to know that for any of a thousand reasons, any given replication may fail, it’s very alarming to think that a failed replication might lead one to be mentioned in the same breath as the prominent fraudsters. Such fraud is exceptionally rare and isn’t really a good way to frame the replication debate and sets the whole thing up very antagonistically.

    This, I think, is where the issue of peer review comes up. To the extent that you designed a reasonable protocol (which is what peer review out to be at discriminating on) and you followed that protocol, your conduct is not blameworthy even if the result is actually false; we all want to be trusted. On the other hand, I think we all know that we’re clever people who like our theories, so there are a large number of subtle ways that we influence our experiments; I think these are nearly always well-intentioned. People that exclude subjects, etc., do so because they honestly think it’s the right thing to do, not because they’re trying to pull the wool over anyone else’s eyes. Framing replication as a response to these inevitable biases and the fact that noise in the world means we’ll sometimes be wrong even if the design, execution, and analysis are flawless takes away the connotation that failing to have one of your findings replicated is somehow morally blameworthy and sets replication up as a friendly, natural part of the scientific process.

    I think, actually, that fields which rely very heavily on observational data (e.g., political science and econ) are much better about adopting this attitude than experimentally-driven fields. No one implies that an observational study that left out some variable or rested on a bad exogeneity assumption or used a bad proxy to measure something and consequently got something wrong is the result of a malicious or fraudulent researcher. Instead, it’s seen as natural that researchers who do perfectly solid work will often, through no fault of their own, be wrong. This goes hand in hand with the issue of retraction. No one would ever suggest that early work on education in labor economics that totally ignored selection effects ought to now be retracted (because that carries the stigma of fraud). Instead, we just accept that research advances and old conclusions often turn out to be wrong, which is the healthy way to think about the issue.

    • Commenter says:

      This is a good point. I think the fear of condemnation if good faith results simply don’t hold up is a major factor that discourages replication, data sharing, etc.

    • Andrew says:


      The trouble is that peer review is just three busy people reading a few words and looking at a few pictures. I don’t think anyone thinks that a claim published in a lab notebook or an Arxiv paper should be taken as correct without verification. I think a similar air of skepticism has to be taken for Arxiv papers that happen to have appeared in Cell, etc.

    • Rahul says:

      “Such fraud is exceptionally rare”

      Is there a good reason to believe this? I wonder. It is true that detected instances of fraud are very rare but is that because they indeed are rare or because we rarely dig deep enough to discover them?

      Is there any empirical reason to believe that academics are any more honest than the average man or industrial executive?

      My hunch is fraud is far more common than we think it is. But in the absence of checks & validation it so rarely gets discovered.

    • Nony says:

      I getcha. And I don’t like someone going over my expense account. But still…probably a good idea in general that there is some checking of expense accounts at times, no? ;_

  10. Fernando says:

    Great post. I agree that peer review is not that informative, and certainly no guarantee of quality. Replication is better.

    My minor contribution to this debate is trying to parse out what people mean by replication. It has many different meanings

    Not the last word, perhaps not the best typology, but a start. Replication ought to be a teachable _method_. I illustrate with an application of one method.

    • K? O'Rourke says:

      I think you are right (and your paper looks interesting).

      There seems to be quite varied ideas here about what it is and what it is for.

      Perhaps not surprising as there are few courses or materials on dealing with uncertainties of multiple _similar_ studies.

      (Used to be called the cult of a single study.)

      My main take was that replication could provide (strong) encouragement for doing better studies (and here _blame_ is mainly an impediment and potentially highly error prone.)

    • jrc says:


      This is a really nice paper, and definitely helpful for thinking about the many different ways we can usefully do replication work. Personally, I’ve always liked the stuff that (I think I’m reading this right) falls between “statistical replication” and “procedural replication”, the testing of the methods that constitute a “scientific standard”. I call this stuff “experimental econometrics/statistics” – testing out estimators on real data and seeing how well they perform (unbiased, appropriately-sized confidence intervals). You cite the classic Lalande paper, but I’m also thinking of the standard error work in, say, “How much should we trust difference-in-difference models?” I think those kinds of papers fall somewhere in the replication spectrum too, but they aren’t exactly replication. Or maybe I just read too fast and you have a category for them too.

      • Fernando says:

        Thanks for the comments. I will take a look at the papers you mentioned. My goal was not to be definitive but only methodical; to higlight the importance of research practice to scientific progress.

  11. […] Gelman also reacted to Mina Bissell’s post on December 17, 2013 on his blog. There’s an interesting discussion going on below his […]

  12. I also read Mina Bissell’s piece and I do not agree with many of her points. Here are some thoughts from the perspective of someone doing replications in the social sciences, and who teaches a replication workshop where graduate students replicate a published paper.

    First of all, I’m not sure I like the word ‘newcomer’ Bissell uses. It sounds as if those trying to replicate work are the ‘juniors’ who are not quite sure of what they are doing, while the ‘seniors’ worked for years on a topic and deserve special protection against reputational damage.

    It goes without saying that anyone trying to replicate works should try to cooperate with the original authors. I agree. However, I would like to point that original authors don’t always show the willingness or capacity to invest time into helping someone else reproducing the results.

    It is in the interest of original authors to clearly report the experimental conditions, so that others are not thrown off track due to tiny differences. This goes for replication based on re-analysing data as well as for experiments. The responsibility of paying attention to details lies not only with those trying to replicate work. Original research might take years, but it really should not take years to replicate them, just because not all information was given.

    Bissell states that replicators should bear the costs of visiting research labs and cooperating with the original authors (“Of course replicators must pay for all this, but it is a small price in relation to the time one will save, or the suffering one might otherwise cause by declaring a finding irreproducible.”). I’m not quite sure of this. As Bissell points out herself at the beginning of her article, it is often students who replicate work. Will their lab pay for expensive research trips? Will they get travel grants for a replication? And for the social sciences that often works without a lab structure, who will pay for the replication costs? I feel the statement that a replicator must bear all costs – even though original authors profit from cooperation as well, can be off-putting for many students engaging in replication.

    I wrote a longer piece here: on Nov 24.

    • Andrew says:

      Regarding your point #2, I’d really like to push back against the norm of requiring outsiders to contact the paper’s authors. For one thing, as you note, authors often simply refuse to share data (see here for an example, from back when I was a student). But, beyond this, the published record is all that most of us have to work with. If a key aspect of the paper can only be learned by contacting the authors, that represents a real weakness in the published record. Often, though, it’s not the authors’ fault. Many journals nowadays seem to be moving toward the “tabloid” goals of crispness, to the extent that there’s just no room to give all the details in the published paper. Ideally it could go in the supplemental material, but typically the supp material is used not to make a complete record but rather to include various alternative analyses to make reviewers happy. In any case, if the material’s not in the paper, it’s not in the supplemental material, and it can’t be found in the references, then as far as most users of the paper are concerned, it’s not there.

      • I see your point. I agree that “If a key aspect of the paper can only be learned by contacting the authors, that represents a real weakness in the published record.”

        But is there really a trade-off between contacting an author, and pushing for more transparency of methods and data description? Ideally the latter would be the case, and contacting the author could be an additional benefit to further understand their thinking, right?

        By the way, in the replication course ( that I’m teaching this year I required that all students pick ONLY those published papers for replication that uploaded data, and describe methods clearly. That implies that they don’t necessarily have to contact the author and ask for help with their projects (which has caused considerable delay and frustration in last year’s course). A few students trusted that they can get help from the authors anyways and did not meet the requirement. Until now (3 weeks into the course) they haven’t heard back yet and could not start their projects.

        What I learned from this is:
        a) I have to make the requirement more clear next year because it causes delays for the student projects.
        b) My procedure means that only honest, transparent authors are being replicated and have to deal with potential negative effects that Bissell talks about. I usually praise authors who are transparent on my blog and I tell my students that this is the ‘gold standard’ – but now I wonder if some might think that the focus on authors who are already providing data and code could be seen as somewhat unfair?

        • Nony says:

          Have the student do one of each (a transparent uploaded and one who is not). Try to control for other variables. It would then be possible with the complete class results to make some judgment about the two populations.

          For that matter, it seems like common sense that you are going to be more careful if you upload code and data, if you annotate it for replication etc. Heck, there’s more motive to have it right, plus the time spent will generally help find bugs, transcription errors, etc.

          • Nony says:

            And if the students don’t get the data from the nontransparent ones, so be it. Perhaps you could sort them in groups then to look at the few nontransparent ones, who reply. (I bet you get less than 50% willingness to share data…which on itself shows Bissel is full of #$%& to suggest lab trips and all that for replicators to original publishers.)

  13. Dan Wright says:

    Here is a “backlash” to a “replication backlash”. Bargh was critical of some researchers who failed to replicate a finding of his until they tweaked it in a manner which would undermine the conclusions (and critical of the Ed Yong who wrote the article in the link below). This post in National Geographic is arguing with Bargh’s ‘backlash’.

    • K? O'Rourke says:

      Thanks for posting this – given a choice between “listening and getting less wrong” verus “claiming you could not have been wrong – at least importantly so”, most seem to choose the worst.

  14. […] of the replication drive. Now that American statistician and political scientist Andrew Gelman blogged about the topic, the discussion continues. According to Gelman, “the push for replication is […]

  15. I agree with Bissell that failure to replicate often is failure on the part of the replicator and not a problem with the original experiment. I also agree with all those that mention the importance of “tacit knowledge” and apprenticeship in learning techniques in biological sciences. I also think that attempting direct replication is most often interesting when it follows a failure to successfully generalize. I want to talk about a different issue.

    It seems to me that Andrew is attacking a different point than the one discussed by Bissell. Bissell seems to be worried about the possibility that replication by professional replicators (or in some other way) would become a *condition for publication*. Andrew seems to be primarily interested in supporting and encouraging people who try to replicate *already published* studies. I think that a lot of the apparent space between them is the result of this gap.

    So part of the distance may result from a difference in what replication means in different fields, part may result from addressing different issues. Another part may be arise because of a difference in the way they are viewing the role of the scientific literature. Is scientific literature primarily a place where scientists can report their findings so other scientists can think about them, use them for inspiration, and pick and choose the results that help them put together a meaningful model of the world? In that case, the cost associated with publishing irreproducible results is relatively small. They may mislead us, but they may also inspire us or challenge us. We prefer to err on the side of getting stuff out there. The alternative is to think of the scientific literature as playing its primary role in guiding policy. In that case, of course, you’d like it to be more trustworthy.

    • Andrew says:


      See the last section of my post above. I think it’s fine to “err on the side of getting stuff out there.” I just don’t think that, just cos something’s published, we should have the presumption that it is correct. And I don’t think there should be such high barriers to publishing corrections. See the discussion here. The flip side of making it free and easy to publish is that we should make it free and easy to publish replications, non-replications, corrections, etc.

    • Rahul says:

      I’m curious why you’d assign more blame to replicator for failure to replicate and not the original? What’s your logic here?

      Obviously not for every sector, but in sub-areas found to be historically notoriously hard to replicate findings in, what’s the harm in having professional replicators?

  16. Lior Pachter says:

    Thanks for your thoughtful post. I agree completely, and just wanted to add two things. First, the point you make about emails, phone calls and a lab visit is precisely why when I blog about published papers I do not consult the authors beforehand (I’ve posted this policy on my site). Second, the idea that researchers are either complete fraudsters or completely honest is a false dichotomy. There is a lot of grey in between, and the constant refrain that “of course, the vast majority of scientists are honest” rings hollow to me. I don’t really see how they are different from bankers. Some are honest, some commit crimes, some just skim 5% off the top- almost none go to jail.

  17. Steve Sailer says:

    “Stereotype Threat” doesn’t seem to replicate in the hands of skeptics, such as John List, Steven Levitt, and Roland Fryer:

    But maybe Stereotype Threat really does replicate well in the hands of true believers. Maybe the subjects can tell if the experimenters deeply want them to perform in a certain way, so they make it come true to make the experimenters happy?

    • RudigerVT says:

      Howdy Steve et al. I don’t think that you’re talking about the difference in the outcomes of experiments that are correlated with the researchers’ expectations. Rather, this study found (again) that stereotype threat does not appear to emerge in bona-fide high-stakes testing situations.

      Why not? Because it’s basically impossible to manipulate in the field: there is no ‘low-threat’ condition. The stereotype will, on average, be activated for all members of groups that are stigmatized as being, well, stupid. Larry Stricker, at ETS, found this in the late 1990′s in a field experiment on the SAT.

      So it’s really an apples-to-oranges comparison between lab and field conditions. BTW, it’s a somewhat delicate effect to elicit (as per my PhD dissertation) but it’s not *that* hard (heck, even I did it!) It’s a very, very widely-replicated effect and has been for nearly 20 years–as long as you’re talking about lab-based experiments. Fun fact: you can get high-performing Caucasian men to under-perform if you ask them their *ethnicity* before the test. Asians, you know.

      Anyway, stereotype threat is not so much an explanation for performance gaps on real-world tests. Rather, it is a mechanism that contributes (in a bad way) to stigmatized students’ becoming disconnected from the *process* of education. Because, in the short term, for these students, failure is potentially a signal that the stereotype applies to them. For that reason, stigmatized learners are more likely to withdraw effort on learning tasks, so as to make (failure) feedback ambiguous. But even if they *are* trying their ‘hardest,’ it’s the distraction of the stereotype that shaves off precious, limited cognitive resources.

      Over years of schooling, left unchecked, these forces (withdrawing effort, and, basically, distraction during learning tasks and tests) lead to lower levels of educational attainment, which (at least in part) would show up as lower performance on standardized tests. If you want to know more, click my name for the compilation at the website,


  18. Paul Gronke says:

    Another section from the article that is worth highlighting is here:

    To me, the solution is not to require e-mails, phone calls, and lab visits—which, really, would be needed not just for potential replicators but for anyone doing further research in the field—but rather to expand the idea of “publication” to go beyond the current standard telegraphic description of methods and results, and beyond the current standard supplementary material (which is not typically a set of information allowing you to replicate the study; rather, it’s extra analyses needed to placate the journal referees), to include a full description of methods and data, including videos and as much raw data as is possible

    This is do-able only if we change the publication incentives in the sciences and social sciences, but change would be good. Right now, the pressure on junior scholars in particular to publish as many “significant” findings as possible within a short amount of time.

    There is no professional incentive to document carefully, replicate your own work, or make available materials that allow others to replicate.

    *IF* funding entities would actually fund replication (or require submission to replication archives; *IF* journals would require replication materials before they’d publish no matter if the author is from Harvard, Howard, or Alabama Huntsville; *IF* tenure and review committees would evaluate work not based solely on H-indices and citation counts but also on replicability;

    If all of this happened, it would slow down the publication cycle (a good thing), assure that our scientific findings are actually findings (another good thing), and I think may also improve the quality of life enjoyed by junior and senior scholars.

  19. […] "It’s a good sign that the push for replication is so strong that there’s now a backlash against […]

  20. Kieran says:

    The discussion seems to have centred on laboratory techniques and how difficult they are to do well. OK, fair enough, I’ll accept that, but it highlights what I think is the most important problem with the reporting of laboratory-based experiments: the methods sections give detailed descriptions of the laboratory techniques used, but virtually no description of the design of the experiment.

    How can anyone replicate an experiment if the experiment is not described?

    Take the following as an example. I came across this paper because of a media report titled “New study gives hope on autism”. It’s a paper in Cell:
    Hsiao et al. (2014). Microbiota Modulate Behavioral and Physiological Abnormalities Associated with Neurodevelopmental Disorders. Cell.

    Briefly, they induce maternal immune activation (MIA) in mice by injecting pregnant dams with the viral mimic poly(I:C). This produces offspring that exhibit the core communicative, social, and stereotyped impairments typical of autism, including a common autism neuropathology. They then look at the offspring of MIA mice that display behavioural abnormalities to see if they have defects in their gut analogous to those found in people with autism. They then randomly assign the offspring to control or treatment with a gut bacterium and look at its effect on autism-related GI and behavioural abnormalities.

    In the Experimental Procedures section (page 10), there is no mention of the number of pregnant dams used or how they were assigned to either the control or MIA group, nor the number of offspring assigned to control or treatment with the gut bacterium.

    The statistical analysis (page 11) involved t-tests and ANOVA which are not appropriate in this study because it’s a clustered experimental design and the data analysis will need to account for this.

    In the Supplemental Information (page S2) there is this:
    Behavioral Testing
    Mice were tested beginning at 6 weeks of age for PPI, open field exploration, marble burying, social interaction and adult ultrasonic vocalizations, in that order, with at least 5 days between behavioral tests. Behavioral data forB. Fragilis treatment and control groups (Figure 5) represent cumulative results collected from multiple litters of 3-5 independent cohorts of mice for PPI and open field tests, 2-4 cohorts for marble burying, 2 cohorts for adult male ultrasonic vocalization and 1 cohort for social interaction. Discrepancies in sample size across behavioral tests reflect differences in when during our experimental study a particular test was implemented.

    What does this mean? What exactly does “multiple litters of 3-5 independent cohorts of mice” mean? I have read this a few times and I am none the wiser.

  21. Rahul says:

    Mina Bissell leaves unsaid a dirty secret beneath all this: In a lot of these high-competition research environments leaving out key ingredients & steps has for long been a weapon to stymie your competitors and force any derivative research to co-author with you by means of a collaboration.

    Strict replication would force full disclosure & many of the ultra-competitive personalities hate that.

  22. Peter says:

    Great post. While much of the replication debate involves the hard sciences, I think it is relevant to the social sciences, even ones (like mine, political science) that tend to not conduct labwork or experiments. It’s really relevant in the area of data coding. Replicating someone’s analysis is fine-most journals require replication data and commands to be available. But the big trick is replicating coding. Could I take the same information and codebook that the researchers used and come up with the same information? This is where issues of falsifying or tweaking data to fit desired outcomes comes in. I think that should be the standard people in my field, and related ones aspire to.

  23. […] via Replication backlash « Statistical Modeling, Causal Inference, and Social Science Statistical Model…. […]

  24. Eli Rabett says:

    As Theodore Sturgeon said 90% of everything is crap, and Eli would add that at least another 5% is obvious and another few percent are only interesting to the people doing the work. Thus, very few key results will need replication in order to advance any field and these are the places where, willy nilly, people work the hardest to replicate.

    • Andrew says:


      Yes, but the trouble is that a lot of smart people don’t know ahead of time what are the key results, and they can waste years or even entire careers on dead ends. Mistakes are unavoidable but perhaps it is ultimately avoidable if, because of misunderstanding of statistics, researchers and the scientific community hold with inappropriate certainty to speculative findings.

  25. […] “… if it really takes a year for a study to be reproduced, if your finding is that fragile, this is something that researchers should know about right away from reading the article.” — Formal replication is hard, but it’s still important. […]

    • Andrew says:

      I followed the link to see this quote:

      You know that thing, where you read an article on the internet, and then you look at the comments, and the first comment is something super judgmental, or scolding, or something defensive and fawning? Yet it is blindingly obvious from the comment that the commenter did not actually finish reading the article?

      That dynamic pretty well describes the faculty discussion of candidates in every job search in academia.

      Except, on the internet, there is usually some other commenter who points out that the first one did not read the whole article. Now imagine an internet comment thread where no one finished reading the article, but where everyone felt compelled to express an opinion.

      That, kids, is how tenure-track positions are filled.

      That is sooooo true!

  26. […] Andrew Gelman on the backlash against replication. […]

  27. […] In December 7′s Weekend Reads, we highlighted a piece by Mina Bissell in Nature arguing that scientists shouldn’t push so hard to replicate findings. Andrew Gelman critiques that piece […]

  28. […] a good sign that the push for replication is so strong that there’s now a backlash against […]

  29. See M R Munafò et al. “Bias in genetic association studies and impact factor”, Molecular Psychiatry 14: 119–120, 2009:

    “We divided the individual study odds ratio (OR) by the pooled OR, to arrive at an estimate of the degree to which each individual study over- or underestimated the true effect size, as estimated in the corresponding meta-analysis. [...] Data were analysed using meta-regression of individual study bias score against journal impact factor. This indicated a significant correlation between impact factor and bias score (R2=+0.13, z=4.27, P=0.00002).”

    In other words, results (of the relevant kind) published in higher-impact-factor journals are somewhat LESS likely to be replicable.

  30. Alice Allen says:

    One area not specifically mentioned so far in this discussion — unless I missed it — are the reproducibility of results generated by computational methods. Many times, the programs (source codes) used to produce the results are not available for examination for a variety of reasons (see Barnes , Shamir et al , Ince et al , for example). Publishers in a few fields encourage code release, but many don’t; there is no journal in astrophysics which requires it as a condition of publication, for example. Providing the source codes in the supplemental materials or making them otherwise discoverable is (or should be) an important part of making the research properly examinable and reproducible.

  31. […] Gelman summed up the debate, and provided a good perspective on it, in a recent post on his site. Gelman comes down pretty strongly in favor of replication: scholars should manage their labs and […]

  32. me says:

    As a reviewer, my contribution is to ask authors to replicate data that I feel isn’t thoroughly done. It’s usually pretty easy to run the calculations to identify an underpowered experiment. Catch these things before they get beyond the gates.

    As for building on what’s in the literature, if you’ve been in the game long enough, you can pretty easily tell the difference between solid work and the work of grinders who are just pumping it out. The bad stuff is easy to identify. And to avoid.

    In the big picture, the tree of discovery trims itself well enough. If I want to build on a finding that someone else has published, I (actually, a member of my team) first try to replicate their key observation. If I can’t, I move on to another problem.

    If it does replicate, hopefully I can make some hay from it. And then I cite their work, and that branch of discovery grows a little longer, and maybe I get cited, too.

    But I just don’t see the point of trimming the tree systematically. Replication for the sake of replication? It just doesn’t make sense. There’s just so much incremental stuff out there that just is NOT worth seeing in print again, replicable or not.

    And these stories of students who are wasting a year trying to replicate something they read in a journal article…. as a general rule, they are being horribly mismanaged by their advisers.

    • Rahul says:

      Isn’t one solution to try & not have so much incremental stuff published in the first place?

      Another option is to force replication on the most sensationalist / counterintuitive / high impact results.

      • me says:

        Replication of the most high impact results happens anyway.

        I entered my own field, as a postdoc, just after a notorious study was published that nobody could replicate. In fact, when I interviewed, I asked my boss about the study, because he was hiring me to do what this study had claimed to have done. He said, “Well, we think they are wrong. The problem we want you to solve hasn’t been solved.” He pointed me to some literature, and when I looked at that and at the study very carefully, I could tell the latter was probably nonsense.

        So I packed up my truck and moved.

        You bet. I ended up solving it correctly almost within a year. I was actually helped to some degree by the false discovery, because it had distracted a lot of the herd, and I was able to work on my own solution in something of a vacuum.

        Some of those who were trying to replicate the original false finding didn’t end their misbegotten quest until my work was announced, which was the only way to prove that the original was not replicable. If something is important, and it can’t be replicated, you can’t know anything until you know the alternative.

        Many of those people were not going to stop replicating/building on a falsehood until someone like me came along and nailed the problem. In fact, people were still publishing incremental, and bad, extensions of the false finding for a few more years. Such is the behavior of herds.

        The whole affair made for some interesting Gordon Conferences, I must say.

        My point is, however, that I don’t see how a systematic replication regime accomplishes anything more than what we already accomplish in our own messy way. I don’t see how such a regime gets implemented in any sort of pragmatic way.

        I DO think existing standards of submission and peer review of manuscripts and grant applications could be much much improved and that there are ways to force replication prior to publication/funding. I DO think that standards of experimental design and data analysis in laboratory groups have become so widely poor that something along the lines of indoctrination camps might be in order…and still not enough. (which explains why I stalk this particular blog)

        But once a study has been published and released into the wild, it will be challenged or not and survive or not on its own merits. That’s the one part of the process that really isn’t broken.

        The problem is not that people are publishing non-replicable work. The problem is that they are conducting non-replicable research in the first place.

        • K? O'Rourke says:

          It differs by area of research, but in clinical research a published (faulty) study may make other studies harder to do (even apparently unethical) and cause immediate changes in health care choices. Even if other studies are eventually done that later suggest the earlier study was _misleading_ – its never clear cut and the field is left unsure which is less wrong.

          This was the holy grail of the quality assessment and _downweighting_ work I did – faulty studies tend to be outliars and the impact takes more/larger studies to undo the damage unless they can be downweighted. Unfortunately credible assessment and downwieghting seems beyong reach

      • For junior researchers pre-tenure, that’s simply not a tenable option.

  33. question says:

    “People in my lab often need months — if not a year — to replicate some of the experiments we have done . . .”

    “Replication is always a concern when dealing with systems as complex as the three-dimensional cell cultures routinely used in my lab. But with time and careful consideration of experimental conditions, they [Bissell's students and postdocs], and others, have always managed to replicate our previous data.”

    Well looking at her most recent paper* they only report representative images and “mean ± SD of triplicate experiments”. So how many results get thrown away during this process because they don’t fit the narrative? It seems strange to spend months/years optimizing the conditions then only do the experiment three times once this has been done.

    Also I am often left wondering what is meant by replication these days. Do they mean “significant with the effect in the same direction”, or “similar results”? The latter is much more convincing evidence of a real phenomenon than the former.

    *Increased sugar uptake promotes oncogenesis via EPAC/RAP1 and O-GlcNAc pathways.

  34. […] This impacts directly on the growing debate on the reproducibility of science, also called the replication problem, which has recently elicited a fair amount of discussion, e.g., here and here. […]

  35. […] Can the replication movement be harmful to research? Do we unfairly damage the reputations of scientist by declaring a finding irreproducible? In November 2013, a comment piece by Mina Bissell on lists provoking arguments against replication. Bissell’s article starts a discussion about challenges in replication and reproducibility. For example, American statistician and political scientist Andrew Gelman says that “the push for replication is so strong that now there’s a backlash against it”. […]

  36. […] First up we have a study showing how everything causes cancer.  Then we have an editorial on replication of findings in […]

  37. Nony says:

    The other issue is that by and large scientist WON’T share data, details of methods, samples or allow lab visits for replicators. Plenty of work has shown more than 50% of such inquiries are blown off. Bissel has to know this unless she is clueless about her colleagues and/or has not read the literature on data requests.

Leave a Reply