Skip to content

Reproducing biological research is harder than you’d think

Mark Tuttle points us to this news article by Monya Baker and Elie Dolgin, which goes as follows:

Cancer reproducibility project releases first results

An open-science effort to replicate dozens of cancer-biology studies is off to a confusing start.

Purists will tell you that science is about what scientists don’t know, which is true but not much of a basis on which to develop new cancer drugs. Hence the importance of knowledge: how crucial this mutation or that cell-surface receptor really is to cancer growth. These are the findings that launch companies and clinical trials — provided, of course, that they have been published in research papers in peer-reviewed journals.

As we report in a News story this week, a systematic effort to check some of these findings by repeating an initial five published cancer studies has reported that none could be completely reproduced. The significance of this divergence — how the specific experiments were selected and what the results mean for the broader agenda of reproducibility in research — is already hotly contested.

Perhaps the most influential aspect of the exercise, called the Reproducibility Project: Cancer Biology, has nothing to do with those arguments. It lies beneath the surface, in the peer reviews of the project teams’ replication plans, which were published before the studies began. . . .

Again and again, the peer reviewers and the replicators clash. The reviewers are eager to produce the best experiment to test a publication’s conclusions; they want to correct deficiencies in the design of the original high-impact studies. The replicators do, on several occasions, agree to add an extra measurement, particularly of positive and negative controls that had originally been neglected. Often, however, they resist calls for more definitive studies. . . .

This is a frustrating, if understandable, response. It is easier to compare the results of highly similar experiments than to assess a conclusion. Thus, the replication efforts are not especially interested in, say, the big question of whether public gene-expression data can point to surprising uses for new drugs. They focus instead on narrower points — such as whether a specific finding that an ulcer drug stalls the growth of lung cancer in mice holds up (it did, more or less; see I. Kandela et al. eLife 6, e17044; 2017). Even so, the results are not definitive. . . .

One aspect that merits increased focus is how markedly the results of control experiments varied between the original work and replications. In one case, mice in the control group of an original study survived for nine weeks after being engrafted with tumours, whereas those in the replication survived for just one. . . .

More than 50 years ago, the philosopher Thomas Kuhn defined ‘normal science’ as the kind of work that faithfully supports or chisels away at current hypotheses. It is easy to dismiss this as workmanlike and uninteresting. But only by doing such normal science — and doing it well — can we recognize when revolutions are necessary.

That’s good! And the point about normal science and revolutions is important; Shalizi and I discuss it in our paper.


  1. Martha (Smith) says:

    There’s a mix-up with link(s) to the Nature article(s) quoted. The first two sentences of the quote are from the link given, but the rest is from

  2. Anoneuoid says:

    Again and again, the peer reviewers and the replicators clash. The reviewers are eager to produce the best experiment to test a publication’s conclusions; they want to correct deficiencies in the design of the original high-impact studies.

    Are the experimental conditions and analysis steps understood and described well enough to get similar results on demand? There is absolutely no reason reviewers should be asking for additional controls for a replication experiment. How it is that people who don’t understand the purpose of a replication end up reviewing papers should be investigated.

  3. Replication in biology is hard, and it’s not just a matter of cost. I think an analogy to cooking might be helpful, though they call it “lab hands” not “cooking skill”.

    Consider a restaurant turning out a perfect soufflé for every order. Could the chefs write out a clear enough recipe that someone on the other side of the world can replicate it only by reading the instructions? Probably not. Is the soufflé replicable? The restaurant does it every day, but the eight other chefs who tried it a couple times all failed, even when they ordered the exact same eggs, stoves, pans, and whisks as they used at the restaurant. They might’ve even had enough previous experience that they turned out something that was clearly a soufflé rather than scrambled eggs. But it wasn’t a repication of the restaurant’s perfect soufflé.

    Whether that soufflé is replicable is merely a matter of semantics—how narrowly do you want to define “replicable”?

    • a reader says:

      This is an important note, in which general science could learn from programming best practices.

      When writing code, you’re not supposed to be “clever”…because being clever greatly limits others from using or editing your code later. The best code is simple and reliable. The same holds true for science. You might have the best lab hands, and so only you can reproduce your experiment…but you probably haven’t contributed as much to science as someone else who came up with a very simple procedure that anyone can use.

      On the other hand, if you have a procedure that’s really difficult to replicate but your lab reliably can, this can give you a competitive edge over other labs, at the cost to general scientific advancement.

      • John Jumper says:

        Sure, everyone would prefer simple and easy, but research biology is by its nature at the edge of what is possible. Leaving out details from papers is not an uncommon practice (and ethically dubious to be sure), but I don’t think anyone is overcomplicating experiments to achieve a competitive edge.

        That being said, I am looking forward to the rise of fully-automated experiments (, etc). Only when we have a programming language for scientific experiments will reproducibility be an easy thing. I expect the scientific benefit to be enormous.

      • Code is much easier to reproduce than a recipe.

        As to leaving things out of a recipe, that’s what people suspected my grandmother was doing in her buttercream frosting recipe. Nope, she just didn’t mention bringing the eggs and butter up to room temp and was a little fuzzy on how milk and eggs turned into custard. She was happy to show Mitzi how she did it. It’s an old tale that a parent would jealously leave out an ingredient or step when passing down a recipe so that their child’s partner can’t reproduce their child’s favorite recipe. I think it’s usually just another case where Hanlon’s razor is useful.

    • I disagree that this is a good analogy. I think typically the ‘chef’ involved can’t reliably produce the ‘soufflé’ but if they do it enough times they can get one. Much like many model-based papers aren’t honest about how fragile the model is, many labs aren’t honest about how fragile a given set of experiments is. That doesn’t mean the experiment is wrong or non-reproducible in the broad sense (that’s a separate issue), but it won’t readily replicate on any given try. Sure “lab hands” can matter but it’s more typical that (esp. with complex systems like time-sensitive cell cultures and animals) orchestrating the right set of conditions is not straightforward and there’s a lot of blood, sweat, and (grad student) tears involved.

      • Jason Yamada-Hanff says:

        It’s not either-or, though. Both the “lab hands” and “experimental fragility” are going on all the time. The problem is figuring out which one is going on, and in any given experiment, there is probably a combination of both.

        • Sure but once you say it like that there’s not much to talk about.

        • Jason Yamada-Hanff says:

          Yes, sure. I was just noting that I don’t think one is “more typical” than the other. There really is a lot of skilled souffle-making, and a lot of things that the experienced chef is doing that they don’t even realize or don’t know how to communicate. Often that shades into ceremony/superstition like on which side of which shelf of which freezer the bowls chill. If the experienced chef gets her souffles to rise consistently under these conditions, is that skilled know-how or is it that souffle-making doesn’t “really work” because it shouldn’t matter in principle how bowls get to the right temperature?

          The point is that it really is just hard to disambiguate, and I don’t see that the replication projects have an approach that actually helps us figure out what’s going on. We already know that even very good experiments are fragile in all sorts of ways. I prefer, for actually learning stuff, to test that fragility against *better/different/more stringent* conditions rather than trying to do direct replications.

          • Anoneuoid says:

            The point of a replication is not to see whether “something works” (or is “true”, etc), it is to see if enough is understood about the phenomenon so that any theories people come up with must be consistent with the results.

            • Jason Yamada-Hanff says:

              Can you clarify what you mean? I can’t quite see how you relate all those pieces together, and how a replication is supposed to achieve it.

              What’s the relation between a replication result and understanding enough about a phenomenon? How does that determine whether a theory must be consistent with results?

              To be very clear, I’m honestly asking.

      • Ask Breck to demonstrate his chocolate souffé. It’s much more reliable than you might think once you get the hang of it. Like I can pretty much make a perfect custard every time after a couple hundred replications (I used to make a lot of ice cream!).

        But yes, some of them might fail and get tossed.

  4. Martha (Smith) says:

    The full quote starting “Often, however, they resist calls for more definitive studies. . . .” is noteworthy; it continues as,

    “Testing “the experiment’s underlying hypothesis,” they insist, “is not an aim of the project.””

  5. Tom Passin says:

    What you say is very true. All the same, there is an important difference between the restaurant and the science. For the restaurant, you only need to know to to do it time after time. Some of the things that the successful chef does may be only superstition, for all we know. If you want to understand a souffle scientifically, you really need to know what those apparently intangible factors are and how important they are to proper understanding and replication.

    • Thus food science programs. They probably could tell you how to make a perfect X repeatedly (although it’ll never be as good and probably full of trans fats since they appear to be key to reproducibility).

      • jrc says:

        From what I understand, food science is so advanced it can make a fish stick into a soufflé! Errr…con people into thinking a fish stick in a soufflé. Errr…… mash up some numbers that say a fish stick is better than a soufflé, when called a fancy soufflé.

        “One of the things we did was we simply over the course of six weeks we took a bunch of recipes and we gave them descriptive names, descriptive labels, or we just called them their normal name. For instance, for a two week time period we would have a seafood soufflé out, then for two weeks it would go off the menu, and then two weeks after that it would come back, and we would rename it something descriptive like, Succulent Italian Seafood Soufflé. That’s the exact same recipe, it’s basically a dried fish stick.

        But what ended up happening was that not only were sales much higher here, but people’s evaluation of the seafood soufflé was a lot higher, they rated the restaurant as being more trendy and up to date, and they rated the cook as having had more culinary experience overseas. In reality, the guy had been fired from like Arby’s two weeks before, but it didn’t matter!”

        You know, that Brian Wainsink really is hilarious. There hasn’t been an Arby’s line that good since Kreayshawn. But Wainsink got the funnys and the staying power.

        (hat tip to google for “brian wainsink soufflé”)

        • Andrew says:


          You missed the best quote from the transcript, right at the beginning, from Prof. Kelly Brownell who does the introduction:

          I’m very happy to introduce our speaker today, Brian Wansink, Dr. Brian Wansink. You’re just about to hear one of the most engaging speakers I think you ever will, but he couples the engaging style with an incredible amount of scientific rigor in the work he does.

          “Incredible” . . . that’s one way to put it!

          • Will says:

            From the Google dictionary:

            adjective: incredible

            impossible to believe.

            In reference to the “secret agent tools” – lasers and hidden scales to measure height and weight – that he sometimes claims to have used, this meaning of incredible seems very appropriate.

        • Dan Jurafsky wrote a whole book on the linguistics of menus (with lots of nice data analytics—he has a dual appointment as a computer scientist).

  6. Statsgirl says:

    I’m not usually a fan of your posts where you quote for multiple paragraphs and add only 1 sentence, but this one is pretty interesting, so I’m let it slide.

  7. Hans says:

    Skimmed through the discussion. One aspect may be missing. Biology is in constant evolution and for some biological systems there exists inherently no reproducibility. E.g. consider emergence of resistant pathogens.

  8. Thanatos Savehn says:

    Maybe scientists should follow the lead of George Davey Smith and aim their sights a little lower. Smith and Shah Ebrahim wrote a great editorial back in 2000 titled “Epidemiology – is it time to call it a day?” They lamented how the public health approach to finding causes via NHST had led to coffee being declared a carcinogen on the morning news, a chemopreventive by lunch, a cancer cure by the evening news and the whole thing mockingly recapitulated on the late night talk shows. And all while getting the causes of some very big diseases very much wrong (e.g. stress versus H. pylori in peptic ulcer). A more promising avenue, they hoped, was to use the tools of epi to uncover genes or patterns of genes that cause cancer. Well, though things are getting somewhat better all the GWASs that followed managed to prove, save in a very few cases, was that genes are not destiny and most discovered associations are poor predictors of anything. So now GD Smith has focused on something rather modest. We can now, it seems, more accurately measure smoking history:

    Estimating prior smoking exposure is, as you might imagine, fraught. Most people lie to themselves and everyone else about how much they smoked (especially once they’re diagnosed with non-small cell lung cancer which is the point at which the data for most of the risk estimates is collected). One result is that confidence in a patient’s risk given smoking history (and thus the recommendations for him/her re: CT and other screening)is very low. Assessing DNA methylation appears to be a marked improvement over self-reporting when it comes to exposure assessment. It ain’t a cure and it ain’t about causation, it just makes for better predictions. And maybe that’s good enough.

    • Anoneuoid says:

      Because there is no reason there needs to be a small set of genes that “cause cancer” when mutated. Like most current interpretational issues with biomed, this is probably something that *does happen*, but accounts for like aneuploidy, which also seems to be a much more common error than indels btw). If you note that karyotyping (to look for non-diploid cells) is basically a gold standard for cancer diagnosis it is amazing this “few genes” idea took off the way it did.

    • Keith O'Rourke says:

      > “It ain’t a cure and it ain’t about causation”
      The expectation that individual researchers should be able to do that is a large part of the problem.

      Rather they should be expected or even mandated to design studies as well as possible while being well informed of past efforts in the area, open to criticism from other’s prior to committing to the design, implement it well and document and report what happened with complete reporting of methods and data, with conclusions (inferences) reserved for meta-analyses and pooled analyses later (as Sander Greenland and I argued here and since. As Peirce put it, true scientists only hope to clear some brush so that others that come after can make some real progress.

      Since this goes against current research incentives to maximize funding and publicity to promise and claim delivery of ‘important’ findings its not surprising that Smith and Shah Ebrahim’s work (such as improving measurement of past smoking) needs to argued for as ” good enough”. (Personally I think the we need to be _brush clearers_ first and foremost is more Shah than George at least when I interacted with them 15 years ago).

      • Kyle C says:

        “Sander Greenland and I”

        I’m confused, is Keith O’Rourke a pseudonym?

        • Will says:

          Keith O’Rourke coauthored the chapter on Meta-Analysis in Modern Epidemiology.

        • Carlos Ungil says:

          You can see the table of contents following the link and using the “look inside” feature, the entry for chapter 33 will clear the confusion.

          • Kyle C says:

            Thank you both! [Well of course, chapter 33 on p. 652, how could I miss it. Parenthetically also, listing the editors on the cover without noting, “Editors,” seems odd and is not something I’m familiar with from other fields, e.g., philosophy, law.]

            • Keith O'Rourke says:

              Sorry, I should have put chapter 33 of …

              Being the last chapter, it is the least read chapter in the book and my guess is that most if not all of the instructors that use the book as a textbook never cover that topic.

              Making sense of multiple studies has been and continues to be an under covered topic in statistics – in a sense it should be the first topic to discuss as everything should anticipate and feed into it – but as its the last thing that is done (if at all) its put at the end.

              • Martha (Smith) says:


              • +1 to discussing multiple studies in one context. It has two pleasant side effects. It leads to consideration of the data collection process (and measurement error), because now there’s one underlying scientific process and multiple data streams. It also helps thinking about priors and their relation to meta-analysis.

Leave a Reply