The necessity—and the difficulty—of admitting failure in research and clinical practice

Bill Jefferys sends along this excellent newspaper article by Siddhartha Mukherjee, “A failure to heal,” about the necessity—and the difficulty—of admitting failure in research and clinical practice. Mukherjee writes:

What happens when a clinical trial fails? This year, the Food and Drug Administration approved some 40 new medicines to treat human illnesses, including 13 for cancer, three for heart and blood diseases and one for Parkinson’s. . . . Yet the vastly more common experience in the life of a clinical scientist is failure: A pivotal trial does not meet its expected outcome. What happens then? . . .

The first thing you feel when a trial fails is a sense of shame. You’ve let your patients down. You know, of course, that experimental drugs have a poor track record — but even so, this drug had seemed so promising (you cannot erase the image of the cancer cells dying under the microscope). You feel as if you’ve shortchanged the Hippocratic oath. . . .

There’s also a more existential shame. In an era when Big Pharma might have macerated the last drips of wonder out of us, it’s worth reiterating the fact: Medicines are notoriously hard to discover. The cosmos yields human drugs rarely and begrudgingly — and when a promising candidate fails to work, it is as if yet another chemical morsel of the universe has been thrown into the Dumpster. The meniscus of disappointment rises inside you . . .

And then a second instinct takes over: Why not try to find the people for whom the drug did work? . . . This kind of search-and-rescue mission is called “post hoc” analysis. It’s exhilarating — and dangerous. . . . The reasoning is fatally circular — a just-so story. You go hunting for groups of patients that happened to respond — and then you turn around and claim that the drug “worked” on, um, those very patients that you found. (It’s quite different if the subgroups are defined before the trial. There’s still the statistical danger of overparsing the groups, but the reasoning is fundamentally less circular.) . . .

Perhaps the most stinging reminder of these pitfalls comes from a timeless paper published by the statistician Richard Peto. In 1988, Peto and colleagues had finished an enormous randomized trial on 17,000 patients that proved the benefit of aspirin after a heart attack. The Lancet agreed to publish the data, but with a catch: The editors wanted to determine which patients had benefited the most. Older or younger subjects? Men or women?

Peto, a statistical rigorist, refused — such analyses would inevitably lead to artifactual conclusions — but the editors persisted, declining to advance the paper otherwise. Peto sent the paper back, but with a prank buried inside. The clinical subgroups were there, as requested — but he had inserted an additional one: “The patients were subdivided into 12 … groups according to their medieval astrological birth signs.” When the tongue-in-cheek zodiac subgroups were analyzed, Geminis and Libras were found to have no benefit from aspirin, but the drug “produced halving of risk if you were born under Capricorn.” Peto now insisted that the “astrological subgroups” also be included in the paper — in part to serve as a moral lesson for posterity.

I actually disagree with Peto—not necessarily for that particular study, but considering the subgroup problem more generally. I mean, sure, I agree that raw comparisons can be noisy, but with a multilevel model it should be possible to study lots of comparisons and just partially pool these toward zero.

That said, I agree with the author’s larger point that it would be good if researchers could just admit that sometimes an experiment is just a failure, that their hypothesis didn’t work and it’s time to move on.

I recently encountered an example in political science where the researcher had a preregistered hypothesis, did the experiment, and the result was in the wrong direction and not statistically significant: a classic case of a null finding. But the researcher didn’t give up, instead reporting the result was statistically significant at the 10% level, explaining that even though the result was in the wrong direction, that was consistent with theory also, and also reporting some interactions. That’s a case where the appropriate multilevel model would’ve partially pooled everything toward zero, or, alternatively, Peto’s just-give-up strategy would’ve been fine too. Or, not giving up but being clear that your claims are not strongly supported by the data, that’s ok. But it was not ok to claim strong evidence in this case; that’s a case of people using statistical methods to fool themselves.

To return to Mukherjee’s article:

Why do we do it then? Why do we persist in parsing a dead study — “data dredging,” as it’s pejoratively known? One answer — unpleasant but real — is that pharmaceutical companies want to put a positive spin on their drugs, even when the trials fail to show benefit. . . .

The less cynical answer is that we genuinely want to understand why a medicine doesn’t work. Perhaps, we reason, the analysis will yield an insight on how to mount a second study — this time focusing the treatment on, say, just men over 60 who carry a genetic marker. We try to make sense of the biology: Maybe the drug was uniquely metabolized in those men, or maybe some physiological feature of elderly patients made them particularly susceptible.

Occasionally, this dredging will indeed lead to a successful follow-up trial (in the case of O, there’s now a new study focused on the sickest patients). But sometimes, as Peto reminds us, we’ll end up chasing mirages . . .

I think Mukherjee’s right: it’s not all about cynicism. Researchers really do believe. The trouble is that raw estimates selected on statistical significance give biased estimates (see section 2.1 of this paper). To put it another way: if you have the longer-term goal of finding interesting avenues to pursue for future research, that’s great—and the way to do this is not to hunt for “statistically significant” differences in your data, but rather to model the entire pattern of your results. Running your data through a statistical significance filter is just a way to add noise.

11 thoughts on “The necessity—and the difficulty—of admitting failure in research and clinical practice

  1. “That said, I agree with the author’s larger point that it would be good if researchers could just admit that sometimes an experiment is just a failure, that their hypothesis didn’t work and it’s time to move on.”

    I had a hard time admitting my little study was negative just being a student and nothing to lose! Now imagine researchers who have spent years on a discovery, with jobs hanging on their performance, companies which spent millions of dollars doing phase 1, and 2 trials showing positive results finally to have a phase 3 trial to be negative. So it is not at all an easy task for researchers and understandable.

  2. “You go hunting for groups of patients that happened to respond — and then you turn around and claim that the drug “worked” on, um, those very patients that you found.”

    I think “respond” is a poor word choice here. The drug really did work on a patient if they “responded”, at least according to my understanding of the definition. I think he should have said, “you go hunting for groups of treated patients that happened to get better”. Or, “you go hunting for categories of patients who did better in the treated group than in the control.

    • i agree with you that “respond” is a poor choice of word. However, it is a word that is routinely used to mean “happened to get better” — i.e., with no information in individual cases that the drug caused the patient to get better.

  3. Isn’t subgroup analysis using RCT data effectively an observational rather than experimental analysis if subgroups we’re not prespecified and sampling was not stratified by that subgroup? If those groups happens to have higher adherence, are older, or are sicker, wont estimates be both noisy and biased? I see how multilevel modelling could be used to shrink noisy estimates, but don’t we still need to be concerned about bias?

      • Hi Andrew,

        How can a subgroup analyzed be randomized? For example, if I look a subgrouop of people taking supplement in an RCT. Obviously, the people who take them are the ones who are worried about the health (healthy effect bias)/highly educated and so forth. These biases will be minimal if they were randomized though, if not eliminated. Am I wrong here?

  4. In my career I heard lots of “positive” studies presented. No group of patients is so uniform that all meet their demise on the same day; although, I was often asked for a prognosis in ways that implied that the questioner thought I had some incredibly precise information. The human emotions that Dr. Mukherjee describes certainly led to investigators proposing that the survival above the median in groups of left-handed blue-eyed people in the study was due to a real treatment effect. The investigators were not insincere nor trying to be deceptive, and their actual belief and general trustworthiness often made them convincing. Sir Peto had a dry, acerbic British wit that allowed him to offer appropriate criticism without arousing the defensiveness that lies in all of us when our work is being examined. I always enjoyed hearing his analysis.

    • > Sir Peto had a dry, acerbic British wit that allowed him to offer appropriate criticism without arousing the defensiveness that lies in all of us
      And ignore any criticism of his work ;-)

      By the way, he told once us that on a plane he sat beside a famous astrologist who tried to convince him those sub-group results made perfect sense.

  5. The Series of Unsurprising Results in Economics (SURE) publishes results like this.
    http://davegiles.blogspot.com/2018/06/the-series-of-unsurprising-results-in.html

    “The Series of Unsurprising Results in Economics (SURE) is an e-journal of high-quality research with “unsurprising” findings. We publish scientifically important and carefully-executed studies with statistically insignificant or otherwise unsurprising results. Studies from all fields of Economics will be considered. SURE is an open-access journal and there are no submission charges.”

    Not sure if something similar is useful in other disciplines. For example, if published (with data) then someone could do see the “scientifically important and carefully-executed studies with statistically insignificant” published result and extend it to the mulit-level model Andrew suggests.

  6. I have a hobbyist’s interest in emerging and experimental anti-depressant medicines. The established results on the conventional and widely used anti-depressants are so peculiar and paradoxical — that drugs with radically different structures that work on entirely different neurotransmitters are nonetheless all indistinguishable in effectiveness, with outcomes all delayed by a comparable, oddly long period without any clear understanding of why, and with outcomes uncorrelated to any feature of symptomology except perhaps overall severity — and given the great practical consequences and the huge resources devoted to the problem, that it almost seems to be a case study of the failure of scientific method, or at least a particularly fascinating enigma.

    Anyway, I have looked for outcomes of something over 50 stage II trials of new anti-depressant drugs over the last six or seven years. This has been by no means a random or representative sample — I have been following particular approaches that seem to me to be novel or to be based in some interesting theory. But the main thing I have observed is that well over half of these trials never report any results at all.

    This seems to me to be seriously bad science, and I think it bad public medical policy to allow it. I think that not only should preregistration of trials and experimental protocols be required, as I believe they are, but publication of negative results should be required as well, as I believe they are not. Or at least, such results are not made generally available.

    I think one reason that cherry-picking of negative results seems as prevalent as it does is that the cherry-picked results are the only negative outcomes we see, outside of federally funded trials.

Leave a Reply to Anoop Cancel reply

Your email address will not be published. Required fields are marked *