On house arrest for p-hacking

People keep pointing me to this excellent news article by David Brown, about a scientist who was convicted of data manipulation:

In all, 330 patients were randomly assigned to get either interferon gamma-1b or placebo injections. Disease progression or death occurred in 46 percent of those on the drug and 52 percent of those on placebo. That was not a significant difference, statistically speaking. When only survival was considered, however, the drug looked better: 10 percent of people getting the drug died, compared with 17 percent of those on placebo. However, that difference wasn’t “statistically significant,” either.

Specifically, the so-called P value — a mathematical measure of the strength of the evidence that there’s a true difference between a treatment and placebo — was 0.08. . . . Technically, the study was a bust, although the results leaned toward a benefit from interferon gamma-1b. Was there a group of patients in which the results tipped? Harkonen asked the statisticians to look.

It turns out that people with mild to moderate cases of the disease (as measured by lung function) had a dramatic difference in survival. Only 5 percent of those taking the drug died, compared with 16 percent of those on placebo. The P value was 0.004 — highly significant. . . . But there was a problem. This mild-to-moderate subgroup wasn’t one the researchers said they would analyze when they set up the study. . . .

Brown reports the other side:

Harkonen’s defenders also think that context is important.

The press release said the results for the primary endpoint weren’t statistically significant. . . . When the study was published in the New England Journal of Medicine in January 2004, the authors wrote that “a clinically significant survival benefit could not be ruled out.”

In short, Harkonen’s defense team argued that nobody was deceived by the press release. . . . Steven Goodman, the pediatrician and biostatistician, believes that “context” also includes what IPF patients are thinking.

“Part of the issue goes to the level of proof a patient would need when facing a fatal disease with no treatment,” he said. The fact that the study’s main result barely missed statistical significance “is far from proof that the treatment didn’t work. And if I were a patient, I would want to know that.”

And here’s where the story ends:

Things haven’t turned out too well for interferon gamma-1b, either.

InterMune did run another trial. It was big — 826 patients at 81 hospitals — in order to maximize the chance of getting clear-cut results. It enrolled only people with mild to moderate lung damage, the subgroup whose success was touted in the press release.

And it failed. A little more than a year into the study, more people on the drug had died (15 percent) than people on placebo (13 percent). . . . It’s possible that there’s a subgroup of patients, not yet fully identified, who benefit. For example, data suggest that people whose cells still make a fair amount of interferon gamma are helped by the high-priced drug. But it’s unlikely anyone will be trying to figure that out soon.

Certainly not Scott Harkonen.

Things could be worse, though. They used to have the death penalty for forgery.

P.S. More here from Patti Zettler:

A wire fraud conviction requires the jury to find that there was a “knowing participation in a scheme to defraud,” and “a specific intent to deceive or defraud.” United States v. Harkonen, 510 F. App’x 633, 636 (9th Cir. Mar. 4, 2013). This means that Dr. Harkonen was not convicted just because the government did not agree with his interpretation of the clinical trial results. Instead, he was convicted because a jury found that he intentionally and knowingly sought to defraud someone (presumably, investors) through the statements made in the press release.

And, in this case, there seems to be ample evidence to support to support the jury’s finding—evidence that distinguishes the communication at issue in Dr. Harkonen’s case from the scientific debate that commenters are worried about protecting. . . .

Because of his position as CEO, Dr. Harkonen had an obvious financial motive for describing the clinical trial results, which would affect InterMune’s bottom line, in a misleadingly positive light. In addition, there was testimony at his trial that, among other things, Dr. Harkonen: said he would “cut that data and slice it until [he] got the kind of results [he was] looking for”; departed from InterMune’s usual procedures by preventing clinical and statistical staff at InterMune from seeing the press release before it issued, and; was told by FDA medical officers that they did not believe the data supported the use of Actimmune in IPF patients (which suggests FDA was not likely to approve Actimmune for IPF based on these data, casting doubt on the accuracy of the press release statements about sales). See Harkonen, 510 F. App’x at 636; Harkonen, 2010 WL 2985257 at *12-13. In sum, there seems to be a fair amount of evidence that Dr. Harkonen intentionally crafted the press release to be misleading for financial reasons.

20 thoughts on “On house arrest for p-hacking

  1. There’s got to be more to this story. Nothing in the article suggests anything “fraudulent” in what the guy did. Statistically questionable, sure, but fraud?

    It’s his bad luck that the p-value for improved survival came out to 0.08 instead of 0.05. So close to that magical, mystical number that would have made all the difference in the world.

    • within the clinical science community, it is an established practice that looking for subgroups is considered wrong
      it should be obvious why this is so: there are a lot of diff subgroups that could have occured
      I won’t comment on the fitness of the punishment, but will re iterate that going back and looking for subgroups (data dredging) is widely considered wrong

  2. I read that article a day or so ago and thought the same. Either that or the legal authorities are innumerate—which is quite possible.

  3. I have mixed concerns.

    Steve is right that this fairly common scientific practice in clinical research, but maybe it’s not common to be publicising results as part of attempts for financial gain.

    If as I heard in MBA school there are large investment firms that have 20 divisions so they can promote the most successful return each year as a _for instance_ of their ability to _regularly_ earn above market returns – that is fraud.

    If I heard of a treatment comparison that had p=.08 and a promising subgroup lead to a second study just on that subgroup to confirm the effect (and especially if the trail was large) – I would short sell the stock (at least without any background knowledge).

    But if I had the disease I would want the drug (unless side effects were really awful).

    • I have mixed feelings about this too.

      How is a single posthoc analysis any worse than the thousands of retrospective statistical analyses on convenience samples and observational data that are done all the time in public health? So long as they’re up front that it is posthoc and about the interpretation… “a clinically significant survival benefit could not be ruled out” is not an unreasonable thing to say – it’s not a claim that there is a survival benefit, although I can see how it could be misinterpreted by the public.

      They did a follow-up study which disproved their hypothesis. If I had an untreatable disease, I certainly would have at least pushed for a follow-up study too. Again this is more than can be said of a lot of conclusions derived from retrospective convenience sample research. (as a side note, I think the use of “the study was a failure” when a null hypothesis doesn’t get rejected isn’t helping things).

      Yes, the interpretation of a posthoc analysis isn’t clean, so the right thing to do is report it as such and do a follow-up. As far as I can tell from the above summary this sounds like what they did, at least on the scientific side of things. There are p-values reported in the public health literature left and right with far worse interpretation problems. Now the marketing/public relations part of this with the press release (linked by steve) on the other hand – that needs to be punished.

      p.s. please people – enough with the dune jokes!

  4. One of the commenters links to this Note for Publication from the 9th circuit appeals court. Highlights:

    Harkonen prevented Intermune’s clinical personnel from viewing the Press Release prior to its publication, even when they asked to see it, at one point becoming “visibly” upset and “castigat[ing]” the head of the communications firm that helped prepare the Press Release for permitting Intermune’s Vice President of Regulatory Affairs to view a draft of the Press Release. Harkonen also did not want the FDA to know about all his post-hoc analyses—the analyses on which the Press Release was based—because he “didn’t want to make it look like we were doing repeated analyses looking for a better result.” … Harkonen stated he would “cut that data and slice it until [he] got the kind of results [he was] looking for,” and requested the final post-hoc analysis “simply . . . to see what that did to the p-value.”

    Mens rea!

    • What I want to know is, did his employees call him “Baron” Harkonen? As in, “Hey, the Baron wants us to run some more tests,” “the Baron wants to exclude the data from day 5 of the experiment,” etc. Cos if they did, that would be funny. We had a professor once who was listed in the course catalog as “Cook, C.,” and we’d refer to him as “the captain.”

  5. The case has at least two parts, the current issue concerns “free speech”. http://errorstatistics.com/2012/12/13/bad-statistics-crime-or-free-speech/
    Some exchanges with Schachtman who is involved first hand in writing some briefs in his defense. http://errorstatistics.com/2012/12/19/philstatlawstock-more-on-bad-statistics-schachtman/
    But the whole thing also connects to the controversial case where the Supreme Court (is thought to have) passed judgment on significance tests in relation to the Matrixx case .

      • Liked this snippet but would replace juror with “best expert witness”

        “even the smartest and most attentive juror will be challenged by the parties’ assertions of observation bias, selection bias, information bias, sampling error, confounding, low statistical power, insufficient odds ratio, excessive confidence intervals, miscalculation, design flaws, and other alleged shortcomings of all of the epidemiological studies”

        Sounds like it would have been a fun trial for Dr Wells.

  6. What about Sir Arthur Eddington’s “confirmation” of general relativity (in 1919 by measuring how much sun would bend light from a star– which was possible to do only during a total eclipse; telescopes set up all over the world etc– made Einstein famous). p >0.05…..

    My guess is that the likelihood ratio for “special relativity is right” vs. “Newtonian mechanics right” was still super duper high — although I’m not sure!

    http://books.google.com/books?id=n66STNFNO7IC&pg=PA170&dq=%22statistically+significant%22+eddington+einstein&hl=en&sa=X&ei=tdJOUtKXD9DD4APhioHYDw&ved=0CC0Q6AEwAA#v=onepage&q=%22statistically%20significant%22%20eddington%20einstein&f=false

  7. Interferon, despite the cool-sounding name that got it mentioned in a Superman comic book in the 1960s, has mostly turned out over many decades to be bad news. When I was diagnosed with terminal-sounding lymphatic cancer in 1997, the first NonHodgkins lymphoma specialist I talked to wanted to enroll me right away in her interferon clinical trial. I procrastinated and wound up in somebody else’ Rituxan trial. And here I am.

  8. As some of the commentators above have recognized, this case is remarkable for its aggressive prosecution of a scientist stating a causal conclusion. You can find people (and scientists) any day of the week making stronger claims on lesser evidence.

    A little bit of background. The “whistle blower” was Thomas Fleming, a well-known statistician who served on the data safety monitoring board. Although the board’s function was completed when it handed over the data to the company, Fleming was incensed by the interpretation given to the data by Dr. Harkonen in a press release. Fleming was the government’s key witness, and he advanced his ultra-orthodox statistical views, tested only by cross-examination. Harkonen did not testify, and his counsel did not call an expert witness.

    After trial, statisticians Donald Rubin and Steve Goodman both filed affidavits in support of Harkonen, but the trial judge regarded the “falsity” as established by the jury’s decision. (Judge Patel could have revisited the issue as to whether falsity was established beyond a reasonable doubt on the evidence in the case, but she was unwilling to do so.) The trial judge was however hard pressed to find anyone who was harmed by the press release, and she acknowledged that some may have been helped. She sentenced Harkonen to 6 months harm incarceration; the government wanted TEN YEARS in prison. Both sides appealed; and the 9th Circuit affirmed in a cursory opinion. Harkonen has asked the Court to take the case. The government has till next week to submit a brief to argue that the Court should not take the case.

    For more details on the statistical issues, see the amicus brief I filed in the Supreme Court, with Professors Kenneth Rothman and Timothy Lash. See http://schachtmanlaw.com/wp-content/uploads/2010/03/KJR-TLL-NAS-Amicus-Brief-in-US-v-Harkonen-090413A.pdf I am NOT defense counsel in the case. I have filed a “friend of the court” brief because I believe that the government’s prosecution was wrong on the law and the facts, and that it has set a precedent that will cause a great deal of mischief.

    Nathan

Comments are closed.