Skip to content

“Bombshell” statistical evidence for research misconduct, and what to do about it?

Someone pointed me to this post by Nick Brown discussing a recent article by John Carlisle regarding scientific misconduct.

Here’s Brown:

[Carlisle] claims that he has found statistical evidence that a surprisingly high proportion of randomised controlled trials (RCTs) contain data patterns that cannot have arisen by chance. . . . the implication is that some percentage of these impossible numbers are the result of fraud. . . .

I thought I’d spend some time trying to understand exactly what Carlisle did. This post is a summary of what I’ve found out so far. I offer it in the hope that it may help some people to develop their own understanding of this interesting forensic technique, and perhaps as part of the ongoing debate about the limitations of such “post publication analysis” techniques . . .

I agree with Brown that these things are worth studying. The funny thing is, it’s hard for me to get excited about this particular story, even though Brown, who I respect, calls it a “bombshell” that he anticipates will “have quite an impact.”

There are two reasons this new paper doesn’t excite me.

1. Dog bites man. By now, we know there’s lots of research misconduct in published papers. I use “misconduct” rather than “fraud” because from, the user’s perspective, I don’t really care so much whether Brian Wansink, for example, was fabricating data tables, or had students make up raw data, or was counting his carrots in three different ways, or was incompetent in data management, or was actually trying his best all along and just didn’t realize that it can be detrimental to scientific progress to be fast and loose with your data. Or some combination of all of these. Clarke’s Law.

Anyway, the point is, it’s no longer news when someone goes into a literature of p-value-based papers in a field with noisy data, and finds that people have been “manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.” At this point, it’s to be expected.

2. As Stalin may have said, “When one man dies it’s a tragedy. When thousands die it’s statistics.” Similarly, the story of Satoshi Kanazawa or Brian Wansink or Daryl Bem has human interest. And even the stories without direct human interest have some sociological interest, one might say. For example, I can’t even remember who wrote the himmicanes paper or the ages-ending-in-9 paper, but in each case I’m still interested in the interplay between the plausible-but-oh-so-flexible theory, the weak data analysis, the poor performance of the science establishment, and the media hype. This new paper by Carlisle, though: it’s so general, so it’s hard to grab onto the specifics of any single paper or set of papers. Also, for me, medical research is less interesting than social science.

Finally, I want to briefly discuss the current and future reactions to this study. I did a quick google and found it was covered on Retraction Watch, where Ivan Oransky quotes Andrew Klein, editor of Anaesthesia, as saying:

No doubt some of the data issues identified will be due to simple errors that can be easily corrected such as typos or decimal points in the wrong place, or incorrect descriptions of statistical analyses. It is important to clarify and correct these in the first instance. Other data issues will be more complex and will require close inspection/re-analysis of the original data.

This is all fine, and, sure, simple typos should just be corrected. But . . . if a paper has real mistakes I think the entire paper should be flagged as suspect. If the authors have so little control over their data and methods, then we may have no good reason to believe their claims about what their data and methods imply about the external world.

One of the frustrating things about the Richard Tol saga was that we became aware of more and more errors in his published article, but the journal never retracted it. Or, to take a more mild case, Cuddy, Norton, and Fiske published a paper with a bunch of errors. Fiske assures us that correction of the errors doesn’t change the paper’s substantive conclusions, and maybe that’s true and maybe it’s not. But . . . why should we believe her? On what evidence should we believe the claims of a paper where the data are mishandled?

To put it another way, I think it’s unfortunate that retractions and corrections are considered to be such a big deal. If a paper has errors in its representation of data or research procedures, that should be enough for the journal to want to put a big fat WARNING on it. That’s fine, it’s not so horrible. I’ve published mistakes too. Publishing mistakes doesn’t mean you have to be a bad person, nobody’s perfect.

So, if Anaesthesia and other journals wants to correct incorrect descriptions of statistical analyses, numbers that don’t add up, etc., that’s fine. But I hope that when making these corrections—and when identifying suspicious patterns in reported data—they also put some watermark on the article so that future readers will know to be suspicious. Maybe something like this:

The authors of the present paper were not careful with their data. Their main claims were supported by t statistics reported as 5.03 and 11.14, but the actual values were 1.8 and 3.3.

Or whatever. The burden of proof should not be on the people who discovered the error to demonstrate that it’s consequential. Rather, the revelation of the error provides information about the quality of the data collection and analysis. And, again, I say this as a person who’d published erroneous claims myself.


  1. Martha (Smith) says:

    It sounds like you didn’t read very much of what Brown wrote. So here is his “Conclusion” section:


    The above analyses show how easy it can be to misinterpret published articles when conducting systematic forensic analyses. I can’t know what was going through Carlisle’s mind when he was reading the articles that I selected to check, but having myself been through the exercise of reading several hundred articles over the course of a few evenings looking for GRIM problems, I can imagine that obtaining a full understanding of the relations between each of the baseline variables may not always have been possible.

    I want to make it very clear that this post is not intended as a “debunking” or “takedown” of Carlisle’s article, for several reasons. First, I could have misunderstood something about his procedure (my description of it in this post is guaranteed to be incomplete). Second, Carlisle has clearly put a phenomenal amount of effort—thousands of hours, I would guess—into these analyses, for which he deserves a vast amount of credit (and does not deserve to be the subject of nitpicking). Third, Carlisle himself noted in his article (p. 8) that is was inevitable that he had made a certain number of mistakes. Fourth, I am currently in a very similar line of business myself at least part of the time, with GRIM and the Cornell Food and Brand Lab saga, and I know that I have made multiple errors, sometimes in public, where I was convinced that I had found a problem and someone gently pointed out that I had missed something (and that something was usually pretty straightforward). I should also point out that the quotes around the word “bombshell” in the title of this post are not meant to belittle the results of Carlisle’s article, but merely to indicate that this is how some media outlets will probably refer to it (using a word that I try to avoid like the plague).

    If I had a takeaway message, I think it would be that this technique of examining the distribution of p values from baseline variable comparisons is likely to be less reliable as a predictor of genuine problems (such as fraud) when the number of variables is small. In theory the overall probability that the results are legitimate and correctly reported is completely taken care of by the p values and Stouffer’s formula for combining them, but in practice when there are only a few variables it only takes a small issue—such as a typo, or some unforeseen non-independence—to distort the results and make it appear as if there is something untoward when there probably isn’t.

    I would also suggest that when looking for fabrication, clusters of small p values—particularly those below .05—may not be as good an indication as clusters of large p values. This is just a continuation of my argument about the p value of .07 (or .0000007) from Article 2, above. I think that Carlisle’s technique is very clever and will surely catch many people who do not realise that their “boring” numbers showing no difference will produce p values that need to follow a certain distribution, but I question whether many people are fabricating data that (even accidentally) shows a significant baseline difference between groups, when such differences might be likely to attract the attention of the reviewers.

    To conclude: One of the reasons that science is hard is that it requires a lot of attention to detail, which humans are not always very good at it. Even people who are obviously phenomenally good at it (including John Carlisle!) make mistakes. We learned when writing our GRIM article what an error-prone process the collection and analysis of data can be, whether this be empirical data gathered from subjects (some of the stories about how their data were collected or curated that were volunteered by the authors whom we contacted to ask for their datasets were both amusing and slightly terrifying) or data extracted from published articles for the purposes of meta-analysis or forensic investigation. I have a back burner project to develop a “data hygiene” course, and hope to get round to actually developing and giving it one day!”

    • Andrew says:


      I read all of Brown’s post before writing mine. It’s all interesting, and Brown makes good points. One reason I posted on this was so if anyone happens to come to this blog to see more about this story, they’d be sent straight to Brown’s thorough post. I just didn’t focus on this particular aspect of Brown’s post because it wasn’t so connected to the aspects of this story that interest me. I’m assuming that anyone who follows up on Carlisle to look at individual papers in Anaesthesia etc. will check the details as Brown did.

  2. Cliff AB says:

    Wait, I’m actually totally confused by this original article (Carlisle).

    Quote from the last paragraph: “In summary, the distribution of means for baseline variables in randomised, controlled trials was inconsistent with random sampling, due to an excess of very similar means and an excess of very dissimilar means”

    …but randomized trials ALWAYS have enrollment criteria. If your enrollment criteria is “males age 20-30 with hip pain”, your baseline demographics are going to be “unusually” (quotes because that is wrong…) similar to another to another study in which the enrollment criteria is “males age 21-29 with hip problems” and “unusually” similar to another study which is “females over 60”. As far as I can tell, this is mentioned nowhere the Carlisle’s paper.

    Or am I missing something very obvious?

    • Cliff AB says:

      last one should be ‘…”unusually” dissimilar to…’

    • Andrew says:


      I was assuming that Carlisle was comparing treatment to control within each experiment, not comparing enrolled people in different studies. But I did not try to read Carlisle’s paper in detail as I found it confusing; I relied on Brown’s description.

      • Cliff AB says:

        Ah, that would make a little more sense. I didn’t see anything in the article explicitly writing out exactly what was being compared.

        But even that is still questionable. Medical ethics dictate intervention, meaning that even though the treatment may be randomly assigned at the start, doesn’t mean that the final treatment/control group has be distributed in the same manner, especially if there is a strong treatment effect. And it seems reasonable that the outliers will be the ones mostly likely to need intervention and thus cause a larger change in the baseline scores.

        Also, Carlisle claims that variables that were stratified were excluded. That seems to imply Carlisle would have had to closely read 5087 papers to write this paper (rather than using some sort of web scraping tool).

        Personally, I find the paper to be extremely weak evidence of an extremely serious crime. I’m sure there’s RCT that have not followed their SOP’s, but I would not be any more suspicious of the trials highlighted with this method than a randomly sampled RCT.

      • Jordan Anaya says:

        Cliff: I’m pretty sure Carlisle is only looking at whether the randomization within a trial actually appears random.

        Andrew: I’ve looked at a couple of Carlisle’s papers, and I find all of them to be confusing. If he hopes for this technique to become standard practice he probably needs to make a web application like I did for granularity testing.

        Regarding the Cuddy paper you mentioned, I always found it interesting that in their correction they never addressed the granularity errors pointed out by Nick.

        • Andrew says:


          Regarding your last point: I’ve written and published a lot of scientific papers through the peer-review process, and the typical experience is: Once we get the review reports, we figure out the minimum effort needed to address the comments, so we can get the damn paper published and move on. OK, not always. And, in many cases, the comments are good and our revisions make the paper better, sometimes much better (as in my popular 2006 paper on prior distributions for variance parameters, where we introduced the 3-schools example and the half-Cauchy distribution only in response to reviewers being unsatisfied with what we’d had earlier). And, sure, sometimes the reviewer comments are serious and cause us to rethink the entire point we’re trying to make. But usually we see the reviews as a hoop we need to jump through, in that final stage of getting the paper done.

          I could well imagine that Cuddy, Norton, and Fiske (or whatever subset of them handled the correction) thought of the error reports in the same way, as a sort of post-publication referee report, and extra annoying for coming after they thought the paper was safely in its final resting place. So they did the minimum needed to satisfy the journal editor and then got out of there. I have no idea, but I could well imagine that they never even considered the possibility that any of their substantive conclusions would need to change.

          I agree with you that in their correction they should’ve addressed all criticisms, and they should’ve gone back with an open mind and reconsidered all their empirical claims—and that goes double for Richard Tol, Brian Wansink, Satoshi Kanazawa, and others whose published work was even more fundamentally flawed—but I can see how all these people have behaved as they did, based on the general principle that peer-reviewed publication is such a pain in the ass that you just do the minimum needed to make the reviewers and journal editors happy.

          I find it much different when I’m writing a book, or a blog, or a blog comment: Then there’s no reviewers, and the responsibility is all on me not to screw up.

          • Jordan Anaya says:

            When people don’t take criticisms of their work seriously I can’t help but conclude they don’t take their work seriously. If the possibility that something you did is wrong doesn’t bother you, then clearly you don’t think it matters if what you published is correct or not.

            Or maybe these people are just completely delusional and do mental gymnastics to rationalize what they are seeing. In the case of Wansink if the lab gets unexpected results it’s not because the theory was wrong, it’s because music was playing, the lighting was too bright, the packages were too small or too large, other food items were offered, etc.

            On the one hand, the fact that Wansink’s work doesn’t seem to fit together very well suggests we’re not looking at a Stapel-like case, but then you have some studies like the 770 person surveys which use Census data which couldn’t have contained the data they claim to be looking at. Or you have studies by the lab which are claimed to have occurred, but citations of the study just leads you to his books, which cite his papers, which cite his books.

            • Andrew says:


              Regarding your first paragraph: I do take criticisms seriously, but journal referee reports often seem to just be a bunch of hoops to jump through, and it’s natural after seeing a few thousand of these, to just jump to the end and try to resolve the issues as quickly as possible. I could well imagine that this is how Cuddy, Norton, and Fiske perceived the error reports, as picky little objections, not so different from complaints that the margins are too narrow or the font is too small or that the wrong citation style is being used. I don’t agree with that attitude, but I can see how it can arise. I’m sure it matters to Cuddy, Norton, and Fiske that their published work is corrected, but it may be that they don’t fully understand the connection between their published claims, their data, and their data analysis. I say all this not as an attempt to get them off the hook (or to condemn them) but in an introspective attempt to understand their behavior, which I agree seems odd given their larger goal of using scientific experimentation to learn about external, reproducible reality.

              Regarding your final paragraph: Stapel seems to have been a sort of purist, repeatedly using data fabrication as his strategy to obtain statistically significant and publishable results. It is possible that Wansink is more eclectic, sometimes p-hacking with clean data, sometimes delegating this task to students, sometimes misrepresenting or misunderstanding his own data collection procedures, sometimes using a single dataset multiple times and representing it differently, etc.

              • Anoneuoid says:

                I could well imagine that this is how Cuddy, Norton, and Fiske perceived the error reports, as picky little objections, not so different from complaints that the margins are too narrow or the font is too small or that the wrong citation style is being used.

                Interesting perspective. It fits with the perennial claim that the errors “do not affect the conclusions”. It is as if any single step of data processing/analysis is irrelevant but somehow the “sum of the evidence” amounts to something (ie claiming 0*x != 0). That is true for the typesetting issues though…

                Also, in agreement with what you say, in my experience much peer review does amount to bikeshedding. So the credibility increase due to “peer review” is way overblown. Of course getting feedback is important, but usually you are better off asking a colleague than some random person who doesn’t care too much.

              • Martha (Smith) says:

                Anoneuoid said, “Also, in agreement with what you say, in my experience much peer review does amount to bikeshedding. So the credibility increase due to “peer review” is way overblown.”

                So maybe one way to improve research quality might be to engage in “journal shaming,” rather than individual shaming – e.g., have a “wall of shame” for journals that have published a lot of papers with “questionable research practices” or other low quality practices.

            • Martha (Smith) says:

              Come to think of it, Andrew has been journal shaming with “PPNAS”.

        • Cliff AB says:

          “Cliff: I’m pretty sure Carlisle is only looking at whether the randomization within a trial actually appears random.”

          Jordan: that’s correct. But Carlisle is strongly implying that this is evidence of fraud; the title is “Data fabrication and other reasons…”. My point is that there are plenty of reasons listed in a valid SOP for why the final baseline summary statistics of treatment and control might not be two samples from the same distribution. And Carlisle’s complaint is about the tails of a QQ-plot. Lots of assumptions need to be correct before tails of QQ plots look right!

          Other reasons than I mentioned above: matched pairs designs (Carlisle mentions dropping variables that are stratified, but never mention matched pairs), non-normal data.

      • Nick says:

        Andrew: What I blogged about was based on reading Carlisle’s article (which is not always as clear as it could be) and reproducing his analyses as far as I could. All I can say for sure is that in the three articles that I looked that:

        1. The p values from the original RCT articles that I was able to recalculate (from the differences between groups) were broadly similar to the p values that Carlisle recalculated. This ought not to be a huge surprise as I think that Carlisle and I are both using the same software (function ind.twoway.second() in the rpsychi package) to calculate F and t tests.

        2. I was able to reproduce the overall p values that Carlisle presented in his S1 file exactly, using my implementation of the Stouffer formula for combining p values via z scores.

        At this point I am reasonably confident that I understand Carlisle’s method, but again, I have only examined three of the more than 5,000 original articles. On that basis, I think you are correct: the analyses are all within-article examinations of the distribution of the p values that correspond to baseline comparisons. The question of whether or not the use of Stouffer’s formula is a valid way of determining whether that set of p values could have occurred by chance is a very long way beyond my statistical competence to answer. A few people are discussing this, and related questions, on Twitter right now.

    • The fact that Carlisle finds more similar and more dissimilar mean differences (using his method of combining p-values using Stouffer’s method) could be explained by the violation of the assumption that the combined p-values must be from independent tests. If you have correlated baseline measures, then even if the null is true, the distribution of p-values becomes bi-modal.
      See this R simulation below. In highly correlated settings (baseline covs correlated), the p-value distribution under a TRUE null is not uniform, but bimodal which would then be falsely interpreted as many means being too similar or too dissimilar.

      #R code below

      baselinefraud <- function(n,a) {
      u <- rnorm(n,0,1)
      x1 <- a*u + rnorm(n,0,sqrt(1-a^2))
      x2 <- a*u + rnorm(n,0,sqrt(1-a^2))
      tr <- rbinom(n,1,.5)
      x1p <- t.test(x1~tr)$p.value
      x2p <- t.test(x2~tr)$p.value
      plist <- c(x1p,x2p)
      #stouffers p
      sp <- 1-pnorm(sum(sapply(plist,qnorm))/sqrt(length(plist)))

      res <- data.frame(raply(50000,baselinefraud(100,round(runif(1,-.04,.94),1)),.progress = "text"))
      names(res) <- c("N","a","p")
      ggplot(res,aes(p)) + geom_histogram(binwidth = .02) + facet_wrap(~a)

  3. Thomas says:

    “The burden of proof should not be on the people who discovered the error to demonstrate that it’s consequential.”

    +1 Errors should simply be acknowledged and fixed. If authors don’t take responsibility for this, we can’t cite their results.

  4. Markus says:

    I have the follwoing concerns or questions regarding the Carlisle paper.
    1. It appears that he has conducted multiple tests, is it posible that some of the significant p-values are due to chance? Is this something worth mentioning?
    2. We do not know the actual postitive or negative predictive value of his method. Just because it worked well in the case of some recent researchers with all of their papers containing errors does not make it work as as a screening tool in detecting papers with fabricated data.
    The case against all previous frausdsters was made by the fact that all their papers had the same problem.

  5. Nick says:

    John Carlisle has just left a couple of comments on my blog post, with some useful background information:

  6. Paul Alper says:


    “Also, for me, medical research is less interesting than social science.”

    Alas, medical research misconduct may be less interesting but its affect is far more important. As evidence I offer up a bête noire of this blog, a Ted Talk. Watch how Ben Goldacre proves his points:


  7. numeric says:

    The death of one man is a tragedy. The death of millions is a statistic.
    –Iosif Vissarionovich Dzhugashvili (usually known by his party name, Joseph Stalin)

    Not “As Stalin may have said, “When one man dies it’s a tragedy. When thousands die it’s statistics.””. If you don’t want to look up a quote, just say “to paraphrase…”

    • Andrew says:


      Don’t be rude. I can use google too; indeed I did. Google “The death of one man is a tragedy. The death of millions is a statistic” and you get to this page, then if you scroll down you see that the statement “The death of one man is a tragedy, the death of millions is a statistic” is listed under Misattributions, but they do have some variants, including this one:

      In Портрет тирана (1981) (Portrait of a Tyrant),[1] Soviet historian Anton Antonov-Ovseyenko attributes the following version to Stalin: “When one man dies it’s a tragedy. When thousands die it’s statistics.” This is the alleged response of Stalin during the 1943 Tehran conference when Churchill objected to an early opening of a second front in France.

      I don’t know Russian but I can do searches on the internet (and waste time responding to blog comments) just like everyone else.

      • numeric says:

        If you are concerned about wasting time I refer you to your numerous forays into literary criticism. As far as rudeness, I have corrected your phraseology in the past (for example, rather than using “hard cases make bad law” you used some synonym for hard–I corrected and you thanked me–different times, different circumstances, I suppose). Interestingly, if you had read a little further in that google search you did, you would find that the earliest written record of the epigram was in Remarque’s “The Black Obelist” (1956):

        (for the younger readers of this blog, Remarque is best known for “All Quiet on the Western Front”, a favorite of Nazi book burnings). I suspect that Stalin actually never said this but it has been misattributed to him ever since–but read the link Andrew provided and read the link above and come to your own conclusions.

        Anyway, without this little exchange, I never would have tracked this down and could not have corrected my thinking (a literary Ellsberg paradox–Ellsberg used to go around to statisticians and described the setup to them and had them give the sub-optimal answer–Lindley, coherent as always, thanked him for the correction–something I’m clearly not going to get from Andrew). And, unfortunately, I set my google preferences to German to track this down and now (not speaking German) I can’t figure out how to reset them.

  8. BD says:

    Interesting critique of Carlisle’s methods and conclusions by a Yale researcher…

Leave a Reply