Skip to content

Cracks in the thin blue line

When people screw up or cheat in their research, what do their collaborators say?

The simplest case is when coauthors admit their error, as Cexun Jeffrey Cai and I did when it turned out that we’d miscoded a key variable in an analysis, invalidating the empirical claims of our award-winning paper.

On the other extreme, coauthors can hold a united front, as Neil Anderson and Deniz Ones did after some outside researchers found a data-coding error in their paper. Instead of admitting it and simply recognizing that some portion of their research was in error, Anderson and Ones destroyed their reputation by refusing to admit anything. This particular case continues to bother me because there’s no good reason for them not to want to get the right answer. Weggy was accused of plagiarism, which is serious academic misconduct, so it makes sense for him to stonewall and run out the clock until retirement. But Anderson and Ones simply made an error: Is admitting a mistake so painful as all that?

In other cases, researchers mount a vigorous defense in a more reasonable way. For example, after that Excel error was found, Reinhart and Rogoff admitted they made a mistake, and the remaining discussion turned on (a) the implications of the error for their substantive conclusions, and (b) the practice of data sharing. I think both sides had reasonable points in this discussion; in particular, yes the data were public and always available but not the particular data file used by Reinhart and Rogoff was not accessible for outsiders. The resulting discussion moved forward in a useful way, toward a position that researchers who publish data should make their scripts and datasets available, even when they are working with public data. Here’s an example.

But what I want to talk about today is when coauthors do not take a completely united front.

In the case of disgraced primatologist Marc Hauser, collaborator Noam Chomsky escalated with: “Marc Hauser is a fine scientist with an outstanding record of accomplishment. His resignation is a serious loss for Harvard, and given the nature of the attack on him, for science generally.” On the upside, I don’t think Chomsky actually defended Hauser’s practice of trying to tell his research assistants how not code his monkey data. I’m assuming that Chomsky kept his distance from the controversial research studies, allowing him to engage in an aggressive defense on principle alone.

Another option is to just keep quiet. The famous “power pose” work of Carney, Cuddy, and Yap has been questioned on several grounds: first that their study is too small and their data are too noisy for them to have a hope of finding the effects they were looking for, second that an attempted replication of their main finding failed, and third that at least one of the test statistics in their paper was miscalculated in a way that moved the p-value from above .05 to below .05. This last sort of error has also been found in at least one other paper of Cuddy. Upon publication of the non-replication, all three of Carney, Cuddy, and Yap responded in a defensive way that implied a lack of understanding of the basic statistical principles of statistical significance and replication. But after that, Carney and Yap appear to have kept quiet. [Not quite; see P.P.S. below.] Cuddy issued loud attacks on her critics but her coauthors perhaps have decided to stay out of the limelight. I’m glad they’re not going on the attack but I’m disappointed that they seem to want to hold on to their discredited claims. But that’s one strategy to follow when your work is found lacking: just stay silent and hope the storm blows over.

A final option, and the one I find most interesting, is when a researcher commits fraud or gross incompetence and does not admit it, but his or her coauthor will not sit still and accept this.

The most famous recent example was the gay-marriage-persuasion study of Michael Lacour and Don Green. When outsiders found out that the data were faked, Lacour denied it but Green pulled the plug. He told the scientific journal and the press that he had no trust in the data. Green did the right thing.

Another example is biologist Robert Trivers, who found out about problems in a paper he had coauthored—one coauthor had faked the data and another was defending the fraud. It took years until Trivers could get the journal to retract it.

My final example, which motivated me to write this post, came today in a blog comment from Randall Rose, a coauthor, with Promothesh Chatterjee and Jayati Sinha, of a social psychology study that was utterly destroyed by Hal Pashler, Doug Rohrer, Ian Abramson, Tanya Wolfson, and Christine Harris, to the extent that that Pashler et al. concluded that the data could not have happened as claimed in the paper and were consistent with fraud. Chatterjee and Sinha wrote horrible, Richard Tol-like defenses of their work (here’s a sample: “Although 8 coding errors were discovered in Study 3 data and this particular study has been retracted from that article, as I show in this article, the arguments being put forth by the critics are untenable”), but Rose did not join in:

I have ceased trying to defend the data in this paper, particularly Study 3, a long time ago. I am not certain what happened to generate the odd results (other than clear sloppiness in study execution, data coding, and reporting) but I am certain that the data in Study 3 should not be relied on . . .

I appreciate that. Instead of the usual the-best-defense-is-a-good-offense attitude, Rose openly admits that he did not handle the data himself and that he has no reason to vouch for the data quality or claim that the results still stand.

Wouldn’t it be great if everyone could do that?

It’s not an easy position, to be a coauthor in a study that has been found wanting, either through fraud, serious data errors, or simply a subtle statistical misunderstanding (such as that which led Satoshi Kanazawa to think that he could possibly learn anything about variation in sex ratios from a sample of size 3000). I find the behavior of Trivers, Green, and Rose in this setting to be exemplary, but I recognize the personal and professional difficulties here.

For someone like Carney or Yap, it’s a tough call. On one hand, to distance themselves from this work and abandon their claims would represent a serious hit on their careers, not to mention the pain involved in having to reassess their understanding of psychology. On the other hand, the work really is wrong, the experiment really wasn’t replicated, the data really are too noisy to learn what they were hoping to learn, and unlike Cuddy they’ve kept a lower profile so it doesn’t seem too late for them to admit error, accept the sunk cost, and move on.

P.S. See here for a discussion of a similar situation.

P.P.S. Commenter Bernoulli writes:

It is not true that Carney has remained totally silent.

From Carney’s webpage, under Selected Publications:






  1. Shravan says:

    Here is how psychologists (and psycholinguists) defend these objections from you Andrew:

    “The famous “power pose” work of Carney, Cuddy, and Yap has been questioned on several grounds: first that their study is too small and their data are too noisy for them to have a hope of finding the effects they were looking for,”

    But the key p-value was less than 0.05, the effect was *reliable*.

    “…second that an attempted replication of their main finding failed, and third that at least one of the test statistics in their paper was miscalculated in a way that moved the p-value from above .05 to below .05.”

    Even though it was miscalculated, the effect was *going in the right direction*. And a failed replication is not informative.

    Hence, no problem.

    • Shravan says:

      I hear the word “reliable” a lot, when people base their conclusions on p<0.05. It is generally used to end an argument ("but the effect was reliable").

      • Martha (Smith) says:


        Maybe “reliable” needs to be added as an appendix to the list of circumlocutions for p >.05 at — if p > .05 warrants circumlocutions, it’s only fair that p < .05 have some!

        • Shravan says:

          I have not yet asked these people what they mean by “reliable”. I will next time. However, a senior psychologist in a major university in Scotland who reviewed a paper of mine that used p-values mentioned in his review that the p-values were not low enough, which raises questions about the reproducibility of the effects. In other words, the low-ness of the p-value apparently tells you whether the effect is reproducible. I guess that’s what they teach you in psych and that’s why “major” journals want to see “robust” effects (i.e., effect with low p-values).

          • Ben Prytherch says:

            Shravan – there was an “alternative” to the p-value called “p-rep” that was required for all articles submitted to Psychological Science for a few years. I discovered it when reading through original papers that were subjected to replication for the Reproducibility Project: Psychology. It is a function of the p-value and nothing else, but its alleged interpretation is “the probability of a significant effect in replication”. So at least one major psychology journal spent a while telling everyone that the reproducibility of an observed effect can be calculated from it’s p-value.

  2. Anoneuoid says:

    >”in particular, yes the data were public and always available but not the particular data file used by Reinhart and Rogoff was not accessible for outsiders.”

    Even the actual file is sometimes not enough. Many times I have seen excel automatically change values because it detects a date, etc. It also sometimes displays a different value than it saves, so opening the file (eg csv) in different software will give different results. Just one person resaving the file after looking at it is enough to mess up the data, download some genetics data from wherever and search for eg the OCT1 gene. It wont take long before you come across the problem…

    The entire data processing pipeline needs to be reproducible.

  3. mark says:

    My own favorite example are all of the co-authors on the couple of dozen paper by Fred Walumbwa that have been retracted, had corrections/corrigendums published, or continue to go uncorrected despite being widely known to based on statistically impossible results. Almost all of the authors continue to proudly tout the articles on their websites and none of them have asked for corrections to be made or to ohave their name to be removed from the papers.
    Many of these papers are discussed at length on

  4. Shravan says:

    In an amusing twist, Susan Fiske and John Bargh have also signed the “tenor (yes, that’s tenor, not terror) of discussions” petition:

    • Andrew says:


      Key bit from that statement:

      In addition, when our own work or other work we care about is on the receiving end of scientific criticism, we must be careful not to misperceive that criticism as an attack or harassment. . . . We must try our best to accept and learn from criticism, if we are to advance our shared goal of contributing to a cumulative and self-correcting science.

      But, later on, there’s this:

      By signing below, we are indicating not that we each agree wholeheartedly with every single sentence above . . .

      So I think Fiske and Bargh are OK, as they can just clarify that they did not agree wholeheartedly with those first two sentences above.

  5. Travis says:

    I think it’s important to remember that co-authors are not all necessarily equal in terms of their ability to disagree with their collaborators. It’s all well and good for Don Green to admit there’s a problem, contradicting Lacour, but it would take some serious gumption for a junior co author to break with his or her senior collaborators. Andy Yap, for example was just a grad student when the power posing paper was publishe. Although, it looks like Yap and Carney have continued with this line of work, implicitly suggesting that they don’t see any problem with the original paper. Still, I think the point about power differential is important to keep in mind.

    • Nathan says:

      As others have noted, Carney seems to have turned her back on the power posing research. However, I think your point about the power differential to be pretty important. I think it is interesting that of the three, Cuddy seems to have managed to leverage the research into not only a TED talk but a book deal, and probably a whole host of other gigs.

  6. Chris Crandall says:

    This posting recklessly mischaracterizes its targets in ways that undermine the reader’s faith in Gelman’s points. Here are two important examples of inaccuracies or elisions that leave a genuinely false impression on the reader.

    First, Gelman describes a paper by Randall Rose, Promothesh Chatterjee and Jayati Sinha as “a social psychology study that was utterly destroyed by . . . ” The reader is obviously left with the notion that this particularly social psychology study is crap. Crappy as it may be, there is absolutely nothing about social psychological about this paper–it is written by business faculty from business schools in a marketing journal (Marketing Letters) studying use of cash vs. credit cards with a cognitive psychological account. This is akin to using a drone to bomb a terrorist but hitting a wedding party instead–it’s only acceptable to people who don’t care about the wedding party in the first place. But these are my people.

    Second, Gelman writes that the “power pose” work of Carney, Cuddy, and Yap has been questioned on several grounds: first that their study is too small and their data are too noisy for them to have a hope of finding the effects they were looking for, second that an attempted replication of their main finding failed . . . “

    It’s true that the original study did not have high power. But the study had TWO main findings–one physiological, and one psychological. The physiological effect has not held up well in replication, but the psychological effect has. I quote from this “failed” replication:

    “Using two-tailed t tests, we replicated Carney et al.’s finding that participants in the high-power condition self-reported, on average, higher feelings of power than did participants in the low-power condition (mean difference = 0.245, 95% confidence interval, or CI = [0.044, 0.446]), t(193) = 2.399, p = .017, Cohen’s d = 0.344. [Ranehill, E., Dreber, A., Johannesson, M., Leiberg, S., Sul, S., & Weber, R. A. (2015). Assessing the Robustness of Power Posing No Effect on Hormones and Risk Tolerance in a Large Sample of Men and Women. Psychological science, 0956797614553946.]

    Certainly, replicating the physiological effect would have been exciting. But I assure you that psychological DVs are quite exciting to psychologists, they are quite meaningful to our theories, and are acceptable as good science in a journal named “Psychological Science.”

    When lobbing your live ammunition, please aim carefully, please aim accurately, or expect to lose the very credibility some of these blog entries are designed to undermine in others.

    • Andrew says:

      Hey Chris, awesome use of the terrorism analogy. You rock, dude!

      • Chris Crandall says:

        Thanks. But I have along way to go to call *you* a terrorist. (For the irony-impaired, I am not.)

      • Terri Vescio says:

        What about the psychological variable – subjective feelings of power? Are there gender differences on this variable? I did think Crandall made an interesting point. There are two variables here. Does replication on her psychological variable justify continued attention to power posing? And the bigger question, when and what scientific evidence should exist as a basis for promoting work as meaningful to people beyond academics? what I do not know, has Cuddy’s discussions of power posing changed in light of the cumulative findings and response to critiques? I think these may be the more important questions, in this particular case.

    • Carol says:

      Hello Chris Crandall,

      If “there is absolutely nothing social psychological about this paper,” as you say, why did Pashler et al.’s critique appear in the journal BASIC AND APPLIED SOCIAL PSYCHOLOGY with the title “A social priming data set with troubling oddities”?

      • Chris Crandall says:

        I cannot explain to you, Carol, which BASP printed this criticism. Prof. Pashler labels a lot of things as “social priming” when they have nothing social about them. I encourage you to read the original paper, and find what’s social psychological about it. I have read it, I have read Pashler et al., and did not discover the why of it.

        • 11th Recon says:

          While I agree that “social priming” is a poor choice of label, the original study has the look and feel of a Social Psychology Priming study all over it. It uses a money prime to look at altruism. I mean for crying out loud the abstract has this gem “This is paradoxical because recent research suggests that mentioning money primes a self-sufficient mindset, thus undermining the very behaviors these organizations desire to elicit.” It’s as much Social Psychology priming research as Kathleen Vohs research that they cite.

          So I disagree with your conclusion that “Crappy as it may be, there is absolutely nothing about social psychological about this paper” and I also disagree with your assertion that these facts “–it is written by business faculty from business schools in a marketing journal (Marketing Letters) studying use of cash vs. credit cards with a cognitive psychological account.” support that conclusion. Social psychologists have been making inroads into Business College’s for some time. For example, Vlad Griskevicius and Kathleen Vohs at Carlson School of Management, University of Minnesota. Yexin Jessica Li at School of Business, University of Kansas are all Social Psychologists.

          This isn’t so much a matter of the drone pilot missing as much as Social Psychology having for better or worse a very big tent.

        • Carol says:

          Hi Chris Crandall,

          I’ve already read the original article (now retracted), Pashler et al.’s critique, the reply by each of the three authors, and the rejoinder by Pashler at al.

          I’ve just sent Hal Pashler and David Trafimow (the editor of BASP) a note asking them to respond; that seems the most efficient way to proceed.

        • Alex says:

          I agree with 11th Recon that there could be a ‘big tent’ issue, but even then it seems to fall fairly solidly in the social psych realm. The original paper is looking at the psychological factors that affect donations; seems fairly social. As the authors say, “priming cash concepts reduces willingness to help others”. The articles by Vohs et al that they contrast themselves to are couched in social psych terms. It strikes me as being at least as much social psych as it might be cognitive psych.

        • Alex says:

          It’s also worth noting that even if this was published in a marketing journal by people in business schools, all three of Cuddy, Carney, and Yap have degrees in psychology (bachelor’s for all, Ph.D.’s in social for Cuddy and Carney) and the top two authors have had positions in psych departments (post doc and joint affiliation for Carney, assistant professor and then joint affiliation for Cuddy). Yap’s web page specifically says that he studies “the psychology of power, status, and hierarchy” ( They’re all psychologists, and I think would describe themselves as social psychologists.

          • Chris Crandall says:

            My comment about “no social psychology” referred to Chatterjee, Rose & Sinha, “Why Money Meanings Matter in Decisions to Donate Time and Money”, from Marketing Letters.

            The power pose work is certainly done by social psychologists.

    • mark says:

      Ah – but the effect on psychological variables only works for men – in both the Carney et al study and also in the Ranehill replication. Further, the authors were well aware of this gender moderation and even wrote about in an earlier draft of their manuscript (which I have a copy of). Funnily enough Amy Cuddy has gone around telling women all over the world how power posing will help them when she was completely aware of the fact that the psychological effect did not apply to women.

  7. Chris Crandall says:

    Mark: I found nothing in Ranehill et al (2015) that suggests that the psychological effect on felt power was different by gender. Perhaps it is buried deeply within the supplemental files?

    • mark says:

      The data is publicly posted. It is there. Across all three studies (two by Carney et al. and one by Ranehill) the effect for men is fourt times as strong as for women.

    • mark says:

      As you may know, the Carney et al. paper also misreports the effects of the treatment on the willingness to gamble as significant – when it is non-significant. Their design also has a basic confound inasmuch as participants had the opportunity to gamble before the post-manipulation hormone measurements were taken – when we have tons of evidence that gambling (and even the opportunity to gamble) raises cortisol and testosterone. Basic stuff that the Psych Science reviewers and editor should have caught.

  8. Bernoulli says:

    Andrew, it is not true that Carney has remained totally silent. If you visit you will see her statements remarking that the power pose effect is not likely real.

  9. Simon Gates says:

    I’ve just had pretty much the situation that Andrew describes; I don’t know if I handled it well or not, and I’d be interested in others’ comments.

    A junior clinician/researcher did some reanalyses of data from a clinical trial that I had been closely involved with, in collaboration with a statistician. I wasn’t involved in the reanalysis, which I thought was terrible, but followed the prevailing methodology of the field. It was submitted to a journal, and unsurprisingly (to me), the journal’s referees didn’t pick up on any statistical issues and the paper was accepted. I considered withdrawing as an author but eventually decided not to, but to include in the “contributions of authors” section that I wasn’t involved in the analysis. I didn’t insist on putting in that I disagreed with the analysis – maybe I should have? I found the situation difficult as some of the other authors are friends as well as colleagues, and were keen for me to stay involved. I wasn’t too bothered from a career point of view (for myself) but for the junior researcher it’s obviously a much bigger deal and I wanted to try to be supportive. After all, I might be wrong about the statistical issues. So maybe we’ve just ended up with a bit of a fudge and unsatisfactory compromise?

  10. mark says:

    Dana Carney has just posted this:

    Basically a complete repudiation of power pose research. Major kudos to her.

  11. Steve Sailer says:

    Here’s a question:

    What if some easy priming trick actually works for 5% of the population, but makes 5% equally worse off, and has no effect on 90%? Would you say that that is phony? Or that it’s something people might try for themselves and see if it works?

    I’m not saying that Power Posing or whatever has this distribution of effects, but it doesn’t seem implausible that some things really do work for a small fraction of the population, while being bad for a comparable small fraction. But would that be phony?

    • Rahul says:

      Only if you can tell me a priori *which* 5% it will work on. Is there an identifiable attribute of the cohort that it works on?

      • Steve Sailer says:

        Maybe, maybe not.

        For example, drinking echinacea tea helps me fend off colds. I’ve been doing it for 20 years and it’s made my life better. On the other hand, nobody I’ve talked into trying it has reported that back to me that it works for them. Furthermore, echinacea is not sweeping the American marketplace. But, then again, it’s not disappearing either.

        My guess is that my good response to echinacea tea has something to with the idiosyncrasies of my immune system.

        My hunch would is that echinacea helps a single digit percentage of the population. That wouldn’t be much, but then it wouldn’t be nothing either.

        Perhaps in a few decades genomic analysis will have advanced to the point where we can get our DNA analyzed and be told: “You should try echinacea tea.”

        Or maybe right now we can go to Whole Foods and buy a box of echinacea tea for $5.49 and try out whether it works for you or (probably) it doesn’t work for you.

        • Steve Sailer says:

          It strikes me that philosophy of science hasn’t caught up with human diversity. Everybody looks to physics and astronomy for examples of how science works: if the General Theory of Relativity doesn’t work in every solar system in the galaxy, then it’s not very general, right?

          But the human sciences may not be all that much like that. Perhaps hypotheses in the human sciences sometimes tend to work for some people in some circumstances but not for others in others?

          Granted, most of the studies that don’t replicate well are probably just basically bogus, but after we clear out the obvious deadwood, what happens when we find things that are truly idiosyncratic? So, what I’m talking about is more theoretical, but it seems like a reasonable thing to be interested in.

          Also, it could provide a soft, non-humiliating fallback for academics whose research doesn’t replicate all that often: maybe it works for some people but not others?

          • Andrew says:


            Social science researchers in fields such as psychology, economics, and political science, are aware of the idea of varying treatment effects. This idea is not well understood, though: the usual textbook discussions are in terms of constant effects. Indeed, one of my criticisms of the recent Imbens and Rubin book on causal inference is I felt they talked too much about constant treatment effects. Imbens and Rubin understand that treatment effects can and do vary, but I think they felt that the simpler approach worked better for an introductory textbook.

            My take on all this is that, whether or not a particular study is bogus, there are true underlying effects. These effects are just a lot smaller and more variable than many people seem to think, and studying them scientifically will require a lot more work than the simple surveys and lab experiments we’ve been seeing.

            • Rahul says:

              When people like Amy Cuddy push their advice, I think, their implicit message is that it helps everyone or at least a majority of people.

              That’s the mis-information they peddle and that’s annoying.

              It’d be a wholely different message if they declared something like, “We think you should try power poses but we think they only help 1 person out of every 1000 that tries them and right now we don’t know if it will help you or not”

              But that isn’t how they sell their message.

            • Anon says:


              To be fair to Imbens and Rubin, they were actually amongst the first to make a substantial contribution to varying treatment effects when they wrote that LATE paper (together with Angrist), which discusses what IV methods might estimate when treatment effects vary.

              • Andrew says:


                Yes, Imbens and Rubin are great; I just had a difference of opinion with them regarding what material to present in an introductory book.

            • elin says:

              I think we would try to find the variable that explains why something (tea, prayer) “works” for some and not others. Otherwise we assume that given enough people rolling dice, someone will get 10 6s in a row. Of course belief that it works will always be one of those potential variables. Also in the case of something like tea, one might say let’s control for that by doing a double blind study.

          • albatross says:

            People are *horrible* at untangling effects like this by personal observation. (See the whole history of medicine, the current vitamin and supplement industry, etc.) So maybe echinacea tea actually does help 1% of people with their colds, but it’s probably far more likely that it kinda seems like it might be helping you because of random stuff (you happened to get better a little faster than usual a couple of times when you drank it) combined with confirmation bias.

        • Rahul says:

          If echinacea tea helps 1% of the population, hurts 1% of the population (maybe gives them heartburn) & just wastes money for the remaining 98% of the population then why is it a good idea to make a blanket recommendation: “Drink echinacea tea to fend off colds”?

          Unless you can identify the 1% population, a blanket recommendation makes little sense, a la Cuddy. If power poses indeed help 1% of the population she better tell us who they are.

          • Andrew says:


            It’s not so much that power pose helps 1% of the population. Rather, I’d guess that it helps some people in some situations. And I doubt it has anything to do with the pose itself.

            • Rahul says:

              0.005% of the population in 0.005% of their daily situations?

              Whatever be the numbers, till we know which people & in exactly which situations, the advice seems not very useful.

            • Rahul says:

              PS. Not sure what you mean by “has anything to do with the pose itself.”

              • Simon Gates says:

                – PS. Not sure what you mean by “has anything to do with the pose itself.”

                The way acupuncture and homeopethy “work” for some people. Effect is created by doing something/getting attention – it’s not the thing you do that produces the effect but the fact of doing something, getting attention from a health professional etc.
                (sorry not very well expresed)

              • Rahul says:


                Fair enough. But if the effect is still real why should we care?

                i.e. Say mere distraction or novelty of acupuncture makes a person with chronic pain feel better, so be it; why should I care (so long as the effect is reproducible, no other worse side effects etc.)

              • Simon Gates says:

                Rahul – yes, that’s a fair point. I guess it comes down to whether being right for the wrong reason matters. A few years ago there was an interesting case when the NICE guideline developent group for back pain ended up recommending acupuncture. Certainly at least some of the group (maybe all) were not acupucture believers, but took the pragmatic view that although it was just a theatrical placebo, there was evidence that it made people feel better, and was unliekly to have any bad side effects. They got a lot of flak for appearing to legitimise a therapy that doesn’t work (or certainly doesn’t work in the way it is believed to – I guess we can’t rule out that it does something).

              • Rahul says:


                My impression is that we shouldn’t mix the two problems (a) Does it work? vs (b) Why does it work?

                If the evidence is strong & repeatable that it works that’d be good enough for a lot of purposes. The academics can then argue about why.

  12. Clyde Schechter says:

    Sorry this is a few days late. But apparently Carney has now repudiated Power Poses. Rather completely and unequivocally.

  13. Liz B says:

    I’m not an academic so have an outsider non-expert perspective.

    However, I find it pretty worrying that there seem to be so many errors in peer reviewed papers in respected journals – data coding errors, miscalculations.

    Whatever happened to checklists – known to be effective in reducing error in a number of fields such as aviation and surgery? A mistake in an academic paper is not life threatening, but nevertheless the use of a checklist to double check all calculations by a co-author or reviewer would have resulted in considerably less time and money being wasted, reputations being shredded….

    • R says:

      I agree completely that much better “data hygiene” guidelines and practices are needed.

      I used to work for an org where any analysis needed to be reproduced at least once by an outsider to the project before it was allowed to leave our doors. What I learned from repeating this exercise dozens of times is that even the most careful work — involving high financial or public relations consequences, the expectation that an adversarial party would be scrutinizing it, use of scripting and not a point-and-click workflow, and subject to hours of discussion and dissection of the numbers — will still contain some errors. Fortunately these errors were usually second-order minor issues, but not always.

      My experience in academic research is that people are far less careful about checking their results than this, and many groups don’t even have anyone with auditing skills. People who aren’t experienced in programming often haven’t developed a debugging or edge-case hunting instinct. It takes solid attention to detail to observe and act on red flags like number of observations changing due to missingness. Sanity checks like comparing observed group means to those implied by a regression model require that the user understand the method, but many researchers and their assistants lack the quantitative background or knowledge of how to use their software to adequately explore their results and diagnose big issues like mixed up coding. All of this is compounded by non-reproducible workflows involving data processing in Excel, point-and-click software or entry on website calculators, and manual entry and rounding of numbers in manuscripts that are not consistently updated when data processing decisions change.

      The peer review system is not equipped to address these kinds of issues. Reviewers provide useful feedback at the level of topic novelty or experimental appropriateness, perhaps the choice of analysis methods, but rarely question implementation specifics because there’s not enough in the paper to raise the issue. Information that could tip a reviewer off to data errors, poor selection of analyses, or “forking paths” tours are typically left out, and a manuscript that did include extensive exploratory work would be tonally jarring and quite long compared with current journal norms. (Look at all the details Carney revealed in her recent power posing confessional that weren’t in their paper — can you imagine how the methods section would have read if they were included?) As such, mostly what we can do is flag post-publication errors, but someone has to be motivated to dig for them (step 0: obtain raw data can stop them in their tracks) and that motivation is not necessarily benign. That leaves researchers feeling much more attacked and defensive about their errors than if these had been observed pre-submission by a friendly party, and so these disputes appear political rather than technical. Automated tools like statcheck (which checks results reported in APA format for internal inconsistencies and posts the results on PubPeer) are a step in the right direction in not targeting individuals, but comes too late in the research lifecycle, and these problems are of course not exclusive to psychology or hypothesis test reporting.

      For these reasons, I am concerned that the overall correctness of data analyses in published research is much lower than we want to it be, without even thinking about multiple comparisons and such. Here’s what I recommend to reduce the incidence of errors (which many people are already practicing but is not widespread enough):

      – Raw data is sacrosanct. Any corrections, subsetting, or aggregations should be made programmatically and not to the original file. How the sample was collected, variable meaning, and value scales must all be documented.

      – Plots with individual observations should be included in any submission, as a supplement if not in the main body of a paper. It is not acceptable that graphics are currently used primarily for reporting aggregated results. (For instance, I believe the Lacour data fabrication could have been detected earlier if graphs showing individual subjects at the longitudinal timepoints were part of the submission.) Ban “dynamite” plots.

      – All analyses should be performed by scripting. Results reported in manuscripts should be automatically inserted using something like knitr. The skill in debugging and checking needed to become proficient in this should be viewed as highly desirable side effect, not a time sink obstacle to producing more research.

      – All analyses should be reproduced start-to-finish by another person before submission, ideally one not heavily involved in their original development and who is knowledgable enough to question the appropriateness of modeling and data processing choices.

      • Rahul says:

        The other side to this is that it costs to get the kind of code / analysis quality you describe. Multiple highly trained, experienced programmers, spending hours or days poring over code and motivated to catch deep bugs doesn’t come cheap.

        Considering that the average grad student makes $25k a year and that the average Prof. churns out upwards of 10 papers a year we are probably getting the code quality we deserve?

        • mark says:

          10+ papers per year? Wow! You must hang out with far more productive faculty than I do. I just checked the google scholar profiles of a couple of super stars in my area and they average four to five peer-reviewed journal articles a year. I would have estimated an average of two to three per year and I know that faculty at some institutions hardly ever publish at all.

          • Rahul says:

            No, really, I don’t think I’m exaggerating. Look at Andrew’s publication list:


            18 papers in 2015.

            In science / engineering 10 per year is not at all unusual. Bob Langer from MIT has 37 papers listed for 2015

            Of course, he’s more high-flyer than average, but that’s why we have 37 vs 10.

            For Ioannidis I counted an incredible 25 publications already listed for 2016 and there’s still many months to go:


          • Shravan says:

            I think I can produce that many every year. I have eight or so PhD students, and if each reports on their work once a year with a paper, and I write two, that’s an easy goal to reach. There are people in Germany who produce 40-45 a year. No way that is possible without major automation and churning out of stuff. This is how you get 350+ papers (as Susan Fiske has). By contrast, I think that Goedel only produced 7-8 papers in his lifetime. Someone had suggested on this blog that each researcher should produce about that many in their lifetime. I agree with that, but it’s impossible to live like that. My university decides on my annual budget based on the *number* of publications I produce, they do not care what’s in them or where they come out. A landmark paper that took years to produce has the same value as a paper that has two experiments that were run in one month and quickly written up. For training graduate students to do research, I have no choice but to get them to start simple (replicate some claims in the literature is my usual starting point with them), and they must publish their results no matter what if they want a job at the end of the rainbow that is the PhD program. That’s how you end up with 10+ papers a year. This year I have 10, last year 11, only 3 in 2014 (I didn’t need a university budget that year as I had negotiated a budget for 2011-14).

            • Shravan says:

              The university and German NSF funding system demands that we write faster than we can think. The German NSF (which claims on paper that they don’t count publications) also counts number of publications; they rejected one of my proposals some years ago on the grounds that I had only three or four publications from previous funding that lasted four years. They expected that if I have funding for four years, I would produce more than 3-4 publications. Think about that. Running one experiment takes up to six months unless you are running with power 0.05. You need at least two experiments to publish a paper unless you are from MIT or Harvard (then, even one experiment is enough). One can run at most two expeirments in parallel. So one takes six months to run two experiments, three months to analyze and write up and submit to a journal. Reviews including revise and resubmit actions can take up to 9 months to a year (more realistic: 1.5 years). If there are dependencies between experiments, i.e., if you cannot start an experiment until the results of the first one are known, then you are even slower. It is physically impossible to write more than 3 papers in a four year project, unless one is just churning out stuff mindlessly. That is what is wanted today.

              I know that the US NSF also does that; I have seen projects rejected there on these grounds (too few publications). The European Research Council (which gives out 2 million Euros like they are cups of take-away coffee) wants to see your h-index when deciding on your personal performance. Unless you are already a superstar, the only way to game an h-index is to churn out stuff in large quantities and then get your friends to cite you and you cite them back. This is also why one needs to remain friendly with everyone in your field; if you antagonize people, they won’t cite you, h-index will stagnate, funding will dry up, and you’ll be surfing on your smartphone all day taking selfies and admiring yourself. In that sense, since *is* a community, even though it should be a method.

              • Shravan says:

                Sorry, one more rant.

                In Germany we have something called the cumulative dissertation. You can get your PhD by pasting together three published articles (they can be under revision after submission, but not rejected). The informal number for accepting a cumulative is three journal articles. That’s something that has to be produced in three years, which is the usual official PhD period in Germany (in reality it is five; we don’t have official coursework).

                It is absolutely absurd to demand three journal articles in 3-5 years from a beginning graduate student, without also demanding that they just churn out whatever they can put on paper. The way it is set up, the setup actively discourages careful thinking.

              • Rahul says:

                In short, we deserve the crap we get. The system incentivizes quantity over quality. Academics produce what the system rewards.

                It’s naive & useless for us to compare the (crappy) product we generate to other obviously higher quality analysis & code.

              • Shravan says:

                Rahul wrote: “In short, we deserve the crap we get. The system incentivizes quantity over quality. Academics produce what the system rewards.
                It’s naive & useless for us to compare the (crappy) product we generate to other obviously higher quality analysis & code.”


              • Shravan says:

                Rahul, not only do academics produce what the system rewards, academics get actively punished if they don’t play the game. If there was zero cost to not playing the game, it would still be an improvement. There is only one situation when an academic can refuse to play the game and suffer no consequences, and that is when they are so famous that it doesn’t matter any more.

              • Rahul says:


                I don’t think academics are as helpless at this as you describe. You do have a choice. I can understand if Junior tenure-track faculty felt helpless but what about senior, tenured, relatively accomplished Profs?

                Why is Ioannidis writing 30+ papers a year? Why do famous people review papers cursorily in 15 mins? Why do tenured, senior Profs. persist to publish in Journals that don’t demand open data? Why do Profs. accept to review papers for Journals that don’t force authors to release data / code. Why do people deride peer review & yet continue to publish in peer reviewed journals?

                I think it’s easier to complain than to act. Plenty of double talk & agonizing. There’s a lot of lip-service to the idealized scientific enterprise but very few willing to match it with actions.

              • Shravan says:

                Rahul said:

                “Why is Ioannidis writing 30+ papers a year? Why do famous people review papers cursorily in 15 mins? Why do tenured, senior Profs. persist to publish in Journals that don’t demand open data? Why do Profs. accept to review papers for Journals that don’t force authors to release data / code. Why do people deride peer review & yet continue to publish in peer reviewed journals?”

                I can answer almost all these questions. Rahul, you would do the same in my position. You can’t see this because you don’t have to do any of this.

                1. “Why do famous people review papers cursorily in 15 mins? “

                This is unacceptable and I never do this. I spend a lot of time and effort on reviewing.

                2. “Why do tenured, senior Profs. persist to publish in Journals that don’t demand open data?”

                Because they usually publish with untenured students (with these as first author), and these students will not get jobs if they don’t show a paper in a major journal. Open access journals don’t necessarily have poorer quality than top journals; but hiring panels look at the brand name.

                3. “Why do Profs. accept to review papers for Journals that don’t force authors to release data / code. “

                The better way is to lead by example. I put up all my data on my home page (in most cases since 2010).

                4. “Why do people deride peer review & yet continue to publish in peer reviewed journals?”

                See 2.

              • Keith O'Rourke says:

                Thanks for posting this – such details help to make sense of many things I’ve experienced – how and why those hidden cliques are so prevalent “game an h-index is to churn out stuff in large quantities and then get your friends to cite you and you cite them back. This is also why one needs to remain friendly with everyone in your field; if you antagonize people, they won’t cite you, h-index will stagnate, funding will dry up”

                So to avoid this you would have to be senior and famous and also not want any students ever again.

              • Shravan says:

                Keith wrote: “So to avoid this you would have to be senior and famous and also not want any students ever again.”

                Also not care whether you get any third-party funding.

              • Hooray Shravan for putting down exactly how broken the system is. A back of the envelope calculation such as yours is all that is needed to see how stupid the publication counting is. To Rahul and those discussing publication count variability, also note that it’s much easier to publish many papers if someone else did all the data collection / experimenting. If you need to breed mice for 1 year to get 20 mice with a certain genetic defect, that’s a very different situation than if you have 45 different public access data sets published by government entities, the Pew Foundation, or the like. It’s not trivial to put a thoughtful query into the CDC Wonder mortality data and get back a useful dataset and build a model about say cancer incidence, but it’s a LOT easier, faster, and lower-cost than a year of breeding and screening mice.

              • Rahul says:


                No matter what you do I think it is humanely impossible for any Prof. to publish 65 papers in just over half a year (see Keith’s analysis of Ioannidis’ publication history for 2016) & yet exercise any reasonable degree of what are traditionally regarded as authorship responsibilities.

      • Martha (Smith) says:

        +1 to R — although Rahul brings up some good points as well. I think one part of the solution needs to be to try to change the culture (admittedly a big job) to value quality over quantity.

  14. Keith O'Rourke says:

    This stuff probably got into every business school and paper … its everywhere and everything to management.

    After her recent speech [April, 2016] at the University of Toronto’s Rotman School of Management, I got to ask Cuddy about what her research means to entrepreneurs. The answer: everything.

Leave a Reply