Cracks in the thin blue line

When people screw up or cheat in their research, what do their collaborators say?

The simplest case is when coauthors admit their error, as Cexun Jeffrey Cai and I did when it turned out that we’d miscoded a key variable in an analysis, invalidating the empirical claims of our award-winning paper.

On the other extreme, coauthors can hold a united front, as Neil Anderson and Deniz Ones did after some outside researchers found a data-coding error in their paper. Instead of admitting it and simply recognizing that some portion of their research was in error, Anderson and Ones destroyed their reputation by refusing to admit anything. This particular case continues to bother me because there’s no good reason for them not to want to get the right answer. Weggy was accused of plagiarism, which is serious academic misconduct, so it makes sense for him to stonewall and run out the clock until retirement. But Anderson and Ones simply made an error: Is admitting a mistake so painful as all that?

In other cases, researchers mount a vigorous defense in a more reasonable way. For example, after that Excel error was found, Reinhart and Rogoff admitted they made a mistake, and the remaining discussion turned on (a) the implications of the error for their substantive conclusions, and (b) the practice of data sharing. I think both sides had reasonable points in this discussion; in particular, yes the data were public and always available but not the particular data file used by Reinhart and Rogoff was not accessible for outsiders. The resulting discussion moved forward in a useful way, toward a position that researchers who publish data should make their scripts and datasets available, even when they are working with public data. Here’s an example.

But what I want to talk about today is when coauthors do not take a completely united front.

In the case of disgraced primatologist Marc Hauser, collaborator Noam Chomsky escalated with: “Marc Hauser is a fine scientist with an outstanding record of accomplishment. His resignation is a serious loss for Harvard, and given the nature of the attack on him, for science generally.” On the upside, I don’t think Chomsky actually defended Hauser’s practice of trying to tell his research assistants how not code his monkey data. I’m assuming that Chomsky kept his distance from the controversial research studies, allowing him to engage in an aggressive defense on principle alone.

Another option is to just keep quiet. The famous “power pose” work of Carney, Cuddy, and Yap has been questioned on several grounds: first that their study is too small and their data are too noisy for them to have a hope of finding the effects they were looking for, second that an attempted replication of their main finding failed, and third that at least one of the test statistics in their paper was miscalculated in a way that moved the p-value from above .05 to below .05. This last sort of error has also been found in at least one other paper of Cuddy. Upon publication of the non-replication, all three of Carney, Cuddy, and Yap responded in a defensive way that implied a lack of understanding of the basic statistical principles of statistical significance and replication. But after that, Carney and Yap appear to have kept quiet. [Not quite; see P.P.S. below.] Cuddy issued loud attacks on her critics but her coauthors perhaps have decided to stay out of the limelight. I’m glad they’re not going on the attack but I’m disappointed that they seem to want to hold on to their discredited claims. But that’s one strategy to follow when your work is found lacking: just stay silent and hope the storm blows over.

A final option, and the one I find most interesting, is when a researcher commits fraud or gross incompetence and does not admit it, but his or her coauthor will not sit still and accept this.

The most famous recent example was the gay-marriage-persuasion study of Michael Lacour and Don Green. When outsiders found out that the data were faked, Lacour denied it but Green pulled the plug. He told the scientific journal and the press that he had no trust in the data. Green did the right thing.

Another example is biologist Robert Trivers, who found out about problems in a paper he had coauthored—one coauthor had faked the data and another was defending the fraud. It took years until Trivers could get the journal to retract it.

My final example, which motivated me to write this post, came today in a blog comment from Randall Rose, a coauthor, with Promothesh Chatterjee and Jayati Sinha, of a social psychology study that was utterly destroyed by Hal Pashler, Doug Rohrer, Ian Abramson, Tanya Wolfson, and Christine Harris, to the extent that that Pashler et al. concluded that the data could not have happened as claimed in the paper and were consistent with fraud. Chatterjee and Sinha wrote horrible, Richard Tol-like defenses of their work (here’s a sample: “Although 8 coding errors were discovered in Study 3 data and this particular study has been retracted from that article, as I show in this article, the arguments being put forth by the critics are untenable”), but Rose did not join in:

I have ceased trying to defend the data in this paper, particularly Study 3, a long time ago. I am not certain what happened to generate the odd results (other than clear sloppiness in study execution, data coding, and reporting) but I am certain that the data in Study 3 should not be relied on . . .

I appreciate that. Instead of the usual the-best-defense-is-a-good-offense attitude, Rose openly admits that he did not handle the data himself and that he has no reason to vouch for the data quality or claim that the results still stand.

Wouldn’t it be great if everyone could do that?

It’s not an easy position, to be a coauthor in a study that has been found wanting, either through fraud, serious data errors, or simply a subtle statistical misunderstanding (such as that which led Satoshi Kanazawa to think that he could possibly learn anything about variation in sex ratios from a sample of size 3000). I find the behavior of Trivers, Green, and Rose in this setting to be exemplary, but I recognize the personal and professional difficulties here.

For someone like Carney or Yap, it’s a tough call. On one hand, to distance themselves from this work and abandon their claims would represent a serious hit on their careers, not to mention the pain involved in having to reassess their understanding of psychology. On the other hand, the work really is wrong, the experiment really wasn’t replicated, the data really are too noisy to learn what they were hoping to learn, and unlike Cuddy they’ve kept a lower profile so it doesn’t seem too late for them to admit error, accept the sunk cost, and move on.

P.S. See here for a discussion of a similar situation.

P.P.S. Commenter Bernoulli writes:

It is not true that Carney has remained totally silent.


From Carney’s webpage, under Selected Publications:

screen-shot-2016-09-24-at-9-15-44-pm

screen-shot-2016-09-24-at-9-17-14-pm

screen-shot-2016-09-24-at-9-18-11-pm

screen-shot-2016-09-24-at-9-19-25-pm

122 thoughts on “Cracks in the thin blue line

  1. Here is how psychologists (and psycholinguists) defend these objections from you Andrew:

    “The famous “power pose” work of Carney, Cuddy, and Yap has been questioned on several grounds: first that their study is too small and their data are too noisy for them to have a hope of finding the effects they were looking for,”

    But the key p-value was less than 0.05, the effect was *reliable*.

    “…second that an attempted replication of their main finding failed, and third that at least one of the test statistics in their paper was miscalculated in a way that moved the p-value from above .05 to below .05.”

    Even though it was miscalculated, the effect was *going in the right direction*. And a failed replication is not informative.

    Hence, no problem.

        • I have not yet asked these people what they mean by “reliable”. I will next time. However, a senior psychologist in a major university in Scotland who reviewed a paper of mine that used p-values mentioned in his review that the p-values were not low enough, which raises questions about the reproducibility of the effects. In other words, the low-ness of the p-value apparently tells you whether the effect is reproducible. I guess that’s what they teach you in psych and that’s why “major” journals want to see “robust” effects (i.e., effect with low p-values).

        • Shravan – there was an “alternative” to the p-value called “p-rep” that was required for all articles submitted to Psychological Science for a few years. I discovered it when reading through original papers that were subjected to replication for the Reproducibility Project: Psychology. It is a function of the p-value and nothing else, but its alleged interpretation is “the probability of a significant effect in replication”. So at least one major psychology journal spent a while telling everyone that the reproducibility of an observed effect can be calculated from it’s p-value.

  2. >”in particular, yes the data were public and always available but not the particular data file used by Reinhart and Rogoff was not accessible for outsiders.”

    Even the actual file is sometimes not enough. Many times I have seen excel automatically change values because it detects a date, etc. It also sometimes displays a different value than it saves, so opening the file (eg csv) in different software will give different results. Just one person resaving the file after looking at it is enough to mess up the data, download some genetics data from wherever and search for eg the OCT1 gene. It wont take long before you come across the problem…

    The entire data processing pipeline needs to be reproducible.

  3. My own favorite example are all of the co-authors on the couple of dozen paper by Fred Walumbwa that have been retracted, had corrections/corrigendums published, or continue to go uncorrected despite being widely known to based on statistically impossible results. Almost all of the authors continue to proudly tout the articles on their websites and none of them have asked for corrections to be made or to ohave their name to be removed from the papers.
    Many of these papers are discussed at length on pubpeer.com
    https://pubpeer.com/search?q=walumbwa&sessionid=A9E513C2E85EFC7FD36F&commit=Search+Publications

    • Shravan:

      Key bit from that statement:

      In addition, when our own work or other work we care about is on the receiving end of scientific criticism, we must be careful not to misperceive that criticism as an attack or harassment. . . . We must try our best to accept and learn from criticism, if we are to advance our shared goal of contributing to a cumulative and self-correcting science.

      But, later on, there’s this:

      By signing below, we are indicating not that we each agree wholeheartedly with every single sentence above . . .

      So I think Fiske and Bargh are OK, as they can just clarify that they did not agree wholeheartedly with those first two sentences above.

  4. I think it’s important to remember that co-authors are not all necessarily equal in terms of their ability to disagree with their collaborators. It’s all well and good for Don Green to admit there’s a problem, contradicting Lacour, but it would take some serious gumption for a junior co author to break with his or her senior collaborators. Andy Yap, for example was just a grad student when the power posing paper was publishe. Although, it looks like Yap and Carney have continued with this line of work, implicitly suggesting that they don’t see any problem with the original paper. Still, I think the point about power differential is important to keep in mind.

    • As others have noted, Carney seems to have turned her back on the power posing research. However, I think your point about the power differential to be pretty important. I think it is interesting that of the three, Cuddy seems to have managed to leverage the research into not only a TED talk but a book deal, and probably a whole host of other gigs.

  5. This posting recklessly mischaracterizes its targets in ways that undermine the reader’s faith in Gelman’s points. Here are two important examples of inaccuracies or elisions that leave a genuinely false impression on the reader.

    First, Gelman describes a paper by Randall Rose, Promothesh Chatterjee and Jayati Sinha as “a social psychology study that was utterly destroyed by . . . ” The reader is obviously left with the notion that this particularly social psychology study is crap. Crappy as it may be, there is absolutely nothing about social psychological about this paper–it is written by business faculty from business schools in a marketing journal (Marketing Letters) studying use of cash vs. credit cards with a cognitive psychological account. This is akin to using a drone to bomb a terrorist but hitting a wedding party instead–it’s only acceptable to people who don’t care about the wedding party in the first place. But these are my people.

    Second, Gelman writes that the “power pose” work of Carney, Cuddy, and Yap has been questioned on several grounds: first that their study is too small and their data are too noisy for them to have a hope of finding the effects they were looking for, second that an attempted replication of their main finding failed . . . ”

    It’s true that the original study did not have high power. But the study had TWO main findings–one physiological, and one psychological. The physiological effect has not held up well in replication, but the psychological effect has. I quote from this “failed” replication:

    “Using two-tailed t tests, we replicated Carney et al.’s finding that participants in the high-power condition self-reported, on average, higher feelings of power than did participants in the low-power condition (mean difference = 0.245, 95% confidence interval, or CI = [0.044, 0.446]), t(193) = 2.399, p = .017, Cohen’s d = 0.344. [Ranehill, E., Dreber, A., Johannesson, M., Leiberg, S., Sul, S., & Weber, R. A. (2015). Assessing the Robustness of Power Posing No Effect on Hormones and Risk Tolerance in a Large Sample of Men and Women. Psychological science, 0956797614553946.]

    Certainly, replicating the physiological effect would have been exciting. But I assure you that psychological DVs are quite exciting to psychologists, they are quite meaningful to our theories, and are acceptable as good science in a journal named “Psychological Science.”

    When lobbing your live ammunition, please aim carefully, please aim accurately, or expect to lose the very credibility some of these blog entries are designed to undermine in others.

      • What about the psychological variable – subjective feelings of power? Are there gender differences on this variable? I did think Crandall made an interesting point. There are two variables here. Does replication on her psychological variable justify continued attention to power posing? And the bigger question, when and what scientific evidence should exist as a basis for promoting work as meaningful to people beyond academics? what I do not know, has Cuddy’s discussions of power posing changed in light of the cumulative findings and response to critiques? I think these may be the more important questions, in this particular case.

    • Hello Chris Crandall,

      If “there is absolutely nothing social psychological about this paper,” as you say, why did Pashler et al.’s critique appear in the journal BASIC AND APPLIED SOCIAL PSYCHOLOGY with the title “A social priming data set with troubling oddities”?

      • I cannot explain to you, Carol, which BASP printed this criticism. Prof. Pashler labels a lot of things as “social priming” when they have nothing social about them. I encourage you to read the original paper, and find what’s social psychological about it. I have read it, I have read Pashler et al., and did not discover the why of it.

        • While I agree that “social priming” is a poor choice of label, the original study has the look and feel of a Social Psychology Priming study all over it. It uses a money prime to look at altruism. I mean for crying out loud the abstract has this gem “This is paradoxical because recent research suggests that mentioning money primes a self-sufficient mindset, thus undermining the very behaviors these organizations desire to elicit.” It’s as much Social Psychology priming research as Kathleen Vohs research that they cite.

          So I disagree with your conclusion that “Crappy as it may be, there is absolutely nothing about social psychological about this paper” and I also disagree with your assertion that these facts “–it is written by business faculty from business schools in a marketing journal (Marketing Letters) studying use of cash vs. credit cards with a cognitive psychological account.” support that conclusion. Social psychologists have been making inroads into Business College’s for some time. For example, Vlad Griskevicius and Kathleen Vohs at Carlson School of Management, University of Minnesota. Yexin Jessica Li at School of Business, University of Kansas are all Social Psychologists.

          This isn’t so much a matter of the drone pilot missing as much as Social Psychology having for better or worse a very big tent.

        • Hi Chris Crandall,

          I’ve already read the original article (now retracted), Pashler et al.’s critique, the reply by each of the three authors, and the rejoinder by Pashler at al.

          I’ve just sent Hal Pashler and David Trafimow (the editor of BASP) a note asking them to respond; that seems the most efficient way to proceed.

        • I agree with 11th Recon that there could be a ‘big tent’ issue, but even then it seems to fall fairly solidly in the social psych realm. The original paper is looking at the psychological factors that affect donations; seems fairly social. As the authors say, “priming cash concepts reduces willingness to help others”. The articles by Vohs et al that they contrast themselves to are couched in social psych terms. It strikes me as being at least as much social psych as it might be cognitive psych.

        • It’s also worth noting that even if this was published in a marketing journal by people in business schools, all three of Cuddy, Carney, and Yap have degrees in psychology (bachelor’s for all, Ph.D.’s in social for Cuddy and Carney) and the top two authors have had positions in psych departments (post doc and joint affiliation for Carney, assistant professor and then joint affiliation for Cuddy). Yap’s web page specifically says that he studies “the psychology of power, status, and hierarchy” (http://www.andyjyap.com/about). They’re all psychologists, and I think would describe themselves as social psychologists.

        • My comment about “no social psychology” referred to Chatterjee, Rose & Sinha, “Why Money Meanings Matter in Decisions to Donate Time and Money”, from Marketing Letters.

          The power pose work is certainly done by social psychologists.

        • Sorry, I jumped controversial papers there between posts. For what it’s worth, Rose’s Ph.D. is in the psychology of marketing and Sinha has a number of publications in psychology journals.

    • Ah – but the effect on psychological variables only works for men – in both the Carney et al study and also in the Ranehill replication. Further, the authors were well aware of this gender moderation and even wrote about in an earlier draft of their manuscript (which I have a copy of). Funnily enough Amy Cuddy has gone around telling women all over the world how power posing will help them when she was completely aware of the fact that the psychological effect did not apply to women.

  6. Mark: I found nothing in Ranehill et al (2015) that suggests that the psychological effect on felt power was different by gender. Perhaps it is buried deeply within the supplemental files?

    • The data is publicly posted. It is there. Across all three studies (two by Carney et al. and one by Ranehill) the effect for men is fourt times as strong as for women.

    • As you may know, the Carney et al. paper also misreports the effects of the treatment on the willingness to gamble as significant – when it is non-significant. Their design also has a basic confound inasmuch as participants had the opportunity to gamble before the post-manipulation hormone measurements were taken – when we have tons of evidence that gambling (and even the opportunity to gamble) raises cortisol and testosterone. Basic stuff that the Psych Science reviewers and editor should have caught.

        • Not everyone actually gambled. The biggest effects on testosterone and cortisol and for actually gambling – something that not everyone did. The mere opportunity to do so has more modest effects.

  7. I’ve just had pretty much the situation that Andrew describes; I don’t know if I handled it well or not, and I’d be interested in others’ comments.

    A junior clinician/researcher did some reanalyses of data from a clinical trial that I had been closely involved with, in collaboration with a statistician. I wasn’t involved in the reanalysis, which I thought was terrible, but followed the prevailing methodology of the field. It was submitted to a journal, and unsurprisingly (to me), the journal’s referees didn’t pick up on any statistical issues and the paper was accepted. I considered withdrawing as an author but eventually decided not to, but to include in the “contributions of authors” section that I wasn’t involved in the analysis. I didn’t insist on putting in that I disagreed with the analysis – maybe I should have? I found the situation difficult as some of the other authors are friends as well as colleagues, and were keen for me to stay involved. I wasn’t too bothered from a career point of view (for myself) but for the junior researcher it’s obviously a much bigger deal and I wanted to try to be supportive. After all, I might be wrong about the statistical issues. So maybe we’ve just ended up with a bit of a fudge and unsatisfactory compromise?

      • Mark and Andrew: In an article on RetractionWatch today, Carney states that she believes that the article should *NOT* be retracted.

        I wonder what Cuddy thinks of all this. And I wonder if this has anything to do with Cuddy’s turning down tenure at Harvard, as we discussed earlier.

        • Carol:

          As I’ve written elsewhere (I think it was a guest post on Retraction Watch, actually), I think that, in practice, retraction is not an effective tool for cleaning the scientific literature, as it has been associated so strongly with scientific misconduct. I think corrections work better.

        • Hi Andrew,

          I wasn’t arguing that it should be retracted. I was just mentioning that Carney’s opinion on this issue had been reported on RetractionWatch today.

          It will be interesting to see what PSYCHOLOGICAL SCIENCE decides to do, if anything.

        • I think that it needs to be retracted. There are literally millions of people (close to 50 million views on youtube and TED) who think that this is a real effect. Further millions have been spent on purchasing the resultant book and on speaking fees for Amy Cuddy. A correction will not register with these consumers of the paper.

        • I reason it’s part of science to self-correct and having this be/remain part of the literature goes with that. I think that also allows researchers, and the general public, to keep reminding themselves that not everything that is published is necessarily “true”.

          If you have a look at the wikipedia page of Cuddy and the section on “power posing”, there is already information regarding the statements from Carney, and recent replications, added to it. Similar information is posted in the comment section to the TED talk itself. I think that will help inform the general public, or those who bought the book, watched the TED talk, etc.

          I understand your expressed sentiment though, but i wonder if important lessons can be learned from this all that outweigh this in the long run. I think the original article should stay part of the record. I think it serves as a very important reminder, for both scientists and the general public, of how science can/should work. I think the whole case is excellent teaching material as well.

      • Hi Mark,

        On an update to the Retraction Watch article about Carney, the editor of Psychological Science (Stephen Lindsay) says that there are currently no plans to retract the Carney, Cuddy, and Yap paper.

    • Mark:

      Wow. I wonder what Andy Yap thinks about all this. From his webpage:

      As predicted, results revealed that posing in high-power (vs. low-power) nonverbal displays caused neuroendocrine and behavioral changes for both male and female participants: High-power posers experienced elevations in testosterone, decreases in cortisol, and increased feelings of power and tolerance for risk; low-power posers exhibited the opposite pattern. In short, posing in powerful displays caused advantaged and adaptive psychological, physiological, and behavioral changes — findings that suggest that embodiment extends beyond mere thinking and feeling, to physiology and subsequent behavioral choices. That a person can, via a simple two-minute pose, embody power and instantly become more powerful has real-world, actionable implications.

  8. Here’s a question:

    What if some easy priming trick actually works for 5% of the population, but makes 5% equally worse off, and has no effect on 90%? Would you say that that is phony? Or that it’s something people might try for themselves and see if it works?

    I’m not saying that Power Posing or whatever has this distribution of effects, but it doesn’t seem implausible that some things really do work for a small fraction of the population, while being bad for a comparable small fraction. But would that be phony?

      • Maybe, maybe not.

        For example, drinking echinacea tea helps me fend off colds. I’ve been doing it for 20 years and it’s made my life better. On the other hand, nobody I’ve talked into trying it has reported that back to me that it works for them. Furthermore, echinacea is not sweeping the American marketplace. But, then again, it’s not disappearing either.

        My guess is that my good response to echinacea tea has something to with the idiosyncrasies of my immune system.

        My hunch would is that echinacea helps a single digit percentage of the population. That wouldn’t be much, but then it wouldn’t be nothing either.

        Perhaps in a few decades genomic analysis will have advanced to the point where we can get our DNA analyzed and be told: “You should try echinacea tea.”

        Or maybe right now we can go to Whole Foods and buy a box of echinacea tea for $5.49 and try out whether it works for you or (probably) it doesn’t work for you.

        • It strikes me that philosophy of science hasn’t caught up with human diversity. Everybody looks to physics and astronomy for examples of how science works: if the General Theory of Relativity doesn’t work in every solar system in the galaxy, then it’s not very general, right?

          But the human sciences may not be all that much like that. Perhaps hypotheses in the human sciences sometimes tend to work for some people in some circumstances but not for others in others?

          Granted, most of the studies that don’t replicate well are probably just basically bogus, but after we clear out the obvious deadwood, what happens when we find things that are truly idiosyncratic? So, what I’m talking about is more theoretical, but it seems like a reasonable thing to be interested in.

          Also, it could provide a soft, non-humiliating fallback for academics whose research doesn’t replicate all that often: maybe it works for some people but not others?

        • Steve:

          Social science researchers in fields such as psychology, economics, and political science, are aware of the idea of varying treatment effects. This idea is not well understood, though: the usual textbook discussions are in terms of constant effects. Indeed, one of my criticisms of the recent Imbens and Rubin book on causal inference is I felt they talked too much about constant treatment effects. Imbens and Rubin understand that treatment effects can and do vary, but I think they felt that the simpler approach worked better for an introductory textbook.

          My take on all this is that, whether or not a particular study is bogus, there are true underlying effects. These effects are just a lot smaller and more variable than many people seem to think, and studying them scientifically will require a lot more work than the simple surveys and lab experiments we’ve been seeing.

        • When people like Amy Cuddy push their advice, I think, their implicit message is that it helps everyone or at least a majority of people.

          That’s the mis-information they peddle and that’s annoying.

          It’d be a wholely different message if they declared something like, “We think you should try power poses but we think they only help 1 person out of every 1000 that tries them and right now we don’t know if it will help you or not”

          But that isn’t how they sell their message.

        • Andrew,

          To be fair to Imbens and Rubin, they were actually amongst the first to make a substantial contribution to varying treatment effects when they wrote that LATE paper (together with Angrist), which discusses what IV methods might estimate when treatment effects vary.

        • I think we would try to find the variable that explains why something (tea, prayer) “works” for some and not others. Otherwise we assume that given enough people rolling dice, someone will get 10 6s in a row. Of course belief that it works will always be one of those potential variables. Also in the case of something like tea, one might say let’s control for that by doing a double blind study.

        • People are *horrible* at untangling effects like this by personal observation. (See the whole history of medicine, the current vitamin and supplement industry, etc.) So maybe echinacea tea actually does help 1% of people with their colds, but it’s probably far more likely that it kinda seems like it might be helping you because of random stuff (you happened to get better a little faster than usual a couple of times when you drank it) combined with confirmation bias.

        • If echinacea tea helps 1% of the population, hurts 1% of the population (maybe gives them heartburn) & just wastes money for the remaining 98% of the population then why is it a good idea to make a blanket recommendation: “Drink echinacea tea to fend off colds”?

          Unless you can identify the 1% population, a blanket recommendation makes little sense, a la Cuddy. If power poses indeed help 1% of the population she better tell us who they are.

        • Rahul:

          It’s not so much that power pose helps 1% of the population. Rather, I’d guess that it helps some people in some situations. And I doubt it has anything to do with the pose itself.

        • 0.005% of the population in 0.005% of their daily situations?

          Whatever be the numbers, till we know which people & in exactly which situations, the advice seems not very useful.

        • – PS. Not sure what you mean by “has anything to do with the pose itself.”

          The way acupuncture and homeopethy “work” for some people. Effect is created by doing something/getting attention – it’s not the thing you do that produces the effect but the fact of doing something, getting attention from a health professional etc.
          (sorry not very well expresed)

        • @Simon

          Fair enough. But if the effect is still real why should we care?

          i.e. Say mere distraction or novelty of acupuncture makes a person with chronic pain feel better, so be it; why should I care (so long as the effect is reproducible, no other worse side effects etc.)

        • Rahul – yes, that’s a fair point. I guess it comes down to whether being right for the wrong reason matters. A few years ago there was an interesting case when the NICE guideline developent group for back pain ended up recommending acupuncture. Certainly at least some of the group (maybe all) were not acupucture believers, but took the pragmatic view that although it was just a theatrical placebo, there was evidence that it made people feel better, and was unliekly to have any bad side effects. They got a lot of flak for appearing to legitimise a therapy that doesn’t work (or certainly doesn’t work in the way it is believed to – I guess we can’t rule out that it does something).

        • @Simon

          My impression is that we shouldn’t mix the two problems (a) Does it work? vs (b) Why does it work?

          If the evidence is strong & repeatable that it works that’d be good enough for a lot of purposes. The academics can then argue about why.

  9. I’m not an academic so have an outsider non-expert perspective.

    However, I find it pretty worrying that there seem to be so many errors in peer reviewed papers in respected journals – data coding errors, miscalculations.

    Whatever happened to checklists – known to be effective in reducing error in a number of fields such as aviation and surgery? A mistake in an academic paper is not life threatening, but nevertheless the use of a checklist to double check all calculations by a co-author or reviewer would have resulted in considerably less time and money being wasted, reputations being shredded….

    • I agree completely that much better “data hygiene” guidelines and practices are needed.

      I used to work for an org where any analysis needed to be reproduced at least once by an outsider to the project before it was allowed to leave our doors. What I learned from repeating this exercise dozens of times is that even the most careful work — involving high financial or public relations consequences, the expectation that an adversarial party would be scrutinizing it, use of scripting and not a point-and-click workflow, and subject to hours of discussion and dissection of the numbers — will still contain some errors. Fortunately these errors were usually second-order minor issues, but not always.

      My experience in academic research is that people are far less careful about checking their results than this, and many groups don’t even have anyone with auditing skills. People who aren’t experienced in programming often haven’t developed a debugging or edge-case hunting instinct. It takes solid attention to detail to observe and act on red flags like number of observations changing due to missingness. Sanity checks like comparing observed group means to those implied by a regression model require that the user understand the method, but many researchers and their assistants lack the quantitative background or knowledge of how to use their software to adequately explore their results and diagnose big issues like mixed up coding. All of this is compounded by non-reproducible workflows involving data processing in Excel, point-and-click software or entry on website calculators, and manual entry and rounding of numbers in manuscripts that are not consistently updated when data processing decisions change.

      The peer review system is not equipped to address these kinds of issues. Reviewers provide useful feedback at the level of topic novelty or experimental appropriateness, perhaps the choice of analysis methods, but rarely question implementation specifics because there’s not enough in the paper to raise the issue. Information that could tip a reviewer off to data errors, poor selection of analyses, or “forking paths” tours are typically left out, and a manuscript that did include extensive exploratory work would be tonally jarring and quite long compared with current journal norms. (Look at all the details Carney revealed in her recent power posing confessional that weren’t in their paper — can you imagine how the methods section would have read if they were included?) As such, mostly what we can do is flag post-publication errors, but someone has to be motivated to dig for them (step 0: obtain raw data can stop them in their tracks) and that motivation is not necessarily benign. That leaves researchers feeling much more attacked and defensive about their errors than if these had been observed pre-submission by a friendly party, and so these disputes appear political rather than technical. Automated tools like statcheck (which checks results reported in APA format for internal inconsistencies and posts the results on PubPeer) are a step in the right direction in not targeting individuals, but comes too late in the research lifecycle, and these problems are of course not exclusive to psychology or hypothesis test reporting.

      For these reasons, I am concerned that the overall correctness of data analyses in published research is much lower than we want to it be, without even thinking about multiple comparisons and such. Here’s what I recommend to reduce the incidence of errors (which many people are already practicing but is not widespread enough):

      – Raw data is sacrosanct. Any corrections, subsetting, or aggregations should be made programmatically and not to the original file. How the sample was collected, variable meaning, and value scales must all be documented.

      – Plots with individual observations should be included in any submission, as a supplement if not in the main body of a paper. It is not acceptable that graphics are currently used primarily for reporting aggregated results. (For instance, I believe the Lacour data fabrication could have been detected earlier if graphs showing individual subjects at the longitudinal timepoints were part of the submission.) Ban “dynamite” plots.

      – All analyses should be performed by scripting. Results reported in manuscripts should be automatically inserted using something like knitr. The skill in debugging and checking needed to become proficient in this should be viewed as highly desirable side effect, not a time sink obstacle to producing more research.

      – All analyses should be reproduced start-to-finish by another person before submission, ideally one not heavily involved in their original development and who is knowledgable enough to question the appropriateness of modeling and data processing choices.

      • The other side to this is that it costs to get the kind of code / analysis quality you describe. Multiple highly trained, experienced programmers, spending hours or days poring over code and motivated to catch deep bugs doesn’t come cheap.

        Considering that the average grad student makes $25k a year and that the average Prof. churns out upwards of 10 papers a year we are probably getting the code quality we deserve?

        • 10+ papers per year? Wow! You must hang out with far more productive faculty than I do. I just checked the google scholar profiles of a couple of super stars in my area and they average four to five peer-reviewed journal articles a year. I would have estimated an average of two to three per year and I know that faculty at some institutions hardly ever publish at all.

        • No, really, I don’t think I’m exaggerating. Look at Andrew’s publication list:

          http://www.stat.columbia.edu/~gelman/research/published/

          18 papers in 2015.

          In science / engineering 10 per year is not at all unusual. Bob Langer from MIT has 37 papers listed for 2015 http://web.mit.edu/langerlab/publications/2015.html

          Of course, he’s more high-flyer than average, but that’s why we have 37 vs 10.

          For Ioannidis I counted an incredible 25 publications already listed for 2016 and there’s still many months to go:

          https://med.stanford.edu/profiles/john-ioannidis?tab=publications

        • Maybe someone needs to study if there is a correlation between publishing crappy Psych studies and a high number of publications per year.

          It seems like there’s a huge range in the publications per year across faculty.

        • We need to ask whether having your name on a paper just means “This work was performed in a lab / group managed by me but I’ve little responsibility for the details”

          Otherwise it’s hard to imagine how any Prof. might ever closely scrutinize that many papers along side the gazillion other responsibilities they shoulder.

          Basically, having your name on a paper has become more about career reward points than authorship.

        • I think I can produce that many every year. I have eight or so PhD students, and if each reports on their work once a year with a paper, and I write two, that’s an easy goal to reach. There are people in Germany who produce 40-45 a year. No way that is possible without major automation and churning out of stuff. This is how you get 350+ papers (as Susan Fiske has). By contrast, I think that Goedel only produced 7-8 papers in his lifetime. Someone had suggested on this blog that each researcher should produce about that many in their lifetime. I agree with that, but it’s impossible to live like that. My university decides on my annual budget based on the *number* of publications I produce, they do not care what’s in them or where they come out. A landmark paper that took years to produce has the same value as a paper that has two experiments that were run in one month and quickly written up. For training graduate students to do research, I have no choice but to get them to start simple (replicate some claims in the literature is my usual starting point with them), and they must publish their results no matter what if they want a job at the end of the rainbow that is the PhD program. That’s how you end up with 10+ papers a year. This year I have 10, last year 11, only 3 in 2014 (I didn’t need a university budget that year as I had negotiated a budget for 2011-14).

        • The university and German NSF funding system demands that we write faster than we can think. The German NSF (which claims on paper that they don’t count publications) also counts number of publications; they rejected one of my proposals some years ago on the grounds that I had only three or four publications from previous funding that lasted four years. They expected that if I have funding for four years, I would produce more than 3-4 publications. Think about that. Running one experiment takes up to six months unless you are running with power 0.05. You need at least two experiments to publish a paper unless you are from MIT or Harvard (then, even one experiment is enough). One can run at most two expeirments in parallel. So one takes six months to run two experiments, three months to analyze and write up and submit to a journal. Reviews including revise and resubmit actions can take up to 9 months to a year (more realistic: 1.5 years). If there are dependencies between experiments, i.e., if you cannot start an experiment until the results of the first one are known, then you are even slower. It is physically impossible to write more than 3 papers in a four year project, unless one is just churning out stuff mindlessly. That is what is wanted today.

          I know that the US NSF also does that; I have seen projects rejected there on these grounds (too few publications). The European Research Council (which gives out 2 million Euros like they are cups of take-away coffee) wants to see your h-index when deciding on your personal performance. Unless you are already a superstar, the only way to game an h-index is to churn out stuff in large quantities and then get your friends to cite you and you cite them back. This is also why one needs to remain friendly with everyone in your field; if you antagonize people, they won’t cite you, h-index will stagnate, funding will dry up, and you’ll be surfing on your smartphone all day taking selfies and admiring yourself. In that sense, since *is* a community, even though it should be a method.

        • Sorry, one more rant.

          In Germany we have something called the cumulative dissertation. You can get your PhD by pasting together three published articles (they can be under revision after submission, but not rejected). The informal number for accepting a cumulative is three journal articles. That’s something that has to be produced in three years, which is the usual official PhD period in Germany (in reality it is five; we don’t have official coursework).

          It is absolutely absurd to demand three journal articles in 3-5 years from a beginning graduate student, without also demanding that they just churn out whatever they can put on paper. The way it is set up, the setup actively discourages careful thinking.

        • In short, we deserve the crap we get. The system incentivizes quantity over quality. Academics produce what the system rewards.

          It’s naive & useless for us to compare the (crappy) product we generate to other obviously higher quality analysis & code.

        • Rahul wrote: “In short, we deserve the crap we get. The system incentivizes quantity over quality. Academics produce what the system rewards.
          It’s naive & useless for us to compare the (crappy) product we generate to other obviously higher quality analysis & code.”

          Exactly.

        • Rahul, not only do academics produce what the system rewards, academics get actively punished if they don’t play the game. If there was zero cost to not playing the game, it would still be an improvement. There is only one situation when an academic can refuse to play the game and suffer no consequences, and that is when they are so famous that it doesn’t matter any more.

        • Shravan:

          I don’t think academics are as helpless at this as you describe. You do have a choice. I can understand if Junior tenure-track faculty felt helpless but what about senior, tenured, relatively accomplished Profs?

          Why is Ioannidis writing 30+ papers a year? Why do famous people review papers cursorily in 15 mins? Why do tenured, senior Profs. persist to publish in Journals that don’t demand open data? Why do Profs. accept to review papers for Journals that don’t force authors to release data / code. Why do people deride peer review & yet continue to publish in peer reviewed journals?

          I think it’s easier to complain than to act. Plenty of double talk & agonizing. There’s a lot of lip-service to the idealized scientific enterprise but very few willing to match it with actions.

        • Rahul said:

          “Why is Ioannidis writing 30+ papers a year? Why do famous people review papers cursorily in 15 mins? Why do tenured, senior Profs. persist to publish in Journals that don’t demand open data? Why do Profs. accept to review papers for Journals that don’t force authors to release data / code. Why do people deride peer review & yet continue to publish in peer reviewed journals?”

          I can answer almost all these questions. Rahul, you would do the same in my position. You can’t see this because you don’t have to do any of this.

          1. “Why do famous people review papers cursorily in 15 mins? ”

          This is unacceptable and I never do this. I spend a lot of time and effort on reviewing.

          2. “Why do tenured, senior Profs. persist to publish in Journals that don’t demand open data?”

          Because they usually publish with untenured students (with these as first author), and these students will not get jobs if they don’t show a paper in a major journal. Open access journals don’t necessarily have poorer quality than top journals; but hiring panels look at the brand name.

          3. “Why do Profs. accept to review papers for Journals that don’t force authors to release data / code. ”

          The better way is to lead by example. I put up all my data on my home page (in most cases since 2010).

          4. “Why do people deride peer review & yet continue to publish in peer reviewed journals?”

          See 2.

        • Thanks for posting this – such details help to make sense of many things I’ve experienced – how and why those hidden cliques are so prevalent “game an h-index is to churn out stuff in large quantities and then get your friends to cite you and you cite them back. This is also why one needs to remain friendly with everyone in your field; if you antagonize people, they won’t cite you, h-index will stagnate, funding will dry up”

          So to avoid this you would have to be senior and famous and also not want any students ever again.

        • Keith wrote: “So to avoid this you would have to be senior and famous and also not want any students ever again.”

          Also not care whether you get any third-party funding.

        • Hooray Shravan for putting down exactly how broken the system is. A back of the envelope calculation such as yours is all that is needed to see how stupid the publication counting is. To Rahul and those discussing publication count variability, also note that it’s much easier to publish many papers if someone else did all the data collection / experimenting. If you need to breed mice for 1 year to get 20 mice with a certain genetic defect, that’s a very different situation than if you have 45 different public access data sets published by government entities, the Pew Foundation, or the like. It’s not trivial to put a thoughtful query into the CDC Wonder mortality data and get back a useful dataset and build a model about say cancer incidence, but it’s a LOT easier, faster, and lower-cost than a year of breeding and screening mice.

        • @Daniel

          No matter what you do I think it is humanely impossible for any Prof. to publish 65 papers in just over half a year (see Keith’s analysis of Ioannidis’ publication history for 2016) & yet exercise any reasonable degree of what are traditionally regarded as authorship responsibilities.

      • +1 to R — although Rahul brings up some good points as well. I think one part of the solution needs to be to try to change the culture (admittedly a big job) to value quality over quantity.

  10. This stuff probably got into every business school and paper … its everywhere and everything to management.

    After her recent speech [April, 2016] at the University of Toronto’s Rotman School of Management, I got to ask Cuddy about what her research means to entrepreneurs. The answer: everything.

    http://business.financialpost.com/entrepreneur/fp-startups/how-adopting-wonder-womans-power-pose-might-change-the-outcome-of-a-meeting-or-even-your-destiny?__lsa=8e58-bd82

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *