Skip to content
 

A study fails to replicate, but it continues to get referenced as if it had no problems. Communication channels are blocked.

In 2005, Michael Kosfeld, Markus Heinrichs, Paul Zak, Urs Fischbacher, and Ernst Fehr published a paper, “Oxytocin increases trust in humans.” According to Google, that paper has been cited 3389 times.

In 2015, Gideon Nave, Colin Camerer, and Michael McCullough published a paper, “Does Oxytocin Increase Trust in Humans? A Critical Review of Research,” where they reported:

Behavioral neuroscientists have shown that the neuropeptide oxytocin (OT) plays a key role in social attachment and affiliation in nonhuman mammals. Inspired by this initial research, many social scientists proceeded to examine the associations of OT with trust in humans over the past decade. . . . Unfortunately, the simplest promising finding associating intranasal OT with higher trust [that 2005 paper] has not replicated well. Moreover, the plasma OT evidence is flawed by how OT is measured in peripheral bodily fluids. Finally, in recent large-sample studies, researchers failed to find consistent associations of specific OT-related genetic polymorphisms and trust. We conclude that the cumulative evidence does not provide robust convergent evidence that human trust is reliably associated with OT (or caused by it). . . .

Nave et al. has been cited 101 times.

OK, fine. The paper’s only been out 3 years. Let’s look at recent citations, since 2017:

“Oxytocin increases trust in humans”: 377 citations
“Does Oxytocin Increase Trust in Humans? A Critical Review of Research”: 49 citations

OK, I’m not the world’s smoothest googler, so maybe I miscounted a bit. But the pattern is clear: New paper revises consensus, but, even now, old paper gets cited much more frequently.

Just to be clear, I’m not saying the old paper should be getting zero citations. It may well have made an important contribution for its time, and even if its results don’t stand up to replication, it could be worth citing for historical reasons. But, in that case, you’d want to also cite the 2015 article pointing out that the result did not replicate.

The pattern of citations suggests that, instead, the original finding is still hanging on, with lots of people not realizing the problem.

For example, in my Google scholar search of references since 2017, the first published article that came up was this paper, “Survival of the Friendliest: Homo sapiens Evolved via Selection for Prosociality,” in the Annual Review of Psychology. I searched for the reference and found this sentence:

This may explain increases in trust during cooperative games in subjects that have been given intranasal oxytocin (Kosfeld et al. 2005).

Complete acceptance of the claim, no reference to problems with the study.

My point here is not to pick on the author of this Annual Review paper—even when writing a review article, it can be hard to track down all the literature on every point you’re making—nor is it to criticize Kosfeld et al., who did what they did back in 2005. Not every study replicates; that’s just how things go. If it were easy, it wouldn’t be research. No, I just think it’s sad. There’s so much publication going on that these dead research results fill up the literature and seem to lead to endless confusion. Like a harbor clotted with sunken vessels.

Things can get much uglier when researchers whose studies don’t replicate refuse to admit it. But even if everyone is playing it straight, it can be hard to untangle the research literature. Mistakes have a life of their own.

21 Comments

  1. Ruben Arslan says:

    This unfortunately doesn’t seem to be an isolated case at all in psychology.
    See my brief citation analysis of the studies that did and did not replicate in the Reproducibility Project Psychology: https://rubenarslan.github.io/posts/2018-09-23-are-studies-that-replicate-cited-more/
    Not only do citations not predict replicability (even I wasn’t so optimistic), there isn’t even a drop in citations for those studies that didn’t replicate compared to those that did (as we might have expected from the stronger signal of retraction which does lead to a drop). Might not be the best set of studies to look at though (the RPP didn’t particularly highlight individual studies), but you’d think if someone builds on a study replicated in the RPP, they check the details.

    • Anonymous says:

      “Not only do citations not predict replicability (even I wasn’t so optimistic), there isn’t even a drop in citations for those studies that didn’t replicate compared to those that did (as we might have expected from the stronger signal of retraction which does lead to a drop)”

      Should you not be familiar with ’em yet, the following two blog posts from 2012 & 2015 by Jussim might be relevant/interesting here:

      https://www.psychologytoday.com/us/blog/rabble-rouser/201207/social-psychological-unicorns-do-failed-replications-dispel-scientific

      https://www.psychologytoday.com/us/blog/rabble-rouser/201505/slow-nonexistent-scientific-self-correction-in-psychology

      I reason psychology is/has been in such a bad state (e.g. publication bias/file-drawing, questionable research practices) that the “input” (e.g. prior reasoning, theorizing, and studies) doesn’t matter much for the “output” (e.g. coming up, and performing, and writing up new studies). I reason this might be reflected by your quote above, and (the gist of) the two blogs i linked to.

      I tried to find a solution to (among other things) the problem you (and the blog post by Jussim) possibly make clear by trying to make it so that the “input” actually matters for the “output”. I reason this all especially (and perhaps only) begins to matter when some “higher” standards (e.g. higher powered research, pre-registration, using and interpreting “non-significant” findings, etc.) will be used. I thus also reason that it might be especially useful for those researchers that want to adhere to these “higher” standards to work together in relatively small groups of researchers (my guess would be around 5 could be “optimal”).

      I reason that a research and -publication format that focuses not only on both “original” studies and “replication”, but more importantly perhaps, what happens AFTER both of these types of research have been performed, could result in improving many possible current problematic issues in psychological science. I hereby also reason the starting point of a research process (e.g. an “original” or “replication” study) might be of less importance than that what happens after that. You can find the idea/format here: https://andrewgelman.com/2017/12/17/stranger-than-fiction/#comment-628652

  2. Martin Modrák says:

    Ruben Aslan did a nice exploratory analysis, whether studies that replicate get more citations that studies that don’t. His conclusion is that non-replication is not visible in the citation record: https://rubenarslan.github.io/posts/2018-09-23-are-studies-that-replicate-cited-more/

  3. currentevents says:

    Replication demands are a form of scientific bullying. Amy Cuddy says that she will not be silenced any more and that her career is thriving

    https://twitter.com/amyjccuddy/status/1055146199093862401

    She says “I’m speaking up now about things that happened over the last few years because I’ve finally found the courage and sense of purpose to do so. Yes, I want to stand up for myself. I also want to demonstrate that others can do the same — and for each other.”

    https://twitter.com/amyjccuddy/status/1054863971902197760
    She says ” I have a lot of feelings abt how my field, outside @Harvard, failed to stand up to clear bullying of a junior female scientist who’d already made significant contributions to her field and who clearly supported people from under-represented groups in several ways.”

    • Andrew says:

      Currentevents:

      Leaving aside questions about whether Cuddy has already made significant contributions to her field—it’s not my field and I’ll leave it to others to judge which contributions are significant and which are not—let me emphasize that I have not ever made a “replication demand.” Indeed, in may situations where replications are possible, I don’t recommend replication as I think the designs are too noisy for replications to tell us anything useful (except for telling us the negative information that the study was so noisy).

      The original replication of power pose by Ranehill et al. that got this particular story going, was done by people who thought that power pose had a large effect. If you think something has a large effect, it makes sense to replicate the study to understand it further. If you’re skeptical about a study, you could recommend replication, but it can also make sense to recommend an investment in careful measurement instead. In any case, I don’t think it’s appropriate to demand replication. If you really want a study replicated, you can replicate it yourself.

      Finally, I fully respect Cuddy’s inclination to speak up for herself. I think that in speaking up she should be careful to accurately represent the actions of others. For example, she wrote that Joseph Simmons and Uri Simonsohn “are flat-out wrong. Their analyses are riddled with mistakes . . .” I never saw Cuddy present any evidence for these claims, and I don’t think it’s true that their analyses were flat-out wrong or riddled with mistakes. To the extent that I’ve inaccurately represented others, I’d like to correct that too.

    • Anonymous says:

      Another guy who was really bullied was that Martin Shkreli guy. I mean, they actually put him in JAIL when his financial reports of his hedge fund didn’t replicate. Jeez people, he’s already made substantial contributions to the pharma industry by buying up a relatively unknown drug, ensuring that the political system prevented anyone from competing with him even though the patent was gone, and then raising prices to hundreds of times what they used to be, which is good because…. reasons duh.

      • Andrew says:

        Anon, Nick:

        Just to be clear, yes, when it comes to power pose and replication issues more generally, I agree with Dana Carney, Anna Dreber, etc., and I do get annoyed when people present, as empirical statements, claims that are not supported by the data (for example, that claim that the effect of power posing on feelings of power has replicated 17 times), but I don’t think that the analogy between Cuddy and Shkreli is so great, Shkreli committed crimes and, at least in the short term, made a drug less available to people, which all seems much much much worse than anything that Cuddy did. I guess that Brian Wansink’s behavior falls somewhere in between: he didn’t mess around with anyone’s drugs, but he did his part to misallocate millions of dollars of public and private funds, which I’d expect to have negative consequences.

        • I don’t think the point is that Cuddy and Shkreli are equivalent, it’s the claim that pretending Cuddy is being bullied because people want her scientific claims to be replicated is about as outrageous as claiming that Shkreli is being bullied because he said he had millions in his hedge fund and in fact the right number was $35 or whatever.

          Demanding that people not make false or misleading claims is not bullying.

    • Nick Danger says:

      “Replication demands are a form of scientific bullying.”

      Hmm, I suppose soon also demands that mathematical proofs be logical will be seen as “bullying” as well.

      • Martha (Smith) says:

        However, I don’t think that any woman mathematician would see such demands as “bullying” — after all, we expect our students to give logical mathematical proofs, so we’d be accusing our selves and colleagues of being bullies. (There might be some students who make such bullying claims — they either shape up or flunk.)

  4. Will Sorenson says:

    It seems like the issue we have is a result the tools we have being limited and a lack of willingness to adapt new tools. What most researchers do when they look for a new paper is go to google scholar.

    What do they see? They see *all* papers that cite it regardless of the nature of that citation. For papers that have thousands of citations, it is not easy to find the small handful of papers that critically examine the claim of the paper.

    Researchers (and normal people!) would be far better served if, when searching google scholar, they see links for two different types of citations instead of one:

    – Citations: (References that do not critically examine whether the claim of the paper is true)
    – Examinations: (References that do critically examine the validity). Someone should be able to think up of a better word than examinations.

    So if someone is deciding whether to use a paper as evidence in their own papers, they can quickly look through the citations that are *examinations* and see whether the results have been replicated.

  5. The story of intranasal oxytocin and trust is more than sad; it’s a scientific tragedy.

    The original Kosfeld paper drew very strong conclusions from noisy data (p = 0.029 *one-sided* -> title of “Oxytocin increases trust in humans”) and failed to make a key statistical comparison to establish a selective effect on trust.

    Kosfeld et al. (2005) spurred a wave of research on intranasal oxytocin and human trust, but most of it was conducted with insufficient sample sizes. Despite this, nearly all published results were statistically significant, a highly implausible state of affairs. Reviewing the literature in 2016, Walum, Waldman & Young concluded: “…intranasal OT studies are generally underpowered and that there is a high probability that most of the published intranasal OT findings do not represent true effects.”

    As the published literature became more rife with seemingly positive results, it became nearly impossible to challenge the idea of oxytocin improving trust. One lab which had reported an initial set of ‘successful’ experiments had manuscripts reporting subsequent failures repeatedly rejected on the basis that the effect was now well established. This lab managed to publish just 39% of all the work it had conducted, all of it suggesting positive effects. Pooling over *all* their data, however, suggested little-to-no effect (see Lane et al., 2016).

    From this manufactured certainty, numerous clinical trials have been launched trying to improve social function in children with autism through intranasal oxytocin. These trials have not yet yielded strong evidence of a benefit. Unfortunately, almost 40% of the 261 children so far treated with oxytocin have suffered an adverse event (DeMayo et al. 2017; compared with only 12% of the 170 children assigned a placebo). Thankfully, most (93) of these adverse events were mild; but 6 were moderate, and 3 severe.

    Here’s the kicker for this story: oxytocin delivered through the nose is probably inert in terms of brain function; it may not be able to pass through the blood brain barrier (reviewed by Leng & Ludwig, 2016). Some still dispute this, but it seems likely that the very large literature claiming behavioral effects of intranasal oxytocin on human behavior is completely and totally spurious. It’s been a colossal waste of money and time. It gave false hope to those with autism and needlessly harmed clinical trial participants. And the nightmare drags on as oxytocin->trust is *still* being cited and marketed as well-established science.

    It should have been possible to know better and do better… but somehow the illusions of certainty held up.

    If anyone is interested, I have a post-mortem of this field in press at The American Statistician; a pre-print is here: https://psyarxiv.com/3mztg/

    And here are the key references:

    DeMayo, M. M., Song, Y. J. C., Hickie, I. B., and Guastella, A. J. (2017), “A Review of the Safety, Efficacy and Mechanisms of Delivery of Nasal Oxytocin in Children: Therapeutic Potential for Autism and Prader-Willi Syndrome, and Recommendations for Future Research,” Pediatric Drugs, Springer International Publishing, 19, 391–410. https://doi.org/10.1007/s40272-017-0248-y.

    Lane, A., Luminet, O., Nave, G., and Mikolajczak, M. (2016), “Is there a Publication Bias in Behavioural Intranasal Oxytocin Research on Humans? Opening the File Drawer of One Laboratory,” Journal of Neuroendocrinology, 28. https://doi.org/10.1111/jne.12384.

    Leng, G., and Ludwig, M. (2016), “Review Intranasal Oxytocin : Myths and Delusions,” 243–250. https://doi.org/10.1016/j.biopsych.2015.05.003.

    Walum, H., Waldman, I. D., and Young, L. J. (2016), “Statistical and Methodological Considerations for the Interpretation of Intranasal Oxytocin Studies,” Biological Psychiatry, Elsevier, 79, 251–257. https://doi.org/10.1016/j.biopsych.2015.06.016.

  6. Dan F. says:

    The fundamental, basic premise of the use of citation counts to measure research productivity is that more citations correlates with article quality.

    However, there is much evidence that the contrary is true. The article mentioned in this post is an extreme example, but perhaps it is the case that many (most?) highly cited articles are easily accessible to those who know little and of little profundity. This doesn not contradict the claim that very useful articles and very deep articles are sometimes highly cited. The question is one of inference, of the simplest kind, and a good illustration that naive, pure thought reasoning can lead to bad priors.

    Most administrators would infer from 3000+ citations that an article was truly important and its authors worthy of promotion, pay raises, and other cheaper sorts of adoration. However, perhaps the correct inference, absent other information, is that they are charlatans, self-promoters, and cheaters.

    This issue needs far more attention than it is usually given.

  7. Mikhail says:

    Isnt replication is the simplest form of Statistical inference? Like after your study produces results, you want to check how reliable these results are. And the simplest way to do it is to run the whole study again.

    But during the last 100 years we have developed a lot of advanced statistical tools so you dont need to waste your time to perform every research multiple times. All you need is to ask your “appropriate statistical inference” how reliable your results are, and “statistical inference” would reply something like “wow, this is totally sure”, “maybe a little but uncertain” or “you have not learned anything”. And then you go and collect more data or do additional experiments, depending on where you uncertainty are.

    But it is not how it works today. Your “statistical inference” tells you that p<0.000001, this is interpreted as "your results as reliable as 1+1=2". And if your really trust your statistics, demands for replication looks like bullying indeed.

  8. Jonathan (another one) says:

    “Data suggest that problematic research was approvingly cited more frequently after the problem was publicized.”
    http://www.gsood.com/research/papers/error.pdf

  9. Andrew [not Gelman] says:

    You should check out “sans forgetica”, the font designed to boost memory: https://www.washingtonpost.com/business/2018/10/05/introducing-sans-forgetica-font-designed-boost-your-memory/?noredirect=on&utm_term=.cbe85e8e4972

    You should also check out these recent review papers:
    1) Meyer, A., Frederick, S., Burnham, T. C., Guevara Pinto, J. D., Boyer, T. W., Ball, L. J., … & Schuldt, J. P. (2015). Disfluent fonts don’t help people solve math problems. Journal of Experimental Psychology: General, 144(2), e16.

    2) Xie, H., Zhou, Z., & Liu, Q. (2018). Null Effects of Perceptual Disfluency on Learning Outcomes in a Text-Based Educational Context: a Meta-analysis.

  10. Oliver C. Schultheiss says:

    Robert (further up) already referred readers to the excellent paper by Lane et al:

    Lane, A., Luminet, O., Nave, G., and Mikolajczak, M. (2016), “Is there a Publication Bias in Behavioural Intranasal Oxytocin Research on Humans? Opening the File Drawer of One Laboratory,” Journal of Neuroendocrinology, 28. https://doi.org/10.1111/jne.12384.

    For a further analysis of the state of the field in social neuroendocrinology, including an update on power posing, a self-critical review of my own research in this area and, perhaps most importantly, pointers to how things can get better in the future, this chapter may be of interest:

    Schultheiss, O. C., & Mehta, P. H. (in press). Reproducibility in social neuroendocrinology: Past, present, and future. In O. C. Schultheiss & P. H. Mehta (Eds.), Routledge international handbook of social neuroendocrinology. Abingdon, UK: Routledge. Preprint URL: http://www.psych2.phil.uni-erlangen.de/%7Eoschult/humanlab/publications/Schultheiss_Mehta_in_press.pdf

    The book also features many chapters that paint a more nuanced picture of oxytocin and social behavior (hint: it can also be associated with increased aggression!). The bottom line is that oxytocin is unlikely to be viewed as a “cuddle hormone” in the future. Just like cortisol is not simply the “stress hormone” or testosterone the “dominance hormone”. Psychophysiological measures are unlikely to have such a simple 1-to-1 relationship with psychological constructs and should have never been portrayed in this manner.

  11. Michael Nelson says:

    The solution here is remarkably simple and relatively easy to implement. There just needs to be a wiki, edited and contributed to by individuals with relevant and verified qualifications in the given field, with “provenance” pages for peer-reviewed papers. The initial objective would be to have a page for every paper that has a minimum of n citations, with n starting large and decreasing as more people contribute. Each entry would only need to provide three types of information for its paper: citations to prior papers that developed the concept being extended or limited in the current paper, citations to direct criticism or praise of the current paper, and citations to later papers that also extended or limited the concept in the current paper. Contributors could then embellish pages with narrative summaries, timelines, notes on citations, drop-down boxes for seeing the “family tree” in detail, links to other wiki articles, etc. Keep it open source, have professors encourage their grad students to read and contribute, and encourage researchers and journal reviewers to use it as a resource. Ideally, the main incentive for using it would be the fear of embarrassment of citing a paper that one’s colleagues can easily see is outdated.

    This sort of model (minus the wiki format) is already used in the antiquities field, where knowing the provenance or biography of a piece can be the difference between making 1 dollar or 1 million. Under our current system, the incentive is reversed: journals are the only ones making bank on old information, and they make more money when related research is scattered across a hundred pay-walled journals. I suspect researchers have tolerated it because ambiguity as to the validity of a widely-cited paper can benefit a famous author by allowing his or her work to eclipse later critiques or contradictions. Incentives are now moving in the opposite direction for researchers–if everyone loses faith in science, whence the famous scientist?–and the technology is here. Hopefully this type of solution either is or will soon be implemented.

Leave a Reply