Sniffing tears perhaps not as effective as claimed

Marcel van Assen has a story to share:

In 2011 a rather amazing article was published in Science where the authors claim that “We found that merely sniffing negative-emotion-related odorless tears obtained from women donors induced reductions in sexual appeal attributed by men to pictures of women’s faces.”
The article is this:
Gelstein, S., Yeshurun, Y., Rozenkrantz, L., Shushan, S., Frumin, I., Roth, Y., & Sobel, N. (2011). Human tears contain a chemosignal. Science, 331(6014), 226-230.

Ad Vingerhoets, an expert on crying, and a coworker Asmir Gračanin were amazed by this result and decided to replicate the study in several ways (my role in this paper was minor, i.e. doing and reporting some statistical analyses when the paper was already largely written). This resulted in:
Gračanin, A., van Assen, M. A., Omrčen, V., Koraj, I., & Vingerhoets, A. J. (2016). Chemosignalling effects of human tears revisited: Does exposure to female tears decrease males’ perception of female sexual attractiveness?.Cognition and Emotion, 1-12.

The paper failed to replicate the findings in the original study.

Original findings that do not get replicated is not special, but unfortunately core business. What IS striking, however, is the response of Sobel to the article of Gracanin et al (2016). See …
Sobel, N. (2016). Revisiting the revisit: added evidence for a social chemosignal in human emotional tears. Cognition and Emotion, 1-7.

Sobel re-analyzes the data of Gracanin et al, and after extensive fishing (with p-values just below .05) he concludes that the original study was right and the Gracanin et al study bad. Irrespective of whether chemosignalling actually exists, Sobel’s response is imo a beautiful and honest defense, where p-hacking is explicitly acknowledged and its consequences not understood.

We also wrote a short response to Sobel’s comment, commenting on the p-hacking of Sobel.
Gračanin, A., Vingerhoets, A. J., & van Assen, M. A. (2016). Response to comment on “Chemosignalling effects of human tears revisited: Does exposure to female tears decrease males’ perception of female sexual attractiveness?”.Cognition and Emotion, 1-2.

To save time, if your interested, I recommend reading Sobel (2016) first.

I asked Assen why he characterized Sobel’s horrible bit of p-hacking as “a beautiful and honest defense,” and he [Assen] responded:

I think it is beautiful (in the sense that I like it) because it is honest. I also think it is a beautiful and excellent example of how one should NOT react to a failed replication, and of NOT understanding how p-hacking works.

This is about emotions; although I was involved in this project, I ENJOYED the comment of Sobel because of its tone and content, even though it I did not agree with its content at all.

Our response to Sobel’s comment supports the fact that Sobel has been p-hacking. Vingerhoets asked BEFORE the replication if it mattered Tilburg had no lab, and Sobel says ‘no’, and AFTERWARDS when the replication fails he believes it IS a problem.

None of this is new, of course. By this time we should not be surprised that Science publishes a paper with no real scientific content. As we’ve discussed many times, newsworthiness rather than correctness is the key desideratum in publication in these so-called tabloid journals. The reviewers just assume the claims in submitted papers are correct and then move on to the more important (to them) problem of deciding whether the story is big and important enough for their major journal.

I agree with Assen that this particular case is notable in that the author of the original study flat-out admits to p-hacking and still doesn’t care.

Gračanin et al. tell it well in their response:

Generally, a causal theory should state that “under conditions X, it holds that if A then B”. Relevant to our discussion in particular and evaluating results of replications in general are conditions X, which are called scope conditions. Suppose an original study concludes that “if A then B”, but fails to specify conditions X, while the hypothesis was tested under condition XO. The replication study subsequently tested under condition XR and concludes that “if A then B” does NOT hold. Leaving aside statistical errors, two different con- clusions can be drawn. First, the theory holds in con- dition XO (and perhaps many other conditions) but not in condition XR. Second, the theory is not valid. We argue that the second explanation should be taken very seriously . . .

They continue:

What seems remarkable and inconsistent is that Sobel regards some of our as well as Oh, Kim, Park, and Cho’s (2012; Oh) findings as strong support for his theory, despite the fact that there was no sad context present in these studies. Apparently, in case of a failure to find corroborating results, the sad context is regarded crucial, but if some of our and Oh’s findings point in the same direction as his original findings, the lack of sad context and exact procedures are no longer important issues.

And this:

Sobel concludes that we did not dig very deep in our data to probe for a possible effect. That is true. We did not try to dig at all. Our aim was to test if human emotional tears act as a social chemosignal, using a different research methodology and with more statistical power than the original study; we were not on a fishing expedition.

I find the defensive reaction of Sobel to be understandable but disappointing. I’m just so so so tired of researchers who use inappropriate statistical methods and then can’t let go of their mistakes.

It makes me want to cry.

14 thoughts on “Sniffing tears perhaps not as effective as claimed

  1. I am not clear on why the replication study did not attempt to be an exact replication. Why not just attempt to replicate the work as exactly as possible?

    This is far from the first time I am seeing this problem. It seems like the consistent failure of psychologists to do that as part of this replication effort is somehow creating an even bigger mess than originally generated by the NHST + p-hacking “method”. I didn’t think there was any room for things to get worse…

    • Anon:

      I don’t actually think exact replication is so important. What is of interest are the larger scientific questions, not the particular experiment done to measure it. We want good measurements that are tied to theoretical understanding. If a research team publishes a paper that is full of noise, it could still motivate others to perform follow-up studies—that’s fine—but I don’t see why the later researchers should feel any push toward replicating the original study exactly. Better, I’d say, to learn from the mistakes and do something better.

      • But if the researchers alter the replication design, then we wouldn’t really know if the initial study was replicable or not, hence whether the claim is approximately true or not. Shouldn’t the core objective of replication is to confirm or reject whether a claim stands in other contexts? If we don’t replicate then we can’t really learn from any mistakes.

        Of course I follow your point in saying that we shouldn’t be tied to a particular design just to prove that point. I think the ideal scenario would be if we could easily replicate any experiment quickly and rigorously, so we can move to other alternative experiments of the same claim (which is what you’re saying by stating that replication is not important but rather concentrating on larger scientific questions.)

        • Jorge:

          I think the mistake is to frame the question as, “whether the claim is approximately true or not.” If a study is done with noisy measurements and forking paths, at best we can consider it as exploratory work, in which case it makes sense in any future study to combine theory with the ideas from that exploratory work to decide what to look at next.

          Here I’m pushing against the attitude that the original published claims are considered to be correct until proven otherwise. If the original experiment were done cleanly, then, sure, it can make sense to replicate it exactly and work from there. But this experiment was not at all clean. Look at the paper and you’ll see multiple forking paths and a loosely defined theory that would be consistent with just about any data the researchers might find. Back in 2010, though, we weren’t so aware of such issues and so it’s not such a surprise that the paper got published in a prestigious tabloid.

        • Andrew, i actually think that this is a good example how conclusions from replications should (also) be qualified. I read Sobel’s response and in this case it seems that the replicators did a rather lousy job. Most stricking is the combination of two samples in Exp 2 after the first led to (undisrable?) replication. What is this? Null hacking? Also, i disagree with you on this one. Its not a “general case” but rather a specific experiment that as any other is repleate with auxlirary asumptions. You wouldnt be blogging about this one if the replication would have been done in the dark right? I think that Sobel’s point about the difference in attractiveness of rhe stimuli may be key and any rigorous attempt to verify his results should have used the exact samw stimuli. It is possible that the reported effect was merely noise but this replication adds very little evidence that this is the case (and gets a publication in Emotion on the fly).

    • One has to remember that these studies purpose to have applicability beyond the exact circumstances — i.e. that they generalize to a population that is of interest.

      So, we see studies which were done using freshmen psychology students forced to volunteer being touted as applying to people in general.

      So, the replicators need to stay close to the original study in intent (i.e. well within the framework of applicability that the original researchers have implied), but don’t need to replicate exactly.

  2. The original study actually seems kind of interesting in the ideas it tried to explore. While the evolutionary “what is the function of tears?” approach is debatable, it would have been interesting to see if tears transmit some information to those around us. And, if so, how is this information transmitted? So it was disappointing that the study leapt straight to assuming that there was a link but never really had me convinced by any data or priors! And so, surely the replicable science stopped here… “Participants failed to discriminate the smell of tears from the smell of saline [mean correct = 31 ± 14%, t(23) = 0.81, P < 0.659; Fig. 1C], indicating that emotional tears did not have a discernable odor." In the absence of a mechanism to link VAS ratings to the chemical makeup of tears, one is without a plausible mechanism linking the two neurological processes, irrespective of what later was observed.

    Similarly, sexual arousal might have better measured using a PPG/polygraph mix (relatively standard as part of the assessment of deviant arousal here in NZ).

    It was interesting later that, "Finally, and critically, levels of salivary testosterone were progressively lower after sniffing tears as compared to the baseline period [baseline testosterone = 151.96 ± 76 pg/ml, last testosterone = 132.66 ± 63.1 pg/ml, t(49) = 3.3, P < 0.001] (Fig. 3G), an effect not evident for saline [baseline testosterone = 154.8 ± 74.4 pg/ml, last testosterone = 154.34 ± 101.8 pg/ml, t(49) = 0.81, P = 0.96] " But there is a lot of overlap and variability in results and still no linkage mechanism. Also, related to measurement accuracy; an endocrinologist colleague explains that this may not be as robust as a measure of anything as the impression of doing a chemical assay implies…

    It seems a shame that measurement accuracy and linkage mechanisms neurologically were not the areas of debate, and instead there were statistical justifications — data trumps (am I allowed to use that word still?!) inference every time.

    • “surely the replicable science stopped here… “Participants failed to discriminate the smell of tears from the smell of saline [mean correct = 31 ± 14%, t(23) = 0.81, P < 0.659; Fig. 1C], indicating that emotional tears did not have a discernable odor." In the absence of a mechanism to link VAS ratings to the chemical makeup of tears, one is without a plausible mechanism linking the two neurological processes, irrespective of what later was observed."

      Reminiscent of homeopathy (where multiple dilutions mean you may have a bottle that does not contain even a single molecule of the original therapeutic substance).

      • >”Reminiscent of homeopathy (where multiple dilutions mean you may have a bottle that does not contain even a single molecule of the original therapeutic substance).”

        Except they (are supposed to; I am sure some are just scammers) “dilute” by skimming off the top or side of the container after shaking it. Not a fan of homeopathy at all, just saying the physical model all the debunkers use is wrong.

  3. So I got to the end of this post Andrew and it seems to me several things are often not discussed in the pressure to persist in the truth of a pretty well falsified result. In this particular case it seems like a relatively minor issue but often times the non-replications are undercutting very large motivators. With bilingualism and power pose it’s easy to see how the principals have huge motivation to stick to their guns.

    It’s not just that a researcher doesn’t want to admit they are wrong. They’re sitting on perhaps a million dollar grant justified by these initial findings and that require future findings to maintain. Often times the conditions of receiving this money depend on sticking with the original research plan and confirming the results. You’ve got employees, part of your salary, your reputation, a huge interconnected network of motivations driving the researcher not to admit that they were incorrect.

    And yeah, makes me want to cry too.

Leave a Reply to Baruch Cancel reply

Your email address will not be published. Required fields are marked *