I’ll reorder this week’s posts a bit in order to continue on a topic that came up yesterday.
A couple days ago a reporter wrote to me asking what I thought of this paper on Money, Status, and the Ovulatory Cycle. I responded:
Given the quality of the earlier paper by these researchers, I’m not inclined to believe anything these people write. But, to be specific, I can point out some things:
- The authors define low fertility as days 8-14. Oddly enough, these authors in their earlier paper used days 7-14. But according to womenshealth.gov, the most fertile days are between days 10 and 17. The choice of these days affects their analysis, and it is not a good sign that they use different days in different papers. (see more on this point in sections 2.3 and 3.1 of this paper: http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf)
- They perform a lot of different analyses, and many others could be performed. For example, “Study 1 indicates that ovulation boosts women’s tendency to seek relative gains when given the opportunity to possess products superior to those of other women. However, we expect that ovulation should not have the same effect on women’s choices if they have the opportunity to possess better products than men who are potential mates.” But if they found the pattern in both groups they could argue that this is consistent with their theory in a different way. They’re essentially playing a lottery where they make the rules, and they can keep coming up with ways in which they win.
- For another example, “Ethnicity, relationship status, and income had no effect on the dependent measures.” But if they had found something, this could’ve fit the story. Recall that their previous paper was all about relationship status!
- Yet another example: “A repeated measures logistic regression . . . revealed a significant interaction . . . however, when women compared their house relative to that of men, there was no difference . . .” Again, any pattern here would fit their story.
- Here’s another example: “As in Study 1, we next examined women’s choices relative to women across the full 28-day cycle.” But in their earlier paper, they did not do this: rather, they only compared days 7-14 to days 17-25, completely excluding days 1-6, 15-16, and 26-28.
- The authors say their results fit their model, for example, “Consistent with H3, we predicted that ovulation would lead women to give smaller financial offers to other women but not to men.” Where exactly is this “prediction”? Did the authors really predict this ahead of time in a public way, or are they just saying they predicted it? In either case, how many other things did they predict? It wouldn’t be so impressive if they predicted (or could predict) thousands of possible comparisons and then they show what worked.
Perhaps the biggest problem with this study is that it purports to be all about effects within women, but it is a between-subjects design. That is, they do _not_ interview women at multiple points during their cycles. What they do is compare different women. That makes this sort of study close to hopeless. There’s just too much variation from person to person. What they’re doing is finding patterns in noise. If you look hard enough—and they do—you’ll find statistically significant patterns. Along with that they have a flexible theory that can explain just about anything.
To conclude, I’m not saying I think the authors “cheated” or “fished for statistical significance.” I have no idea how they did their analysis, but they very well could’ve made various data coding and data analysis decisions after seeing the data, and in addition that have a lot of choices in how to interpret their results in light of their theories. I think it’s hopeless. They might as well be reading tea leaves.
The reporter thanked me and wrote that he was still trying to figure out whether to write about the paper.
This made me think a bit: it’s not an easy question. I’m negative on the paper but others might be positive. Beyond that there’s a selection issue. Suppose you happen to be convinced that the article is worthless, and so you decide not to run the story. But somewhere else there is a reporter who swallows the press release, hook, line, and sinker. This other reporter would of course run a big story. Hence the selection bias that the stories that do get published are likely to repeat the hype. Which in turn gives a motivation for researchers and public relations people to do the hype in the first place. On the other hand, if you do run a skeptical story, then you’re continuing to give press for this silly study, giving it more attention that it doesn’t deserve (in my opinion).
This is a reporter’s dilemma that echoes the discussion I was having with Jeff Leek about when to shoot down what we believe to be bad work and when to ignore it. If lots of other people are paying attention to the paper, then I think a journalist is doing a service by shooting it down. But if the paper is basically being ignored (or perhaps just being treated as a “politically incorrect” oddity), then there’s no point in pulling it up from obscurity just to go on about what’s wrong with it.
The problem, as I see it, is when a claim presented with (essentially) no evidence is taken as truth and then treated as a stylized fact. And the norms of scientific publication, as well as the norms of science journalism, push toward this. If you act too uncertain in your scientific report, I think it becomes harder to get it published in a top journal (after all, they want to present “discoveries,” not “speculations”). And science journalism often seems to follow the researcher-as-Galileo mold.