Last week I published in Slate a critique of a paper that appeared in the journal Psychological Science. That paper, by Alec Beall and Jessica Tracy, found that women who were at peak fertility were three times more likely to wear red or pink shirts, compared to women at other points in their menstrual cycles. The study was based an 100 participants on the internet and 24 college students. In my critique, I argued that we had no reason to believe the results generalized to the larger population, because (1) the samples were not representative, (2) the measurements were noisy, (3) the researchers did not use the correct dates of peak fertility, and (4) there were many different comparisons that could have been reported in the data, so there was nothing special about a particular comparison being statistically significant. I likened their paper to other work which I considered flawed for multiple comparisons (too many researcher degrees of freedom), including a claimed relation between men’s upper-body strength and political attitudes and the notoriously unreplicated work by Daryl Bem on ESP. These were two papers that, like Beall and Tracy’s, were published in top peer-reviewed psychology journals.
Tracy and Beall responded to me, and I thought it only fair to post this response on my website. I will also ask Slate to add a paragraph at the end of my article, linking to their response.
Below is Tracy and Beall’s response, along with my brief comments. (I give the above summary of my argument to provide background for those who are coming to this story for the first time.)
OK, here are Tracy and Beall:
While we agree several of Andrew Gelman’s broad concerns about current research practices in social psychology (see “Too Good to Be True”), much of what he said about our article, “Women are more likely to wear red or pink at peak fertility”, recently published in Psychological Science, was incorrect. Unfortunately, Gelman did not contact us before posting his article. Had he done so, we could have clarified these issues, and he would not have had to make the numerous flawed assumptions that appeared in his article. Here, we take the opportunity to make these clarifications, and also to encourage those who read Gelman’s post to read our published article, available here, and Online Supplement available here.
We want to begin with the issue that received the greatest attention, and which Gelman suggests (and we agree) is most potentially problematic: that of researcher degrees of freedom. Gelman makes several points on this issue; we respond to each in turn below.
a) Gelman suggests that we might have benefited from researcher degrees of freedom by asking participants to report the color of each item of clothing they wore, then choosing to report results for shirt color only. In fact, we did no such thing; we asked participants about the color of their shirts because we assumed that shirts would be the clothing item most likely to vary in color.
b) We categorized shirts that were red and pink together because pink is a shade of red; it is light red. The theory we were testing is based on the idea that red and shades of red (such as the pinkish swellings seen in ovulating chimpanzees, or the pinkish skin tone observed in attractive and healthy human faces) are associated with sexual interest and attractiveness (e.g., Coetzee et al., 2012; Deschner et al., 2004; Re, Whitehead, Xiao, & Perrett, 2011; Stephen, Coetzee, & Perrett, 2011; Stephen, Coetzee, Law Smith, & Perrett, 2009; Stephen et al., 2009; Stephen & McKeegan, 2010; Stephen, Oldham, Perrett, & Barton, 2012; Stephen, Scott et al., 2012; Whitehead, Ozakinci, & Perrett, 2012). Thus, our decision to combine red and pink in our analyses was a theoretical one.
c) We are confused by Gelman’s comment that, “other colors didn’t yield statistically significant differences, but the point here is that these differences could have been notable.” That these differences could have been notable is part of what makes the theory we were testing falsifiable. A large body of evidence suggests that red and pink are associated with attractiveness and health, and may function as a sexual signal at both a biological and cultural level (e.g., Burtin, Kaluza, Klingenberg, Straube, and Utecht 2011; Coetzee et al., 2012; Elliot, Tracy, Pazda, & Beall, 2012; Elliot & Pazda 2012; Guéguen, 2012a; Guéguen, 2012b; Guéguen, 2012c; Guéguen & Jacob, 2012; 2013a; 2013b; Jung, Kim, & Han, 2011a; Jung et al., 2011b; Meier et al., 2012; Oberzaucher, Katina, Schmehl, Holzleitner, & Mehu-Blantar, 2012; Pazda, Elliot, & Greitmeyer, 2012; 2013; Re, Whitehead, Xiao, & Perrett, 2011; Roberts, Owen, & Havilcek, 2010; Schwarz & Singer, 2013; Stephen, Coetzee, & Perrett, 2011; Stephen, Coetzee, Law Smith, & Perrett, 2009; Stephen et al., 2009; Stephen & McKeegan, 2010; Stephen, Oldham, Perrett, & Barton, 2012; Stephen, Scott et al., 2012). In order to test the specific prediction emerging from this literature, that fertility would affect women’s tendency to wear red/pink but not their tendency to wear other colors, we ran analyses comparing the frequency of women in high- and low-conception risk groups wearing a large number of different colored shirts. The results of these analyses are reported in detail in the Online Supplement to our article (which includes a Figure showing all frequencies). If any of these analyses other than those of pink and red had produced significant differences, we would have failed to support our hypothesis.
Gelman’s concern here seems to be that we could have performed these tests prior to making any hypothesis, then come up with a hypothesis post-hoc that best fit the data. While this is a reasonable concern for studies testing hypotheses that are not well formulated, or not based on prior work, it simply does not make sense in the present case. We conducted these studies with the sole purpose of testing one specific hypothesis: that conception risk would increase women’s tendency to dress in red or pink. This hypothesis emerges quite clearly from the large body of work mentioned above, which includes a prior paper we co-authored (Elliot, Tracy, Pazda, & Beall, 2012). We came up with the hypothesis while working on that paper, and were in fact surprised that it hadn’t been tested previously, because it seemed to us like such an obvious possibility given the extant literature. The existence of this prior published article provides clear evidence that we set out to test a specific theory, not to conduct a fishing expedition. (See also Murayama, Pekrun, & Fiedler, in press, for more on the role of theory testing in reducing Type I errors).
d) Our choice of which days to include as low-risk and high-risk was based on prior research, and, importantly, was determined before we ran any analyses. Gelman is right that there is a good deal of debate about which days best reflect a high conception risk period, and this is a legitimate criticism of all research that assesses fertility without directly measuring hormone levels. Given this debate, we followed the standard practice in our field, which is to make this decision on the basis of what prior researchers have done. We adopted the Day 6-14 categorization period after finding that this is the categorization used by a large body of previously published, well-run studies on conception risk (e.g., Penton-Voak et al., 1999; Penton-Voak & Perrett, 2000; Little, Jones, Burris, 2007; Little & Jones, 2012; Little, Jones & DeBruine, 2008; Little, Jones, Burt, & Perrett, 2007; Farrelly 2011; Durante, Griskevicius, Hill, & Perilloux, 2011; DeBruine, Jones, & Perrett, 2005; Gueguen, 2009; Gangestad & Thornhill, 1998). Although the exact timing of each of these windows is debatable, it is not debatable that Days 0-5 and 15-28 represent a window of lower conception risk than days 6-14.
Furthermore, if our categorization did result in some women being mis-categorized as low-risk when in fact they were high risk, or vice-versa, this would increase error and decrease the size of any effects found. Most importantly, we did not decide to use this categorization after comparing various options and examining which produced significant effects. Rather, we adopted it a priori and used it and only it in analyzing our data; no researcher degrees of freedom came into play.
e) In any study that assesses conception risk using a self-report measure, certain women must be excluded to ensure that those for whom risk was not accurately captured do not erroneously influence results. All of the exclusions we made were based on those suggested by prior researchers studying the psychological effects of conception risk, such as excluding women with irregular cycles (as it is more difficult to accurately determine when they are likely to be at risk), excluding pregnant women and women taking hormonal birth control (as they do not regularly ovulate), and excluding women currently experiencing pre-menstrual or menstrual symptoms (to ensure that effects observed cannot be attributed to these symptoms; see Haselton & Gildersleeve, 2011; Little, Jones, & Debruine, 2008). Although most of these exclusion criteria are necessary to accurately gauge fertility risk, several fall into a gray area (e.g., excluding women with atypical cycles). The decision of whether to exclude women on the basis of these gray-area criteria does lead to the possibility of researcher degrees of freedom. Because we were aware of this concern, we reported (in endnotes) results when these exclusions were not made. This is the solution recommended by Simmons, Nelson and Simonhnson (2011), who write: “If observations are eliminated, authors must also report what the statistical results are if those observations are included.” (p. 1363). Thus, while we did make a decision about the most appropriate way to analyze our data, we also made that decision clear, reported results as they would have emerged if we had made the alternate decision, and gave the article’s reviewers, editor, and readers the information they needed to judge this issue.
In addition to the degrees of freedom concern, Gelman also raises concerns about representativeness and measurement. We have addressed these issues in a longer version of this response, posted here, and we encourage those who are interested to read the longer version. In an effort to keep this response concise, however, we wish to close by mentioning a few broader issues relevant to Gelman’s piece.
First, like any published set of empirical studies, our article should not be viewed as the ultimate conclusion on the question of whether women are more likely to wear red or pink when at high risk for conception. We submitted our article for publication because we believed that the evidence from the two studies we conducted was strong enough to suggest that there is a real effect of women’s fertility on their clothing choices, at least under certain conditions, but not because we believe there is no need for additional studies. Indeed, many questions remain about this effect, such as its generalizability, its moderators, and its mediators. We look forward to seeing new research address these questions, both from our own lab (where follow-up and additional replication studies are already underway) and others.
Second, setting the ubiquitous need for additional research aside for the moment, Gelman’s claim that our two studies provide “essentially no evidence for the researchers’ hypotheses” is both inflammatory and unfair. For one thing, it is important to bear in mind that our research went through the standard peer review process—a process that is by no means quick or easy, especially at a top-tier journal like Psychological Science. This means that our methods and results have been closely scrutinized and given a stamp of approval by at least three leading experts in the areas of research relevant to our findings (in this case, social and evolutionary psychology). This does not mean that questions should not be raised; indeed, questioning and critiquing published work is an important part of the scientific process, and Gelman is correct that the review process often fails to take into account researcher degrees of freedom. But research critics—especially those who publish their critiques in widely dispersed forums like Slate blog posts—must ensure that they get the facts right, even if that means contacting an article’s authors for more information, or explicitly mentioning additional information that the authors provided in endnotes.
Indeed, a statistician like Gelman could go well beyond simply mentioning possible places where additional degrees of freedom might have come into play and then making assumptions about the validity of our findings on that basis. He could, and should, instead find out exactly the places where researcher degrees of freedom did come into play, then calculate the precise likelihood that they would have resulted in the two significant effects that emerged in our studies if these effects were not in fact true. In other words, additional researcher degrees of freedom increase the chance that we will find a significant effect where none exists. But by how much? The chance of obtaining the same significant effect across two independent consecutive studies is .0025 (Murayama et al., in press). How many researcher degrees of freedom would it take for this to become a figure that would reasonably allow Gelman to suggest that our effect is most likely a false positive? This is a basic math problem, and one that Gelman could solve. Without such calculation, the conclusion that our findings provide no support for our hypothesis would never pass the standards of scientific peer review. Researchers do have certain responsibilities—such as avoiding, to whatever extent possible, taking advantage of researcher degrees of freedom and being honest about it when they do– but critics of research have certain responsibilities too.
This is particularly important because there is a very real possibility that most readers of posts such as these will assume that they are accurate without checking against the original research reports. Indeed, most Slate readers do not have access to academic journal articles, so must rely on media summaries to form an assessment of the research. Added to the viral power of the internet, this creates a very real burden on critics and others who discuss scientific research in popular media forums to make serious efforts to maintain accuracy.
The field of psychology—and social psychology in particular—is currently experiencing an intense period of self-reflection. On the whole, this is a very good thing: psychologists are interested in finding and reporting true effects, and increased scrutiny of problematic research practices will help us do so. At the same time, it would be unfortunate if one consequence of this self-reflection is that researchers become afraid to publish certain findings for fear of reputational damage. Research articles that follow good research practices should not become suspect simply because their findings are unexpected.
[This is followed by a list of references which can be found at the end of the post here.]
And here’s my response:
Regarding researcher degrees of freedom, the fundamental issue is that many different plausible hypotheses could have been tested; indeed, the supplementary material reports tests for each color. Yes, Beall and Tracy found their desired pattern with the red-pink combination, but had they found it only for red, or only for pink, this would have fit their theories too. Consider their reference to “pinkish swellings” and “pinkish skin tones.” Had their data popped out with a statistically significant difference on pink and not on red, that would have been news too. And suppose that white and gray had come up as the more frequent colors? One could easily argue that more bland colors serve to highlight the pink colors of a (European-colored) face. With so many possibilities, it is not particularly striking that a particular comparison happened to appear to be large.
Tracy and Beall write, “If any of these analyses other than those of pink and red had produced significant differences, we would have failed to support our hypothesis.” I think but other findings would have supported their hypothesis in different ways. Data can fool people. A factor of 3 difference for pink but nothing for red, or a factor of 3 difference for white and gray but nothing for any other color, etc etc.—any such pattern would have fit just fine into their larger theory. The point is that there are many degrees of freedom available, even if with the particular data that happened to occur, the researchers did only one particular analysis.
Similarly, Beall and Tracy found a pattern in their internet sample and their college students, but a pattern in just one group and not the other could also have been notable and explainable under the larger theory, given the different ages of the two groups of participants. And it would have seemed reasonable to combine the results from the two samples, or even gather a third sample, if a striking but not-quite statistically significant pattern were observed. Again, their data-analysis choices seem clear, conditional on the data they saw, but other choices would have been just as reasonable given other data, allowing many different possible roads to statistical significance.
Regarding fertility, it is accepted that (a) the dates of peak fertility vary from woman to woman, and (b) to the extent that there are general recommendations, this would be days 10-17 or something close to that. Not days 6-14. As discussed in the Slate article, my best guess as to what happened was that they were following a paper from 2000 whose authors misread a paper from 1996.
The trouble is, if the effect size is small (which it will have to be, given all the measurement error involved here), any statistically-significant patterns in a small-sample study are likely to be noise. As has been demonstrated many times, if you start with a scientific hypothesis and then gather data, it is all too possible to find statistically-significant patterns that are consistent with your hypothesis.
To conclude, let me repeat what I wrote in my earlier article:
I don’t mean to be singling out this particular research team for following what are, unfortunately, standard practices in experimental research. Indeed, that this article was published in a leading journal is evidence that its statistical methods were considered acceptable.
And I meant it. I hope that this discussion motivates these and other researchers to carefully read the work of Simonsohn et al. on p-values and researcher degrees of freedom, and the related work on the hopelessness of trying to learn about small effects from small sample size; see, for example, the paper discussed here and the “50 shades of gray” paper of Nosek, Spies, and Motyl.
As Tracy and Beall point out, their paper was accepted by subject-matter experts to appear in a top journal in psychology. This is what worries me (and with others such as Simonsohn, Francis, Nosek, Ioannidis, etc., although I can’t comment on their reactions to this particular paper). My point in writing the Slate article was not to pick on this research on fertility and dress but to use it as an example to discuss a larger problem in social-science and public-health research. I do not want researchers to “become afraid to publish certain findings” but I would like authors and journals to be more cautious about claims that patterns in small unrepresentative samples generalize to the larger population.
I stand by my conclusion that the system of scientific publication is set up to encourage publication of spurious findings, and at the same time I would like to thank Tracy and Beall for their gracious response to my article. I remain hopeful that open discussion of research methods will help move us forward.