Response by Jessica Tracy and Alec Beall to my critique of the methods in their paper, “Women Are More Likely to Wear Red or Pink at Peak Fertility”

Last week I published in Slate a critique of a paper that appeared in the journal Psychological Science. That paper, by Alec Beall and Jessica Tracy, found that women who were at peak fertility were three times more likely to wear red or pink shirts, compared to women at other points in their menstrual cycles. The study was based an 100 participants on the internet and 24 college students. In my critique, I argued that we had no reason to believe the results generalized to the larger population, because (1) the samples were not representative, (2) the measurements were noisy, (3) the researchers did not use the correct dates of peak fertility, and (4) there were many different comparisons that could have been reported in the data, so there was nothing special about a particular comparison being statistically significant. I likened their paper to other work which I considered flawed for multiple comparisons (too many researcher degrees of freedom), including a claimed relation between men’s upper-body strength and political attitudes and the notoriously unreplicated work by Daryl Bem on ESP. These were two papers that, like Beall and Tracy’s, were published in top peer-reviewed psychology journals.

Tracy and Beall responded to me, and I thought it only fair to post this response on my website. I will also ask Slate to add a paragraph at the end of my article, linking to their response.

Below is Tracy and Beall’s response, along with my brief comments. (I give the above summary of my argument to provide background for those who are coming to this story for the first time.)

OK, here are Tracy and Beall:

While we agree several of Andrew Gelman’s broad concerns about current research practices in social psychology (see “Too Good to Be True”), much of what he said about our article, “Women are more likely to wear red or pink at peak fertility”, recently published in Psychological Science, was incorrect. Unfortunately, Gelman did not contact us before posting his article. Had he done so, we could have clarified these issues, and he would not have had to make the numerous flawed assumptions that appeared in his article. Here, we take the opportunity to make these clarifications, and also to encourage those who read Gelman’s post to read our published article, available here, and Online Supplement available here.

We want to begin with the issue that received the greatest attention, and which Gelman suggests (and we agree) is most potentially problematic: that of researcher degrees of freedom. Gelman makes several points on this issue; we respond to each in turn below.

a) Gelman suggests that we might have benefited from researcher degrees of freedom by asking participants to report the color of each item of clothing they wore, then choosing to report results for shirt color only. In fact, we did no such thing; we asked participants about the color of their shirts because we assumed that shirts would be the clothing item most likely to vary in color.

b) We categorized shirts that were red and pink together because pink is a shade of red; it is light red. The theory we were testing is based on the idea that red and shades of red (such as the pinkish swellings seen in ovulating chimpanzees, or the pinkish skin tone observed in attractive and healthy human faces) are associated with sexual interest and attractiveness (e.g., Coetzee et al., 2012; Deschner et al., 2004; Re, Whitehead, Xiao, & Perrett, 2011; Stephen, Coetzee, & Perrett, 2011; Stephen, Coetzee, Law Smith, & Perrett, 2009; Stephen et al., 2009; Stephen & McKeegan, 2010; Stephen, Oldham, Perrett, & Barton, 2012; Stephen, Scott et al., 2012; Whitehead, Ozakinci, & Perrett, 2012). Thus, our decision to combine red and pink in our analyses was a theoretical one.

c) We are confused by Gelman’s comment that, “other colors didn’t yield statistically significant differences, but the point here is that these differences could have been notable.” That these differences could have been notable is part of what makes the theory we were testing falsifiable. A large body of evidence suggests that red and pink are associated with attractiveness and health, and may function as a sexual signal at both a biological and cultural level (e.g., Burtin, Kaluza, Klingenberg, Straube, and Utecht 2011; Coetzee et al., 2012; Elliot, Tracy, Pazda, & Beall, 2012; Elliot & Pazda 2012; Guéguen, 2012a; Guéguen, 2012b; Guéguen, 2012c; Guéguen & Jacob, 2012; 2013a; 2013b; Jung, Kim, & Han, 2011a; Jung et al., 2011b; Meier et al., 2012; Oberzaucher, Katina, Schmehl, Holzleitner, & Mehu-Blantar, 2012; Pazda, Elliot, & Greitmeyer, 2012; 2013; Re, Whitehead, Xiao, & Perrett, 2011; Roberts, Owen, & Havilcek, 2010; Schwarz & Singer, 2013; Stephen, Coetzee, & Perrett, 2011; Stephen, Coetzee, Law Smith, & Perrett, 2009; Stephen et al., 2009; Stephen & McKeegan, 2010; Stephen, Oldham, Perrett, & Barton, 2012; Stephen, Scott et al., 2012). In order to test the specific prediction emerging from this literature, that fertility would affect women’s tendency to wear red/pink but not their tendency to wear other colors, we ran analyses comparing the frequency of women in high- and low-conception risk groups wearing a large number of different colored shirts. The results of these analyses are reported in detail in the Online Supplement to our article (which includes a Figure showing all frequencies). If any of these analyses other than those of pink and red had produced significant differences, we would have failed to support our hypothesis.

Gelman’s concern here seems to be that we could have performed these tests prior to making any hypothesis, then come up with a hypothesis post-hoc that best fit the data. While this is a reasonable concern for studies testing hypotheses that are not well formulated, or not based on prior work, it simply does not make sense in the present case. We conducted these studies with the sole purpose of testing one specific hypothesis: that conception risk would increase women’s tendency to dress in red or pink. This hypothesis emerges quite clearly from the large body of work mentioned above, which includes a prior paper we co-authored (Elliot, Tracy, Pazda, & Beall, 2012). We came up with the hypothesis while working on that paper, and were in fact surprised that it hadn’t been tested previously, because it seemed to us like such an obvious possibility given the extant literature. The existence of this prior published article provides clear evidence that we set out to test a specific theory, not to conduct a fishing expedition. (See also Murayama, Pekrun, & Fiedler, in press, for more on the role of theory testing in reducing Type I errors).

d) Our choice of which days to include as low-risk and high-risk was based on prior research, and, importantly, was determined before we ran any analyses. Gelman is right that there is a good deal of debate about which days best reflect a high conception risk period, and this is a legitimate criticism of all research that assesses fertility without directly measuring hormone levels. Given this debate, we followed the standard practice in our field, which is to make this decision on the basis of what prior researchers have done. We adopted the Day 6-14 categorization period after finding that this is the categorization used by a large body of previously published, well-run studies on conception risk (e.g., Penton-Voak et al., 1999; Penton-Voak & Perrett, 2000; Little, Jones, Burris, 2007; Little & Jones, 2012; Little, Jones & DeBruine, 2008; Little, Jones, Burt, & Perrett, 2007; Farrelly 2011; Durante, Griskevicius, Hill, & Perilloux, 2011; DeBruine, Jones, & Perrett, 2005; Gueguen, 2009; Gangestad & Thornhill, 1998). Although the exact timing of each of these windows is debatable, it is not debatable that Days 0-5 and 15-28 represent a window of lower conception risk than days 6-14.

Furthermore, if our categorization did result in some women being mis-categorized as low-risk when in fact they were high risk, or vice-versa, this would increase error and decrease the size of any effects found. Most importantly, we did not decide to use this categorization after comparing various options and examining which produced significant effects. Rather, we adopted it a priori and used it and only it in analyzing our data; no researcher degrees of freedom came into play.

e) In any study that assesses conception risk using a self-report measure, certain women must be excluded to ensure that those for whom risk was not accurately captured do not erroneously influence results. All of the exclusions we made were based on those suggested by prior researchers studying the psychological effects of conception risk, such as excluding women with irregular cycles (as it is more difficult to accurately determine when they are likely to be at risk), excluding pregnant women and women taking hormonal birth control (as they do not regularly ovulate), and excluding women currently experiencing pre-menstrual or menstrual symptoms (to ensure that effects observed cannot be attributed to these symptoms; see Haselton & Gildersleeve, 2011; Little, Jones, & Debruine, 2008). Although most of these exclusion criteria are necessary to accurately gauge fertility risk, several fall into a gray area (e.g., excluding women with atypical cycles). The decision of whether to exclude women on the basis of these gray-area criteria does lead to the possibility of researcher degrees of freedom. Because we were aware of this concern, we reported (in endnotes) results when these exclusions were not made. This is the solution recommended by Simmons, Nelson and Simonhnson (2011), who write: “If observations are eliminated, authors must also report what the statistical results are if those observations are included.” (p. 1363). Thus, while we did make a decision about the most appropriate way to analyze our data, we also made that decision clear, reported results as they would have emerged if we had made the alternate decision, and gave the article’s reviewers, editor, and readers the information they needed to judge this issue.

In addition to the degrees of freedom concern, Gelman also raises concerns about representativeness and measurement. We have addressed these issues in a longer version of this response, posted here, and we encourage those who are interested to read the longer version. In an effort to keep this response concise, however, we wish to close by mentioning a few broader issues relevant to Gelman’s piece.

First, like any published set of empirical studies, our article should not be viewed as the ultimate conclusion on the question of whether women are more likely to wear red or pink when at high risk for conception. We submitted our article for publication because we believed that the evidence from the two studies we conducted was strong enough to suggest that there is a real effect of women’s fertility on their clothing choices, at least under certain conditions, but not because we believe there is no need for additional studies. Indeed, many questions remain about this effect, such as its generalizability, its moderators, and its mediators. We look forward to seeing new research address these questions, both from our own lab (where follow-up and additional replication studies are already underway) and others.

Second, setting the ubiquitous need for additional research aside for the moment, Gelman’s claim that our two studies provide “essentially no evidence for the researchers’ hypotheses” is both inflammatory and unfair. For one thing, it is important to bear in mind that our research went through the standard peer review process—a process that is by no means quick or easy, especially at a top-tier journal like Psychological Science. This means that our methods and results have been closely scrutinized and given a stamp of approval by at least three leading experts in the areas of research relevant to our findings (in this case, social and evolutionary psychology). This does not mean that questions should not be raised; indeed, questioning and critiquing published work is an important part of the scientific process, and Gelman is correct that the review process often fails to take into account researcher degrees of freedom. But research critics—especially those who publish their critiques in widely dispersed forums like Slate blog posts—must ensure that they get the facts right, even if that means contacting an article’s authors for more information, or explicitly mentioning additional information that the authors provided in endnotes.

Indeed, a statistician like Gelman could go well beyond simply mentioning possible places where additional degrees of freedom might have come into play and then making assumptions about the validity of our findings on that basis. He could, and should, instead find out exactly the places where researcher degrees of freedom did come into play, then calculate the precise likelihood that they would have resulted in the two significant effects that emerged in our studies if these effects were not in fact true. In other words, additional researcher degrees of freedom increase the chance that we will find a significant effect where none exists. But by how much? The chance of obtaining the same significant effect across two independent consecutive studies is .0025 (Murayama et al., in press). How many researcher degrees of freedom would it take for this to become a figure that would reasonably allow Gelman to suggest that our effect is most likely a false positive? This is a basic math problem, and one that Gelman could solve. Without such calculation, the conclusion that our findings provide no support for our hypothesis would never pass the standards of scientific peer review. Researchers do have certain responsibilities—such as avoiding, to whatever extent possible, taking advantage of researcher degrees of freedom and being honest about it when they do– but critics of research have certain responsibilities too.

This is particularly important because there is a very real possibility that most readers of posts such as these will assume that they are accurate without checking against the original research reports. Indeed, most Slate readers do not have access to academic journal articles, so must rely on media summaries to form an assessment of the research. Added to the viral power of the internet, this creates a very real burden on critics and others who discuss scientific research in popular media forums to make serious efforts to maintain accuracy.

The field of psychology—and social psychology in particular—is currently experiencing an intense period of self-reflection. On the whole, this is a very good thing: psychologists are interested in finding and reporting true effects, and increased scrutiny of problematic research practices will help us do so. At the same time, it would be unfortunate if one consequence of this self-reflection is that researchers become afraid to publish certain findings for fear of reputational damage. Research articles that follow good research practices should not become suspect simply because their findings are unexpected.

[This is followed by a list of references which can be found at the end of the post here.]

And here’s my response:

Regarding researcher degrees of freedom, the fundamental issue is that many different plausible hypotheses could have been tested; indeed, the supplementary material reports tests for each color. Yes, Beall and Tracy found their desired pattern with the red-pink combination, but had they found it only for red, or only for pink, this would have fit their theories too. Consider their reference to “pinkish swellings” and “pinkish skin tones.” Had their data popped out with a statistically significant difference on pink and not on red, that would have been news too. And suppose that white and gray had come up as the more frequent colors? One could easily argue that more bland colors serve to highlight the pink colors of a (European-colored) face. With so many possibilities, it is not particularly striking that a particular comparison happened to appear to be large.

Tracy and Beall write, “If any of these analyses other than those of pink and red had produced significant differences, we would have failed to support our hypothesis.” I think but other findings would have supported their hypothesis in different ways. Data can fool people. A factor of 3 difference for pink but nothing for red, or a factor of 3 difference for white and gray but nothing for any other color, etc etc.—any such pattern would have fit just fine into their larger theory. The point is that there are many degrees of freedom available, even if with the particular data that happened to occur, the researchers did only one particular analysis.

Similarly, Beall and Tracy found a pattern in their internet sample and their college students, but a pattern in just one group and not the other could also have been notable and explainable under the larger theory, given the different ages of the two groups of participants. And it would have seemed reasonable to combine the results from the two samples, or even gather a third sample, if a striking but not-quite statistically significant pattern were observed. Again, their data-analysis choices seem clear, conditional on the data they saw, but other choices would have been just as reasonable given other data, allowing many different possible roads to statistical significance.

Regarding fertility, it is accepted that (a) the dates of peak fertility vary from woman to woman, and (b) to the extent that there are general recommendations, this would be days 10-17 or something close to that. Not days 6-14. As discussed in the Slate article, my best guess as to what happened was that they were following a paper from 2000 whose authors misread a paper from 1996.

The trouble is, if the effect size is small (which it will have to be, given all the measurement error involved here), any statistically-significant patterns in a small-sample study are likely to be noise. As has been demonstrated many times, if you start with a scientific hypothesis and then gather data, it is all too possible to find statistically-significant patterns that are consistent with your hypothesis.

To conclude, let me repeat what I wrote in my earlier article:

I don’t mean to be singling out this particular research team for following what are, unfortunately, standard practices in experimental research. Indeed, that this article was published in a leading journal is evidence that its statistical methods were considered acceptable.

And I meant it. I hope that this discussion motivates these and other researchers to carefully read the work of Simonsohn et al. on p-values and researcher degrees of freedom, and the related work on the hopelessness of trying to learn about small effects from small sample size; see, for example, the paper discussed here and the “50 shades of gray” paper of Nosek, Spies, and Motyl.

As Tracy and Beall point out, their paper was accepted by subject-matter experts to appear in a top journal in psychology. This is what worries me (and with others such as Simonsohn, Francis, Nosek, Ioannidis, etc., although I can’t comment on their reactions to this particular paper). My point in writing the Slate article was not to pick on this research on fertility and dress but to use it as an example to discuss a larger problem in social-science and public-health research. I do not want researchers to “become afraid to publish certain findings” but I would like authors and journals to be more cautious about claims that patterns in small unrepresentative samples generalize to the larger population.

I stand by my conclusion that the system of scientific publication is set up to encourage publication of spurious findings, and at the same time I would like to thank Tracy and Beall for their gracious response to my article. I remain hopeful that open discussion of research methods will help move us forward.

135 thoughts on “Response by Jessica Tracy and Alec Beall to my critique of the methods in their paper, “Women Are More Likely to Wear Red or Pink at Peak Fertility”

  1. With regard to the apparent mismatch between between the utilized and actual days of maximum fertility, it may not be unreasonable to infer that there is indeed an effect synched with a woman’s cycle, but that it does not necessarily peak during the days of maximum fertility. Due to real social issues, it may make sense that the “signaling” occur a bit in advance of the peak. Alternately, it may make sense not that there be a particular peak of signaling, but rather there be a dip in signaling at times of definitive non-fertility. Pure speculation, but it’s always fun to speculate :)

    That being said, I’m still troubled by the sample selection and sample size issues. These would be rather less troubling if each subject were queried over time (probably at random intervals rather than daily), and preferably over multiple cycles.

    • Clark:

      I agree that the speculations are interesting; they also introduce new degrees of freedom in potential analyses. As discussed above, alternative data patterns could match to plausible alternative models, all of which comport with the general story.

      In any case, I agree with you that a within-subject design would be a lot more informative.

  2. I think you overreach a bit in your criticism of multiple comparisons here. While I agree that the authors likely would have interpreted a significant result in any red/pink grouping as meaningful, I don’t think it’s fair to assume that they would have spun a significant result for white shirts only when they clearly set out to study red. It seems that the authors were trying to verify that indeed their hypothesis was the only one that was significant. They were attempting due diligence, not (even unwittingly) fishing, and it’s unfair to assume otherwise. If a researcher is interested in a particular hypothesis, are you suggesting that it’s good practice for him to willfully blind himself to other aspects of the data?

    I agree fully with every other criticism.

    • If authors had a clear theory, measures and tests, they should have pre-registered an analytical plan. That way we dont have to take them at their word.

    • Put another way, here are 2 scenarios:

      1) The pre-specified comparison was the only one out of 10 examined that was significant.
      2) The pre-specified comparison was significant and was the only one examined.

      Do you consider scenario (1) weaker evidence for a real difference than scenario (2) all other things being equal? (I know that due to other considerations in this study, the evidence for a real difference is weak in either case.) And do you agree that your criticism implies a preference for scenario 2 over scenario one? That feels weird unless you assume researcher dishonesty about whether the comparison was actually pre-specified.

      • (Or at least assume a high level of researcher incompetence to determine whether the comparison was actually pre-specified)

    • Z:

      I think that the comparisons they performed in their analysis made sense, given the data they saw. I also think that if the data they had seen had come out differently—for instance, if there were no big differences with the red and pink shirts but if white and gray shirts had been three times as likely to appear during their specified dates—that it would have been completely reasonable to interpret this as evidence in favor of hypothesis. It wouldn’t have felt like “fishing”; rather, it would have been a completely reasonable response to their data.

      In that sense, my use of the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. I have no reason to think that Tracy and Beall did this, and I’ll take their word that they did not. I think the real story is that they did a reasonable analysis given their assumptions and their data, but had the data turned out differently, they could have done other analyses that were just as reasonable in those circumstances. That is what I meant when I said that there are many roads to statistical significance.

      The term “p-hacking” may be unfortunate for similar reasons, as it invokes an image of a researcher hacking around to construct a statistically significant finding. I have no doubt that this sort of p-hacking occurs all the time, especially perhaps in high-pressure small-n medical research where a researcher can come up with excuses to exclude patients from their analysis. But, again, the problem of “researcher degrees of freedom” arises even if no “hacking” is done at all. With any given data set, a reasonable analysis can be done, but different reasonable choices can be made, conditional on different data.

      I regret the spread of the terms “fishing” and “p-hacking” for two reasons: first, because when an outsider such as myself uses such terms to describe a study, there is the misleading implication that the researchers were trying out many different analyses on a single data set; and, second, because it can lead researchers who know they did not try out many different analyses to mistakenly think they are not so strongly subject to problems of researcher degrees of freedom.

      This whole discussion has been very helpful to me in deepening my understanding of these problems.

      • Following this story with interest, still rather new to thinking about reproducible science, researcher df, etc.

        I’m a little confused, because I thought the problem with researcher df was if they were taken and not disclosed (i.e., choosing the best comparison, measure, etc.), and therefore resulting effects are untrustworthy.

        But here you seem to be criticizing even though researcher df were acknowledged and reported. I understand that you’re saying, with so many researcher df and comparisons, the p of *any* sig contrast is so high anyway, that it’s nothing special if one contrast turns out sig, even if it accords with their a priori hypothesis. I may be naive, but this seems like a no-win situation to me, since a lot of research (other than pre-registered clinical trial-type studies) have lots of df and many possible comparisons! What remedy do you propose?

        • Joel:

          This study had problems with measurement, representativeness, and sample size. If you are limited to the data available from this particular study, I don’t think much can be done except to publish the results (including all the raw data) openly as speculation. That would be ok with me. Speculation can be useful.

        • Right. Leaving aside the (very real) concerns with measurement and representativeness, increases in sample size can reduce the probability of Type S/M errors, but does it directly address concerns about researcher df?

        • Are you really saying that such small N studies taken from the college student population having nothing to contribute beyond speculation? It seems like some significant amount of exploratory work is essential in any field of study where scientists do not know enough yet to justify larger expenditures of time and taxpayer dollars on a single question. Is it not important for such exploratory work to have some quantitative measure of how reliable the effects found might be. I’d agree that the methods commonly used in practice do not suffice, but saying that we can only speculate from such data seems a bit nihilistic.

        • David:

          I did not intend the word “speculation” to be nihilistic. What I meant (but perhaps did not express well) is what you say: weak data can suggest ideas without being conclusive.

  3. What I don’t understan is why they havn’t replicated this study om the mechanical turk. Since they already have been running the study there once running it again should be as easy as clicking on a button. And while at it, why not run ot with 1000 participants instead of 100?

    • Money? Dunno. But I agree that the last word on this will come from a larger study and hopefully also a non-internet, non-college-student-only study.

  4. If they had used the correct window for peak fertility, would that have increased the size of the effect found? They’re claiming that a priori, but I don’t know if a reanalysis has been done.

  5. Pingback: Spurious statistical significance – when science gets it all wrong | LARS P SYLL

  6. I am skeptical that many (maybe most) psychologists know how to make a good scientific argument with a statistical analysis.

    For example, in their reply, Tracy and Beall noted that their studies were designed to test hypotheses that were derived from earlier studies. In the original paper they noted, “Building on the evidence reviewed above suggesting that women may seek to increase their attractiveness by self-adorning in reddish colors, and should be particularly motivated to do so during peak fertility, we tested whether women are more likely to wear red- or pink-colored clothing during this period, compared to other phases of their menstrual cycle. Support for this prediction would provide the first evidence for a distinct and visually obvious behavioral display linked to female ovulation.” The experimental results then are reported to provide support for this prediction.

    This kind of validation of a novel prediction feels like a strong scientific argument, but I think it is misleading because the prediction was never real. At best researchers can only predict the probability that a hypothesis test will reject the null (power), and such a prediction can only be made for some hypothesized effect size with knowledge about the experimental design and sample size. If the earlier studies had provided an estimated effect size, then there could have been a true (probabilistic) prediction. The paper does not discuss a predicted effect size or experimental power. It seems unlikely that Tracy and Beall performed the effect size and power analysis but left it out of their paper. A post hoc power estimate of their findings suggests power of 0.64 for Sample A and 0.50 for Sample B. It is difficult to imagine a scenario where a scientist deliberately sets out to run an experiment that is the equivalent of a coin flip.

    Their previous research could have motivated their investigations, but it is difficult to believe that it predicted the outcome of their hypothesis tests. In my view, there is nothing wrong with exploratory work, but I think it is improper to call it a prediction. If it really was a prediction, then I think Tracy and Beall ran a really lousy set of experiments.

    As a general commentary on the problems with current practice of statistical analysis, I think Andrew’s characterization is spot on.

  7. “At the same time, it would be unfortunate if one consequence of this self-reflection is that researchers become afraid to publish certain findings for fear of reputational damage.”

    I’d probably call that “fortunate”! Can people think of any other findings in social-psychology that a researcher was afraid to publish and that turned out to be a net loss to the world subsequently? I rather think less prolific publication in this particular branch of work would do the world a lot of good.

    The counter-factual would have meant that Tracy and Beall waited for a sample of 1000 “real” participants, not all college students hopefully.

    • How is one to learn what social psychology results have not been published, never mind determine the resultant net loss to the world?

      I loved Andrew’s Slate piece, and as an ex psychologist I am really glad to see these issues getting air time. But actually I think that what happens when researchers choose not to publish spurious findings is that eventually they choose (or are forced) to move on to other careers, and the field is driven by researchers who, presented with the same dilemma, made a different choice. I don’t have an answer, but I don’t think shelving crummy results is all win.

      • There is indeed no systematic way to know about unpublished results, yes. But what I had in mind is stories from other disciplines where an unsubmitted manuscript emerges in a persons papers later, or anecdotal stories of someone doing a study and shelving publication because of self-doubt (we had a blog post on it by Andrew this week).

        Pretty much informal channels.

    • “At the same time, it would be unfortunate if one consequence of this self-reflection is that researchers become afraid to publish certain findings for fear of reputational damage.”

      If I am not mistaken, science already offered a nice solution for this

      http://en.wikipedia.org/wiki/Ad_hominem

      As long as we talk about facts, logic, reasoning, findings, arguments, etc. nobody has to even mention things like ‘reputational damage´ (whatever that means).

  8. One thing that is missing from the discussion of ‘researcher degrees of freedom’ on both sides of the debate is the selection of sample size. How did the researchers choose to study 24 people in their lab study versus, say, 20 or 40 or 200? How did they choose 100 for their internet sample versus 24, 42, or 400?

    My sense is that, in psychology, most of the time the sample size is selected according to the following rule: continue testing people until you get the effect you’re looking for or one that you think will allow you to tell a compelling story. I don’t know whether that is what Beall did. But there is nothing in the article or the response to suggest otherwise: no discussion of expected effects, statistical power, or stopping rules.

    • “My sense is that, in psychology, most of the time the sample size is selected according to the following rule: continue testing people until you get the effect you’re looking for or one that you think will allow you to tell a compelling story.”

      I haven’t worked in Psych research in a long time but my impression is the sample size is often dictated by what you can get. One cannot just fill out an invoice and order up 200 human subjects from the local laboratory supply store.

      If you need people (subjects) in the lab or need to do elaborate testing, say in a school situation it can get costly and labour intensive, and very time-consuming, very quickly.

      In some cases time pressures may force a cut-off. How long can one afford to hunt for subjects?

      As an example: Just how easy is it to find 200 piano players with Royal Conservatory qualifications, over 60 years of age and who are willing to volenteer to spend a couple of hours in a psychology laboratory?

      It sometimes means smaller sample sizes and weaker results. Mind you, I suspect that until very recently power calculations etc were not usually done and so it was easier to decide enough was enough.

      • How hard and expensive is it to survey a tad more than the 25 subjects they used.

        These aren’t whizkid piano players, they are adult women. Period.

        Besides, if you cannot afford to get a decent sample size I’d rather they not attempt such studies at all.

        • / These aren’t whizkid piano players, they are adult women. Period./

          This sentence could be seen as really funny, when you take the original study into account.

          Regardless of that, I totally agree with your point that larger sample size would have been a good thing it would not seem that hard to arrange!

  9. I think that many of the comments here miss the most important issue. I can’t fault the paper’s authors for wanting to test an idea they have, and I can easily picture how their thought process may have gone:

    “Hey, if this is so then we ought to see more women wearing reddish clothing in the middle of their cycle. Lets try to take a look and see if that’s actually happening.”

    Nothing wrong with that! The small and non-representative samples are a problem, but you have to start somewhere. Andrew argues that had the results turned out otherwise, the authors could have made them to support their ideas, but that’s just speculation and I haven’t seen any sign that they would have.

    No, my problem with this kind of work as published is that there are usually many other plausible reasons why the results might have turned out this way, but the authors didn’t happen to think of them. Or didn’t regard them as plausible. For example in this case, suppose the results turn out to be essentially correct. It might be that womens’ skins become a little redder during the fertile part of their cycle, as the authors state. Now many women have a keen sense of colors and want their clothes to coordinate well with their colors. This alone might lead to a tendency for their clothes to become more reddish during those periods.

    Thus the result does not need to come from a desire to be more alluring during periods of peak fertility. The authors may in fact have come up with a correct sign and value of an effect, but the effect may not arise from what they thought.

    I think my example alternative explanation is plausible. But it doesn’t matter if you agree or not. It’s just an example of an unexamined alternative hypothesis – that may be just as plausible as the published one, who knows? – that wasn’t examined and may not have occurred to the authors.

    Especially in study areas dealing with people, there are usually so many plausible alternative explanations that it excessively simplistic to expect any one or two approaches to pin down what’s going on, regardless of their statistical power.

    So it behooves people to be humble in what they claim. But you do have to start somewhere when you have an idea. My own reaction to papers like this is just to add the humbleness about sweeping conclusions that may be lacking in the papers. We just have to hope other researchers remember to do the same when they try to make use of published work.

    • Yes, you have to start somewhere. But then again, you are not obliged to publish every tiny experiment you conduct. I say, be cautious, repeat using a larger, more representative group and if your hypothesis still holds, fine go ahead and publish. 24 college students and some suspect internet respondents is hardly any respectable sample.

      I wouldn’t call Andrews points ” just speculation”; no, we don’t have proof beyond reasonable doubt, but I feel these are very logical conclusions that any reasonable person may make.

    • I think your alternative explanation is a key insight: it may have been women’s color perceptions that mattered, and nothing to do with wanting to (subconsciously) appear sexier.

      I’d also criticize their decision to eliminate women who had irregular periods or who were pregnant. They thus eliminated controls from their experiment. Heck, I’d have added several men to the study as controls. Especially the on-campus part of the study where one can imagine al kinds of common influences that might synchronize with a regular cycle.

  10. Theory K predicts all women in a population will wear pink/red when ovulating.

    Our prior on K is P(K)=0.2, P(not K)=0.8

    Likelihood a non-random sample of women wear red if K is true is P(R|K)=0.8, and P(R|not K)=0.5 say.

    Posterior is 0.29 so a test on a non random sample is informative for a universal claim.

    Fishing messes up the likelihood s.t. P(R|.)=1 which fails to update the prior.

    I suspect it also alters the theory and the prior. A quagmire.

    • PS I guess the point of analysis plan is to lay out the operating characteristics of the test. This is easier with randomized experiments.

  11. Andrew,

    Like others here, I am not sure I understand part of your critique. You are saying the researchers found X (e.g. pink/red is worn more during fertile periods), but if they had found Y (e.g. white is worn more) they would have reported that as an effect, therefore you are less likely to believe that their reported effect X is true. Am I paraphrasing correctly?

    Do you mean that researcher degrees of freedom should be restricted to the point that only the prespecified hypotheses should be considered in a study (and researchers should be careful to point that out when they write it up)? I’m not saying I disagree with this, I’m just not clear on what you mean.

    • I cannot speak for Andrew but multiple hypotheses is not limited to testing multiple outcomes (extensive margin). It also involves testing the same outcome under different research configurations multiple times (intensive margin).

      For example by shifting the time window of ovulation, by recruiting an extra subject, by changing the test statistic, or the computation of the disturbances, by dropping “outliers”, by adding and removing regressors, by changing the estimator, by imposing a different distribution on the disturbances, etc…

      Statistics courses waste a lot of time teaching how OLS is BLUE under certain assumptions, and how estimates are chosen to minimize the sum of squares (e.g. Choose Beta s.t. min(RSS)). I practice that goes out the window when the real objective function is (often inadvertently) significance. That is, choose subsets of X,Y, test statistics, etc to min p-value using OLS or whatever. (e.g. Chose (research design) s.t. p-value<.05). My sense is we need to do better at teaching research practice.

    • Eric:

      I think that researchers should feel free to do whatever analysis they want, but when they have flexibility in their analysis they should not be so impressed with low p-values.

      • Or better yet, control the FWER so we know just how low a $p$-value needs to be before we should be impressed.

  12. I understand the criticism, but couldn’t this logic be applied to any hypothesis testing result? Suppose the had a larger sample size and higher power etc. You could keep stretching out the space of related hypotheses until the results are no longer significant, since “related hypotheses” is quite subjective

    In practice you’d probably reach a point where nobody would take the complaint seriously – eg once you start including hypotheses from totally different fields into the multiple comparison adjustment. However, the logic of the argument would be the same.

    Maybe the underlying issue here may be a broader problem with applying “testing” logic in non-experimental settings.

  13. The gist of Andrew’s criticism seems to be that researchers are saying that they are testing a theory X (ovulating women are more likely to wear red or pink) but, in Andrew’s opinion, they are really testing a much wider theory Y (ovulating women of some ages are more or less likely to wear red, pink, and/or white than non-ovulating women of these ages.) I agree that the sample size is too small to confirm or disprove theory Y, and that p<0.05 for a single color and a single group (or even p<0.05 for one group and p=0.051 for the other group) tells you nothing. Therefore it is illegal to make a post hoc decision about the hypothesis on the basis of the data and then use p<0.05 as an argument for the validity of the hypothesis.

    However, they did not do it post hoc. They specify a hypothesis and they score a hole-in-one: p=0.02 and p=0.051 on the first attempt. (Or so they say.) Yes, if you don't get p<0.05 on the first try but you're allowed to keep changing the hypothesis till you get there, p-value misleads you about the statistical significance of the result. But getting p<0.05 on the first attempt is an important fact.

    Low p-value basically says that it is unlikely for the finding to occur given null hypothesis. In the event that you have multiple possible hypotheses, loss of significance because of the increased chance of finding "evidence" at least for one is balanced by the increase of significance because of the low probability (given null hypothesis) to find p<0.05 on the first try.

    • Hole in one? We really need to stop this “p.05 means you win” mentality. Research is supposed to be about understanding the truth. There are issues here that supersede the p-value in terms of interpretational importance in that respect. Selection of the null hypothesis, causal identification, effect estimation, etc.

    • Nameless:

      The researchers specified a hypothesis but they did not specify a specific set of operations to be done on the data. As I wrote above, had they seen different patterns in the data, they could’ve done an equally reasonable analysis that would’ve been just as consistent with their substantive hypothesis. There are many different roads that would lead to statistical significance. To put it another way, there is actually much more than a 5% chance that they could get p<.05 on the first attempt, because what is done in that first attempt is conditional on the observed data. It was not a prespecified analysis. I am not actually a big fan of prespecified analyses---I like to decide on details of my analysis after seeing the data. But if this is done, you have to realize that p<.05 doesn't mean what you think it does. This is the key point of Simonsohn et al.

      • “had they seen different patterns in the data, they could’ve done an equally reasonable analysis”

        The article seems to imply that they were looking for the exact pattern that they found: an excess of red and pink, in the high fertility group according to their arbitrary definition, in both groups. That’s what I called “hole in one”. It is possible that they would have looked for red only, or for excess of white, etc. if they didn’t find an excess of red and pink. Unless they are not describing their methodology correctly, there’s in fact a 5% chance to get p<0.05 on the first attempt, because looking at the data and mining for asterisks would not count as "first attempt".

        All the talk about p-values is a bit misleading, because they did not report the p-value for their hypothesis in the collapsed set. It should be lower than their lowest reported 0.02 for the first group.

        There is a bigger problem though. There's no raw data, and the numbers they do report are inconsistent. I tried to reconstruct original data and it seems to be impossible.

        "Women were classified as wearing a red/pink shirt (Sample A, n=17; Sample B, n=5) or an other-colored shirt (Sample A,
        n=83; Sample B, n=19)…. we found that 76% of women in Sample A and 80% of women in Sample B who were wearing red/pink were at peak fertility"

        So far this is fine. Total numbers are 17+83=100 in sample A and 5+19=24 in sample B. 76% of 17 red/pink in sample A and 80% of 5 red/pink in sample B are at peak fertility: that's 13 and 4 women respectively. (13/17=0.7647, 4/5=0.8).

        "Women at high-conception risk were substantially more likely to be wearing a red/pink-colored shirt compared to women at low-conception risk; 40% vs. 7%, and 26% vs. 8%, in Samples A and B respectively."

        In group A, we have 13 "red/high risk", 4 "red/low risk", X "non-red/high risk", 83-X "non-red/low risk". They are reporting 13/(13+X)=0.40 and 4/(4+83-X)=0.07. There's no value of X that results in these percentages. You can't even get 40% for 13/(13+X) with proper rounding, because 13/32=0.406 and 13/33=0.395. These correspond to 4/(87-X) of 0.0588 and 0.0597.

        In group B it'

        • In group B it’s equally bad. We have exactly 4 women in “red/high risk” and 1 woman in “red/low risk”. If 26% of high-risk women and 8% of low-risk women wear red/pink, there are 11 women in “non-red/high risk” (4/(4+11)=0.267) and either 11 or 12 women in “non-red/low risk” (1/(1+11)=0.083, 1/(1+12)=0.769). But that’s not possible, because there are only 19 women total wearing other-colored shirts in sample B.

        • We have shared the raw data from these studies with several researchers who have asked. Please contact me or Alec Beall if you would like to see our raw data. Our goal is to be completely transparent, and so we are very open to data-sharing.

        • “While we agree several of Andrew Gelman’s broad concerns about current research practices in social psychology”

          “The field of psychology—and social psychology in particular—is currently experiencing an intense period of self-reflection.”

          Great the you are willing to share the raw data !

          Do you have any possible tips what researchers can do themselves (e.g. aside from being willing to share raw data) to possibly better things ? For instance, could they use the Simons et al. ’21 word’ -option (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2160588 and http://pss.sagepub.com/content/22/11/1359.full.pdf+html) to optimally provide the reader with information about the analyses ? Could that have helped with providing clarity about “researchers degree of freedom”?

          Have you already tried to replicate the findings?

        • Title: “Do you have any possible tips what social psychologists, social psychology journals, or institutions where social psychology research is being performed, can do themselves to better things?”

          Introduction:

          Taking personal responsibility may be especially hard for (some) social psychologists (“theory of social psychology”, 2013). This is because they have been focused on groups and surroundings and such most of their lives. The “theory of social psychology” (2013) posits that social psychologists may have taken in what the “rest of the group” does to such an extreme extent that they simply cannot withstand such influences. This results in a situation where the current poor state of social psychology is basically everyone else’s fault, or the “system” is at fault, but in no way do they themselves have had any influence on things or have any responsibility for things. The “theory of social psychology”(2013) explains that this is also the case for social psychology journals, and prelimenary results indicate that this may also hold for institutions where social psychology research is being performed.

          The current situation social psychologists are in is undoubtedly due to unconsious processes, or fear of rejection by their peers (or maybe because of one of the countless other social psychology theories/variations to which such a hypothesis and/or results could be tied). It is however important to note that it’s not social psychologists fault. It’s not even social psychology journals’ fault. None of these parties, especially social psychologists themselves have had any influence, or any responsibility, in the current situation whatsoever. Any criticism, or even critical thinking about this, is therefore totally uncalled for. Most importantly however, anything social psychology puts out there is totally true: you don’t have to really think about things but only have to listen to social psychologists and their findings!

          Hypothesis:

          Social psychologists, and social psychology journals, have a very hard time taking even the most simple measures to improve things. This is mediated by a tendency to not be able to look in the mirror (to be measured by the “Tendency To Not Be Able To Look In The Mirror”-scale; TTNBATLITM-scale, 2013) and by a tendency to behave like a little child when someone (wants to) replicate my findings (to be measured with the “Tendency To Behave Like A Little Child When Someone (Wants To) Replicate My Findings”-scale; TTBLALCWS(WT)RMF-scale, 2013).

          Results:

          Look at the present-day state of social psychology. Really, look at it. No further statistical analyses were needed (this is also why we did not pre-register our analysis plan).

          Conclusion:

          It may be especially hard to look in the mirror, and take responsibility for things, and think for yourself when you are a social psychologist, social psychology journal, or a institution where social psychology research is being performed. Further research is needed of course, also concerning possible other moderators and mediaters.

          Regardless of that, a press release will be leaving our university/ journal asap! (“exploratory findings” are just too important for science to evolve of course, and why not then also immediately share it with the entire world !!)

        • I noticed the same problems with the numbers and sent an email to the first author for clarification. Afterwards, I realized what happened. The text switches the percentages for Samples A and B. The 40% and 7% are actually for sample B and the 26% and 8% are for sample A. Now you can get frequencies that match those percentages.

        • Yes, that makes sense.

          Then the solution is:
          13+4 red/high
          4+1 red/low
          38+6 other/high
          45+13 other/low

          In the collapsed set, 49.2% are high-risk. In the red subset, 17 of 22 are high risk. Probability of 17+ of 22 being high risk given null hypothesis is 0.007.

  14. Their title alone is enough for increased scrutiny: “Women Are More Likely to Wear Red or Pink at Peak Fertility”. Though they protest, their title makes a strong inference to the general population of fertile women. When I looked at their supplemental data (http://ubc-emotionlab.ca/wp-content/uploads/2013/01/Beall-and-Tracy-PS-Online-Supplement-2013.pdf) , I concluded that women , regardless of “risk”, just don’t like wear red at all. How about this for a bizarro alternative title: “Women at high conception risk less likely to avoid wearing red “

    • This is not the issue with the paper. People use the phrase “more likely” for comparisons between groups even if the base rate is low in both groups.

      • I do think the title is part of the problem here. Sure, the use of “more likely” is consistent with describing results of comparisons as you mention. However, the use of the phrase here is a cue for the general reader to make strong inferences about wearing red and fertility, and ignore the low base rate. Will most readers ever venture beyond the abstract?

  15. Andrew

    >How many researcher degrees of freedom would it take for this to become a figure that would reasonably allow Gelman to suggest that our effect is most likely a false positive? This is a basic math problem, and one that Gelman could solve. Without such calculation, the conclusion that our findings provide no support for our hypothesis would never pass the standards of scientific peer review.

    This seems like a decent point. The answer might be generally useful as a fun fact if not an actual first-cut heuristic. I was surprised your rejoinder didn’t include this calculation nor an explanation why the idea is off-track. Would you mind please..? Thanks in advance.

    • If I look at a scatter plot and have one degree of freedom to knock off a subset of data points, I am done in 1df.

      Silly question? Bad definition?

    • Brad:

      I addressed the issue of the two different p-values in the following paragraph (copied from above):

      Similarly, Beall and Tracy found a pattern in their internet sample and their college students, but a pattern in just one group and not the other could also have been notable and explainable under the larger theory, given the different ages of the two groups of participants. And it would have seemed reasonable to combine the results from the two samples, or even gather a third sample, if a striking but not-quite statistically significant pattern were observed. Again, their data-analysis choices seem clear, conditional on the data they saw, but other choices would have been just as reasonable given other data, allowing many different possible roads to statistical significance.

      • Andrew,

        Thanks for your reply & reference. I’m sorry, I’m only a little conversant with but not competent at all in mathematical statistics. I understood from their question that the answer would be a certain number ‘N’ of degrees of freedom. From your reply to me, it sounds like you reframed their question & answered it qualitatively.

        With respect, I don’t think it’s obvious to the layperson how your answer relates to, or is suggestive of, that hypothetical value N; or alternatively why you feel that knowing N wouldn’t really make a difference in the discussion?

        • Brad:

          It’s hard to answer the question quantitively because the researchers do not precisely state the connection between the researchers’ hypotheses and the data, hence it cannot precisely be defined what analyses had been performed if other data had been observed. We would have the same difficulty trying to understand the analysis of the arm-circumference paper and the ESP paper discussed in my State article. Recall that the ESP paper had something like 9 different statistically-significant experimental results. (1/20)^9 is a tiny number and we certainly cannot imagine a set of researchers performing anything like 20^9 different data analyses. Nonetheless the problem is still there. The key is that, even if the researchers tried out only one single analysis given their data, had they seen other data they could well have chosen equally plausible analyses.

        • “It’s hard to answer the question quantitively because the researchers do not precisely state the connection between the researchers’ hypotheses and the data.”

          This is an interesting theme of this discussion to me; I think it could be substantially ameliorated if genuinely formal models were more common in psych, so that the gap between ‘theory’ and ‘testable predictions’ is a deductive, unambiguous one, as opposed to theories stated entirely in prose that leave it unclear what would or would not count as disconfirmation.

          Also, broadly: this has been a fantastic discussion to read, and I’m glad you wrote the Slate article and that Beall and Tracey were willing to politely engage further on the issue as well.

        • It seems like this is not a problem of mathematical statistics (or basic math as the paper authors suggest). More like enumerating the number of places to look when you have lost your keys — the exact number is a function of how imaginative you are at searching and how much you really want those keys. And how you count — is the whole sofa worth 1, or 1 for every pillow I look under? I understand researcher df as a rhetorical device.

        • Nick,

          Well again, just speaking as a layperson here, I don’t read their description of this hypothetical calculation as simply rhetorical. It does sound to me as though they feel it’s a straightforward process to determine that ‘N’, the number of degrees of experimenter freedom that would immediately render their findings suspect. If you can parse their text differently, I’d sincerely appreciate hearing your interpretation.

        • If they think this, then they clearly *don’t* understand the issues involved. When say applying the t test, each “degree of freedom” is well defined. “researcher degrees of freedom” is not something like that.

  16. On one hand, I’m still confused on the hypothesis test criticism here. If we assume that the authors did not mine for asterisks, it seems highly unlikely that by chance alone they predicted which color would produce a ‘significant’ result. This goes out the window if we assume that the authors are tinkering with the hypothesis after observing the data — Oh hey! Shoe style, or skin exposure, or makeup application varies with ovulation! Let’s speculate. — but, short that assumption, it just doesn’t seem likely that they’d nail the result a priori in two different, if weak, samples.

    On the other hand, the samples are weak, small and the effects are difficult to estimate with any certainty. And despite how it may look (i.e., “mining”), I like seeing robustness checks. For the ovulation window, for example, having a pre-registered window is good. But if we shorten it, or (more likely) lengthen it to accord with other scholarly definitions, would the results change? Even though this can look like “mining”, I’m more apt to believe a result if it seems robust to small changes in how we measure debatable quantities.

    And, just as a side note, if pink is a shade of red, why is grey not a shade of black and/or white? And the comment that 76-80% of women who wore red were at peak fertility (p. 3) seems weird. It seems unreasonably high that this proportion of red-wearers are at peak fertility. An important bit of information for undergraduate males at UBC, I guess.

  17. Forget the statistical criticisms (well, actually, no don’t – see http://deevybee.blogspot.co.uk/2013/06/interpreting-unexpected-significant.html) but really!
    As someone who spent many years with no red/pink clothes, and whose principal means of selecting the day’s clothing is by sniffing the armpits, I find this study’s results totally unbelievable. It presupposes that most women have a substantial multicoloured wardrobe and our choice of clothing is dictated by factors other than what is not in the laundry.

    • I would, however, entirely believe results that showed that menstruating women are not as likely to wear white or other lightly colored pants as non-menstruating women. In fact, I think that bit of advice constitutes a large part of “sex education” in some eras and geographic regions of the US.

    • ‘It presupposes that most women have a substantial multicoloured wardrobe and our choice of clothing is dictated by factors other than what is not in the laundry.’

      Most importantly, it will turn out to be due to some strange unconscious processes. That’s really the important thing to put in your article, and most importantly to put in the subsequent press release to the general public.

      • Who is responsible for making research findings like these being presented in a press-release anyway? Is that something the scientists have to initiate themselves, or is this something the journal initiates ? How does this work.

  18. Interesting discussion! It strikes me that your argument is a bit like accusing your girlfriend of cheating, because you know that in general a lot of people cheat. Your girlfriend wouldn’t think this was very fair, and neither, I imagine, do Tracy and Beall.

    Essentially, you cannot draw conclusions about a single paper based on the fact that questionable research practices are common. Maybe this study was purely confirmatory and hypothesis-driven. Maybe it wasn’t. Do you have any inside information that tells you one way or the other? So all you can say is that you don’t know whether the results from this particular study are to be trusted or not. This is a big problem, and is the focus of the whole debate about pre-registration and such. But what you seem to be saying (do I misread?) is that you somehow know that these results are the result of a fishing expedition.

    Dorothy Bishop makes a more convincing case, I think!

    • Bad analogy, I think. My analogy: Lots of cashiers / tellers used to dip from the till. So now we put systems in place: a locked cash register, end of shift tally of cash and receipts, a second auditor, video supervision, declaration of personal cash pre-shift etc.

      If someone disables the camera, “forgets” to tally, does not wait for an auditor etc. yes, sure we don’t have proof they are stealing. But its perfectly normal to be very suspicious that they are.

      Pre-registering your study, using large samples, using representative samples, etc. are all defenses that guard against such accusations. If you as a researcher chose to ignore it all, you do so at your own peril and can’t now complain of people picking on you.

    • Sebastiaan:

      No, I’m not accusing Tracy and Beall of cheating. I’m saying they followed standard practice, but standard practice has big problems. That’s why I wrote about this topic in the first place. I don’t really care about their particular claims. I’m bothered that such completely unconvincing work is being published in the top journal in the leading society in psychology.

      Regarding the bit about fishing etc, see this comment above. The researchers may well have chosen a specific analysis given their data with no fishing or trying out of alternative hypotheses. But had they seen different data, they could have done other completely reasonable analyses. It would not feel like “fishing” because, in any case, only one data set would be seen so the analysis would be uniquely chosen. Nonetheless, there would be many different roads to statistical significance.

      • I do appreciate your general point, don’t get me wrong. The only thing that I want to point out is that you are inferring the specific (these results are untrue because they result from p-value fishing) from the general (many results are untrue because p-value fishing is so common). Researcher’s degrees of freedom are very slippery things, and you just don’t know how many hypotheses and comparisons could potentially have been considered and reported. Maybe only one, if we are to believe the authors. Of course I’m skeptical as well, and I share your gut feeling that this study has all the traits of a fishing expedition. But it’s just a gut feeling.

        In the end, by picking out one specific study you a) create a lively debate that raises awareness for an important issue (look at the number of comments here!), but also b) place yourself in a weak position, because you cannot know how representative this particular study is for the general point that you’re making.

        • I just thought of a (hopefully) better way to say what I mean. If we know that there were many researcher’s degrees of freedom, than we know that we have to be skeptical about the results. Let’s call this level 1 uncertainty. However, if we do not know whether or not there were many researcher’s degrees of freedom, than we do not even know whether or not to be skeptical. Level 2 uncertainty, say. Even though we may be able to estimate across a large number of studies how skeptical we have to be on average.

          I guess I feel that you are describing a level 2 uncertainty problem, as though it is a — far simpler — level 1 uncertainty problem. Does that make sense? Or is this statistically esoteric nonsense?

  19. “Here, we take the opportunity to make these clarifications, and also to encourage those who read Gelman’s post to read our published article, available here, and Online Supplement available here.”

    I read this:

    http://ubc-emotionlab.ca/wp-content/files_mf/bealltracyinpresspsychsci.pdf

    “These findings support the expectation that displays of red and pink are a reliable fertility cue in women, and are the first to suggest a visually salient, publicly observable objective behavior that is associated with female ovulation.”

    What I subsequently don’t understand is the following. In the introduction it is stated:

    (p. 5) “(…) an increased desire to wear revealing clothing (Durante, Li, & Haselton, 2008), and a tendency to wear clothing that leads women to be judged as “trying to look more attractive” (Haselton et al., 2006). In addition, one study found that women at peak fertility wore more revealing clothing, but this effect emerged only among partnered women (whose partners were absent) attending Austrian discotheques where, presumably, dressing provocatively does not violate social norms (Grammar, Reninger, & Fischer, 2004)”.

    Can’t “wearing more revealing clothing” be seen as “publicly observable objective behavior” or am I not reading/ understanding things correctly?

    I subsequently wondered how these findings can be cited in the introduction first and then later on the results of the present research are being described as “the first to suggest a visually salient, publicly obersvable objective behavior that is associated with female ovulation” ? To me this does not make any sense, and seems like totally incorrect mmmkay.

    I wonder if perhaps the editors of the journal could have been at peak-fertility at the time of accepting this manuscript, because this seems like a nice example of one of those “sexy” findings (but which make good headlines in papers and tabloids) that gets published in (some) social psychology-journals…

    • Can one also contradict oneself in a single paragraph ?:

      “Past research has suggested that women desire to dress sexier during ovulation; however, studies have largely failed to
      demonstrate any consistent behavioral change in the sexiness of women’s dress across periods of conception risk (Haselton et
      al., 2007; Durante, et al., 2008; Haselton & Gangestad, 2006; Grammer, et al., 2004). The current investigation offers a possible explanation for this discrepancy: Although women at peak fertility may largely refrain from dressing more provocatively out of social-normative concerns (Durante, et al., 2008), they may nonetheless seek to increase their apparent sexiness by adorning in the colors known to increase their attractiveness to men, which, at least in North American contexts, are not associated with any social stigma”

      2. provocative – exciting sexual desire; “her gestures and postures became more wanton and provocative”
      sexy – marked by or tending to arouse sexual desire or interest; “feeling sexy”; “sexy clothes”; “sexy poses”; “a sexy book”; “sexy jokes”

      Maybe the ‘discrepancy’ relates to these previous studies about ‘sexiness’ if they only studied ‘sexiness’ in the sense of ‘showing more skin’ or something like that? If this is true, then I wonder why they would not make that more explicit (it would just be a few words extra). To me, these sentences as they are written now, make no sense.

  20. “Gelman’s concern here seems to be that we could have performed these tests prior to making any hypothesis, then come up with a hypothesis post-hoc that best fit the data. While this is a reasonable concern for studies testing hypotheses that are not well formulated, or not based on prior work, it simply does not make sense in the present case.”

    I have some trouble with seeing how post-hypothesizing is only a “reasonable concern for studies testing hypotheses that are not well formulated, or not based on prior work” or how this could “not make sense in the present case”.

    Wouldn’t the mere possibility of post-hoc hypothesizing imply that there are hypotheses to pick (which I assume are somehow/ in some form based on “prior work”) ?

    I assume that post-hypothesizing somehow always involves refering to “prior work” (otherwise I would reason that every article that ever published findings that were “post-hypothesized” would be easily recognizable because I would then reason that these articles would have no introduction with references to “prior work”).

    If that makes sense, then I would think that post-hypothesising implies that apparently there are findings that would allow for an alternative hypothesis. If this makes any sense, then I would think that a “well formulated” hypothesis is not necessarily proof of not having engaged in post-hypothesizing.

    article on HARKing (hypothesizing after results are known): http://www.sozialpsychologie.uni-frankfurt.de/wp-content/uploads/2010/09/kerr-1998-HARKing.pdf

    • *“Gelman’s concern here seems to be that we could have performed these tests prior to making any hypothesis, then come up with a hypothesis post-hoc that best fit the data. While this is a reasonable concern for studies testing hypotheses that are not well formulated, or not based on prior work, it simply does not make sense in the present case.””

      *a) Gelman suggests that we might have benefited from researcher degrees of freedom by asking participants to report the color of each item of clothing they wore, then choosing to report results for shirt color only.In fact, we did no such thing; we asked participants about the color of their shirts because we assumed that shirts would be the clothing item most likely to vary in color.’

      —->’we asked participants about the color of their shirts because we assumed that shirts would be the clothing item most likely to vary in color.’

      —-> I wonder whether this assumption was based on prior work, and how this relates to ‘studies testing hypotheses that are/are not well-formulated’

    • *’Gelman’s concern here seems to be that we could have performed these tests prior to making any hypothesis, then come up with a hypothesis post-hoc that best fit the data. While this is a reasonable concern for studies testing hypotheses that are not well formulated, or not based on prior work, it simply does not make sense in the present case

      *’Research articles that follow good research practices should not become suspect simply because their findings are unexpected.’

      In general: I don’t understand how expected or unexpected results have anything to do with following good research practices

      In this particular case: I don’t understand how the results are ‘unexpected’, because it is also stated that: ‘We came up with the hypothesis while working on that paper, and were in fact surprised that it hadn’t been tested previously, because it seemed to us like such an obvious possibility given the extant literature’

  21. Are the authors’ conclusions based on p=.02 and p=.051? Surely that is very weak statistical evidence. Berger and colleagues have shown several times that such p-values can *never* yield a compelling Bayes factor, no matter what prior one uses. Also, I am not convinced by the authors’ reply. A compelling reply is “Andrew, you raise important concerns. We will convince you by replicating our experiment, preregistering it on Open Science Framework, and conducting exactly the same statistical analysis so that all our degrees of freedom are eliminated.” I very much encourage the authors to do just this — I see no good reason not to, the experiment is trivial to repeat and testing on the Amazon Turk is effortless and inexpensive. If the authors do not wish to replicate their experiment to bolster their claim, it would be interesting to know why not.
    Cheers,
    E.J.

    • Agrre, but a bit of the challenge is that authors should not be the ones to replicate their results – others (with design input from the authors) ideally should do it.

      Author’s overly defensive responses regarding methodology and especially statistics are to be expected but also expected to mostly be a waste of time.

      • “Agrre, but a bit of the challenge is that authors should not be the ones to replicate their results – others (with design input from the authors) ideally should do it. ”

        I understand the importance of “independent” replications, and maybe even preferably done through close collaboration with the original authors, but I then also always think about the following scenario:

        What if you would have a researcher who always publishes low-powered, “exploratory” studies and never him-/herself pre-registers studies or use larger sample sizes. What if it turned that most of their studies do not seem to replicate very well, when replicated by other researchers. It could then perhaps sort of be seen as them possibly “littering” the scientific literature with less-than-optimal studies, which other researchers would then “have to clean up” by optimally performing a replication study for instance.

        What I wonder is whether original authors could be seen as expected to also replicate their own studies, e.g. pre-register a “confirmation study” or something like that, following up a more “exploratory” study (which they apparently deemed important enough to be published). In my opinion, they may have a responsibility to do this, because they are/were the ones responsible for publishing it in the first place (Thereby possibly causing other people to view this information as possibly useful, true, or whatever. And thereby also possibly cause other researchers to invest resources in building upon the work, and/or trying to replicate it, etc.).

        • C.F.:

          You write:

          What if you would have a researcher who always publishes low-powered, “exploratory” studies and never him-/herself pre-registers studies or use larger sample sizes. What if it turned that most of their studies do not seem to replicate very well, when replicated by other researchers. It could then perhaps sort of be seen as them possibly “littering” the scientific literature with less-than-optimal studies, which other researchers would then “have to clean up” by optimally performing a replication study for instance.

          In my ideal world, this researcher’s studies would be published in Arxiv or Plos-One or some other repository but not in Science, Nature, JPSP, Psychological Science, etc. Then others could follow such work if they want, but they would not feel any obligation to replicate this work because it would be generally understood to be speculative.

          My big problem with studies such as that of Tracy and Beall is not that they are performed and published, but they are published in (and publicized by) high-profile, supposedly serious journals such as Psychological Science.

        • I don’t care where things are or are not published. It seems to me that that is of less importance, regarding scientific credibility, than the quality of the content/data/reasoning/etc.

          E.g. if I would know of a researcher who replicated the results of this “sexy” study, using pre-registration, more people, different populations, etc. I would tend to view that information as more credible and informative. If this was subsequently posted on Arxiv (and would not even be pusblished in a journal) I would still view that information as more credible than any published stuff in a “scientific journal” using less rigorous methods. What even makes a journal “scientific”?

        • “What even makes a journal “scientific”?”

          I think one characteristic of a “scientific” journal is “peer review” before accepting the article for publication. With “peer review” you have a few colleagues who read your article and then give tips on how to improve it before the article gets accepted or rejected for publication.

          Maybe the ‘women are more likely to…’-article forms a nice argument for why continuous post-publication “peer review” could be good for science. And maybe you could even just leave out the “peer”-part…

        • From the above reply of the authors: “For one thing, it is important to bear in mind that our research went through the standard peer review process—a process that is by no means quick or easy, especially at a top-tier journal like Psychological Science. This means that our methods and results have been closely scrutinized and given a stamp of approval by at least three leading experts in the areas of research relevant to our findings (in this case, social and evolutionary psychology).”

          “This means that our methods and results have been closely scrutinized and given a stamp of approval by at least three leading experts in the areas of research relevant to our findings (in this case, social and evolutionary psychology”

          “and given a stamp of approval by at least three leading experts ”

          Take the original article as an example to think about the benefits, from a scientific perspective, if more than 3 “leading experts” could give feedback on things !!

          Continuous post-publication review/commenting seems such a great idea ! I mean just read this blog, and all the scientists (and maybe even non-scientists) presenting information, using logic, and argumentation in order to try and optimally interpret findings, and maybe even find ways to improve science alltogether. Awesome !

        • CF,

          I think you’re right in principle, especially when considering the opinion of scientists. Unfortunately, as Beall & Tracy point to in their rejoinder, research (and criticism) is often viewed through the media lens. Being published in a top-tier journal gives the study credibility, particularly with journalists and the public.

          Andrew is right that top journals should probably not publish this type of study. The issue goes further, though. Even reviewers for lower-tier journals should — if they deign to publish the piece at all — require a bit more modesty in presenting findings. More “Hey, this is speculative, we need somebody else to do this better!” and less “Here is a solid finding based on good data and solid methods.”

          Pre-registering hypotheses and designs, of course, would do even more to combat the issue, as would general redesign of publication procedures.

        • My big problem with studies such as that of Tracy and Beall is not that they are performed and published, but they are published in (and publicized by) high-profile, supposedly serious journals such as Psychological Science.

          I’m skeptical that applying a more stringent filter at the top journals will shift science in a positive direction. I’d like to hear more about what kinds of articles you think these journals should accept, though. Preregistration (and operationalization) of all hypotheses, with prospective power calculations? Preference given to replication of previous work?

          Personally, the more I think about the problem, the more convinced I become that shifting to post-publication peer review is an essential part of the answer. I think using journal prestige to signal “important” and “true” creates broken incentives. Is anyone trying anything like a Reddit for science? That’s the kind of system I’d like to see tried. It would take a lot of guts to publish on such a new model pre-tenure, of course. The tenure process is a conservative force when it comes to problems in scholarly communication.

        • “Is anyone trying anything like a Reddit for science? That’s the kind of system I’d like to see tried”

          Maybe this right here can be seen as sort of a “Reddit for science”.

          There have been a lot of comments made about this study here, which the authors, or others, can take into account should they view them as possibly valid and useful. But the point is that it can all be seen as information to be possibly used to set up/ design a (new) study. This all has been achieved over the course of a few days, by engagement of multiple people. With this in mind, I like to refer to the article itself which took over 1 year to move through the peer review, acceptance, and subsequent publication process.

          Aside from commenting about possible improvements, a “Reddit for science” could also include ideas for follow-up studies, maybe tackling different aspects of methodology, hypotheses, etc. (e.g. having participants take a picture every day of what they are wearing for an entire month. That way the entire attire can later on be judged in terms of how “revealing” this is, as well as color aside from shirts, etc.). Many possible possibilities with a “Reddit for science” type system, and I think it would also speed up, and improve, scientific progress. That would be fun stuff!!

        • You’re right, blogs like this one do offer a certain kind of post-publication review, and it’s super valuable! I’ve assumed that something more centralized would be necessary to really reach one’s intended audience — certainly that has been among the needs filled by scientific journals: knowing where to look for the newest stuff of interest to you; knowing where the people you want to reach will be looking — but I guess it’s an open question whether self publishing + Google would be enough.

          However, the other thing I’d want to replicate is the function (currently served imperfectly by prepublication peer review) of getting a sense whether people in the field take a finding seriously. Some random wordpress post isn’t (and shouldn’t be) enough for people to believe a thing is true. If research reports were hosted in some central place with infrastructure supporting upvotes, downvotes, comments, etc, readers could get a sense of a paper’s centrality & the strength of its methods as adjudicated by peers — a better sense, I’d argue, than we get from the current system, where all we know in most cases is “three unpaid reviewers plus an editor thought this was okay.” Being able to revise usefully when people point out flaws, etc. would also be a great thing, because when it comes to communicating problems to people who have read and may be relying on a published paper, my hunch is that errata and retractions and such don’t work well at all.

          Taking off my utopia goggles for a moment, I do think a Reddit-like model could have a lot of pitfalls, culture chief among them. It’s hard to establish norms for appropriate behavior, and the original Reddit certainly isn’t a beacon in that regard! And figuring out the appropriate role for authority would be important, too… not impossible but it would require a lot of trial and error, I think. But it would be nice to see someone try it because the current model is just a mess, IMO.

        • Re: culture in Reddit-like systems, it is definitely hard, but an area of active research. The folks at Hypothes.is (http://hypothes.is/) have thought a LOT about reputation models and how to build trustworthiness into the system, building on lots of work on online communities. In any of these systems you’re going to inevitable have noise/loss, but the relevant question would be how much relative to the benefits (and relative to noise/loss in current system).

        • Joel: fascinating link! Thanks for sharing. Peter Brantley’s name is super familiar but I can’t place why. I will be thinking more about this.

        • Erin:

          No filter will catch everything, but I do think we have a problem when Psychological Science–the leading journal published by the leading society of research psychologists–publishes several of these unbelievable papers in such a short period of time. Beyond the immediate damage done by the publicizing of these implausible and very weakly supported claims, consider the incentives this gives for psychology researchers everywhere to aim for these little “p less than .05” mini-studies. And if the current study had been published in Arxiv or Plos-One instead of Psychological Science, we wouldn’t be discussing it right now. Any discussion would have to wait on a serious replication, at the very least.

          As a start, I think that if a journal is willing to publish a paper on whatever topic, whether it be ESP or arm circumference or the color of clothing, that the publication decision should be based on the quality of the experiment, not on the p-value. That is, if the paper by Tracy and Beall was good enough to be published in Psychological Science as is, I think that an equivalent paper with no statistically-significant p-values should also be publishable.

          That said, I completely agree with you about post-publication peer review. I do think that’s the way to go. For now though we still have to deal with the large influence of high-ranked journals. That’s an issue that might eventually go away but hasn’t yet.

        • I definitely understand the role of Psychological Science within psychology — it comes to my house every month!

          What is your metric for the quality of an experiment? I would love to see a system where null results became publishable, but I don’t think null results from a paper like Tracy and Beall’s would tell us much. Maybe that’s your point? — that to be a high-quality experiment, your N must be large enough (i.e. your prospective power large enough) that a null result affords some degree of interpretability?

          One of the things many psychologists are missing, and it’s something I still struggle with occasionally because that’s where I began, is the idea that somebody might care how big an effect is. Like, APA style can mandate the reporting of Cohen’s d or whatever and that’s fine, but I don’t think it’s really crossed over into affecting how psychologists build their theories. Or at least it wasn’t when I trained, where I trained. The question was always simply, “Is this effect present [Y/N]?” And the goal was to find some effect that your enemy’s theory wouldn’t have predicted and thereby prove your enemy wrong. Trying to attach a size to the effect seemed beside the point. Sometimes even trying to attach a DIRECTION to the effect seemed beside the point!

  22. I’m happy to agree that their data provide weak evidence for the effect. I would like to see their confidence intervals rather than their p values because that would emphasize just how small an effect is consistent with their 0.02 p value. But I’m confused about how you think people should avoid “researcher degrees of freedom”.

    In particular, most of the work I do doesn’t involve such squishy subjects, I recently have been working on analyzing a biological assay for transcription activity. In building the model my colleague and I had certain modeling choices, for example, we used a t distribution for the prior over certain types of effects because we saw certain outliers, realized that such outliers are not *that* difficult to generate by technical problems (ie. have a very simple biological explanation for how those things could be generated such as by either mislabeling tubes, contaminating the tubes with low levels of hormone, or something similar, fairly uncommon maybe 1% or so of experiments, but could cause a big effect), and decided that a distribution with long tails makes sense in this context. Typical researcher degree of freedom, in our case intended to account for real scientifically valid knowledge.

    Results so far indicate several very large effects, where the p value for positive effect would be less than say 10^-4. But there are also a few effects that are smaller, and interesting. Suppose we generate a p value for one of the small interesting effects that is 0.02. (These are all bayesian p values, so posterior p that the effect size is greater than zero+epsilon for example where epsilon is some “minimally biologically significant level”). Sure if we want to make this more convincing we should generate some more data on that contrast or whatever. But if criticizing statisticians are allowed to pull in all sorts of possible alternative results that could have been interpreted as consistent with our core hypothesis, then how small do we have to make this p value by increasing data size before you’ll believe it? Are you relying in essence on the fact that if an effect is real, a little more data will narrow the posterior distribution enough that non-fraudulent researcher degrees of freedom would never overwhelm the decline in p value? I guess I’m puzzled by this emphasis on alternative results that might be consistent with the core hypotheses. It seems to me a little like frequentist p values being about unseen data that might have been more extreme than some reference level.

  23. The crux of much of the debate here seems to revolve around how many comparisons the researchers could have or did perform. Gelman claims that they could have performed many, perhaps did perform many, and happened to report on one that “worked”. The authors claim that they were only interested in one comparison, chosen in advance and with a theoretical basis, and that had things not “worked”, they would not have used other comparisons.

    But here’s the peculiar thing: how could the researchers have insulated themselves against this criticism? Most simply, they could have simply collected less information. Instead of collecting information on 9 different colors, they could have just collected a binary red/pink or not red/pink measure and instead of collecting information on cycle days, they could have just collected a binary measure of in/not in the most fertile period (whatever that might be; I won’t weigh in on that debate). With just two binary observations on each woman, there’s really only one comparison to be made (though Gelman speculates in the Slate article, it seems unfairly, that the researchers might have been collecting additional data that they don’t report). Is this really what we want, though? To discourage researchers from collecting additional data? Surely, then the various large survey projects like ANES or GSS or the Pew Global surveys should be shut down tomorrow, and we should banish to the recycling bin every article ever to rely on these with scant apologies to essentially every scholar of public opinion in the last several decades.

    We could, however, fall back on the long-standing suggestion from philosophers of science to fall back on theories to tell us which comparisons are relevant. It seems that this what the authors of the maligned paper were doing and why the editors and referees chose to publish it. Journals cannot police people based on their hypothesized intentions for data comparisons or what they did with R in the dark of the night when no one was looking. They can police whether or not empirical work engages with theories important to the field. This seems to be the real problem – that many of the social sciences are increasingly atheoretical and focused on “sexy” findings that are disconnected from real theories and either based on questionable observational practices or on the most local of “local average treatment effects”.

    • Don:

      Just to be clear here, I certainly do not think the authors should’ve gathered less information. More information is better. And I also think it would be good for them to publish their raw data (anonymizing the participant names, of course). I just don’t think journals should be publishing things just because p is less than .05.

      Or, to put it another way, if it really was a good idea for Psychological Science to publish this paper, then I think it would have been a good idea for them to publish the paper even if nothing there at all were “statistically significant.” As Greg Francis discusses in his comment above, the trouble with this sort of study—even before the data are gathered—is that it can provide so little information that, realistically, the results can only expected to be suggestive and speculative, not conclusive in the way that is implied by the “p less than .05” claim.

    • Don:

      Great points. One caveat. There is no need to lump together theory testing and measurement. After all, we can take the reductio ad absurdum in the other direction. Why not collect data on the whole universe?

      So yes theory is critical but so is economy. Given the magnitude of humanity’s problems and the strengh of the theory the authors claimed to have, the two variable test you propose sounds like really good value for money.

      The money saved could go into a replication, or to other experiments.

      PS Surveys like anes or econ PWT stand on their own. Science begins with observation, concepts, measurement, and measurement instruments.

  24. Maybe the point is what people could take as support of something that _measures_ strength of it.

    If they considered n hypothesis and just took the largest that’s clear.

    If they could considered n hypothesis but took the first they assessed as that was good enough – its less clear.

    If it was agree upon with the FDA prior to even collecting data – that’s clear.

    But as JG Gardin once put it “you can’t rule out an hypothesis by the way it was generated, but you should choose which ones to spend your time on”.

    So here the (Andrew’s?) concern might mainly be just the efficient use of research resources, which claims should we try to replicate and who should do that? If its real it will replicate, if people have found some evidence supporting it – it often is not real.

    As some interesting stuff by Ed George on this from bayesian view at last years joint meeting.

  25. Andrew,

    This situation reminds me of the CHANCE article where you accused a researcher of being unethical for not sharing data with you for you to reanalyze.

    In both cases, you had broad, general statistical points to make:
    – Now: the problem of “many roads to significance”.
    – Then: the desirability of sharing data and reanalyzing it.

    In both cases, you’ve gotten yourself into trouble by choosing to use a specific example to highlight your general point without finding out what the truth is of those specific examples before going forward:
    – Now: Suggesting researcher df for different clothing items only to find out that wasn’t done. Or insisting that “other findings” would support their hypothesis despite the authors insisting that “other findings” would *not* support their hypothesis (no way to win that he-says/they-say battle).
    – Then: Suggesting a PhD student could analyze the data better than an experienced masters-level statistician only to find out (decades later) that there were elements of the problem you were unaware of that pointed to the analysis they did and away from the one you wanted. Or leveling a claim of “unethical” behavior decades ago despite the level of data openness you desire not being in effect even now (much less then).

    And in both cases, you’ve retreated behind your general points without acknowledging the mistakes you’ve made in the specific examples.

    Several people have commented above to the effect that your criticism of this paper (at least the researcher df part of it) depends on the assumption that the authors would have bent “other findings” to fit their supposed hypothesis. The authors themselves have denied they would have done so. I find it interesting that you so want the authors to dial back their certainty about their results, yet you seem unwilling to do so about your speculations.

    • Bjs:

      1. I do think it was unethical of the U.S. government researchers to not share their data with a citizen who asked for the data, and in that case, despite what you write above, there were no elements of the problem that pointed to the analysis they did and away from the one I wanted. This has nothing to do with the fact that their statistician had a masters degree. More eyes on the data is better, and I think it’s unethical not to share data, especially if it’s a government lab and the data are on chickens so no confidentiality issues.

      2. Regarding the Beall and Tracy study, I never said the authors would have “bent” anything, nor did I say that the authors of the study on arm circumference would have bent anything, nor did I say that ESP researcher Daryl Bem would have bent anything. What I said, and continued to say, is that you can do a reasonable analysis, consistent with your substantive theory, given your data. But had the data been different, there would a different reasonable analysis, just as consistent with the substantive theory. In any given case, the analysis makes sense, but when you put it together there are many more ways of obtaining statistical significance than you’d think based on the nominal p-value. That’s what researcher degrees of freedom are all about. As I commented above, I do think that expressions such as “fishing” and “p-hacking” are unfortunate in that they imply that a researcher has to be actively trying out different analyses. Actually, the researcher-degrees-of-freedom problem arises even if the researcher only tries out one analysis. The issue is that the chosen analysis is conditional on the data.

      • Andrew,

        Let me try and say back to you what I think your point is. The thinking seems to be something like “the results of this test may accord with the hypothesis under investigation, but lots of other results that could have come out of this test would also accord with the same hypothesis.” I think this is right, but I also think it is getting lost in the discussions of “researcher degrees of freedom”.

        I keep thinking to myself that that there is some weird complimentary but inverse (or something) relation to Deborah Mayo’s “severe test” here (warning, I’m new to her, and just digesting it slowly, so forgive me if I formulate this wrong). That reasoning argues that a good test will produce evidence (with high probability) consistent with the hypothesis put forward and against alternative theories when those alternative theories are false. You seem worried about tests where, prior to be conducted, a number of different results would all support the hypothesis under examination.

        I feel like I’m not quite smart enough to figure out, or haven’t had enough time to think about, how these two problems are related. But I do think the vocabulary we are using to discuss this (as you have pointed out several times) is inadequate at getting at the philosophical/epistemological point.

        Maybe we want an empirical exercise to demonstrate two things: First, we want a test where only one result is consistent with the hypothesis under investigation. Second, we want a test such that equally plausible alternative hypotheses would not produce that result. This first point seems to be the one you’ve been making in this post and others (such as the muscle-politics paper). Is that right? Is there some vocabulary we could use to differentiate these epistemological problems from the more behavioral (or at least behavioral sounding) problems of “researcher degrees of freedom”, “p-hacking”, and “fishing”?

      • Andrew,

        Thanks for responding.

        1. In the CHANCE exchange, while there was a disagreement over *analysis*, I was (mis)remembering the disagreement over *experimental design*. From Dr. Blackman’s response: “I do not question Gelman’s conclusions that, on the average, the results of samples A2, B1, and B2 were indistinguishable, but, in my considered judgment, the audience I was addressing would not have accepted a revised experimental protocol as he suggests”.

        2. I never said that you said “bent”. What I said, and continue to say, is that it is an assumption on your part that had the data been different, the researchers would have either a) pursued a different analysis or b) re-imagined (my word, not yours — is it better than “bent”?) how the result still manages to fit their hypothesis. I think it is relevant, but not definitive, that the authors say they would not have done so (that neither their analysis nor hypothesis-interpretation would have been conditional on the data). And I think it is very relevant that a statistician should at least acknowledge the assumptions that he or she makes.

        I believe my larger point holds true. You have a history of using specific examples to illuminate general points (I have no problem with this). But you also have a history of missing important elements in the specific examples, being called-out by the authors of the specific examples, and then insulating yourself from their criticisms by retreating to the general point and avoiding the specifics.

        You’ve even done this again with your response to my comment. You reiterate your general stance that US gov’t researchers should share data with citizens who request it. You avoid my specific point that this is not currently standard practice despite your desire for it. And it certainly was not standard practice 20+ years ago. To label actions as “unethical” based on a standard which no reasonable person at the time would expect to be upheld is unfair.

        I know this may sound overly harsh or like I am out to “catch” you, but I am really writing to try to help you with your future blog posts. In your writing, you come across as quite confident, but as a statistician, I would think you would be sensitive to the uncertainty that surrounds the statements you make. My suggestion is to be more cautious in your conclusions about what researchers have done or why they did it or what they might have done in alternative circumstances. Perhaps in the future you could avoid some ire of other researchers while still making the general points you want to make.

        I hope this helps…. Thanks.

        • Bjs:

          I appreciate your feedback. In quick response:

          1. I consider not sharing data in response to a reasonable request to be unethical even if that’s not standard practice.

          2. Beall and Tracy did not preregister their analysis and they made various judgment calls (see, for example, Erik Loken’s comment on this post for some discussion of this point). I have little doubt that if the data had come out differently they would’ve done a different analysis. And that would’ve made sense. For example, had they seen no difference with red and a huge difference with pink, of course it would have made sense for them to focus on pink. Their own theory (as they described it here) singled out pink. This is not a criticism, to say that their analysis is contingent on the data. That’s just the way things are.

        • Andrew,

          1. We’ll have to agree to disagree. I think it is unreasonable to expect someone to behave in a certain way when it isn’t the way people typically behave. And I think it is unfair to then label that person (or their actions) as unethical.

          2. Fair enough. I understand the “many roads to significance” issue and the many ways one can end up on one of the many roads. And I’m in no position to claim that the authors are correct in their post hoc statement that they wouldn’t have changed their analysis. I guess I was hoping that you would at least acknowledge that this is an assumption (even if it is one you have “little doubt” about).

          In any case, I maintain my overall point about the difficulties you run into when you write about specific examples in the pursuit of general points.

          Thanks.

      • This “researcher degrees of freedom” terminology is where I think Ioannidis went wrong, or at least has been widely wrongly interpreted.

        The focus on “researcher degrees of freedom” has led people to think that the academic literature is broken because researchers are exploiting their “researcher degrees of freedom”. However, as you point out, the problem still arises from the collective action of researchers even if individual researchers voluntarily restrict themselves to almost no degrees of freedom going into the study.

        I think researcher degrees of freedom is the wrong term. What we’re actually talking about are poor study designs with respect to causal identifiability (and sometimes power as well) where the causal explanation is underdetermined given the measured outcomes. If a field (eg psychology, genetics, etc.) collectively utilizes such study designs, you will get lots of incorrect conclusions _even if_ every researcher picks one hypothesis going into the study. Everyone could even pre-register their hypothesis with some national research registry – it wouldn’t fix the problem.

  26. The study’s basic finding is that 77% of women who wear red are ovulating. Those are some pretty impressive odds, and I’m not sure what kind of prior I would have put on that finding. The logic of the argument depends crucially on the fact that sample A and sample B were both “significant”. But in collecting sample B – with the resources of the UBC psychology department at their disposal – the researchers settled for a sample size of 24. The Fisher exact test on sample B is p = .12. The chi-square is p = .05. Who would settle for such a weak replication on a survey that requires nothing more than single word answers to two questions?

    Furthermore, for sample B 9 of the 24 women didn’t meet the inclusion criteria of being more than 5 days away from menses onset, but they were included anyway. (22% of sample A also didn’t meet that criterion, but were included anyway. And even though sample A was supposed to be restricted to women younger than 40, the age range for women included in sample A was up to 47.) Out of all the women who participated across the two samples, 31% were excluded for not providing sufficient precision and confidence in their answers (but note that sufficient precision will vary with time since menses. It’s easier to be certain +/- 1 day for something that occurred 5 days ago as opposed to 22 days ago).

    Researcher degrees of freedom are in play when the inclusion and exclusion criteria are juggled like this. Researcher degrees of freedom are in play when the sample sizes for study A and study B are chosen in peculiar ways.

    Ah, but the researchers will say over and over again that “the results held” in the combined sample regardless of these choices. By “results held” I assume they mean the asterisk remained. And in their rejoinder they even say they could split the data on red, pink or both and it all still works.

    But notice that the logic of replication is lost. All of these robustness checks were done by pooling the two studies. There is no sample A and sample B as far as the robustness checks are concerned. The existence of sample B must have played a big role in convincing the reviewers to accept this paper. It looks good and gets its own separate name and everything. From a statistical point of view, however, sample B contributes hardly any validation evidence. It’s a tiny sample, very poorly powered given the results of sample A, and is only 20% of the pooled sample on which the various checks are made.

    Look, it’s possible the hypothesis has merit. But what would the expected effect size be? Women would need to have red and pink garments in their wardrobes, which would have to be available, and appropriate for the temperature and tasks of the day. At some conscious or subconscious level the women would have to consider the redness and the garment as a good choice out of their entire wardrobe for their purposes (with the purpose itself being conscious or unconscious). The central hypothesis may well be true, but the effect size can’t be expected to be substantial. I understand what’s being sold to me, I’m just not convinced by the evidence. There are far too many red flags to ignore.

    • Eric:

      Well put. The difficulty, I think, is that many researchers (including, I think, the authors of this paper and the referees and editor for Psychological Science) do not understand the difficulties of inference for small effects. (In your last paragraph listing the reasons why any effect size would have to be small, you could also add that the days of peak fertility vary from woman to woman and were mislabeled in the study.)

      The usual statistical training says that “statistical significance” is the goal, and if you reach significance you’re won. It’s sort of like, if you win the game, then your team can go on to the playoffs; once you’ve won, the score of that previous game doesn’t matter.

      Actually, though, statistical significance tells you just about nothing if you’re estimating a small effect with a small sample. But this piece of information isn’t in typical statistics books (I don’t think it’s in any of mine either, actually!). So it’s hard to blame researchers for not knowing it. But I would like to spread the news to help out future researchers—including Beall and Tracy in their future work!

        • CI: Looking very briefly the comment on meta-analysis “Meta-analyses integrate evidence from a number of studies” seems very naïve – it presupposes successful replication which Eric nicely raised in his comment.

          From http://en.wikipedia.org/wiki/Meta-analysis “a meta-analysis refers to methods focused on contrasting and combining results from different studies, in the hope of identifying patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light in the context of multiple studies”

      • Interestingly, this race for significance, notwithstanding ridiculously small samples (even when marginal cost of larger sampling is small, as in this case) is a uniquely academic phenomenon.

        In all the practical contexts I’ve seen Statistics being used in industry, no one dares to make a point with a tiny sample.

        In other words, intuitively most people seem to be very wary of small sample inference (significance test be damned). Somehow, training (or wrong incentives), causes academics to shut off this intuitive distrust of small samples.

  27. The journals (not just this one) are complicit in squeezing down word counts. The methods is the first place to get the chop because, however much the statistician argues for it, the subject expert co-authors all think it’s “dry”. With plenty of methods (web supplement?) we would know exactly what they set out to do in terms of operationalizing the hypothesis test(s).

  28. Pingback: “Women Are More Likely to Wear Red or Pink at Peak Fertility”: multiplicity, subjectivity and other statistical wardrobe malfunctions | Robert Grant's stats blog

  29. Am I the only woman that is upset about being compared to an ‘ovulating chimpanzee’. This is another example that the problem with current social psychology is not just the method but what is considered a worth-while scientific question.

  30. Why the upset? After all, chimps are our closest living relatives, with whom we shared a last common ancestor 5-6 MYA. The comparative study of mating strategies has been a worthwhile scientific question since 1859. For some recent exemplary work, google “Sara Hrdy”.

  31. After reading this I am inclined to become slightly disappointed in the fashion sense of the women in this study: Did they all wore t-shirts ? Nobody had a dress on, or a tank top on, or a vest on, or a combination of these on, or whatever ? That’s kind of boring.

    • Cat:

      I brought this up in my Slate article and I still wonder about what they did with respondents who were wearing dresses or sweaters. My guess is that the survey question just asked about shirts and then the researchers just took the respondents as is. But maybe there was some pre-screening, where women who were wearing dresses or sweaters were excluded from the study. I didn’t see any mention of this in the paper or the supplementary material but maybe it’s somewhere that I didn’t notice.

      • ‘But maybe there was some pre-screening, where women who were wearing dresses or sweaters were excluded from the study’

        Seems to me that those clothing-items also have colors, and it would seem strange to exclude them given the hypothesis. It also seems highly unlikely that all 124 women wore a t-shirt and nobody wore anything else. Options then:

        1. possibly they described the situation poorly (i.c. there WERE women who wore a dress, or something else, but they had to be “concise” and just lumped every clothing-option together under the term “shirt”

        2. an optional conclusion of the study could have been something like ‘women who participated in this study all have a significantly highly similar taste in attire-preference, more specifically wearing a t-shirt’ (p < .0001)

        • ‘Across two samples (total N =124), women at high-conception risk were over three times more likely to wear a red or pink shirt than women at low-conception risk, and 77% of women who wore red or pink were found to be at high, rather than low, risk.’

          “Shirt” it is indeed. Thanks for the correction.

          https://en.wikipedia.org/wiki/Shirt

          ‘A shirt is a cloth garment for the upper body. Originally an undergarment worn exclusively by men, it has become, in American English, a catch-all term for almost any garment other than outerwear such as sweaters, coats, jackets, or undergarments such as bras, vests or base layers.’

        • Ah, if I am understanding it correctly, it is an appropriate lumping together/ catch all-term for clothing for the upper body.

          Maybe that then also includes dresses (upper part of a dress?), or nobody wore a dress. Option 2. is out of the window then, and option 1. is probably not stated correctly then. “Shirt” is probably used to describe the general upper body clothing (and may possibly also include the upper part of dresses?). That would be my guess then.

  32. Two things that would help this discussion:

    1. Where is the raw data in this discussion? In connection with his CHANCE critique, Andrew had previously analyzed the data according to his understanding of the problem (IIRC the analysis appeared in Gelman and Hill 2007 as well). Shouldn’t one be reanalyzing this data to illustrate an ideal analysis?

    2. Why the focus on such a topic? The kind of abuse Andrew is addressing is happening in areas where people will actually die, e.g., due to policies implemented based on flawed analyses. One should take on work where it’s really going to matter.

    • Shravan:

      1. Sure, but in this case N is so small that I don’t think much can be done at all (even setting aside the measurement and sampling problems).

      2. I agree that it’s best to work on important problems. But methods are methods, and we can get insight from focusing on any specific example with care. As we say in statistics, God is in every leaf of every tree.

    • The second point should not be raised in response to criticism of a study, but could perhaps be raised in response to publication of the study. The idea that’s it’s “fine” (in the complacent sense) to publish on a subject, but a waste of resources to criticize flaws in studies of that subject, is really weird to my mind. If someone publishes (in Slate) a critique of someone else’s essay (published in a scholarly journal) on a John Ashbery poem, in an attempt to raise questions about the state of literary theory, do we say, “This is just poetry! People are actually dying somewhere”?

      What we have to remember is that the study being discussed is providing support for a general psychological theory. This theory, if not perhaps the specific result we’re talking about here, will then (eventually) have an affect on social policy and/or psychotheraphy.

      Moreover, as Andrew notes, this is a methodological critique. If left uncriticized, these methods will continue to set the standard, likely introducing ever more erroneous beliefs about how the mind works.

      We have to remember that psychology is a key discipline in shaping our sense of what the human mind is, how it works, what it does. It’s important to keep an eye on the basis on which this so-called knowledge is produced.

      • Hi Thomas, My own research also does not result in any deaths, so I should be careful of what I say ;). You wrote:

        “If someone publishes (in Slate) a critique of someone else’s essay (published in a scholarly journal) on a John Ashbery poem, in an attempt to raise questions about the state of literary theory, do we say, “This is just poetry! People are actually dying somewhere”?”

        Actually, that’s exactly what I’d say; and at least Philip Larkin would agree :). But I take your and Andrew’s point.

        • Yes, though Larkin, I presume, would say that about the original essay, not merely the critique in Slate. My view is that if there’s room for scholarship about poetry, there’s got to be room for concerned criticism of the scholarship. We can’t say that poetry is perfectly harmless and misreadings are therefore perfectly harmless too. (Interestingly, my own criticisms of the plagiarism of poetry in organization studies is sometimes dismissed with an “it’s only poetry” gesture.) I’m not even sure that poetry (i.e., bad poetry) is as harmless as all that, but that’s a longer discussion.

          In any case, I’m glad you see the main point. It would be a terrible situation if, in order to say anything critical of anyone’s research, you had to always indicate how many lives are at stake.

        • I just see really dangerous stuff coming out in medicine, and I am surprised nobody takes these irresponsible people on and destroys their non-story. For example, there have been a series of articles that claim that daily nocturnal dialysis has no discernible benefit over the minimal treatment of 4 or so hours three times a week, or may even be harmful. There are many serious problems with the published analyses, but nobody took the authors on. Their null results even got mentioned in the popular press as a conclusive and positive finding, the takeaway being that there is no advantage to more dialysis. One consequence of such peer-reviewed research is policy decisions about how much dialysis to fund.

          It’s due to examples like this that I was wondering: the same arguments invoked in the women-wear-red study apply in these other studies, but one is getting a lot of attention here. The other guys got away with skewing the literature and messing up people’s lives.

          But of course your general point remains valid; as I mentioned, it is easy to argue that what I myself do for a living is utterly useless and pointless :). I still do the statistics and still try to work out the right methodology!

          Some of the articles:
          http://www.nature.com/ki/journal/v83/n2/full/ki2012329a.html
          http://jasn.asnjournals.org/content/early/2013/01/30/ASN.2012060595.abstract

  33. 1) In nature red is also the sign of poison e.g. danger ( Red-backed Poison Frog/Black Widow Spider/monarch butterflies etc)and red signs on the road are “stop” and “yield” so an alternative explanation is that the women are signalling the danger to men (and to themselves) of having sex at that time. Doesn’t it seem more likely that women wouldn’t want to get pregnant especially college students?

    2) “These aren’t whizkid piano players, they are adult women. Period.”

    But wouldn’t most young women, especially college women, be on the pill and so have no fertility window. How many women did they have to throw away to get their 24 potentially fertile women? And how does that effect generalisability?

    3) Women in close contact can sync up their periods and since the college kids are all from the same school and the school is in Canada where the national colour is red there is potentially a problem – the kids from the same dorm sync up, get dressed up in red to support a national team which just happens to be in their ovulation window. (And being in Canada, it’s probably why wearing red has such a high probability of occurring).

  34. Pingback: In praise of exploratory statistics | Dynamic Ecology

Comments are closed.