Jessica Tracy and Alec Beall, authors of that paper that claimed that women at peak fertility were more likely to wear red or pink shirts (see further discussion here and here), and then a later paper that claimed that this happens in some weather but not others, just informed me that they have posted a note in disagreement with an paper by Eric Loken and myself.
Our paper is unpublished, but I do have the megaphone of this blog, and Tracy and Beall do not, so I think it’s only fair to link to their note right away. I’ll quote from their note (but if you’re interested, please follow the link and read the whole thing) and then give some background and my own reaction.
Tracy and Beall write:
Although Gelman and Loken are using our work as an example of a broader problem that pervades the field–a problem we generally agree about–we are concerned that readers will take their speculations about our methods and analyses as factual claims about our scientific integrity. Furthermore, we are concerned that their paper will misrepresent aspects of our research, because Gelman previously wrote a blog post on our research, published in Slate, which contained a number of mischaracterizations [see the three links in the first paragraph above; you won’t be surprised to hear that I don’t think I mischaracterized Tracy and Beall’s work, but clearly there has been some failure of communication.—AG] . . . we are posting here new information that we have also directly provided to Gelman and Loken . . .
Following the publication of our paper . . . we conducted a new study seeking to replicate our findings. This study produced a null result, but led us to formulate new hypotheses about a potential moderator of our previously documented effect (see here for a detailed description of this failure to replicate and our subsequent hypotheses). We found preliminary support for these new hypotheses in re-analyses of our previously published data, and so moved on to conduct a new study (N = 209) to directly test our new theory. This study proved fruitful; a predicted interaction emerged in direct support of our hypotheses. All of these results can be found in “The impact of weather on women’s tendency to wear red or pink when at high risk for conception” . . . Of note, this paper and the Psych Science paper together report ALL data we have collected on this issue . . .
Regarding the robustness of our main effect, we have now run new analyses testing for this effect across all these collected samples—the two samples we originally reported in our Psych Science paper, and the two new samples that comprise the two new studies reported in the PLoS ONE paper. Together these comprise a sample of N = 779. Although we expected the main effect to be considerably weaker across these samples than it was in our initial studies, due to major variance in the moderator variable that we have now found to influence this effect, we nonetheless found consistent support for that main effect. . . .
They follow up with many details of their statistical analysis, and again I encourage readers to go to their note. I have linked to it and quoted from it here to give them the same level of exposure that I have when posting on this blog.
Now for some discussion, which I thought it best to post this right away. As the saying goes, I apologize for the length of this post; I did not have the time to make it shorter.
You can have a multiple comparisons problem, even if you only performed a single analysis of your data
This all started a couple weeks ago when Tracy and Bell informed me that they’d come across a preprint of my recent (unpublished) paper with Eric Loken, The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. This paper of ours is no secret; it’s openly posted on my website and I’ve referred to it a few times on the blog.
In section 3 of our paper, Eric and I discuss problems of multiple comparisons that, as we see it, destroy the interpretation of the p-values in Beall and Tracy’s paper. We did not feel that this discussion was controversial—after all, we had already discussed these issues in various places. Rather, we used their example in our paper in part because Beall and Tracy had assured us that they did not do any selection of different hypotheses to test. As we wrote:
Even though Beall and Tracy did an analysis that was consistent with their general research hypothesis—and we take them at their word that they were not conducting a fishing expedition—many degrees of freedom remain in their specific decisions. . . . all the [analysis contingent on data] could well have occurred without it looking like “p-hacking” or “fishing.” It’s not that the researchers performed hundreds of different comparisons and picked ones that were statistically significant. Rather, they start with a somewhat-formed idea in their mind of what comparison to perform, and they refine that idea in light of the data. . . . the data analysis would not feel like “fishing” because it would all seem so reasonable. Whatever data-cleaning and analysis choices were made, contingent on the data, would seem to the researchers as the single choice derived from their substantive research hypotheses. They would feel no sense of choice or “fishing” or “p-hacking”—even though different data could have led to different choices, each step of the way. . . . In this garden of forking paths, whatever route you take seems predetermined, but that’s because the choices are done implicitly. The researchers are not trying multiple tests to see which has the best p-value; rather, they are using their scientific common sense to formulate their hypotheses in reasonable way, given the data they have. The mistake is in thinking that, if the particular path that was chosen yields statistical significance, that this is strong evidence in favor of the hypothesis.
The above quote from our paper is notable because, in their recent note, Tracy and Beall write:
Gelman and Loken’s central concern is that our analyses could have been done differently – including or excluding different subsets of women, or using a different window of high conception risk. They imply that we likely analyzed our results in all kinds of different ways before selecting the one analysis that confirmed our hypothesis.
No. We did not imply this. As we wrote:
We take them at their word that they were not conducting a fishing expedition . . . In each of these cases [of multiple comparisons], the data analysis would not feel like “fishing” because it would all seem so reasonable. Whatever data-cleaning and analysis choices were made, contingent on the data, would seem to the researchers as the single choice derived from their substantive research hypotheses.
I guess we should work on making this clearer in the revision. In particular, in the sentence, “It’s not that the researchers performed hundreds of different comparisons and picked ones that were statistically significant,” we could change “hundreds of” to “many.”
In any case, Eric and I do not imply that Tracy and Beall “likely analyzed [their] results in all kinds of different ways before selecting the one analysis that confirmed our hypothesis.” Rather, the whole point of our paper was the opposite: that even if, given their data, they only did a single analysis, they could’ve done other analyses had their data looked different. (And, indeed, in their second study, they got different data and they did different analyses.) Such behavior is not necessarily a bad thing—as a practicing statistician, my analyses are almost always contingent on the actual data that appear—but they invalidate p-values.
To say this one more time, let me quote from the title of our paper:
Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time
Our whole point was that Tracy and Beall’s p-values should not be taken as strong evidence of their research hypotheses, even if, conditional on their data, they only did a single analysis, and indeed we are taking their word that they did no analyses that were unreported in their paper. I’m not sure how much clearer we can make this point, given that it’s in the title of our paper, but given the confusion here, we will try.
In their email, Tracy and Beall asked us to put them in contact with the editor of the journal who is handling our paper, and they told us that if we were to amend our paper to take into account of their new analyses, they would share their raw data with us if we would like. We replied that we would bear their points in mind when making revisions of our article, and in particular we would try to make clear that the multiplicities we discuss represent potential analyses that could have been done had the data been different, which indeed is a separate question from their discussion of alternative analyses of the existing data. We also assured them that we would inform the editor of their concerns. But we told them that we did not feel that it makes sense in the editorial process to give any special role to the authors of papers that we discuss. It is common practice for papers in statistics and research methods to cite and discuss, often critically, the work of others, and in general the authors of work being discussed do not necessarily get any privileged role in the editorial process. We also said that we think the idea of public discussion of these examples is a good one, perhaps involving Petersen et al., Durante et al., Bem, and others authors of controversial studies which have been identified as having multiple-comparisons problems.
Finally, we recommended that, rather than send their data to us, that Tracy and Beall post their data on the web for all to see. I don’t think they have any obligation to do so, nor do I think there is even an expectation that this be done—my impression is that is uncommon for psychology researchers to post the (anonymized) raw data from their published papers, and I certainly don’t think that Tracy and Beall have any additional requirement to do so, just because their paper is controversial. If they would like to post their data, I encourage them to do so; if not, it’s their call.
In any case, I think the question of whether we should invite Tracy, Beall, Bem, and others into our editorial process, or the question of how widely these researchers should share their data, is separate from the statistical and scientific question, What do their studies tell us? I’m bringing up these “process” issues because they arose in our email interactions with Tracy and Beall, but they are a separate matter from the questions of scientific inference which I would like to focus on here.
When is peak fertility?
Beall and Tracy define peak fertility as days 6-14 of the menstrual cycle. When I first saw this it seemed odd to me because, when we were trying to get pregnant a few years ago, my wife’s gynecologist had told us to focus on days 10-17, or so I recalled.
So I decided to look things up.
As you probably know, peak fertility varies. There are no sure things when it comes to the fertility calendar. However, we can get some basic guidelines from the usual public health authorities. For example, here’s what the U.S. Department of Health and Human Services has to say about Natural Family Planning. Under the Calendar Method, they recommend you compute the first day when you’re likely to be fertile as “Subtract 18 days from the total days of your shortest cycle,” and for the last day they say, “Subtract 11 days from the total days of your longest cycle.” Beall and Tracy assume a 28-day cycle (this is somewhat ambiguous because they also number the days as going from 0 to 28), so that will take you from days 10-17. Or, if you want to say that the shortest cycle is 27 days and the longest is 29, that will give you the range 9-18.
Or we could try babycenter.com. Their ovulation calculator says the most fertile days are 12-17. I’m pretty sure that other sources will give other dates. But I don’t think anyone out there is including days 6 and 7 in the peak times for fertility.
So where did Beall and Tracy’s fertility days come from? They write that “specified window is based on prior published work” and they give tons of references, but if you try to track these down, there’s not much there. My best guess is that Beall and Tracy followed a 2000 paper by Penton-Voak and Perrett, which points to a 1996 paper by Regan, which points to the 14th day as the best estimate of ovulation. Regan claims that “the greatest amount of sexual desire was experienced” on Day 8. So my best guess (but it’s just a guess) is that Penton-Voak and Perrett misread Regan, and then Beall and Tracy just followed Penton-Voak and Perrett.
But it doesn’t really matter. We’re doing science here, not literary scholarship, and if we’re studying the effects of fertility, we should use actual fertility, or the closest we can measure. I’m going with the Department of Health and Human Services until you can point me to something better.
Beall and Tracy write, “other researchers have used a slightly different window,” but 10-17 is a lot different than 6-14! Nearly half of Beall and Tracy’s interval is outside the HHS interval, and on the other end they’re missing three of HHS’s peak-fertile days.
Why does this matter?
Why do I harp on the days of peak fertility? In some sense it’s not a huge deal; indeed, had Beall and Tracy used the more standard 10-17 day window, maybe the comparison reported in their paper would not have come up statistically significant, but I think it’s possible they would’ve seen something else interesting in their data (for example, a comparison between partnered and unpartnered women, or between young and old women) that was statistically significant, and then we’d still be having this discussion in a different form.
But I think the days-of-peak-fertility story is important because it indicates something about Tracy and Beall’s approach to research—and, I fear, a lot of other research in this field. The problem, I see it, is that their work has lots of science-based theorizing (arguments involving evolution, fertility, etc.) and lots of p-values (statistical significance is what gets your work published), but not a strong connection between the two.
Consider the following statement of Tracy and Beall:
It doesn’t particularly matter which window researchers use, as long as they make an a-priori decision about which to use and then run analyses for that window only.
This represents what one might call a “legalistic” interpretation of the p-value. It doesn’t matter what we are measuring, as long as we get statistical significance. Now, as I’ve pointed out in various places, I don’t think this legalistic reasoning holds here, because there’s no evidence that the authors made an a priori decision to analyze the data the way they did. In none of their papers is there preregistration of the decisions for data inclusion, data coding, or data analysis. As Eric and I emphasized in our paper, even if a research team only does a single analysis of their data, this does not at all imply they would’ve done that same analysis had the data been different. Beall and Tracy may have made an a priori decision to use this window, but they would’ve been free to change that decision had their data been different. That sort of thing is why people preregister.
But here I want to take a different tack, and just note the absurdity (to me) of Tracy and Beall saying that it doesn’t matter if they get the dates right! Just think about this for a moment. Here are the authors of two published papers on fertility, and they didn’t even bother to talk with a gynecologist or even look things up on the internet. [Correction: it is possible they talked with an expert or looked things up on the internet, or both. I have no idea who Tracy and Beall talked with or what they checked. It might be that the expert told them that the days of peak fertility were days 10-17 or that they looked at the U.S. Department of Health and Human Services website, but they decided to go with days 6-14 anyway. Or it’s possible they talked with an gynecological expert who told them that days 6-14 were peak fertility or that they happened to encounter a website that gave those dates.] I’m not an expert on the field but I just happen to be a middle-aged guy with a middle-aged wife so I noticed something funny.
Don’t Beall and Tracy care [just to be clear: I have no doubt that they care about the degree of truth of their research hypotheses; the thing I’m asking is if they care that they might have gotten the dates of peak fertility wrong]? I agree with the commenter who wrote:
The problem is simple, the researchers are disproving always false null hypotheses and taking this disproof as near proof that their theory is correct.
This isn’t how science should go. Look at it this way: suppose that the Beall and Tracy paper had no multiple comparisons issues, suppose they’d pre-registered their analysis, suppose even that they’d managed to replicate their results without having to add a new interaction mid-stream. Even so, under that best of all possible worlds (a world which did not exist), they wouldn’t have a finding about peak fertility, they’d have a finding about days 6-14. That should be interesting, no? Upon learning they got the dates of peak fertility wrong, Tracy and Beall’s response should not be: Hey, it doesn’t matter. It should be: Hey, this is a big deal! Our experiment did not tell us what we thought it told us. But they didn’t react this way. Why? I don’t think Tracy and Beall are bad people or that they are unconscientious scientists. I have every reason to believe they are doing their best. But I think they’re missing the whole point of statistical measurement, if they think “it doesn’t particularly matter” what they are measuring, as long as they decided to measure it ahead of time. It does matter, if you’re ever planning to link your statistical findings to your scientific hypotheses.
And this all affects Beall and Tracy’s statement. In particular, they report in their recent note that repeating their analysis using the days 7-14 window (recall that they originally used days 6-14) yields an odds ratio of 1.76 and a p-value of 0.046. Their original paper reported odds ratios of 3.85 and 8.67, so I guess a lot must have been happening on day 6 of the menstrual cycle. [Correction: Tracy just emailed me and pointed out, in addition to shifting their window by one day, their new analysis includes additional data, so that’s why the estimate shifted so much.]
In any case, the point is that if you want to study peak fertility, you should study peak fertility, which, according to the most authoritative source I can find, goes from days 10-17. Shifting from days 6-14 to days 7-14 isn’t enough. It is striking that their estimate changes so much from a 1-day shift [correction: not so striking given that new data were pooled in], but perhaps this is not such a surprise given the small sample size and high variation.
Again, let me emphasize that this particular analysis is not central to our point; Eric and I in our article list many dimensions of multiple comparisons, and I’d like again to point all of you to the excellent 50 Shades of Gray paper by Nosek, Spies, and Motyl, where they demonstrate the garden of forking paths in action. I’m going into detail about this peak fertility thing because I want to emphasize my statistician’s perspective that, if you care about the science, you should care about the measurement.
I don’t criticize Beall and Tracy for getting the dates of peak fertility wrong—it’s natural to trust the literature in your subfield, and we all make mistakes. (Indeed, I had to retract the entire empirical section of a paper after I learned that I’d miscoded some survey responses.) But I am bothered that, all these months after I pointed out the dates-of-peak-fertility thing, they haven’t even checked. They should check: After all, maybe I’m wrong! I’ve been wrong before. But if I’m not wrong, and the U.S. Department of Health and Human Services really does say the dates of peak fertility are days 10-17, then it’s time for Beall and Tracy to take a deep think. After all, the phrase “peak fertility” is in the title of their paper.
It sort of bothers me to keep writing this, not out of fear of redundancy—we’re already long past that point—but because indeed I might be missing something obvious and come out looking like a fool. But fair’s fair, if I really am being foolish, then you might as well get the opportunity to find out. If my only interest were to make an airtight case against the Tracy and Beall claims, I might not even bother with the peak fertility thing, as we have enough other strong points. But I do think it’s worth mentioning, even at some risk of embarrassment, because to me this issue is symptomatic of a big problem in this sort of research, which is that people often aren’t measuring what they say they’re measuring, and they don’t even seem so bothered when people point this out.
What should Beall and Tracy do? I don’t know that they’ll follow my recommendations, as they probably feel that I’m picking on them. I guess I’d feel that way if the situation were reversed, and all I can say to them is Eric and I are using their work as one of several case studies in an article that’s all about how statistics can be misused, even by researchers who are trying their best and are not “cheating.” To put it in Clintonesque terms,
I feel your pain:
We’ll think of the faith of our advisors that was instilled in us here in psychology, the idea that if you work hard and play by the rules, you’ll be rewarded with a good life for yourself and a better chance for your research hypotheses.
I think Beall and Tracy do work hard and play by the rules, and unfortunately there’s the expectation that if you start with a scientific hypothesis and do a randomized experiment, there should be a high probability of learning an enduring truth. And if the subject area is exciting, there should consequently be a high probability of publication in a top journal, along with the career rewards that come with this. I’m not morally outraged by this: it seems fair enough that if you do good work, you get recognition. I certainly don’t complain if, after publishing some influential papers, I get grant funding and a raise in my salary, and so when I say that researchers expect some level of career success and recognition, I don’t mean this in a negative sense at all.
I do think, though, that this attitude is mistaken from a statistical perspective. If you study small effects and use noisy measurements, anything you happen to see is likely to be noise, as is explained in this now-classic article by Katherine Button et al. On statistical grounds, you can, and should, expect lots of strikeouts for every home run—call it the Dave Kingman model of science—or maybe no home runs at all. But the training implies otherwise, and people are just expecting to the success rate you might see if Rod Carew were to get up to bat in your kid’s Little League game. (Sorry for the dated references; remember, I said I’m middle-aged.)
To put it another way, the answer to the question, “I mean, what exact buttons do I have to hit?” is that there is no such button. But I suspect that Tracy and Beall have been trained (implicitly, presumably not explicitly) to expect an unrealistically high rate of success. They really seem to believe that they can discover enduring truths through short questionnaires of Mechanical Turk participants, and once they think they’ve had such success, they understandably are not happy with anyone tries to take this away from them.
Tracy and Beall conclude the main body of their recent note with:
Indeed, we hope other researchers will join us in seeking a more precise estimate of the main effect and of the weather moderator, and in discovering other variables that no doubt also moderate this effect.
But I’m sorry to say that all the evidence I’ve seen suggests that they are chasing noise. Their results and their papers look just like what I’d expect them to look like, if they were studying effects so small as to be undetectable with the tools they are using.
It makes me sad when people chase noise, so I concluded my email message to Tracy and Beall with some advice:
We wish you the best of luck with your research and we encourage you to consider within-subject designs with repeated measurements, which should give you more of a chance to discover stable patterns in your experimental data.
And I meant it. Actually, I think Eric means it even more than I do, as he’s been the one who keeps banging on the importance of within-person studies for estimating within-person effects. Yes, if you’re careful you can estimate within-person effects with between-person studies, and yes, I know about the concerns with poisoning the well, but in a case like this, where you’re interested in how individual behavior is changing over time, and there is so much individual-level variation (including, in this case, factors such as what clothing a woman has in her closet), I really think that a within-subject design is the only way to go. Such a study takes more effort—but, again, if we accept that important breakthroughs can require lots of work, that’s fine. A Ph.D. thesis can take 2 years or more, right?
Preregistered replication could help too but that’s not really the most important thing. I think any researchers in this area really need to focus first on studying large effects (where possible) and getting good measurements. There’s no point in trying to get ironclad proof that you’ve discovered something, until you first set things up so that you’re in the position to make real discoveries.
I don’t particularly care about fertility and choices of clothing, and I do feel that these researchers are shooting in the dark. I just spent three hours writing this (instead of working on Stan; sorry, everyone!) because (a) I do feel this example illustrates some important and general statistical points, and (b) Tracy and Beall are saying that Loken and I were wrong, so of course I want to clarify our points. Beall, Tracy, Loken, and anyone else can further clarify or dispute in the comments.
Or of course Eric and I could’ve lain low and just waited till our paper appeared, but as I wrote in the very first paragraph above, that didn’t seem fair. I wanted to link to Tracy and Beall’s note right away, so that they can get as large an audience for their statements as we have for ours. I can’t promise this treatment for everyone whose work I criticize, but when people go to the trouble of writing something and alerting me to it, I like to give them the chance to air their views. I’ve done this several times before and I’m sure will be doing it many times in this blog in the future.
Meanwhile, if you’re interested in scientific misunderstandings but you don’t want to read more about fertility and clothing choice, I recommend you go to Dan Kahan’s cultural cognition blog.
P.S. I have more to say about the role of statistics in scientific practice and the role of statistics in science criticism (the so-called “replication police”), but that really deserves its own post so I’ll stop here.
P.P.S. Just one more time: I have no desire to hurt Tracy and Beall in any way. In all sincerity, I applaud their openness, both in contacting me and in posting their views. For the reasons described in (exhausting and repetitive) detail above, I think they’re basically wrong, but I also think I can see where they are coming from. They followed standard practice and achieved the great success of publication in a top journal. When their work was criticized, they chose to defend rather than reflect, but, again, that’s a perfectly normal choice. And at this point it’s hard for them to go back. Especially given the traditional norms of statistical work in their subfield, it seems most natural for them to continue to think they did things right. Conditional on all that, Beall and Tracy seem to me to be behaving in a cordial, professional, and scholarly manner. (Not that I have any special status to judge this; I’m just stating my impression here.) Even their decision not to share their data seems reasonable enough from their perspective: Researchers don’t usually post raw data, and in this case they could well feel that Loken and I just want their data to do a hatchet job on them. That’s not the case—there were just a couple of things that Eric wanted to look into—but I can see how Tracy and Beall could think that, conditional on their continuing to believe in the correctness of the conclusions they drew from their published analyses. I don’t think it’s too late for them to sit, think, and see the problems—maybe a couple of preregistered replications on their part could help out, although I don’t think that should be truly necessary—but in any case I appreciate the cordial spirit that these discussions have taken.
P.P.P.S. In response to an email Tracy just sent me, I’ve added several clarifications in [brackets] above. Let me also emphasize that I had no intention of being offensive or defamatory of either Beall or Tracy in any way, nor did I intend in any way to question their integrity. I think they are following common research practices that are, in my view, mistaken, but this is not a comment on their integrity in any way.
In particular, I regret implying that Beall and Tracy “don’t care.” Their take on peak fertility is different from mine, but I can see how it makes sense for them to follow the literature in their subfield, even if from my perspective these are not the most authoritative sources on the topic. “Don’t care” isn’t a fair summary of “disagree on what is the most authoritative source on peak fertility.”