One more rep.
The new thing you just have to read, if you’re following the recent back-and-forth on replication in psychology, is this post at Retraction Watch in which Nosek et al. respond to criticisms from Gilbert et al. regarding the famous replication project.
Gilbert et al. claimed that many of the replications in the replication project were not very good replications at all. Nosek et al. dispute that claim.
And, as I said, you’ll really want to read the details here. They’re fascinating, and they demonstrate how careful the replication team really was.
When reading all this debate, it could be natural as an outsider to want to wash your hands of the whole thing, to say that it’s all a “food fight,” why can’t scientists be more civil, etc. But . . . the topic is important. These people all care deeply about the methods and the substance of psychology research. It makes sense for them to argue and to get annoyed if they feel that important points are being missed. In that sense I have sympathy for all sides in this discussion, and I don’t begrudge anyone their emotions. It’s also good for observers such as Uri Simonsohn, Sanjay Srivastava, Dorothy Bishop, and myself to give our perspectives. Again, there are real issues at stake here, and there’s nothing wrong—nothing wrong at all—with people arguing about the details while at the same time being aware of the big picture.
Before sharing Nosek et al.’s amazing, amazing story, I’ll review where we are so far.
Background and overview
As most of you are aware (see here and here), there is a statistical crisis in science, most notably in social psychology research but also in other fields. For the past several years, top journals such as JPSP, Psych Science, and PPNAS have published lots of papers that have made strong claims based on weak evidence. Standard statistical practice is to take your data and work with it until you get a p-value of less than .05. Run a few experiments like that, attach them to a vaguely plausible (or even, in many cases, implausible) theory, and you got yourself a publication. Give it a bit more of a story and you might get yourself on Ted, NPR, Gladwell, and so forth.
The claims in all those wacky papers have been disputed in three, mutually supporting ways:
1. Statistical analysis shows how it is possible—indeed, easy—to get statistical significance in an uncontrolled study in which rules for data inclusion, data coding, and data analysis are determined after the data have been seen. Simmons, Nelson, and Simonsohn called it “researcher degrees of freedom” and Eric Loken and I called it “the garden of forking paths.” It’s sometimes called “fishing” or “p-hacking” but I don’t like those terms as they can be taken to imply that researchers are actively cheating.
Researchers do cheat, but we don’t have to get into that here. If someone reports a wrong p-value that just happens to be below .05, when the correct calculation would give a result above .05, or if someone claims that a p-value of .08 corresponds to a weak effect, or if someone reports the difference between significant and non-significant, I don’t really care if it’s cheating or just a pattern of sloppy work.
2. People try to replicate these studies and the replications don’t show the expected results. Sometimes these failed replications are declared to be successes (as in John Bargh’s notorious quote, “There are already at least two successful replications of that particular study . . . Both articles found the effect but with moderation by a second factor” [actually a different factor in each experiment]), other times they are declared to be failures (as in Bargh’s denial of the relevance of another failed replication which, unlike the others, was preregistered). The silliest of all these was Daryl Bem counting as successful replications several non-preregistered studies which were performed before his original experiment (anything’s legal in ESP research, I guess), and the saddest, from my perspective, came from the ovulation-and-clothing researchers who replicated their own experiment, failed to find the effect they were looking for, and then declared victory because they found a statistically significant interaction with outdoor temperature. That last one saddened me because Eric Loken and I repeatedly advised them to rethink their paradigm but they just fought fought fought and wouldn’t listen. Bargh I guess is beyond redemption, so much of his whole career is at stake, but I was really hoping those younger researchers would be able to break free of their statistical training. I feel so bad partly because this statistical significance stuff is how we all teach introductory statistics, so I, as a representative of the statistics profession, bear much of the blame for these researchers’ misconceptions.
Anyway, back to the main thread, which concerns the three reasons above why it’s ok not to believe in power pose or so many of these other things that you used to read about in Psychological Science.
Here’s the final reason:
3. In many cases there is prior knowledge or substantive theory that the purported large effects are highly implausible. This is most obvious in the case of that ESP study or when there are measurable implications in the real world, for example in that paper that claimed that single women were 20 percentage points more likely to support Obama for president during certain times of the month, or in areas of education research where there is “the familiar, discouraging pattern . . . small-scale experimental efforts staffed by highly motivated people show effects. When they are subject to well-designed large-scale replications, those promising signs attenuate and often evaporate altogether.”
Item 3 rarely stands on its own—researchers can come up with theoretical justifications for just about anything, and indeed research is typically motivated by some theory. Even if I and others might be skeptical of a theory such as embodied cognition or himmicanes, that skepticism is in the eye of the beholder, and even a prior history of null findings (as with ESP) is no guarantee of future failure: again, the researchers studying these things have new ideas all the time. Just cos it wasn’t possible to detect a phenomenon or solve a problem in the past, that doesn’t mean we can’t make progress: scientists do, after all, discover new planets in the sky, cures for certain cancers,
cold fusion, etc.
So if my only goal here were to make an ironclad case against certain psychology studies, I might very well omit item 3 as it could distract from my more incontestable arguments. My goal here, though, is scientific not rhetorical, and I do think that theory and prior information should and do inform our understanding of new claims. It’s certainly relevant that in none of these disputed cases is the theory strong enough on its own to hold up a claim. We’re disputing power pose and fat-arms-and-political-attitudes, not gravity, electromagnetism, or evolution.
Putting the evidence together
For many of these disputed research claims, statistical reasoning (item 1 above) is enough for me to declare Not Convinced and move on, but empirical replication (item 2) is also helpful in convincing people. For example, Brian Nosek was convinced by his own 50 Shades of Gray experiment. There’s nothing like having something happen to you to really make it real. And and theory and prior experience (item 3) tells us that we should at least consider the possibility that these claimed effects are spurious.
OK, so here we are. 2016. We know the score. A bunch of statistics papers on why “p less than .05” implies so much less than we used to think, a bunch of failed replications of famous papers, a bunch of re-evaluations of famous papers revealing problems with the analysis, researcher degrees of freedom up the wazz, miscalculated p-values, and claimed replications which, when looked at carefully, did not replicate the original claims at all.
This is not to say that all or even most of the social psychology papers in Psychological Science are unreplicable. Just that many of them are, as (probabilistically) shown either directly via failed replications or statistically through a careful inspection of the evidence.
Given everything written above, I think it’s unremarkable to claim that Psychological Science, PPNAS, etc., have been publishing a lot of papers with fatal statistical weaknesses. It’s sometimes framed as a problem of multiple comparisons but I think the deeper problem is that people are studying highly variable and context-dependent effects with noisy research designs and often with treatments that seem almost deliberately designed to be ineffective (for example, burying key cues inside of a word game; see here for a quick description).
So, I was somewhat distressed to read this from a recent note by Gilbert et al., taking no position on whether “some of the surprising results in psychology are theoretical nonsense, knife-edged, p-hacked, ungeneralizable, subject to publication bias, and otherwise unlikely to be replicable or true” (see P.S. here).
I could see the virtue of taking an agnostic position on any one of these disputed public claims: Maybe women really are three times more likely to wear red during days 6-14 of their cycle. Maybe elderly-related words really do make people walk more slowly. Maybe Cornell students really do have ESP? Maybe obesity really is contagious? Maybe himmicanes really are less dangerous than hurricanes. Maybe power pose really does help you. Any one of these claims might well be true: even if you study something in such a noisy way that your data are close to useless, even if your p-values mean nothing at all, you could still have a solid underlying theory and have got lucky with your data. So it might seem like a safe position to keep an open mind on any of these claims.
But to take no position on whether some of these “surprising results” have problems? That’s agnosticism taken to a bit of an extreme.
If they do take this view, I hope they’ll also take no position on the following claims which are supported just about as well from the available data: that women are less likely to wear red during days 6-14 of their cycle, that elderly-related words make people walk faster, that Cornell students have an anti-ESP which makes them consistently give bad forecasts (thus explaining that old hot-hand experiment), that obesity is anti-contagious and when one of your friends gets fat, you go on a diet, etc.
Let’s keep an open mind about all these things. I, for one, am looking forward to the Ted talks on the “coiled snake” pose and on the anti-contagion of obesity.
The new story
OK, now you should go here and read the story from Brian Nosek and Elizabeth Gilbert, (no relation to the Daniel Gilbert of “Gilbert et al.” discussed above). They take one of the criticisms of Gilbert et al. who purported to show how unfaithful one of the replicated studies was, and carefully and systematically describe the study, the replication, and why the criticism of Gilbert et al. was at best sloppy and misinformed, and at worst a rabble-rousing, misleading bit of rhetoric. As I said, follow the link and read the story. It’s stunning.
In a way it doesn’t really matter, but given the headlines such as “Researchers overturn landmark study on the replicability of psychological science” (that was from Harvard’s press release; I was going to say I’d expect better from that institution where I’ve studied and taught, but it’s not fair to blame the journalist who wrote the press release; he was just doing his job), I’m glad Nosek and E. Gilbert went to the trouble to explain this to all of us.
P.S. I’m about as tired of writing about all this as you are of reading about it. But in this case I thought the overview (in particular, separating items 1, 2, and 3 above) would help. The statistical analysis and the empirical replication studies reinforce each other: the statistics explains how those gaudy p-values could be obtained even in the absence of any real and persistent effect, and the empirical replications are convincing to people who might not understand the statistics.
P.P.S. I just noticed that the Harvard press release featuring Gilbert et al. also says that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”
100%, huh? Maybe just to be on the safe side you should call it 99.9% so you don’t have to believe that the Cornell ESP study replicates.
What a joke. Surely you can’t be serious. Why didn’t you just say “Statistically indistinguishable from 200%”—that would sound even better!