A couple days we again discussed the replication crisis in psychology—the problem that all sorts of ridiculous studies on topics such as political moderation and shades of gray, or power pose, or fat arms and political attitudes, or ovulation and vote preference, or ovulation and clothing, or beauty and sex ratios, or elderly-related words and walking speed, or subliminal smiley faces and attitudes toward immigration, or ESP in college students, or baseball players with K in their names being more likely to strike out, or brain scans and political orientation, or the Bible Code, are getting published in top journals and getting lots of publicity. Indeed, respected organizations such as the Association for Psychological Science and the British Psychological Society have promoted what I (and many others) would consider junk science.
I should emphasize that, if all that was wrong with these studies was that they were ridiculous, you could say that ridiculous is in the eye of the beholder and that sometimes ridiculous-seeming claims turn out to be true. But it’s not just that. The real problem is that the evidence people take as strong support of these theories—the evidence that their supporters take as so strong that they still hold on to these theories even after attempted replications fail and fail and fail—is that there are some comparisons with p-values less than .05. What is not well understood is that, in the presence of what Simmons, Nelson, and Simonsohn have called “researcher degrees of freedom” and what Loken and I call “the garden of forking paths,” such “statistically significant” p-values provide essentially zero information.
In addition to all of this, a group of researchers coordinated replications of a bunch of experiments published in psychology journals and reported their results in a paper by Nosek et al. that appeared last year, along with headlines which (correctly, in my opinion) declared a replication crisis in science.
My recent post on this topic was triggered by two recent papers, one by Gilbert et al. disputing the claims made from the replication studies, and a response by Nosek at al. defending their work. As I wrote, I pretty much agree with Nosek et al., and I pointed the reader to this more thorough discussion from Sanjay Srivastava.
What’s new today?
In the discussion of my post from the other day, a commenter pointed to a further reply by Gilbert et al. who continue to hold that the replication project “provides no grounds for drawing ‘pessimistic conclusions about reproducibility’ in psychological science.” I replied in comments, explaining why I think they still miss the point:
Gilbert et al. write, “Why does the fidelity of a replication study matter? Even a layperson understands that the less fidelity a replication study has, the less it can teach us about the original study.” Again, they’re placing the original study in a privileged position. There’s nothing special about the original study, relative to the replication. The original study came first, that’s all. What we should really care about is what is happening in the general population.
I think the time-reversal heuristic is a helpful tool here. The time order of the studies should not matter. Imagine the replication came first. Then what we have is a preregistered study that finds no evidence of any effect, followed by a non-preregistered study on a similar topic (as Nosek et al. correctly point out, there is no such a thing as an exact replication anyway, as the populations and scenarios will always differ) that obtains p less than .05 in a garden-of-forking-paths setting. No, I don’t find this uncontrolled study to provide much evidence.
Gilbert et al. are starting with the published papers—despite everything we know about their problems—and treating the reported statistical significance in those papers as solid evidence. That’s their problem. The beauty of having replications is that we can apply the time-reversal heuristic.
On the specific point of representativeness, I do agree with you (and Gilbert et al.) that the results of the Nosek et al. study cannot be taken to represent a general rate of reproducibility in psychological science. As Gilbert et al. correctly point out, to estimate such a rate, once would first want to define a population that represents “psychological science” and then try to study a representative sample of such studies. Sampling might not be so hard, but nobody has even really defined a population here, so, yes, it’s not clear what population is being represented here.
Regarding the issue of representativeness, I disagree with Gibert et al.’s statement that “it is difficult to see what value their [Nosek et al.’s] findings have.” Revealing replication problems in a bunch of studies does seem to me to be valuable, especially given the attitudes of people like Cuddy, Bargh, etc., who refuse to give an inch when their own studies are not replicated.
Gilbert et al. are coming up with lots of arguments, and that’s fine—as Uri Simonsohn says in his thoughtful commentary, the replication project is important and it’s good to have open discussion and criticism. So, while I think Gilbert et al. are missing the big picture and I think it’s too bad they made some mistake in their article, I think it’s good that they’re continuing to focus attention on the replication crisis.
But right now I see two problems in psychological science right now:
1. Lots of bad stuff is being published in top journals and being promoted in the media.
2. Even after all this, Gilbert et al., Bargh, Cuddy, etc., don’t want to face the problems in the scientific publication process.
1 and 2 work together. Part of my “pessimistic conclusions about reproducibility” come from the fact that, when problems are revealed, it’s a rare researcher who will consider that their original published claim may be mistaken.
If the Barghs, Cuddys, etc. would recognize their problems with their past work, then item 1 above would not be such a problem. Sure, lots of low-quality research might still be published (I mean “low quality” not just retrospectively in the sense of not getting replicated, but prospectively in the sense of being too noisy to have a chance of revealing anything useful as John Carlin and I discussed in our recent paper), but there’d be churn: Researchers would openly publish work as exploratory speculation—no more null-hypothesis-significance-testing and taking statistical significance to represent truth—and they’d recognize when their research had reached a dead end.
But, as long as these researchers will not admit their mistakes, as long as they continue to hold on to every “p less than .05” claim they’ve ever published—and as long as they’re encouraged in that view by Gilbert et al.’s move-along-no-problem-here attitude—then, yes, we have a serious problem.
Daryl Bem, the himmicanes guys, Marc Hauser, Kanazawa, . . . none of these guys ever, as far as I know, acknowledged that they might be wrong. It’s taken a lot of effort to explain to people why “statistical significance” doesn’t mean what they think it means, and why publication in a top journal is not a badge of quality. The reproducibility project of Nosek et al. provides another angle on this serious problem. As Gilbert et al. correctly point out, that project is itself imperfect: like any empirical study, it is only as good as its data, and it makes sense for interested parties to examine its replications one by one. And, as also Gilbert et al. correctly point out, the studies in this project are not a representative sample of any clearly defined population and so we should be careful in our interpretation of any replication percentages.
What continues to concern me is the toxic combination of items 1 and 2 above. You guys who feel that Nosek et al. are giving science a bad name: You should be more bothered than anybody about the behavior of researchers such as Bem, Bargh, etc., who refuse to let go of their published by fatally flawed and unreplicable results.
P.S. Someone pointed me to a new note by Gilbert et al. which takes no position on whether “some of the surprising results in psychology are theoretical nonsense, knife-edged, p-hacked, ungeneralizable, subject to publication bias, and otherwise unlikely to be replicable or true.” So I don’t know if if they really believe that women are three times more likely to wear red during days 6-14 of their cycle? I wonder if they really believe that elderly-related words make people walk more slowly? Or that Cornell students have ESP? Or that obesity is contagious? Etc. I’m sure there will always be people who will believe some or even all these things—they all got published in top peer-reviewed journals, after all, and some of them even appeared in the New York Times and in Ted talks! And that’s fine, there should be a diversity of beliefs. I hope there are also people out there believing the opposite statements, which I think are also just about as well supported by the data: that women less likely to wear red during days 6-14 of their cycle, that elderly-related words make people walk faster, that Cornell students have an anti-ESP which makes them consistently give bad forecasts (hey—that would explain those hot-hand findings!), that obesity is anti-contagious and when one of your friends gets fat, you go on a diet. All these things are possible. For now, I’d just like it if people would stop saying or acting as if the statistical evidence for these claims is “overwhelming” or that “you have no choice but to accept that the major conclusions of these studies are true.” By now it should be clear from statistical grounds alone that the evidence in favor of these various claims is much weaker than has been claimed, but I think the replication project of Nosek et al. has been valuable in showing this in another way.
P.P.S. Also this update from Nosek et al., going into some of the details on one of the replications that had been criticized by Gilbert et al. Very helpful to see this example.