1. The pizzagate story (of Brian Wansink, the Cornell University business school professor and self-described “world-renowned eating behavior expert for over 25 years”) keeps developing.
Last week someone forwarded me an email from the deputy dean of the Cornell business school regarding concerns about some of Wansink’s work. This person asked me to post the letter (which he assured me “was written with the full expectation that it would end up being shared”) but I wasn’t so interested in this institutional angle so I passed it along to Retraction Watch, along with links to Wansink’s contrite note and a new post by Jordan Anaya listing some newly-discovered errors in yet another paper by Wansink.
Since then, Retraction Watch ran an interview with Wansink, in which the world-renowned eating behavior expert continued with a mixture of contrition and evasion, along with insights into his workflow, for example this:
Also, we realized we asked people how much pizza they ate in two different ways – once, by asking them to provide an integer of how many pieces they ate, like 0, 1, 2, 3 and so on. Another time we asked them to put an “X” on a scale that just had a “0” and “12” at either end, with no integer mark in between.
This is weird for two reasons. First, how do you say “we realized we asked . . .”? What’s to realize? If you asked the question that way, wouldn’t you already know this? Second, who eats 12 pieces of pizza? I guess they must be really small pieces!
Wansink also pulls one out of the Bargh/Baumeister/Cuddy playbook:
Across all sorts of studies, we’ve had really high replication of our findings by other groups and other studies. This is particularly true with field studies. One reason some of these findings are cited so much is because other researchers find the same types of results.
Ummm . . . I’ll believe it when I see the evidence. And not before.
In our struggle to understand Wansink’s mode of operation, I think we should start from the position that he’s not trying to cheat; rather, he just doesn’t know what he’s doing. Think of it this way: it’s possible that he doesn’t write the papers that get published, he doesn’t produce the tables with all the errors, he doesn’t analyze the data, maybe he doesn’t even collect the data. I have no idea who was out there passing out survey forms in the pizza restaurant—maybe some research assistants? He doesn’t design the survey forms—that’s how it is that he just realized that they asked that bizarre 0-to-12-pieces-of-pizza question. Also he’s completely out of the loop on statistics. When it comes to stats, this guy makes Satoshi Kanazawa look like Uri Simonsohn. That explains why his response to questions about p-hacking or harking was, “Well, we weren’t testing a registered hypothesis, so there’d be no way for us to try to massage the data to meet it.”
What Wansink has been doing for several years is organizing studies, making sure they get published, and doing massive publicity. For years and years and years, he’s been receiving almost nothing but positive feedback. (Yes, five years ago someone informed his lab of serious, embarrassing flaws in one of his papers, but apparently that inquiry was handled by one of his postdocs. So maybe the postdoc never informed Wansink of the problem, or maybe Wansink just thought this was a one-off in his lab, somebody else’s problem, and ignored it.)
When we look at things from the perspective of Wansink receiving nothing but acclaim for so many years and from so many sources (from students and postdocs in his lab, students in his classes, the administration of Cornell University, the U.S. government, news media around the world, etc., not to mention the continuing flow of accepted papers in peer-reviewed journals), the situation becomes more clear. It would be a big jump for him to accept that this is all a house of cards, that there’s no there there, etc.
Here’s an example of how this framing can help our understanding:
Someone emailed this question to me regarding that original “failed study” that got the whole ball rolling:
I’m still sort of surprised that they weren’t able to p-hack the original hypothesis, which was presumably some correlate with the price paid (either perceived quality, or amount eaten, or time spent eating, or # trips to the bathroom, or …).
I suspect the answer is that Wansink was not “p-hacking” or trying to game the system. My guess is that he’s legitimately using these studies to inform his thinking–that is, he forms many of his hypotheses and conclusions based on his data. So when he was expecting to see X, but he didn’t see X, he learned something! (Or thought he learned something; given the noise level in his experiments, it might be that his original hypothesis happened to be true, irony of ironies.) Sure, if he’d seen X at p=0.06, I expect he would’ve been able to find a way to get statistical significance, but when X didn’t show up at all, he saw it as a failed study. So, from Wansink’s point of view, the later work by the student really did have value in that they learned something new from their data.
I really don’t like the “p-hacking” frame because it “gamifies” the process in a way that I don’t think is always appropriate. I prefer the “forking paths” analogy: Wansink and his students went down one path that led nowhere, then they tried other paths.
2. People keep pointing me to a recent statement by Daniel Kahneman in a comment on a blog by Ulrich Schimmack, Moritz Heene, and Kamini Kesavan, who wrote that the “priming research” of Bargh and others that was featured in Kahneman’s book “is a train wreck” and should not be considered “as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.” Here’s Kahneman:
I accept the basic conclusions of this blog. To be clear, I do so (1) without expressing an opinion about the statistical techniques it employed and (2) without stating an opinion about the validity and replicability of the individual studies I cited.
What the blog gets absolutely right is that I placed too much faith in underpowered studies. As pointed out in the blog, and earlier by Andrew Gelman, there is a special irony in my mistake because the first paper that Amos Tversky and I published was about the belief in the “law of small numbers,” which allows researchers to trust the results of underpowered studies with unreasonably small samples. We also cited Overall (1969) for showing “that the prevalence of studies deficient in statistical power is not only wasteful but actually pernicious: it results in a large proportion of invalid rejections of the null hypothesis among published results.” Our article was written in 1969 and published in 1971, but I failed to internalize its message.
My position when I wrote “Thinking, Fast and Slow” was that if a large body of evidence published in reputable journals supports an initially implausible conclusion, then scientific norms require us to believe that conclusion. Implausibility is not sufficient to justify disbelief, and belief in well-supported scientific conclusions is not optional. This position still seems reasonable to me – it is why I think people should believe in climate change. But the argument only holds when all relevant results are published.
I knew, of course, that the results of priming studies were based on small samples, that the effect sizes were perhaps implausibly large, and that no single study was conclusive on its own. What impressed me was the unanimity and coherence of the results reported by many laboratories. I concluded that priming effects are easy for skilled experimenters to induce, and that they are robust. However, I now understand that my reasoning was flawed and that I should have known better. Unanimity of underpowered studies provides compelling evidence for the existence of a severe file-drawer problem (and/or p-hacking). The argument is inescapable: Studies that are underpowered for the detection of plausible effects must occasionally return non-significant results even when the research hypothesis is true – the absence of these results is evidence that something is amiss in the published record. Furthermore, the existence of a substantial file-drawer effect undermines the two main tools that psychologists use to accumulate evidence for a broad hypotheses: meta-analysis and conceptual replication. Clearly, the experimental evidence for the ideas I presented in that chapter was significantly weaker than I believed when I wrote it. This was simply an error: I knew all I needed to know to moderate my enthusiasm for the surprising and elegant findings that I cited, but I did not think it through. When questions were later raised about the robustness of priming results I hoped that the authors of this research would rally to bolster their case by stronger evidence, but this did not happen.
I still believe that actions can be primed, sometimes even by stimuli of which the person is unaware. There is adequate evidence for all the building blocks: semantic priming, significant processing of stimuli that are not consciously perceived, and ideo-motor activation. I see no reason to draw a sharp line between the priming of thoughts and the priming of actions. A case can therefore be made for priming on this indirect evidence. But I have changed my views about the size of behavioral priming effects – they cannot be as large and as robust as my chapter suggested.
I am still attached to every study that I cited, and have not unbelieved them, to use Daniel Gilbert’s phrase. I would be happy to see each of them replicated in a large sample. The lesson I have learned, however, is that authors who review a field should be wary of using memorable results of underpowered studies as evidence for their claims.
Following up on Kahneman’s remarks, neuroscientist Jeff Bowers added:
There is another reason to be sceptical of many of the social priming studies. You [Kahneman] wrote:
I still believe that actions can be primed, sometimes even by stimuli of which the person is unaware. There is adequate evidence for all the building blocks: semantic priming, significant processing of stimuli that are not consciously perceived, and ideo-motor activation. I see no reason to draw a sharp line between the priming of thoughts and the priming of actions.
However, there is an important constraint on subliminal priming that needs to be taken into account. That is, they are very short lived, on the order of seconds. So any claims that a masked prime affects behavior for an extend period of time seems at odd with these more basic findings. Perhaps social priming is more powerful than basic cognitive findings, but it does raise questions. Here is a link to an old paper showing that masked *repetition* priming is short-lived. Presumably semantic effects will be even more transient.
And psychologist Hal Pashler followed up:
One might ask if this is something about repetition priming, but associative semantic priming is also fleeting. In our JEP:G paper failing to replicate money priming we noted:
For example, Becker, Moscovitch, Behrmann, and Joordens (1997) found that lexical decision priming effects disappeared if the prime and target were separated by more than 15 seconds, and similar findings were reported by Meyer, Schvaneveldt, and Ruddy (1972). In brief, classic priming effects are small and transient even if the prime and measure are strongly associated (e.g., NURSE-DOCTOR), whereas money priming effects are [purportedly] large and relatively long-lasting even when the prime and measure are seemingly unrelated (e.g., a sentence related to money and the desire to be alone).
Kahneman’s statement is stunning because it seems so difficult for people to admit their mistakes, and in this case he’s not just saying he got the specifics wrong, he’s pointing to a systematic error in his ways of thinking.
You don’t have to be Thomas W. Kuhn to know that you can learn more from failure than success, and that a key way forward is to push push push to understand anomalies. Not to sweep them under the rug but to face them head-on.
3. Now return to Wansink. He’s in a tough situation. His career is based on publicity, and now he has bad publicity. And there no easy solution for him, as once he starts to recognize problems with his research methods, the whole edifice collapses. Similarly for Baumeister, Bargh, Cuddy, etc. The cost of admitting error is so high that they’ll go to great lengths to avoid facing the problems in their research.
It’s easier for Kahneman to admit his errors because, yes, this does suggest that some of the ideas behind “heuristics and biases” or “behavioral economics” have been overextended (yes, I’m looking at you, claims of voting and political attitudes being swayed by shark attacks, college football, and subliminal smiley faces), but his core work with Tversky is not threatened. Similarly, I can make no-excuses corrections of my paper that was wrong because of our data coding error, and my other paper with the false theorem.
P.S. Hey! I just realized that the above examples illustrate two of Clarke’s three laws.