I fear that many people are drawing the wrong lessons from the Wansink saga, focusing on procedural issues such as “p-hacking” rather than scientifically more important concerns about empty theory and hopelessly noisy data. If your theory is weak and your data are noisy, all the preregistration in the world won’t save you.

Someone pointed me to this news article by Tim Schwab, “Brian Wansink: Data Masseur, Media Villain, Emblem of a Thornier Problem.” Schwab writes:

If you look into the archives of your favorite journalism outlet, there’s a good chance you’ll find stories about Cornell’s “Food Psychology and Consumer Behavior” lab, led by marketing researcher Brian Wansink. For years, his click-bait findings on food consumption have received sensational media attention . . .

In the last year, however, Wansink has gone from media darling to media villain. Some of the same news outlets that, for years, uncritically reported his research findings are now breathlessly reporting on Wansink’s sensational scientific misdeeds. . . .

So far, that’s an accurate description.

Wansink’s work was taken at face value by major media. Concerns about Brian Wansink’s claims and research methods had been known for years, but these concerns had been drowned out by the positive publicity—much of it coming directly from Wansink’s lab, which had its own publicity machine.

Then, a couple years ago, word got out that Wansink’s research wasn’t what it had been claimed to be. It started with some close looks at Wansink’s papers which revealed lots of examples of iffy data manipulation: you couldn’t really believe what was written in the published papers, and it was not clear what had actually been done in the research. The story continued when outsiders Tim van der Zee​, Jordan Anaya​, and Nicholas Brown found over 150 errors in four of Wansink’s published papers, and Wansink followed up by acting as if there was no problem at all. After that, people found lots more inconsistencies in lots more of Wansink’s papers.

This all happened as of spring, 2017.

News moves slowly.

It took almost another year for all these problems to hit the news, via some investigative reporting by Stephanie Lee of Buzzfeed.

The investigative reporting was excellent, but really it shouldn’t’ve been needed. Errors had been found in dozens of Wansink’s papers, and he and his lab had demonstrated a consistent pattern of bobbing and weaving, not facing these problems but trying to drown them in happy talk.

So, again, Schwab’s summary above is accurate: Wansink was a big shot, loved by the news media, and then they finally caught on to what was happening, and he indeed “has gone from media darling to media villain.”

But then Schwab goes off the rails. It starts with a misunderstanding of what went wrong with Wansink’s research.

Here’s Schwab:

His misdeeds include self-plagiarism — publishing papers that contain passages he previously published — and very sloppy data reporting. His chief misdeed, however, concerns his apparent mining and massaging of data — essentially squeezing his studies until they showed results that were “statistically significant,” the almighty threshold for publication of scientific research.

No. As I wrote a couple weeks ago, I fear that many people are drawing the wrong lessons from the Wansink saga, focusing on procedural issues such as “p-hacking” rather than scientifically more important concerns about empty theory and hopelessly noisy data. If your theory is weak and your data are noisy, all the preregistration in the world won’t save you.

To speak of “apparent mining and massaging of data” is to understate the problem and to miss the point. Remember those 150 errors in those four papers, and how that was just the tip of the iceberg? The problem is not that data were “mined” or “massaged,” the problem is that the published articles are full of statements that are simply not true. In several of the cases, it’s not clear where the data are, or what the data ever were. There’s the study of elementary school children who were really preschoolers, the pizza data that don’t add up, the carrot data that don’t add up, the impossible age distribution of World War II veterans, the impossible distribution of comfort ratings, the suspicious distribution of last digits (see here for several of these examples).

Schwab continues:

And yet, not all scientists are sure his misdeeds are so unique. Some degree of data massaging is thought to be highly prevalent in science, and understandably so; it has long been tacitly encouraged by research institutions and academic journals.

No. Research institutions and academic journals do not, tacitly or otherwise, encourage people to report data that never happened. What is true is that research institutions and academic journals rarely check to see if data are reasonable or consistent. That’s why it is so helpful that van der Zee​, Anaya​, and Brown were able to run thousands of published papers through a computer program use statistical tools to check for certain obvious data errors, of which Wansink’s paper had many.

Schwab writes:

I wonder if we’d all be a little less scandalized by Wansink’s story if we always approached science as something other than sacrosanct, if we subjected science to scrutiny at all times, not simply when prevailing opinion makes it fashionable.

That’s a good point. I think Schwab is going too easy on Wansink—I really do think it’s scandalous when a prominent researcher publishes dozens of papers that are purportedly empirical but are consistent with no possible data. But I agree with him that we should be subjecting science to scrutiny at all times.

P.S. In his article Schwab also mentions power-pose researcher Amy Cuddy. I won’t get into this except to say that I think he should also mention Dana Carney—she’s the person who actually led the power-pose study and she’s also the person who bravely subjected her own work to criticism—and Eva Ranehill, Magnus Johannesson, Susanne Leiberg, Sunhae Sul, Roberto Weber, and Anna Dreber, who did the careful replication study that led to the current skeptical view of the original power pose claims. I think that one of the big problems with science journalism is that researchers who make splashy claims get tons of publicity, while researchers who are more careful don’t get mentioned.

I think Schwab’s right that the whole Wansink story is unfortunate: First he got too much positive publicity, now he’s getting too much negative publicity. The negative publicity is deserved—at almost any time during the past several years, Wansink could’ve defused much of this story by simply sharing his data and being open about his research methods, but instead he repeatedly attempted to paper over the cracks—but it personalizes the story of scientific misconduct in a way that can be a distraction from larger issues of scientists being sloppy at best and dishonest at worst with their data.

I don’t know the solution here. On one hand, here Schwab and I are as part of the problem—we’re both using the Wansink story to say that Wansink is a distraction from the larger issues. On the other hand, if we don’t write about Wansink, we’re ceding the ground to him, and people like him, who unscrupulously seek and obtain publicity for what is, ultimately, pseudoscience. It would’ve been better if some quiet criticisms had been enough to get Brian Wansink and his employers to clean up their act, but it didn’t work that way. Schwab questions Stephanie Lee’s journalistic efforts that led to smoking-gun-style emails—but it seems like that’s what it took to get the larger world to listen.

Let’s follow Schwab’s goal of “subjecting science to scrutiny at all times”—and let’s celebrate the work of van der Zee​, Anaya​, Brown, and others who apply that scrutiny. And if it turns out that a professor at a prestigious university who’s received millions of dollars from government and industry and who’s received massive publicity for purportedly empirical results that are not consistent with any possible data, then, yes, that’s worth reporting.

18 thoughts on “I fear that many people are drawing the wrong lessons from the Wansink saga, focusing on procedural issues such as “p-hacking” rather than scientifically more important concerns about empty theory and hopelessly noisy data. If your theory is weak and your data are noisy, all the preregistration in the world won’t save you.

  1. “If your theory is weak and your data are noisy, all the preregistration in the world won’t save you.”

    Well, it will save *us* a lot of time, because in ~95% of cases you won’t have have any bogus p < .05 results to publish. Hypothetically speaking (ahem), a willingness to invent results will overcome even preregistration, but I doubt that's very common, so p-hacking is probably the bigger problem for science. The bigger problem for the scientist comes when he is willing to invent results, but can't even bother to make up the raw data which would generate the results.

    • David:

      Here’s what I wrote a couple weeks ago regarding that point:

      Forking paths and p-hacking do play a role in this story: forking paths (multiple potential analyses on a given experiment) allow researchers to find apparent “statistical significance” in the presence of junk theory and junk data, and p-hacking (selection on multiple analyses on the same dataset) allows ambitious researchers to do this more effectively. P-hacking is turbocharged forking paths.

      So, in the absence of forking paths and p-hacking, there’d be much more of an incentive to use stronger theories and better data. But I think the fundamental problem in work such as Wansink’s is that his noise is so much larger than his signal that essentially nothing can be learned from his data. And that’s a problem with lots and lots of studies in the human sciences.

      The forking paths and p-hacking are relevant only in the indirect way that it explained how the food behavior researchers (like the beauty-and sex-ratio researchers, the ovulation-and-clothing researchers, the embodied-cognition researchers, the fat-arms-and-voting researchers, etc etc etc) managed to get apparent statistical success out of hopelessly noisy data.

      The indirect role of preregistration etc. is potentially important, but I don’t want people to think that if they just preregister (or, more generally, that if they just act virtuously) that research results will just stream in.

      • If “research results” don’t “just stream in” what happens to the grant funded labs and the research institutes they support and the prestige they impart to the universities that house them? Researchers are under tremendous pressure to produce, and p-hacking is easier to claim “unintentional” than other forms of unethical behavior. Many of my former grad students (now current faculty) refuse to recognize mining data as at all problematic. So grind through the data until you get some p<.05 results and make everyone happy. What pre-regestration can do is afford others the opportunity to replicate. But there’s little payoff for this, so it’s all too infrequent.

        • The point of preregistration is not to make research results stream in (why would anyone think that?), is to block the stream of bad research results. If those labs find that they rarely get “significant” results they may wise up and start doing better experiments.

  2. I guess it was inevitable that journalists would start throwing out their own hot takes on this story, and this take is low hanging fruit–they see a researcher getting dragged for p-hacking, then see other researchers talking about how p-hacking is common, so why the focus on this one researcher?

    If you’re a detective and you have a couple cases on your desk: a serial killer, an arsonist, and a robber, which one are you going to prioritize?

    I guess that example doesn’t really work here because none of us are actually professional science detectives. A better analogy is if you see someone jaywalking you probably aren’t going to call the police, but if you see larger crimes you probably are.

    Wansink’s blog post was basically admitting a serious crime. Yes, other scientists run their labs similarly, and in fact I was bothered by the blog post because it reminded me of previous labs I had worked in, but these other labs typically don’t announce what they are doing.

    Even with the blog post though, I had no intention of dropping my current projects to read food psychology articles. But I did take a peek at the pizza articles, and immediately noticed problems. Then we looked at more articles, and noticed similar problems. And during all this time Wansink either ignored us or downplayed the problems. So yes, we did contact some journalists, and when journalists contacted us we happily gave them all the information they could want.

    I do think this is an interesting story, so if more news organizations want to continue the investigation I don’t see any problem with that. I recently listened to Slow Burn, which is about Watergate. I didn’t realize how long that investigation took. Journalists who initially covered the story weren’t able to get any traction. You can only capture the attention of the public when there’s already a smoking gun, and while we may view Wansink’s blog post as the smoking gun, or the errors in his pizza papers as the smoking gun, or the data and text recycling as the smoking gun, or his misrepresentation of the ages of the children for a multi-million dollar governmental program as the smoking gun, if the public wants to view Wansink’s emails as the smoking gun and the media can only get traction for their stories now I don’t mind. Heck, there might be even more stuff they can find, so let them look.

  3. In my world this is the stage when the new narrative is created. I’m way out in the left tail of comprehending the things you good people discuss here but it seems to me that your message should be “methods matter; and here’s why”. That way you win on ethos (as you, Andrews and the rest can readily demonstrate your competence), pathos (you’re motivated by the greatest quest of all – to find and stare at the truth unblinking), and logos (deduction FTW as the kids say). And thus you persuade. Making it personal in any way is just consenting to play in the mud pit by mud pit rules – which is what you’re trying crawl out of.

    • Thanatos:

      I agree. All this is due to the actions of individual people, but it’s not personal, it’s about the research. All this work could’ve been done by teams of completely anonymous people wearing masks, and the problems would all still be there.

    • Anon:

      Stephanie Lee of Buzzfeed has been on the story for awhile.

      One difficulty has been that Wansink was refusing to take the criticism seriously, and Cornell University wasn’t doing anything either (or, at least, they weren’t doing anything that anyone outside of Cornell was hearing about).

      What happened was that some persistent people were contacting the journals that had published the questionable papers, and then, one by one, these journals were issuing corrections or retractions. Each of these was reported in Retraction Watch, and I think each of these corrections or retractions gave courage to other journal editors to correct or retract other papers from Wansink’s lab. The process snowballed, and then in the meantime Lee did some reporting, and at some point it was all considered to be news. I have no idea what is the rule for what is considered to be news.

      In addition, Wansink, for all his public relations skills, seems to have had no interest in a counter-offensive. There have been no public attacks on the people who found problems in Wansink’s work, no news articles presenting Wansink as a victim, etc.

      Ultimately, I don’t think there’s much conflict here, and I’m glad that the story is not being presented as an interpersonal conflict. There was sloppy research going on for a long time, it got discovered awhile ago, and eventually it exceeded whatever threshold was necessary for the news media, and now maybe Cornell University, to take it seriously. There’s lots of sloppy research going on all over, with problems not quite as blatant as what was happening at the Cornell Food and Brand Lab, and that hasn’t reached the threshold where anyone feels the need to do anything about it.

  4. Let’s be fair: it doesn’t have to be an either/or issue. Wansick (and others) has issues with weak theory AND noisy data AND data falsification AND p-hacking. Maybe you’re right that the biggest problem is weak theory and noisy data, but there are plenty of researchers with different or overlapping subsets of those issues.

    • I’m discovering that it is very often the case that one kind of problem (e.g., GRIM inconsistencies when means have been “manually adjusted”) goes along with others (e.g., outrageous removal of “outliers”), even though there is no necessary relation between the two, and either technique on its own would often be sufficient to get the result you want. My provisional explanation is that someone who is not very competent at science probably isn’t very competent at cheating either. In some cases I suspect this may go back many years, with some of these people learning at quite a young age that they can compensate for their lack of competence by developing great presentation skills or becoming expert at misdirection.

      • I’m not sure I agree with “learning at quite a young age that they can compensate for their lack of competence by developing great presentation skills or becoming expert at misdirection.” I’m more inclined to believe that they see science more as a matter of persuasion and do not realize that real science doesn’t allow all types of persuasion, but just those within certain constraints — and that their inability to see this distinction (between scientifically “allowable” arguments and “any type of persuasion”) is what makes them not very competent at science.

  5. Right now, the current push in applied economics (e.g., Finance, Accounting, and to a lesser extent, Management) is to no longer rely on statistical significance in assessing the validity of a study, but also look for “large” economic significance or effect size. The issue I’m having with this is that the people I know in the field that *do* engage in p-hacking, data mining, etc. have already proved they have some ability to shirk the ethics of science in order to get published. Thus, this is only making it slightly harder for them to find a result that is publishable and now, more than ever, the effect sizes far exceed what one would expect given the weak theory of many published studies (e.g., CEOs with above median facial “masculinity” are 3-4 times more likely to engage in fraud relative to their feminine counterparts). Also, I’m becoming increasingly concerned that this new approach will end up incentivizing outright fraud because how much of a moral jump is it to go from changing research design decisions until you find a result that is almost certainly non-replicable to simply type somewhat reasonable numbers into a spreadsheet with stars next to them.

    Love the blog, and keep up the good fight!

    • Oh, I didn’t say it in the original post, so quick clarification, I was wondering if you could give your thoughts on this! I’m a PhD student right now and would love to hear your thoughts on people going from “gray-area” fraud to outright fraud.

        • I really like that take on Clarke’s law, I will definitely be using it in the future (with appropriate attribution ofc). One unfortunate consequence of the fact that crappy research and fraud are indistinguishable is that it allows researchers that commit academic fraud and are caught to use the defense of negligence. For example, two accounting academics recently published a paper in a top journal with seemingly egregious errors (see: https://econjwatch.org/articles/will-the-real-specification-please-stand-up-a-comment-on-andrew-bird-and-stephen-karolyi) but it will be nearly impossible to prove scienter if it does exist. Unfortunately, it looks like the only solution (I believe it is one that you have alluded to in the past) is to punish crappy research and fraud the same, however, the cynic in me does not see this happening anytime in the near future.

          Thanks again for your response and for posting your thoughts on this blog!

  6. I’m coming in very late to this, but could you direct me to anything that will clearly explain the difference between data mining (good) and p-hacking (bad)? To me, it looks as if the researchers engaged in these two activities are performing the exact same analyses, and for the life of me I can’t see why it is wrong in one case and right in the other.

Leave a Reply to Thanatos Savehn Cancel reply

Your email address will not be published. Required fields are marked *