Skip to content
 

Hark, hark! the p-value at heaven’s gate sings

screen-shot-2016-12-15-at-8-59-11-pm

Three different people pointed me to this post, in which food researcher and business school professor Brian Wansink advises Ph.D. students to “never say no”: When a research idea comes up, check it out, put some time into it and you might get some success.

I like that advice and I agree with it. Or, at least, this approached worked for me when I was a student and it continues to work for me now, and my favorite students are those who follow this approach. That said, there could be some selection bias here, that the students who say Yes to new projects are the ones who are more likely to be able to make use of such opportunities. Maybe the students who say No would just end up getting distracted and making no progress, were they to follow this advice. I’m not sure. As an advisor myself, I recommend saying Yes to everything, but in part I’m using this advice to take advantage of the selection process, in that students who don’t like this advice might decide not to work with me.

Wansink’s post is dated 21 Nov but it’s only today, 15 Dec, that three people told me about it, so it must just have hit social media in some way.

The controversial and share-worthy aspect of the post is not the advice for students to be open to new research projects, but rather some of the specifics. Here’s Wansink:

A PhD student from a Turkish university called to interview to be a visiting scholar for 6 months. . . . When she arrived, I gave her a data set of a self-funded, failed study which had null results (it was a one month study in an all-you-can-eat Italian restaurant buffet where we had charged some people ½ as much as others). I said, “This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.” I had three ideas for potential Plan B, C, & D directions (since Plan A had failed). . . .

Every day she came back with puzzling new results, and every day we would scratch our heads, ask “Why,” and come up with another way to reanalyze the data with yet another set of plausible hypotheses. Eventually we started discovering solutions that help up regardless of how we pressure-tested them. . . .

There seems to be some selection bias here, as Wansink shares four papers from this study—these must be the results of Plans B, C, D, and E, or something like that—but we never hear about failed Plan A.

That’s important, right? Sure, the published results might be fine, but when Plan A fails—and, remember, this was an idea from a “world-renowned eating behavior expert”—that’s news, no? I’d think there’d be room for just one more paper, and at least one more press release and media appearance, about the idea that didn’t work.

OK, so what was happening at that all-you-can-eat buffet?

I googled to one of the listed articles, “Lower Buffet Prices Lead to Less Taste Satisfaction,” by David Just, Özge Sığırcı, and Brian Wansink. From the abstract:

Diners at an AYCE restaurant were either charged $4 or $8 for an Italian lunch buffet. Their taste evaluation of each piece of pizza consumed was taken along with other measures of behavior and self-perceptions. . . . Diners who paid $4 for their buffet rated their initial piece of pizza as less tasty, less satisfactory and less enjoyable. A downward trend was exhibited for each of these measures with each additional piece (P = 0.02). Those who paid $8 did not experience the same decrement in taste, satisfaction and enjoyment.

This should not be confused with the paper, “Peak-end pizza: prices delay evaluations of quality,” which reports:

For the diners who paid $4 for their buffet, overall taste, satisfaction and enjoyment evaluation depend on the taste of the last piece of the pizza and the peak taste consistent with prior findings. For those paying $8 for the buffet, the first piece of pizza is more important in predicting the overall taste, satisfaction and enjoyment ratings.

Or the unforgettable “Low prices and high regret: how pricing influences regret at all-you-can-eat buffets,” from which we learn:

139 total individuals who came to the restaurant alone (n = 8), in groups of two (n = 52) and in groups of three or four (n = 43) and five and over (n = 30) are participated to the study. Out of participants who ate at least one piece of pizza and were included to our analysis (n = 95), 49 of them were male and 46 of them were female, the mean age was 44.11, the mean height was 67.58 in., and the mean weight was 181.61 lb. The results were analyzed using a 2×3 between groups ANOVA. Diners who paid $4 for their buffet rated themselves as physically more uncomfortable and had eaten more than they should have compared to the diners who paid $8 for the buffet (p < 0.05). However, diners who paid $4 for their buffet gave higher ratings to overeating, feelings of guilt and physical discomfort than the diners who paid $8 for the buffet, even if they ate the exact same number of pieces.

That “n = 95” looked odd to me, because the first paper above reported:

Of the 139 participants (72 groups), 6 people who were younger than 18 years old were eliminated. Eleven other participants did not complete the relevant questions on the survey. Thus, usable and complete data were collected from 122 people.

So I don’t know why this other study only included 95 people. Maybe that’s what it took to get p less than 0.05. In any case, it’s good to know that “the mean height was 67.58 in.”

I don’t know how many people were included in the analysis for “Low prices and high regret: how pricing influences regret at all-you-can-eat buffets.” I’m sure this information is in the published article, but it’s paywalled:

screen-shot-2016-12-15-at-9-44-25-pm

I couldn’t bring myself to pay $16 just for the privilege of reading this paper. I guess if they’d charged $32 I’d value it more, ha ha ha.

I googled the title to see if I could find an un-paywalled preprint, but I found no full paper: all that turned up was the abstract and about a zillion press releases, including a twitter post by Brian Wansink. I followed the link, and this guy tweets every day. He has 1687 tweets! That’s fine, I’m hardly one to criticize given that I’ve published 7000 posts on this blog, but it is kinda funny coming from someone who wrote, “Yet most of us will never remember what we read or posted on Twitter or Facebook yesterday.” I have a feeling that a lot more people will read Wansink’s blog post than will ever read “Peak-end Pizza: Prices Delay Evaluations of Quality” all the way to the end.

But wait, there’s one more! I checked out “Eating Heavily: Men Eat More in the Company of Women,” which was also based on that pizza experiment—Bruno Frey would be proud!—and says:

One hundred and thirty three adults (74 males and 59 females) were recruited to participate in a study of eating at an Italian restaurant in Northeastern USA where customers paid a fixed price for “all you can eat” pizza, salad, and side dishes. Our analyses are based on a sample of 105 respondents because we discarded responses from eight recruits who were eating alone, and 20 recruits provided incomplete survey responses.

OK, the 133 adults are the 139 participants minus the 6 kids. So far, so good. And I can see that for the purposes of this study they removed the solo eaters, although this concerns me a bit—they compared same-sex to mixed-sex groups, but they also could’ve thrown the singles into the comparison groups, and also they studied both sexes so it’s kind of iffy that this article is only about men. All these papers are full of the “difference between significant and non-significant” thing. But then they also excluded 20 people who “provided incomplete survey responses.” The last time they did this, they only excluded 11 people! I guess it depends on which questions they study, what gets excluded. But then this raises some concerns about all the “digging through the data.”

Here’s how Wansink concludes his post:

Facebook, Twitter, Game of Thrones, Starbucks, spinning class . . . time management is tough when there’s so many other shiny alternatives that are more inviting than writing the background section or doing the analyses for a paper.

Yet most of us will never remember what we read or posted on Twitter or Facebook yesterday. In the meantime, this Turkish woman’s resume will always have the five papers below.

I have two objections to this attitude.

First, enjoyment is a worthy goal in itself, no? In all seriousness, I think that watching a season’s worth of episodes of Game of Thrones is more valuable than writing a paper such as “Eating Heavily: Men Eat More in the Company of Women.” After all, I read that psychologists have found that it is experiences, not possessions, that make people happy. So why not recommend that your grad students spend more time going to bullfights?

Second, I’m bothered by that last sentence that the resume “will always have the five papers.” The end state of research is not the resume. Nor is it the tenured job, the press release, the Ted talk, or the appearances on Oprah and Dr. Oz. Just ask Roy Baumeister or John Bargh.

I really don’t like the message that Wansink is sending to his students, that a paper on your resume lasts forever. It lasts forever if it’s a real finding, or if it leads to progress. But it doesn’t last forever if it can’t replicate (except in the indirect way that, certain papers on ESP, sex ratio, himmicanes, power pose, cold fusion, etc., will last forever as warnings of scientific overconfidence).

Reactions

To his credit, Wansink has a comment section on his blog. And most of the comments are pretty harsh; for example:

You pushing an unpaid PhD-student into salami slicing null-results into 5 p-hacked papers and you shame a paid postdoc for saying ‘no’ to doing the same.

Because more worthless, p-hacked publications = obviously better….? The quantity of publications is the key indicator of an academic’s value to you?

I sincerely hope this is satire because otherwise it is disturbing.

This is a great piece that perfectly sums up the perverse incentives that create bad science. I’d eat my hat if any of those findings could be reproduced in preregistered replication studies. The quality of the literature takes another hit, but at least your lab got 5 papers out.

What you describe Brian does sound like p-hacking and HARKing. The problem is that you probably would not have done all these sub-group analyses and deep data dives if you original hypothesis had p < .05. . . . it is a bit difficult to end on a positive note. I have always been a big fan of your research and reading this blog post was like a major punch in the gut.

But I strongly disagreed with this comment:

If a hypothesis is sound, you should be able to predict the result of an experiment. Predict as in beforehand.

That sounds good, but in most of my applied work I learn so much from the data analysis and I can almost never predict beforehand what I’ll find. It’s important to get good data, though, and I have doubts about the quality of the data in that all-you-can-eat-restaurant experiment. To me, the key problem here is any theory is weak to nonexistent, and there are so many different ways that you can look at this small dataset. I’m not surprised that with intense effort they were able to find many different statistically significant comparisons—who knows, maybe a few more papers are forthcoming on the speed at which different people ate their salads, the positioning of men, women, and children at the restaurant tables, the relationship between how much they ate and how far away their cars were parked, etc. The possibilities are endless.

P.S. I just wasted an hour writing this. Ugh. I wish I’d watched an episode of Game of Thrones instead. My CV is ephemeral; the High Sparrow is eternal.

P.P.S. Wansink added an addendum to the beginning of his post. My take on the addendum is that Wansink is an open person who read all the comments but, unfortunately, doesn’t seem to understand the key statistical or methodological point, which is that he and his student could well have been sifting through noise, and that there’s no real reason to believe most of the claims published in those papers. But openness is a good start; I’m hoping that Wansink and others like him will continue reading the relevant literature and at some point will realize that failure is an acceptable option with noisy, poorly-motivated studies.

Here’s what Wansink wrote in his addendum:

With field studies, hypotheses usually don’t “come out” on the first data run. But instead of dropping the study, a person contributes more to science by figuring out when the hypo worked and when it didn’t. This is Plan B. Perhaps your hypo worked during lunches but not dinners, or with small groups but not large groups. You don’t change your hypothesis, but you figure out where it worked and where it didn’t. Cool data contains cool discoveries. If a pilot study didn’t precede the field study, a lab study can follow — either we do it or someone else does.

The problem is, is that thees “deep data dives” can often tell you nothing more than meaningless patterns, idiosyncratic to your data and hand. The statement “cool data contains cool discoveries” can be flat-out wrong. It doesn’t matter how “cool” your data are: if the noise is much higher than the signal, forget about it.

Brian Nosek learned this himself when he and his colleagues tried, and failed, to replicate their “50 shades of gray” study.

Brian Wansink refuses to let failure be an option. If he has cool data, he keeps going at it until he finds something, then he publishes, publishes, publishes. Brian Nosek recognizes that his research can fail. He realizes that when Plan A fails, the best Plan B may be to simply write the paper explaining that Plan A failed, to accept the limitations of his data. I hope that, someday, Brian Wansink learns this lesson too.

P.P.P.S. I did some more searching and found someone pointing out this from the “Low prices and high regrets” article:

screen-shot-2016-12-16-at-1-33-15-am

“OS” is Ozge Sigirci, the Turkish graduate student mentioned above. But then something’s really wrong here. Wansink clearly stated in his blog post that the study had been designed and the data collected before Sigirci arrived. So how could it possibly be that she collected the data? I also continue to wonder about the original “failed study which had null results” from which all the rest of this flowed. How can you possibly think you’re making research progress if you publish your noise-mined successes but keep your failures hidden? Sure, I understand the bias: I’ve had a lot of failed ideas myself and I don’t usually get around to writing them up and publishing them; it takes work which always seems could be better spent elsewhere. But if you’re going to publish four separate papers on different aspects of a “failed study which had null results,” wouldn’t it be a good idea to also make clear what were your original hypotheses that weren’t borne out? Cos it might be that these hypotheses are actually fine, and that your data was just too noisy for them to show up in your sample.

P.P.P.P.S. See here for a similar reaction from Ana Todorović.

P. P.P.P.P.S. More here.

33 Comments

  1. Jonathan says:

    I enjoy Wansink’s work. He does field work experiments that seem to me to illuminate how people think about food and how that is easily manipulated by tweaking expectations with lighting, nice labels, etc. and how the availability of food, even really bad food, together with simple things like portion sizes or removing plates more often also manipulate how much a person eats. But fieldwork like this pretty much alway has a limited n and, frankly, I see little benefit in his papers in statistical analysis beyond things like “the average amount eaten was x in this case versus y in this case”. I was disappointed to see this field work mined for so much dross. Not even sure why that was done. BTW, much of the work the does is really for industry and I suppose he’s best known for stuff like 100 calorie packs. His latest book is about half aimed at restaurants and institution, giving advice on how to increase diner satisfaction while saving money. He tweets so much because his popular books have a “community” around them, so he feeds that the same way other pop-sci authors do. I categorize his work as better than much of the writing about food because his experiments actually demonstrate behaviors you might see in yourself (or not), as opposed to a highly curated version of science that isn’t as well understood as presented but which is presented as truth. As in, we switched to smaller plates and different shaped glasses, relocated our snacks and rearranged other items to make them more prominent. I contrast that kind of specific behavioral observation backed by field work – but with limited statistical analysis because small experiments are small – with dubious statements about fructose or carbs or certain kinds of proteins and how they have changed our bodies, our minds, etc.

  2. Marcel van Assen says:

    Should we want to know ALL results, also of those analyses that did NOT detect an associaion?
    Surprisingly, my answer is NO.
    Here an example to explain my point.
    Imagine someone has a huge (HUGE) dataset with zilions of variables, most of them seemingly unrelated. Sample size is quite high, so that power approaches 1 even when true effect size is small.
    Then he decides to compute to compute correlations between all correlations and test them.
    One of the results is that eating peanut butter is unrelated to the quality of human semen. The researcher had no expectation on this null result, and knows nobody who had/has. Should he report on this null-result? And on all the other 1E+10 null-results on which he had no expectations?
    Imo, a report should let people know what is tested (in the example above, by linking to a description of the dataset and syntax), but report only on null-findings that were hypothesized to be a true finding, and all findings (hypothesized or not).

    • random-internet guy says:

      I think you are missing the point. Finding a correlation is at most half the job. Just having a correlation can still be meaningless because you still don’t know all the variables involved (or the relevance). It is trivial to find random correlations or massage data so that if fits your hypothesis.

      This is not science. These studies miss half the work. Where are the attempts to disprove the conclusion? Science is about the attempting to DISPROVE a result. It is one thing to have a hypothesis, have a study, and jump to a conclusion. It worse to have if fail, find another hypothesis based on the study and jump to a conclusion.

      • Anoneuoid says:

        >”Where are the attempts to disprove the conclusion? Science is about the attempting to DISPROVE a result.”

        Yep, when I was becoming aware this is one of the first conversations I had in the hallowed academic halls.

        me: “Shouldn’t we by trying to *disprove* our hypothesis, rather than prove it? The null hypothesis should be what we predict will happen.”
        prof: *blinks* and changes subject

    • Andrew says:

      Marcel:

      We don’t need to hear about all null results, but when a “world-renowned eating behavior expert” (as Wansink calls himself) does a study that he thinks is such a good idea that he funds it himself, and then his “Plan A” fails, he should share this failure with the world, no?

      This was not a random thing being tested, it was the carefully-thought-out hypothesis of a world-renowned expert. When a world-renowned expert gets it wrong, that’s news, and it would help the advancement of science for all of us to know right away.

  3. Ruben says:

    > I couldn’t bring myself to pay $16 just for the privilege of reading this paper.
    You don’t know Sci-Hub.cc or you don’t (dare) use it?

  4. Robert Grant says:

    The divide between pre-specified stat hypothesis and explanatory hypothesis keeps bugging me. This is a nice example: they do the comparison of same- and mixed-sex groups and how much they eat (fine, stat hypothesis is pretty clear-cut) and then report only the men. So, either there is some sub-group cherry-picking or they actually suspected the men all along. I don’t know this fellow’s work or anything about food research but it seems a common problem. With some familiar, off-the-shelf tests to reach for, people tend to think less about the explanation or data generating process and that gets them in tight spots. If you really wanted to study men eating more pizza around other men, you’d do the study differently, not salami slice (ha!) this one. But standard psych-influenced stats textbooks will gleefully point you to some kind of anova or whatever for this situation and let you p yourself (joke stolen from stats-textbook-writing psychologist Andy Field) when you find a significant result for your stat hypothesis that you think maps one-to-one to your tacit explanatory hypothesis (if you thought about it much at all).

  5. Robert Grant says:

    So I think the problem is the education with a bunch of ready-to-go procedures. Daniël Lakens says: “Young scholars: Theories and paradigms change. But knowing philosophy of science, computer programming, and statistics are skills for life.” (https://twitter.com/lakens/status/707099764056334336) I like this quote a lot. With philosci you wouldn’t stumble into testing dumb hypotheses, and with programming you would think about simulation, randomisation tests, cross-validation, and other more bespoke ways of looking at what you did and whether you can trust it. I think they link together.

    • Anonymous says:

      “With philosci you wouldn’t stumble into testing dumb hypotheses, and with programming you would think about simulation, randomisation tests, cross-validation, and other more bespoke ways of looking at what you did and whether you can trust it”

      “Skills for life” somehow related to only science then?

      I vote for logical reasoning to be a part of the curriculum and as an example of a topic that can be considered as developing “skills for life”. I still find it strange that that wasn’t a required part of the curriculum at my university.

  6. Dale Lehman says:

    I can’t believe nobody has contributed this yet: the advice seems fit for a garden of forked paths. If ever there was an example of how/why forked paths are a problem, this sounds like a recipe for them.

    More seriously, I disagree with Marcel’s point above. Sure, reporting all results (even negative findings) is not always desirable – his example is one case of that. But we are dealing with a very different case here. Here we have a researcher with a hypothesis, A, which sounds like prior information here. Even if a Bayesian approach is not used, I think hearing what the data says about A is of primary interest. This is not a case where we have zillions of data points and hundreds of variables but a somewhat carefully designed study to explore a behavior about which we have some prior beliefs. In this case, I think reporting all the results is important. Even more important in my mind is that the data be released publicly.

    • Anonymous says:

      I think the real problem with psychology is that you can find “evidence” or references for just about anything. There is always “prior information” i reason, at least to some extent. To me, this also makes “prior information” pointless, at least in the way i comprehend it is being used currently.

      To me, this is could perhaps also help explain why studies which have later failed to replicate received hundreds or thousands of citations. You can “build” on just about anything, and combined with low standards of publication of your “new” findings, you end up with an endless cycle of nonsense (but all with some “prior information” and what seem to be “valid” hypotheses).

      I toyed with the idea of trying to see just how much of my above expressed rambling is possibly accurate, by trying to take an introduction section of a random psychology paper and create a new introduction which states the exact opposite of the reasoning and conclusions of the original paper, and then find references to fit these new sentences/conclusions. To me, that would be a nice validation of my thoughts. I however don’t have the energy for it…..

  7. G says:

    Andrew:

    This is a really good post. I emailed a month or so ago about the same problem. I was invited to analyze a “really cool” dataset which cost the original researchers a lot of money to generate. When I consistently found null effects I was continually pushed to produce significant findings. I recall an email from one of the researchers imploring me to explore interaction terms, quadratic and cubic terms, transforming the dependent variable, and omitting cases. As a budding PhD these situations make me extremely uncomfortable. I’ve trained myself over the past 4 years to use the generally accepted best practices in statistics and data analysis. I’ve learned Bayesian statistics and used Stan to generate meaningful models (even if they do find null results). Despite all this work, it seems that many people I work with would rather I just perform an underpowered ANOVA and report something significant. It leaves me feeling very discouraged sometimes.

    • Martha (Smith) says:

      G:

      Hang in there; you’re doing the right thing.

    • +1 to what Martha said, also, it’s easy to ask potential employers what they think of the garden of forking paths.

    • Robert Grant says:

      If you’re an isolated statistician, the constant requests to mess it up and score cheap points will be your lot. I get them too. But you clearly are destined for better things than that, as long as you don’t get discouraged.

      Never agree to work on Really Cool Datasets. They are invariably a pile of poop.

    • Rahul says:

      I’m skeptical of the utility of situations where someone dumps a dataset on you and asks you to discover a relationship.

      Almost always the productive interactions are where they come with a certain problem (e.g. Is A better than B; can you predict X etc.)

      • Anoneuoid says:

        >”Almost always the productive interactions are where they come with a certain problem (e.g. Is A better than B; can you predict X etc.)”

        It usually starts like this (“Is A related to B?”). But then when the answer is “doesn’t look like there is much going on here” the frantic flailing about begins. BTW, that flailing is what a lot of researchers think constitutes “exploring the data”.

  8. Jason Chin says:

    Shouldn’t peer-review have picked up on this?

  9. Jordan Anaya says:

    Hi Andrew,
    I have been carefully reading the pizza publications in question. In case the sample size of 95 is still bothering you I believe some diners were dropped because they ate more than 3 pieces of pizza. The study specifically reports diners who ate 1, 2, or 3 pieces, while the other studies report on diners who potentially ate more than 3 pieces. But of course we can’t rule out more nefarious reasons for the sample size differences.

    • Andrew says:

      Jordan:

      In the paper it says, “Out of participants who ate at least one piece of pizza and were included to our analysis (n = 95).” The phrase “and were included in our analysis” can cover anything, I guess.

      But my concern is not that they were doing anything nefarious. The problem is the usual forking-paths story: the researchers had many ways they could exclude data and many ways they could code data and many ways they could analyze data, and indeed Wansink states himself that they tried lots and lots of things until they found some results. The variation in sample size from study to study is just one signal of a particularly obvious set of researcher degrees of freedom.

      I’d really like to avoid talking about “nefariousness” in such situations, for two reasons:

      1. I have no reason to think the authors did anything knowingly unethical in their research.

      2. By focusing on “nefariousness,” you give these sorts of researchers an “out”: it’s natural for them to say, “We were not nefarious, we did not p-hack, we did not cheat” and then think they’re off the hook. One reason I keep talking about the garden of forking paths is to emphasize that these problems can happen even if you’re not trying to cheat in any way.

      • Jordan Anaya says:

        Sorry for giving the researchers an out. Yes, even though it is convenient to compare diners who ate 1, 2, or 3 pieces of pizza, there is no reason they couldn’t have used the groups 1, 2, or 3+ pieces and included more diners. Perhaps they initially did this (and many other things) and didn’t get as good of results, which is the main concern here.

  10. Kaiser says:

    I forgot to sign my comment so apologies if this shows up as a duplicate.

    Absolutely agree that this is a case of forking paths. For this reason, I want to know that the researcher ran one million tests on the Big Data set and reported the four tests reaching p< 0.05. Better to mention that there were 999,996 negative tests in the abstract to save my time!

  11. Hey, thanks for referencing my post! That explains why people still keep coming to read it. I replied to your comment over there.

    The sad thing is, he seems to be acting from a kind impulse – he’s sharing the things that worked for him on the road to tenure.

  12. Anonymous says:

    > Second, I’m bothered by that last sentence that the resume “will always have the five papers.” The end state of research is not the resume. Nor is it the tenured job, the press release, the Ted talk, or the appearances on Oprah and Dr. Oz. Just ask Roy Baumeister or John Bargh.

    This is an indulgence for the tenured, not the Ph.D student. At least in the past, the best way to get hired has been a lengthy resume.

    • Andrew says:

      Anon:

      Sure, but in that case Wansink shouldn’t say, “this Turkish woman’s resume will always have the five papers.” He should say, “the five papers should be enough to get this Turkish woman a job, at which point it won’t matter how good the work was.” Saying “always” implies a permanence that isn’t there.

  13. Random guy says:

    The problem lies also in the journals themselves and their acceptance practices. Nobody wants to here about non-significant results. So even if you had a nice idea, good data-set but the results are not significant then it is really difficult to persuade the editors and the reviewers that is is a good result on its own.

    • Andrew says:

      Random:

      Sure, but there was nothing stopping Wansink from telling us on his blog what was this failed hypothesis. Why keep it a secret?

      Also I think he could’ve mentioned the failed hypothesis in the papers he published. Then again, it was kinda Bruno-Frey-like of him to publish 4 papers from the same study that didn’t cite each other. It’s almost like he went to extra effort to extract all context out of these presentations.

  14. Nick Brown and colleagues wrote up an interesting re-analysis of the papers in question, it’s on PeerJ:

    https://peerj.com/preprints/2748/

Leave a Reply