Pizzagate gets even more ridiculous: “Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature . . . in the later study they again found the exact opposite, but did not comment on the discrepancy.”

Background

Several months ago, Jordan Anaya​, Tim van der Zee, and Nick Brown reported that they’d uncovered 150 errors in 4 papers published by Brian Wansink, a Cornell University business school professor and who describes himself as a “world-renowned eating behavior expert for over 25 years.”

150 errors is pretty bad! I make mistakes myself and some of them get published, but one could easily go through an entire career publishing less than 150 mistakes. So many in a single paper is kind of amazing.

After the Anaya et al. paper came out, people dug into other papers of Wansink and his collaborators and found lots more errors.

Wansink later released a press release pointing to a website which he said contained data and code from the 4 published papers.

In that press release he described his lab as doing “great work,” which seems kinda weird to me, given that their published papers are of such low quality. Usually we would think that if a lab does great work, this would show up in its publications, but this did not seem to have happened in this case.

In particular, even if the papers in question had no data-reporting errors at all, we would have no reason to believe any of the scientific claims that were made therein, as these claims were based on p-values computed from comparisons selected from uncontrolled and abundant researcher degrees of freedom. These papers are exercises in noise mining, not “great work” at all, not even good work, not even acceptable work.

The new paper

As noted above, Wansink shared a document that he said contained the data from those studies. In a new paper, Anaya, van der Zee, and Brown analyzed this new dataset. They report some mistakes they (Anaya et al.) had made in their earlier paper, and many places where Wanink’s papers misreported his data and data collection protocols.

Some examples:

All four articles claim the study was conducted over a 2-week period, however the senior author’s blog post described the study as taking one month (Wansink, 2016), the senior author told Retraction Watch it was a two-month study (McCook, 2017b), a news article indicated the study was at least 3 weeks long (Lazarz, 2007), and the data release states the study took place from October 18 to December 8, 2007 (Wansink and Payne, 2007). Why the articles claimed the study only took two weeks when all the other reports indicate otherwise is a mystery.

Furthermore, articles 1, 2, and 4 all claim that the study took place in spring. For the Northern Hemisphere spring is defined as the months March, April, and May. However, the news report was dated November 18, 2007, and the data release states the study took place between October and December.

And this:

Article 1 states that the diners were asked to estimate how much they ate, while Article 3 states that the amount of pizza and salad eaten was unobtrusively observed, going so far as to say that appropriate subtractions were made for uneaten pizza and salad. Adding to the confusion Article 2 states:
“Unfortunately, given the field setting, we were not able to accurately measure consumption of non-pizza food items.”

In Article 3 the tables included data for salad consumed, so this statement was clearly inaccurate.

And this:

Perhaps the most important question is why did this study take place? In the blog post the senior author did mention having a “Plan A” (Wansink, 2016), and in a Retraction Watch interview revealed that the original hypothesis was that people would eat more pizza if they paid more (McCook, 2017a). The origin of this “hypothesis” is likely a previous study from this lab, at a different pizza buffet, with nearly identical study design (Just and Wansink, 2011). In that study they found diners who paid more ate significantly more pizza, but the released data set for the present study actually suggests the opposite, that diners who paid less ate more. So was the goal of this study to replicate their earlier findings? And if so, did they find it concerning that not only did they not replicate their earlier result, but found the exact opposite? Did they not think this was worth reporting?
Another similarity between the two pizza studies is the focus on taste of the pizza. Article 1 specifically states:

“Our reading of the literature leads us to hypothesize that one would rate pizza from an $8 pizza buffet as tasting better than the same pizza at a $4 buffet.”

Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature, because in that paper they found ratings for overall taste, taste of first slice, and taste of last slice to all be higher in the lower price group, albeit with different levels of significance (Just and Wansink, 2011). However, in the later study they again found the exact opposite, but did not comment on the discrepancy.

Anaya et al. summarize:

Of course, there is a parsimonious explanation for these contradictory results in two apparently similar studies, namely that one or both sets of results are the consequence of modeling noise. Given the poor quality of the released data from the more recent articles . . . it seems quite likely that this is the correct explanation for the second set of studies, at least.

And this:

No good theory, no good data, no good statistics, no problem. Again, see here for the full story.

Not the worst of it

And, remember, those 4 pizzagate papers are not the worst things Wansink has published. They’re only the first four articles that anyone bothered to examine carefully enough to see all the data problems.

There was this example dug up by Nick Brown:

A further lack of randomness can be observed in the last digits of the means and F statistics in the three published tables of results . . . Here is a plot of the number of times each decimal digit appears in the last position in these tables:

These don’t look like so much like real data but they do seem consistent with someone making up numbers and not wanting them to seem too round, and not being careful to include enough 0’s and 5’s in the last digits.

And this discovery by Tim van der Zee:

Wansink, B., Cheney, M. M., & Chan, N. (2003). Exploring comfort food preferences across age and gender. Physiology & Behavior, 79(4), 739-747.

Citations: 334

Using the provided summary statistics such as mean, test statistics, and additional given constraints it was calculated that the data set underlying this study is highly suspicious. For example, given the information which is provided in the article the response data for a Likert scale question should look like this:

Furthermore, although this is the most extreme possible version given the constraints described in the article, it is still not consistent with the provided information.

In addition, there are more issues with impossible or highly implausible data.

And:

Sığırcı, Ö, Rockmore, M., & Wansink, B. (2016). How traumatic violence permanently changes shopping behavior. Frontiers in Psychology, 7,

Citations: 0

This study is about World War II veterans. Given the mean age stated in the article, the distribution of age can only look very similar to this:

The article claims that the majority of the respondents were 18 to 18.5 years old at the end of WW2 whilst also have experienced repeated heavy combat. Almost no soldiers could have had any other age than 18.

In addition, the article claims over 20% of the war veterans were women, while women only officially obtained the right to serve in combat very recently.

There’s lots more at the link.

From the NIH guidelines on research misconduct:

Falsification: Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.

54 thoughts on “Pizzagate gets even more ridiculous: “Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature . . . in the later study they again found the exact opposite, but did not comment on the discrepancy.”

  1. Isn’t it a little not fair to dump on Wansink without at least referring to his response here: http://foodpsychology.cornell.edu/sites/default/files/unmanaged_files/Response%20to%20%27Statistical%20Heartburn%27%20Table%20of%20150%20Inconsistencies%20FINAL.pdf
    No one has yet responded to his response.
    Not saying that Wansink is any good (he, like many in the soft sciences, seems to be pretty lousy with statistics), but rather than just a screed, it would be nice to see what he says also and why it is wrong.

    • Folow:

      There are no screeds here. Follow the link from my post above: this takes you to the new article by Anaya​, van der Zee, and Brown. (Here’s the link again, for convenience.) They discuss Wansink’s response, go through it in detail, and explain how he was wrong.

      Also, yes, Wansink “seems to be pretty lousy with statistics,” but what really makes him stand out is that he doesn’t report what he’s actually done. His various descriptions of his data collection contradict each other and also are not consistent with the numbers in his published papers. This is far worse than mere bad statistics.

      • I guess the purpose of posting our reanalysis was to provide people with an independent analysis of the data, although I guess you could argue we are biased since we claimed 150 errors in the original papers so in theory we are incentivized to push the narrative that the papers contain serious issues.

        Although I was curious to look at the data to try and understand where all the originally reported numbers came from, I didn’t really want to spend more time disparaging the Cornell Food and Brand Lab as it appears their brand is beyond salvaging at this point. However, I didn’t find their report to be very transparent, or accurate. The report made it seem as if most of the granularity problems were due to misreported sample sizes which resulted from missing responses. Technically there are misreported sample sizes, and there is missing data, but many sample sizes were larger than originally reported, indicating further issues than missing data. In fact, there are quite a few statistics that I don’t know how to reproduce given the data.

        Basically, whenever future journalists report on this story I didn’t want them to just trust Cornell’s report and assume that we overstated the problems in the publications. If anything, I would say our concerns were validated and then some. But if our reanalysis is biased or wrong of course Cornell or others are free to post a rebuttal, which I would be very interested in reading.

        • Jordan,
          You guys have done fantastic work exposing researchers like Wansink, but you are overstating your case that his reputation is beyond salvaging. I suspect that the problems you found with his papers are rampant in the literature in his field, and this will lead to overall ignoring your criticisms of him, since they are prevalent in the field and his peers will perceive it as no big deal. I think alot more work needs to be done, not about individual researchers but about whole fields (obviously no one person can do all of this). ! Personally, I suspect that the food science field is full of poorly trained scientists and statisticians, and that most papers display these errors and that they will just circle the wagons.
          We can test this hypothesis by simply waiting a few years and see if Wansink’s h-index had decreased. My bet is no.

          On a different note, I think all the emphasis on granularity errors and other types of errors masks a larger problem. Based on Wansink’s original post, a huge swath of his work is clearly p-hacked. In my mind, once you are p-hacking, I don’t care if all of your statistics are error-free, the whole approach is just totally non-scientific and the work should be dismissed, even if each individual calculation adds up.This was the original criticism of Wansink, and it, in my mind, is the most egregious.
          To put this another way, imagine the Pizza papers lacked any of the errors that you detected in your analysis, would Wansink’s work be any more correct than it is now? My answer is an affirmative no.

          The real breakthrough of yours, Andrew’s and others work on the Wansink case (in my opinion) is that it opens up the opportunity for more general inquiry of whole sub-fields that may be suspect, such as food science. Wansink is one example, how many others are out there who are engaging in p-hacking?

        • FolowUp:

          I agree that p-hacking is more serious than misreported sample sizes, typos, or whatever the case may be with granularity problems. However, consider this: if we didn’t point out that the numbers in Wansink’s papers didn’t add up would the general public (i.e. media) cared about Wansink’s blog post? I think it’s difficult to get people riled up about flexible analyses, small sample sizes, file drawer effect, etc., but it’s pretty easy to get people to understand that numbers in papers should add up.

          This is simply a hypothesis, but I believe that people engaging in rampant p-hacking and poor study design are much more likely to inaccurately report their results, and be exposed by granularity testing. As a result, I do think granularity testing can be used as a sort of proxy for catching p-hacking.

          Also, granularity testing has the ability to catch people who are just completely fabricating results, which is obviously far less prevalent than p-hacking, but is still worth trying to detect and expose.

  2. Jordan,
    You are right of course about getting the media fired up, but I am a little wary about this type of media attention seeking since 1. the general public focuses on the wrong issues and this will lead to 2. the scientific community ignoring the outcry, since it is focused on the wrong issue. If the public mis-learns the issues, it just makes reform harder. Most of the posts in the mainstream media about BW have focused on the errors of his papers, and have given little or no attention to the p-hacking.

    Some of this might be because it is easy to explain to people that someone made errors, but explaining why p-hacking is bad is hard to explain to a non-statistician. Journalists will then default to the easy, but less fundamental, issue.
    I think a better approach is to explain the issues as clearly as possible (for example, the fivethirtyeight.com blog does a nice job of explaining p-hacking: https://fivethirtyeight.com/features/science-isnt-broken/). Alot of press has already come out of the p-hacking issue, and even the BW issue started out of p-hacking.

    We’ll wait and see, I still bet that Wansink comes out of this relatively unscathed within his field.

    • FolowUp,

      It isn’t just the media and journalists, of course. Journals behave the same way – errors of the nature that Jordan and Tim and Nick and James have found can lead to corrections and retractions. When was the last time a paper was corrected or retracted after it came out that the results were p-hacked?

      • I grant p-hacking is nearly impossible to prove absolutely and BW only got in trouble because he openly admitted it. However, there is an understanding that that in many fields of the social sciences it is happening frequently. I think the retraction/correction procedure is not a good metric, and over emphasis on it is not a good idea. This is especially true because this type of detective work tends to focus on individual researchers to show they are bad (e.g. BW) without properly condemning whole fields. I, for one, basically don’t trust anything that comes from the social sciences (other than from certain sources) because of this problem and I certainly don’t trust the field of food science. I think that this emphasis on bad researchers versus bad fields/bad practices misses the forest for the trees.

        • Folow:

          I agree that retraction/correction isn’t the best solution here. The problem is that retraction/correction aren’t scalable: each retraction or correction is taken as such a big deal. Even in extreme cases it’s hard to get a retraction or a meaningful correction if the author doesn’t want to do it. Remember that “gremlins” paper by Richard Tol that had almost as many errors as data points? Even that didn’t ever get retracted or fully corrected, I think.

          It’s similar to the problem of using the death penalty to control crime. It’s just too much of a big deal. Each death sentence gets litigated forever, and very few people get executed because of (legitimate) concerns of executing an innocent person.

          Scientific journals’ procedures for addressing published errors are, like capital punishment in the U.S., a broken system.

          I don’t have any great solutions to crime and punishment. Flogging, maybe.

          But for the journal system, I recommend post-publication review. In a post-publication comment you can point out problems with forking paths and other statistical problems, and you can also point out problems such as in Wansink’s work where a research team has, in the words on NIH, been “manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.”

        • Andrew,
          I agree with your assessment of the problem, and I think post-publication peer review is a good idea for this. Apropos this, one way PPPR I think needs to go is not just searching for fraud (ala Pubpeer) but genuine commenting on papers, discussing strengths and weaknesses, alternative approaches etc. This type of discussion allows an assessment of who is reliable and who, while not committing fraud, is pushing a story that their data might not fully justify. I have right in front of me a paper by an eminent physicist that is widely known in the tiny subfield to not be their best work, and is probably an artifact of the experimental system used. This is not a fraudulent paper, but (like BW, I suspect) is a reflection of a non-ideal experimental system, sloppy work, with resulting interpretations not fully justified. I am not even really against that it would be published, but a formal post-publication review would allow readers to understand that in the end, it is a crummy paper.
          One frustrating thing for me about sites like Pubpeer is that they are now mostly used to search for fraud and not for discussion, but that is a topic for a different time….

        • FolowUP: “one way PPPR I think needs to go is not just searching for fraud (ala Pubpeer) but genuine commenting on papers, discussing strengths and weaknesses, alternative approaches etc.”

          Yes!

          Also: The 538 article you linked to above does seem to explain the issues unusually well for a non-statistician audience. It ought to be required reading for Research Methods courses in lots of fields. And such courses also ought to include reading papers and then PPPR’s of them of the sort you recommend.

        • Anoneuoid: You’re right, that particular point is not good — but most of the rest of it is (for the intended audience).

        • Anoneuoid: You’re right, that particular point is not good — but most of the rest of it is (for the intended audience).

          My position is that it’s impossible to explain the logic of NHST to the layperson because it makes no sense. People intuitively know that scientists are just supposed to be evaluating their hypothesis, not using some kind of convoluted logic like:

          “We set a null hypothesis the opposite of our hypothesis (also called the alternative hypothesis) and try to disprove that, then if we reject the null hypothesis it means our hypothesis must be true… To decide whether to reject the null hypothesis we calculate a p-value which assumes the null hypothesis is true, therefore assuming our hypothesis must be false. Thus, the p-value is a “measure”, or “index”, of how surprised we would be to get our results if our hypothesis was false.

          If we reject the null hypothesis, then we must accept our hypothesis since the results would be unsurprising if our hypothesis were true. If that is difficult to understand, think about it just like court, where the null hypothesis is assumed innocent until proven guilty. In other words, the burden of evidence is on the scientist to prove the null hypothesis is false, which would mean that their hypothesis is true.”

          The only reason we still have this going on is that people who realize how much has been claimed based on NHST (ie not laypeople) are scared of the scope of damage that may have been done.

        • I’m not sure what makes the quote about the p-value as indeed of surprise so bad. At least compared to the following remark in the Nature article she mentions:

          > Last year, for example, a study of more than 19,000 people showed that those who meet their spouses online are less likely to divorce (p < 0.002) [….] That might have sounded impressive, but the effects were actually tiny: meeting online nudged the divorce rate from 7.67% down to 5.96%

          If a 22% reduction in divorce rates is "tiny", I wonder what kind of effect would she find worth reporting.

        • Anoneuoid said:

          “People intuitively know that scientists are just supposed to be evaluating their hypothesis, not using some kind of convoluted logic like:

          “We set a null hypothesis the opposite of our hypothesis (also called the alternative hypothesis) and try to disprove that, then if we reject the null hypothesis it means our hypothesis must be true… “”

          I would call “then if we reject the null hypothesis it means our hypothesis must be true” invalid logic, rather than “convoluted”.

          I realize that many people use this invalid logic, so when teaching frequentist statistics, I emphasize that rejecting the null hypothesis does not mean the null hypothesis is false — that rejecting the null hypothesis on the basis of a p-value is an *individual choice*, not a logical consequence, so must be accompanied by the acceptance that our choice might be incorrect.

        • Marth wrote:

          I would call “then if we reject the null hypothesis it means our hypothesis must be true” invalid logic, rather than “convoluted”.

          Sure, but the convoluted part is that we are dealing with the “null hypothesis” and somehow must make it seem that we are evaluating the “research hypothesis”. If one is true the other must be false, but for some reason we need to check the former to draw a conclusion about the latter, etc.

          I put in the stuff like confusing “rejecting” with “disproving” as the usual red herrings to obscure the real problem with the logic (that simply rejecting the null hypothesis in no way indicates the research hypothesis is correct). It is convoluted to begin with, and then becomes even moreso upon disentangling due to the comedy of errors stacked on top of each other. I really don’t think that parody of an explanation I offered is a strawman.

        • “I certainly don’t trust the field of food science.”

          GS: Just out of curiosity…what is “food science”? It doesn’t sound like a “field” to me. And it would be misnamed if it were since “it” surely is concerned with behavior in some way (as in rates of consumption of food etc.) Now, if that is the case, then the field is something like “regulation of food-intake” and would, thus, be a part of the natural science of behavior as well as the part of physiology specifically concerned with the physiological mediation of behavior. In that case, I don’t know exactly why you would not trust it any less than a great many fields.

        • Food science is close to nutrition, but it deals with how food is prepared, how to improve it, etc. Wansink’s research covers food science, nutrition, marketing, and psychology related to food and nutrition. All interesting subjects, if the research is well done…

        • I don’t know much about Wansink – just what I hear about him here – but aren’t his dependent-variables directly or indirectly related to behavior? At least all of the studies being ridiculed here?

        • A lot of “food science” has nothing to do with behavior. We have food scientists to thank for the fact that botulism is rare, that commercial bread doesn’t mold within 1 day of preparation, that ground meat doesn’t kill you a day or three after you eat it, that we can buy shelf-stable soup broth, that you can buy a spray can that makes your baked goods not stick to the loaf pan… etc etc

  3. Re: “Number of occurrences”, to be fair, one practice is to never end with a 0, so the small number of 0s may not be suspicious at all. E.g., you could report 1.563, but you would not report 1.560, you’d report 1.56. So the only 0s reported would be pure 0s.

    • I suspected this too, but if you look at the tables every mean is reported to exactly 1 decimal place. There are a couple instances of “X.0”, but no instances of just “X”.

  4. This is a reply to Carlos Ungil above:

    I’m not sure what makes the quote about the p-value as indeed of surprise so bad.

    This is the quote in question:

    “Instead, you can think of the p-value as an index of surprise. How surprising would these results be if you assumed your hypothesis was false?”

    https://fivethirtyeight.com/features/science-isnt-broken/

    I think everything else there is a red herring that distracts from the reasoning behind this quote.

    1) Calling the p-value an “index” of surprise (I have also seen “measure”). It is never quite defined how this index/measure works. Eg, what type of measurement is it (categorical, ordinal, interval, ratio)? How exactly does it map to the amount of evidence for “your hypothesis”? Why not just look at the amount of evidence for/against your hypothesis instead of this index?

    2) What is “surprise”? How is it defined? Isn’t the amount of surprise going to depend on the person, ie isn’t this “subjective”?

    3) Then there is “if you assumed your hypothesis was false”. Well, the p-value is calculated using equations that assume the *null hypothesis* is true, so “your hypothesis” must amount to “not the null hypothesis”.
    — a. First, we are leaving out the possibility to correctly use a p-value to asses what “your hypothesis” predicted.
    — b. Second, is “your hypothesis” really amounting to “anything possible except the null hypothesis I tested”? Isn’t this a kind of unfair competition between the null hypothesis and “your hypothesis” since the former is a single value and the latter is every other possible outcome?

      • The usual definition I would give (even if I was not trying to be concise), is, “The probability of observing data at least as extreme as what you found, conditional on the null hypothesis,” which comes in at 18 words. I think this has less potential to be misleading than 538’s version, since it correctly defines things in terms of the null, without reference to whatever particular alternative you were thinking of. I also dislike the “index of surprise” because it presupposes that the falsity of the null hypothesis would be surprising, but in many real cases the null hypothesis being tested (‘effect size of exactly zero’) is a priori unlikely.

        • > it correctly defines things in terms of the null

          However, it doesn’t define the null. The word “conditional” is also a bit technical for the layman (or maybe I’m supposed to say layperson?). But I agree that’s also a good definition.

          > the “index of surprise” presupposes that the falsity of the null hypothesis would be surprising

          I don’t think so. It works whether you think the null hypothesis is likely to be true (how surprising would it be for someone to toss five heads in a row, assuming the coin is fair) or false (how surprising would it be for Nadal to beat you five sets on a row, assuming both of you played equally well).

        • how unusual ~ how surprising

          for a specifically chosen random number generator ~ if you assumed the null hypothesis is true

          Doesn’t look much better to me.

        • I agree, but is hard to include the concept of exhaustive alternative in the 20 words description. Your “specifically chosen random number generator” avoids the difficulty by not explaining anything at all about how that random number generator was specifically chosen. I don’t think that makes it a better description.

        • Also, I think the use of the term “null hypothesis” gives an incorrect impression in the layman. a null hypothesis is just a specific random number generator, but it sounds more like “a hypothesis”. I think of a hypothesis like “giving kids free toothbrushes improves their oral health” and so a null hypothesis is something like “giving kids free toothbrushes doesn’t improve their oral health” which doesn’t sound much like a random number generator. But in the end you’re going to have a normal(0,sigma) RNG *is* your null hypothesis. so “giving kids free toothbrushes causes their oral health to change by a random amount with mean zero and normal distribution shape and estimated sigma” is your real null hypothesis. For laymen, and maybe even for teaching statisticians, I think being explicit about the fact we’re talking about an RNG is helpful.

      • This is the best I’ve seen (28 words):

        The P-value and sample size together correspond to a unique likelihood function, and thus act as a summary of that function and the evidence quantified by that function.

        https://arxiv.org/pdf/1311.0081

        We’d need to explain a likelihood function is a way of showing the value and uncertainty about a model parameter being estimated. I think that is it though.

        • The problem then becomes that a very small % of the people who read that fivethiryeight article would have any idea what this means.

          If you were writing an article for a popular audience in which you were explaining to them why relying on p < 0.05 (or p-values at all) is a problem, how would you explain it? Assume that the readers are generally educated and curious and possibly involved in research of some kind, but don't understand what a p-value is. I know you said above that you can't explain the logic of NHST to laypeople on account of it not making sense, but there are lots of people out there who think it makes sense and who explain it all the time (typically poorly). Assuming that there is some value in communicating the problem of relying on p < 0.05 to a general audience, this logic needs to be explained somehow, even if only for the purpose of eventually attacking it.

          I run into this when talking to friends of mine who are using p-values in their own work. I'm not going to dissuade them from relying on p < 0.05 simply by asserting that they shouldn't do it; after all they live in a world where statistical significance is rewarded. I've got to explain the logic behind it in as clear a manner as possible first, and that's going to mean skipping over some considerations like the ones you listed above.

        • The flaw with NHST has been explained clearly on this blog many times. It is rejecting strawman null hypothesis A then accepting substantive research hypothesis B.

          The p-value is just an intermediate calculation in this process that is compared to an arbitrary cutoff point. NHST does not require the p-value, any summary statistic can be used to perform this procedure.

        • But also you like to pick the low-hanging fruit (i.e. social psychology). There are many people using NHST who aren’t “accepting the alternative” when they get low p-values. For a lot of people in the social sciences the point estimate, standard error (and p-value) are just standard ways we describe our parameter estimates. In the same way Bayesians report the posterior mean and sd, etc. of parameter estimates.

        • But also you like to pick the low-hanging fruit (i.e. social psychology).

          Actually I most like to “pick on” medical research (both preclinical and clinical) because one day I hope to be able to work in that field again (when it becomes standard to take your job seriously), and could for the most part care less about social psych. Eg, from the various replication project results preclinical cancer research looks much worse than social psych. I also “picked on” the LIGO analysis since I do not see why there was so much focus on the null model while the other various alternative explanations got a couple sentences in the main paper. Basically anywhere you find NHST I will see the same issues.

          There are many people using NHST who aren’t “accepting the alternative” when they get low p-values.

          In the case where the null hypothesis is the default “no difference between groups” not predicted by any theory what do you learn from the p-value?

          For a lot of people in the social sciences the point estimate, standard error (and p-value) are just standard ways we describe our parameter estimates. In the same way Bayesians report the posterior mean and sd, etc. of parameter estimates.

          There is nothing wrong with parameter estimation, but what does the p-value add? For the usual t-test at least, it is just a non-linear transform of the info contained in the point estimate and standard error. It contains less information and is more abstracted from the data.

        • Yes, I agree. Again, I’m asking about how this ought to be communicated to certain audiences who don’t already understand it, in particular: a) members of the general public who read articles on fivethirtyeight, and b) those who currently use p-values in their research due to the fact that everyone around them also use p-values and they were taught to use p-values and they know that if they can get a p-value to fall below p < 0.05, they get a reward.

          This is the audience for the article that gave the sloppy but easy to understand explanation of what a p-value is.

        • That description is unintelligible for someone who doesn’t know statistics, and I don’t think it’s very interesting for someone who knows statistics either.

          The p-value only corresponds to a unique likelihood function when there is only one parameter and we’re doing a one-sided test. In general, multiple likelihood functions (i.e. multiple values of the parameter vector) can correspond to the same p-value.

          Even in the case where the bijective relationship holds, there are infinitely many alternative ways to define a one-to-one correspondence that could also act as a summary of the likelihood function (in the indexing sense) but lack the characteristics that make the p-value interesting: there is a meaning in the ordering of p-values and there is a meaning in the magnitude of p-values.

          (Apart from that, I find the including of sample size in that definition somewhat artificial. The sample size is part of the model. The likelihood functions for models with different numbers of observations live in different spaces. Whatever indexing capability into likelihood functions is provided by p-values, it’s unrelated to sample size.)

        • The p-value only corresponds to a unique likelihood function when there is only one parameter and we’re doing a one-sided test. In general, multiple likelihood functions (i.e. multiple values of the parameter vector) can correspond to the same p-value.

          I am not sure how well this works for other situations, but I would agree that adding “for the most common use-case of comparing two groups” would be better at this time. I also think that use-case is sufficient for a quick lay explanation. Explaining the likelihood function doesn’t seem so hard for this case either. It tells you the most likely parameter value (eg effect size) as well as how uncertain you are about it. It shows which values are more or less likely than others.

          there is a meaning in the ordering of p-values and there is a meaning in the magnitude of p-values
          […]
          (Apart from that, I find the including of sample size in that definition somewhat artificial. The sample size is part of the model. The likelihood functions for models with different numbers of observations live in different spaces. Whatever indexing capability into likelihood functions is provided by p-values, it’s unrelated to sample size.)

          The indexing requires sample size. If you know the sample size and p-value you can then use a lookup table to get the pre-computed likelihood function, imagine this as a bunch of charts at the end of a textbook. If you know only the p-value you don’t know whether the location is far away from zero and likelihood is wide or it is close to zero and likelihood is narrow (this is the classic “statistical significance does not mean practical significance” issue that people discovered empirically).

          Note I am really just trying to explain what a p-value actually is here, not how people are trying (incorrectly) to use it. How about:

          “For the most common use-case (comparing two groups), the p-value and sample size index (are like the street name and house number for) a unique likelihood function. These likelihood functions are a way of seeing how relatively well various effect sizes would fit the data.”

        • You don’t need just the p-value and the sample size. You need also the model. I would say that the sample size can be considered part of the model. But I got confused with the functions living in different spaces for different sample sizes: the number of parameters doesn’t change. Anyway, my main point stands.

          You could also index the likelihood functions using a number constructed from the data by some ridiculous mechanism like intercalating digits. Let’s say you have three measurements x1=1.234, x2=56.7 an x3=89 and you produce the magic number 58169.2703004. You can produce such a magic number for any dataset you got, and this magic numbers are an index for a unique (really unique in this case) likelihood function (the one corresponding to x1, x2, and x3). The values of x1, x2 and x3 can be trivially recovered from the “index” and the number of measurements used to create it, you don’t even need a lookup table!

          Do you really think that a p-value is not a bit more informative that this index of mine?

        • You don’t need just the p-value and the sample size. You need also the model.

          Yes, but for the common use case being covered the model will be the t-test, so it is “always” the same model (ignoring the equal sample size, etc variations).

          Do you really think that a p-value is not a bit more informative that this index of mine?

          I agree it is more informative. For a given p-value a larger sample size will narrow the likelihood, while a smaller will widen it (we can also put this as decreasing/increasing uncertainty). For a given sample size a larger p-value will move it closer to zero, while a smaller will move it farther away. What else is beyond that?

        • For a given p-value, a larger sample size will narrow the likelihood *and* move it closer to zero. If one remembers the relation between tail area and p-value is easy to see why, it’s unfortunate that you don’t want to include that in the definition.

          I think we agree that the p-value is not just an index, it has some interesting properties of its own. But is of course true that any sufficient statistic can be interpreted as an index into likelihood functions if one wishes to do so.

        • For a given p-value, a larger sample size will narrow the likelihood *and* move it closer to zero. If one remembers the relation between tail area and p-value is easy to see why, it’s unfortunate that you don’t want to include that in the definition.

          Yes, sorry. I was thinking for a given effect size a larger sample will move the p-value closer to zero.

          I think we agree that the p-value is not just an index, it has some interesting properties of its own

          So then perhaps there should be a list that distinguishes a p-value from other sufficient statistics?

        • I’m not sure if that is a question, because I think you already know the answer. The thing distinguishing the p-value, the reason it has been used (and misused) for more than a century, is the relation with the tail area of the probability distribution / likelihood function mentioned before.

          Which also makes the sampling distribution uniform in [0 1] under the null hypothesis (at least in simple cases, leaving aside discrete distributions, composite hypothesis, etc).

          And while this is not unique to p-values it also has the continuity and monotonicity properties that you seem to appreciate, not every sufficient statistic has that (notice that the magic number I proposed above is a sufficient statistic).

        • I’m not sure if that is a question, because I think you already know the answer. The thing distinguishing the p-value, the reason it has been used (and misused) for more than a century, is the relation with the tail area of the probability distribution / likelihood function mentioned before.

          Which also makes the sampling distribution uniform in [0 1] under the null hypothesis (at least in simple cases, leaving aside discrete distributions, composite hypothesis, etc).

          And while this is not unique to p-values it also has the continuity and monotonicity properties that you seem to appreciate, not every sufficient statistic has that (notice that the magic number I proposed above is a sufficient statistic).

          I meant as something to point the layperson to that explains why scientists care about it. This list seems to not be a solution, since even though I know what you are talking about I still don’t see it. Maybe if the “null hypothesis” == “research hypothesis”, but even then you will always have simplifications that render the tested hypothesis false…

        • Or how about get rid of mentioning the likelihood function:
          “For the most common use-case (comparing two groups), the p-value and sample size are like the house number and street name for a curve (similar to the “bell curve”) that shows how relatively well various effect sizes would fit the data.”

      • I don’t think less than 20 words is sufficient. Here’s a 63 word attempt:

        “The p-value is sometimes described as “an index of surprise”: How surprising would these results be if you assumed your hypothesis were false? However, the p-value is calculated using various assumptions (called model assumptions) that are difficult (or often impossible) to verify in any given case. Thus the p-value is usually a very iffy thing to use to draw any convincing conclusion.”

        • My 63 word attempt would be better if punctuated as follows:

          The p-value is sometimes described as, “An index of surprise: How surprising would these results be if you assumed your hypothesis were false?”

          However, the p-value is calculated using various assumptions (called model assumptions) that are difficult (or often impossible) to verify in any given case. Thus the p-value is usually a very iffy thing to use to draw any convincing conclusion.

  5. Somewhat related to incredible science.

    Love this article’s logic ( https://www.hsph.harvard.edu/news/press-releases/recent-presidential-election-could-have-negative-impact-on-health/ ):

    1. Trump elected;
    2. Creates distress amongst some groups that *could* lead to “increased risk for disease, babies born too early, and premature death”
    3. Hence, clinicians should suggest “psychotherapy or medication”.

    Is all above possible? Yes

    Likely? Not in my opinion.

    Is there scientific evidence? I’d say published “evidence” consists mostly of uncontrolled studies serving as rhetorical devices for motivated reasoning.

Leave a Reply

Your email address will not be published. Required fields are marked *