Pizzagate: The problem’s not with the multiple analyses, it’s with the selective reporting of results (and with low-quality measurements and lack of quality control all over, but that’s not the key part of the story)

“I don’t think I’ve ever done an interesting study where the data ‘came out’ the first time I looked at it.” — Brian Wansink

The funny thing is, I don’t think this quote is so bad. Nothing comes out right the first time for me either! World-renowned eating behavior expert Brian Wansink’s research has a lot more than 99 problems, but data exploration ain’t one.

Good research often involves the unexpected; indeed, that’s kinda why we do most of our research in the first place, because we don’t already know the answers. Like Brian Wansink, I gather and analyze data because I want to learn, not because I’m trying to prove something I already know.

I’m not here to praise Wansink’s research or his research methods: the data collection, processing, and analysis that he and his colleagues have done are a mess. To paraphrase a local airport, I think this is the most extraordinary collection of mishaps, of confusion, that has ever been gathered in the scientific literature – with the possible exception of when Richard Tol wrote alone.

But we should be careful not to criticize Wansink and his colleagues for the wrong reasons. Their problem was not that they wanted to use data to explore; their problem—beyond having no control over their data in the first place—was in not actually reporting what they had found. Their problem was not in running 400 mediation analyses to find an effect; their problem was in running 400 analyses and only reporting one of them.

No, researchers should not “set out to prove a specific hypothesis before a study begins”

The quote that leads off this post comes from a recent news article by Stephanie Lee. As an outsider in this saga, who has been appalled by Wansink’s practices, both in science and in responding (or not responding) to criticism, I very much appreciate the effort Lee put into documenting exactly what’s been going on in the Food and Brand Lab, interviewing researchers who were pressured to do unethical work, and so forth.

And it’s Lee’s excellent reporting that brings us to the topic of today’s post, which is my concern about criticizing bad work for the wrong reason. In her article, Lee wrote:

Ideally, statisticians say, researchers should set out to prove a specific hypothesis before a study begins.

I have no idea who are the statisticians who say that.

I’m a statistician, and I disagree with the above quoted statement for two reasons:

1. I don’t think it’s generally good practice for a researcher to “set out to prove” anything. Once you start a project with the goal of proving something, you’ve already put a direction on your goals, and there’s a risk of closing your mind. So I’d rather say that researchers can set out to investigate a hypothesis, rather than saying they’re setting out to prove it.

2. Some of the best and most important research is done in a spirit of exploration. Statisticians are very much in favor of explorations, indeed a classic book in our field is Exploratory Data Analysis by John Tukey, and exploration is a continuing subject of statistical thinking (for example, here’s a paper of mine from 2004, which I include not to say that it is definitive but just to indicate some of my own thinking on the topic)

There is an issue with work such as Wansink’s in which many different comparisons are considered and only the best results are published; but here I think the problem is not in the data exploration (that is, in considering many comparisons) but rather in the reporting. I’d have no problem if Wansink etc. were to perform a thousand analyses on their data, if they’d just report everything they’d done. Indeed, many, if not most, of the problems of Wansink’s work would have been immediately resolved had the raw data simply been archived and published from the start.

I can understand how, after hearing about all the ways in which Wansink and their colleagues misrepresented their data collection and analysis procedures, it would be appealing to think that there is some sort of rigorous approach recommended by statisticians, and it would be appealing to be able to say that the problems with the work of Wansink et al. were caused by “p-hacking.”

In some ways, though, my message here is more encouraging, in that I’m not saying that “researchers should set out to prove a specific hypothesis before a study begins”; indeed I think that’s typically a terrible idea. Exploration is great. Now consider someone like Wansink whose strengths as a researcher are: (a) a focus on topics that people really care about, (b) clever research designs. His weaknesses come in the areas of: (c) quality of measurements, and (d) reporting of results. I think he should stick with (a) and (b), work on (c), and resolve the problems with (d) by just reporting all his data. No need for him to set out to prove any statistical hypothesis, no need for him to learn a bunch of new statistical methods, he can just do what he does best. And I think this holds more generally. Yes, statistical methods can be important, but in large part that’s by motivating the careful measurement that’s typically necessary in order to get data worth analyzing in the first place.

Connection to the larger replication crisis

Here I’m setting aside the ethical lapses and disorganization that have plagued the work of Wansink and his associates. From a statistical point of view, the focus on data and ethical problems has been a distraction from the larger issue of published papers that make strong claims not supported by the data, and the related issue of published findings not reappearing in outside replication studies.

What, then, is the connection between Wansink—who published papers based on data summaries that never could’ve been, who repeatedly mischaracterized his data collection processes, misreported who did what on his papers, etc.?—and the more run-of-the-mill examples of the replication crisis: papers where the data were accurately represented, and standard statistical procedures appeared to show strong results, which then did not, or could not be expected to, replicate?

The connection is that, in all these examples—the slop-fests coming out of the Cornell Food and Brand Lab, and more careful papers reporting experiments or surveys with clear data trails—, strong conclusions were extracted from subsets of the data. In each of these studies, there’s was multiverse of possible analyses, but the reader of the published paper doesn’t get to see that multiverse, or anything like it. Instead, what the reader sees is the result of a statistically unsound procedure in which results are sifted based on their statistical significance. That’s a problem, and that’s what the pizzagate studies had in common with lots of studies in the medical and social sciences that were a lot more careful but still, ultimately, doomed.

So, if we want to do a “statisticians say” thing, I’d like to say:

Ideally, statisticians say, researchers should report all their comparisons of interest, as well as as much of their raw data as possible, rather than set out to prove a specific hypothesis before a study begins.

And none of this will help you if your measurements are too noisy. Wansink, like many other researchers, was able to avoid thinking about data-quality problems because he and his coauthors kept being able to extract statistically significant comparisons. Once you shut off that spigot, you have more of a motivation to get good measurements.

Ironically, “getting good measurements” seems like something that Brian Wansink might be good at. It’s just, as with many researchers, he was never told this was important. His teachers told him that what was important was getting p less than .05, and he dutifully did that. And then once he became the boss, others dutifully did it for him.

56 thoughts on “Pizzagate: The problem’s not with the multiple analyses, it’s with the selective reporting of results (and with low-quality measurements and lack of quality control all over, but that’s not the key part of the story)

  1. “1. I don’t think it’s generally good practice for a researcher to ‘set out to prove’ anything. Once you start a project with the goal of proving something, you’ve already put a direction on your goals, and there’s a risk of closing your mind. So I’d rather say that researchers can set out to investigate a hypothesis, rather than saying they’re setting out to prove it.” I have a question if you please, but if you want you can skip to the end to read it…

    Could it be that Lee was using “prove” in the sense of surviving falsification? It is customary for people to disavow Karl Popper when pressed (because Popper was wrong,) but it seems to me in practice most people adhere religiously to the crudest forms of Popperism…as when a hypothesis seems to survive *the* null hypothesis, then the hypothesis has passed the first proof. “Proof” is an old synonym of “test,” after all. Since nothing is ever really proven in Popperism (the notorious problem of induction, and all that,) then it is not clear when, how or if propositions with the right p-values should be distinguished from propositions non-scientific thinkers naively call “facts.” There are huge numbers of scientists who will resolutely deny any relevance in science to the idea of certainty. And to be honest I’m not real clear on how, or whether, they incorporate statistical notions of confidence into their general views.

    Personally, I tend to think of science as the systematic investigations of the way things are, using the various tools and procedures that people’s experience has shown to be relatively reliable, not least by multiple perspectives over time. Part of finding those tools and procedures seems to me to involve proving something is so, but maybe I misunderstand. But to me, the idea that you’re trying to prove how things are, as opposed to disproving challenges to your hypothesis in pursuit of an undefined goal (when uncertainty becomes negligible, if that definable at all,) draws attention to the bigger picture. The thing that stands out the most for me is that hypothesis testing aims at proving which of alternative hypotheses about the way things are is correct. In this view, “the” null hypothesis is the probability of the alternative hypotheses, rather than random outcomes, even if Fisher spins in his grave. (I had thought Bayes’ theorem might be relevant here, but apparently I don’t understand Bayesian inference one little bit.)

    So, although it’s clear you don’t think one should try to prove a hypothesis, do you think one should try to disprove a hypothesis?

    • Steven:

      Yes, what you say makes a lot of sense to me, indeed I’ll dub your comment the Comment of the Year (so far) in that it directly expands upon the key point of my post.

      Setting out to refute a hypothesis does seem like good scientific practice. Even when the goal is pure exploration, I think that exploring data and models is most effective when done from a starting point. And the better the model we can reject, the more we can learn.

      Still, though, I want to emphasize that the goal from data collection should typically be to learn new things. I’ll assume that, by prove, Lee meant “explore the implications of” or “put to the test,” not “confirm.”

      P.S. I’m a big Popper fan, and by “Popper,” I mean “Lakatos”; see here.

      • “All scientific research programmes may be characterized by their ‘hard core’. The negative heuristic of the programme forbids us to direct the modus tollens at this ‘hard core’. Instead, we must use our ingenuity to articulate or even invent ‘auxiliary hypotheses’, which form a protective belt around this core, and we must redirect the modus tollens to these. It is this protective belt of auxiliary hypotheses which has to bear the brunt of tests and get adjusted and re-adjusted, or even completely replaced, to defend the thus-hardened core. A research programme is successful if all this leads to a progressive problemshift; unsuccessful if it leads to a degenerating problemshift.” Imre Lakatos, The Methodology of Scientific Research Programmes (at pg 55 of this pdf: http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjdsKmW_sbZAhVCnFkKHT5SD2YQFggtMAA&url=http%3A%2F%2Fstrangebeautiful.com%2Fother-texts%2Flakatos-meth-sci-research-phil-papers-1.pdf&usg=AOvVaw3yVSm7YfuhKMJfEyXL2sR6)

        He continues a couple of pages later with this:

        “Few theoretical scientists engaged in a research programme pay undue attention to ‘refutations’. They have a long-term research policy which anticipates these refutations. This research policy, or order of research, is set out – in more or less detail – in the positive heuristic of the research programme. The negative heuristic specifies the ‘hard core’ of the programme which is ‘irrefutable’ by the methodological decision of its proponents; the positive heuristic consists of a partially articulated set of suggestions or hints on how to change, develop the ‘refutable variants’ of the research-programme, how to modify, sophisticate, the ‘refutable’ protective belt.

        The positive heuristic of the programme saves the scientist from becoming confused by the ocean of anomalies. The positive heuristic sets out a programme which lists a chain of ever more complicated models simulating reality: the scientists attention is riveted on building his models following instructions which are laid down in the positive part of his programmed. He ignores the actual counterexamples, the available data… [the rest of the paragraph is a debatable discussion of Newton’s research programme].

        Our considerations show that the positive heuristic forges ahead with almost complete disregard of ‘refutations’: it may seem that it is the verifications, rather than the refutations which provide the contact points with reality … it is the ‘verifications’ which keep the programming going, recalcitrant instances notwithstanding.”

        So again my question: in what way did Wansink’s research programme differ from that described by Lakatos? It can hardly be disputed that Wansink spent years creating a protective belt to defend his “hard core” hypothesis that food consumption is readily modified by environmental cues, that he never submitted any strong form of that core hypothesis to the modus tollens, that he kept his focus riveted on finding ever stronger belts, and that he somehow avoided being confused by the ocean of non-significant associations that STATA spit out after each slice and dice pizza party. You never know, sometimes testing a core hypothesis such as immunotherapy effectiveness being a function of mutation load can lead to really good things: https://www.nytimes.com/2018/02/19/health/ovarian-cancer-immunotherapy.html

        • Thanatos:

          I don’t think the problems with Wansink’s research programme were philosophical. I think the problems were: (a) his theories were weak, (b) his data were of low quality, (c) he had no control over his data (over and over, he seemed to be surprised by very basic features of what was being measured and who was in his experiments), and (d) his writeups presented incomplete and misleading information.

          Details matter. It’s hard to do science with weak theories and poor data, especially when you don’t even accurately report the data you have.

        • Unless of course your philosophy is: “Science is the process of guessing what might be true, predicting what the outcomes of experiments should be based on those guesses, and then subjecting those predictions to verification against actual carefully collected data about the world”.

          In which case Wansink’s program had a 100% philosophical problem, in that it simply wasn’t science from the beginning along with some vast sea of other similar research throughout academia today.

          The problem in Academia today is that the philosophy of science seems to be: “Science is whatever stuff people who call themselves scientists do these days” which is not particularly philosophically sound (in the sense of being based on any useful basic principal in any way).

    • It’s a different situation if you are doing totally exploratory work, as opposed to hoping to answer a particular question. For example, a particular thing you’d like to know might be “Does this measles vaccine prevent measles?” By contrast, you might ask “Does eating pizza late at night have some kind of good or bad health outcomes?”

      For a specific question, what you can do is see whether the data – all the data you have – could plausibly have come about if your hypothesis is correct. And, as Mayo keeps writing, it’s important to know if the experiment has the capability for testing this strongly, as opposed to weakly. It’s weak if many other, substantially different, hypotheses could also have plausibly led to the results.

      • For example, a particular thing you’d like to know might be “Does this measles vaccine prevent measles?” By contrast, you might ask “Does eating pizza late at night have some kind of good or bad health outcomes?”

        I don’t think anyone should ever care about the answers to those questions (I can answer for much cheaper too, both are “sometimes”). Instead:

        1) “To what extent does this vaccine prevent measles under circumstances x, y, z?”
        2) “What is the relationship between eating pizza late at night and health outcomes a, b, c under circumstances x, y, z?”

        • Actually, I think people *do* care about exactly this kind of question. You have rephrased them because you have learned that they can’t be answered without qualification – at least, by any one or even a series of measurements. So you want to be cautious. But, to give a personal example, as a parent whose child was given polio vaccine, I didn’t think about qualifications, but only whether my child was going to be protected.

          So although I phrased my questions in an over-simplified way to make for an easily-readable post, I do think that they actually represent what most people want to know. And yes, I realize that those people won’t be able to get everything they might want.

        • But, to give a personal example, as a parent whose child was given polio vaccine, I didn’t think about qualifications, but only whether my child was going to be protected.

          This is a totally different question than “does this vaccine prevent polio?” Also, you should be concerned about the risks/costs of the vaccine, chance of exposure to polio, etc. In the case of polio I know it has been proposed that cases of “vaccine-induced, and non-polio acute flaccid paralysis” rise in a vaccinated population.

          http://pediatrics.aappublications.org/content/135/Supplement_1/S16.2
          https://www.ncbi.nlm.nih.gov/pubmed/14768785

    • Other issues aside, I am bothered by the use of the word “proof” in science. We never “prove” anything in science, instead we may “show” or “demonstrate” or other equivalents to some degree of confidence. “Proof” applies to mathematical certainty, which we never have when the things we’re looking at are measured with some degree of uncertainty.

      • I used to think like this. But now I think we math types may be hijacking the word “proof”. e.g. In law there’s been the idea of “proof beyond reasonable doubt” for ages and ages.

        • Euclid’s Elements were from ca 300 BC and had a fully developed concept of proof. Whereas the Twelve Tables of Roman law were ca 449 BC. https://en.wikipedia.org/wiki/Twelve_Tables, however, the Twelve Tables don’t seem to give any meaningful guidance on proof of guilt etc. They do let you call witnesses by “loud calls before the doorway” of the witness, however only as frequently as every 3 days… Also “If any person has sung or composed against another person a SONG (carmen’) such as was causing slander or insult…. he shall be clubbed to death.”

          so there’s that.

          I suspect the “proof beyond a reasonable doubt” type stuff is *much much* more recent. For example, perhaps the English Common Law of the middle ages: https://en.wikipedia.org/wiki/English_law#Common_law

          So, I conclude on this weak basis that mathematical proof most likely has priority

        • Okay, Daniel, but I think the issue here is semantics — as Clark says, ‘the use of the word “proof” ‘ — not the concept. If I had university access to the Oxford Dictionary online, I’d check the etymology and earliest usage of the English word “proof”. I’d also be interested to know what the actual word(s) used for proof are in the Elements and the Twelve Tables and other (non-English) texts.

        • OED seems to think that Proof came ultimately from Latin through French: https://en.oxforddictionaries.com/definition/proof

          and that current concepts of the word proof are late 19th century.

          I don’t disagree that “proof” can have meanings that are less absolute and mathematical, but I honestly think that the appeal of using the word is the pun, it provides an air of sophistication, it’s similar to “statistically significant” in that people like it because it automatically gives the impression that you’ve found something “significant”

        • From translations as presented online, it looks like Euclid didn’t use a word for “proof”. It looks like he had a word for “proposition”, but then just went straight into the proof without a break for a new heading.

          I wonder what the earliest mathematical work to use words equivalent to “theorem”, “proof”, etc. was.

          I’m more comfortable with legal “proof”, where it’s clear this is not scientific, than with “clinically proven”, which I don’t like at all.

  2. “Ideally, statisticians say, researchers should set out to prove a specific hypothesis before a study begins.”

    I wonder if this statement comes from a crude (mis)understanding of power analysis and study design?

  3. “Ideally, statisticians say, researchers should set out to prove a specific hypothesis before a study begins.”

    I don’t understand your criticism of this statement. I think what Lee is referring to is the commonly advised practice of research preregistration, namely releasing a data analysis plan prior to collecting data. The Association for Psychological Science claims “a thorough preregistration promotes transparency and openness and protects researchers from suspicions of p-hacking”.
    https://www.psychologicalscience.org/publications/psychological_science/preregistration

    • Al:

      I’ve done a few hundred research projects over the years. Not once have I set out to prove a hypothesis. My studies are informed by hypotheses—it’s not like I’m collecting data on just whatever—but the purpose of gathering the data is to learn something new, not to prove some pre-existing idea.

      • “I’ve done a few hundred research projects over the years. Not once have I set out to prove a hypothesis. My studies are informed by hypotheses—it’s not like I’m collecting data on just whatever—but the purpose of gathering the data is to learn something new, not to prove some pre-existing idea.”

        If grasshopper does not seek, so may he find
        If grasshopper knows nothing, so may he be wise

        You have only ever done exploratory analyses? You have never been in a situation like this?:

        Whales sometimes strand themselves. Some folks have noticed that strandings seem to be more common after the Navy has big sonar tests.

        You have never studied something with a specific cause and effect? If you don’t start with the hypothesis “sonar causes strandings” your data collection will not be focused. How can you apply your statement that “the purpose of gathering the data is to learn something new, not to prove some pre-existing idea” to this situation? You can’t just go out and look at stranded whales. You have to focus on the ones near sonar, and compare the condition of their ears to other stranded whales. That is not “being informed” by an hypothesis, that is trying to prove or disprove a hypothesis.

        My specific point is this: a lot of science involves both a cause and an effect of interest, not just an effect of interest. So your statement that I quoted at the top is only reasonable in certain situations, despite your implications to the contrary. Or what am I missing?

        • Matt:

          1. My published papers are here. If you see any in which it seems that I set out to prove a hypothesis, please let me know. It could well be that I’m forgetting something.

          2. I agree that causal inference is important, and I’ve done various research projects involving causal inference. In none of these did I set out to prove a hypothesis.

          3. Your sonar example seems consistent with what I wrote earlier. In that example, the goal is not to prove that sonar causes stranding, it’s to estimate the effects of sonar on strandings. I agree completely that a lot of science involves both a cause and an effect of interest; see this paper. Causal inference is super important and in my experience it’s best done in a spirit of exploration, not in the spirit of trying to prove a hypothesis.

        • I agree, Anon. The widespread confusion between scientific hypotheses or questions and statistical hypotheses has badly affected the literature, statistical or otherwise. As has the myth/bad advice that one must fix alpha and then report results with the dichotomized “significant/non-significant” terminology.

          Whether research is “exploratory” or “confirmatory” (another popular but confusion-generating and unnecessary dichotomy), almost has the primary interest is in assessing the sign and magnitude of effects or correlations that have disciplinary import.

        • I read the paper you linked to in 3. but did not inventory the list of papers in 1. The following refers to the Reverse Causal paper:

          You wrote:

          “A reverse causal question does not in general have a well-defined answer, even in a setting where all possible data are made available.”

          This is not true in a general sense. In my experience, many reverse causal questions do indeed have a well-defined answer.

          It is true that every question you consider in the paper has no well-defined answer. Cancer clusters in particular are a good example because the root cause of any single incidence is unknowable. All of the other questions you consider, political motivations and such, have at least some element that is not fully resolvable by data collection, so maybe you really have never encountered a question that can be resolved.

          Plenty of reverse causal questions can be satisfactorily resolved with focused data collection. The question about the whales can be resolved. An autopsy can clearly show whether or not rupturing has occurred in the ear. In many fields of enquiry, if the basic reverse-causal resolution process is followed, the result is often conclusive. The process is as follows:

          1. Brainstorm a hierarchal cause map (I think you would call this a hierarchal model) that shows every plausible cause that anyone in the field can think of. The lowest/highest level of the hierarchy consists of hypotheses that are specific enough to be testable.
          2. Create and execute a test plan (what you call data collection) to show supporting or refuting evidence for each plausible cause.

          If one potential cause is strongly supported by evidence and all others are strongly refuted by evidence, there is a robust conclusion. It is easy to envision this happening with the whales and in fact it did happen with whales and sonar.

          This point aside, I did find the paper quite interesting. You have clearly thought a lot about this – and in a more formal way than I have – and for what it is worth, I agree with everything else you wrote.

    • I think the point is there isn’t a binary opposition of “ideal” way to conduct research vs. other ways that are flawed. Pre-registration is useful for some studies and less useful for others.

  4. Having been trained as a scientist first I don’t think Wansink’s problem was really statistical*, he’s also a terrible scientist. I don’t think I’ve ever directly worked with someone who would fool themselves like this in the pursuit of a hypothesis _and_ fail to make an alternative test…

    *I mean sure, statisticians have thoroughly demonstrated how his type misuse the tools, but they misuse lots of other tools too, like logic, spreadhseets, and data.

    • After reading some of Wansink’s work and that of his students and watching some of their videos, what I saw was something I have observed quite commonly and may have even believed at one time and that is:

      Any set of events submitted to treatment and control that produces statistically significant results should be taken seriously and should be properly placed within the body of literature about similar events and theories about them.

      One can rightly call this naive, but until more recently it was a fairly widespread shared belief within certain domains of the social sciences.

  5. “Ideally, statisticians say, researchers should set out to prove a specific hypothesis before a study begins.”

    Others have already guessed where they think this comes from.

    I think it may come from statisticians insisting that there must be a null hypothesis.

    • A statistician who insisted on this is terribly blameworthy, but not as
      much as you suggest.
      R: “So, I need a clear hypothesis before the study?”
      S: “No, no, no, your hypothesis can be fuzzy has heck, you need
      a very precise and SPECIFIC “anti-hypothesis” – something that would probably be inconsistent with
      your vague theorizing, but which is precise and exacting specified in all its details,
      and we will try to disprove that. In doing so, we prove your hypothesis – whenever
      you actually settle on one.”
      R: “That’s terribly confusing; I think you are asking me to find a specific
      hypothesis.”
      S: “Groan. You can’t be listening. No. If anything, I’m urging the opposite.
      But go on, if that’s what you think I said, just do it and don’t bother me again.”

      • Nice, but you forgot some mention of “null hypothesis is innocent until proven guilty, like in a court”. Also, there is usually bizarre nitpicking about “proving” vs “failing to reject” vs “accepting” a hypothesis that sounds like Popper0.

        Like I was saying in the other thread, this gibberish is what has filled the gap left by the removal of philosophy of science from the modern research curriculum.

        • Anoneuoid, what’s your criticism of the distinction between “failing to reject” and “accepting” a null hypothesis?

          I ask because I can’t count the number of times I’ve read or heard p > 0.05 interpreted as “there was no effect / difference / correlation / etc”. This is a serious mistake that has serious consequences. We can argue that NHST shouldn’t be used, or that it’s used should be severely narrowed. But so long as it remains common practice for researchers to crank out p-values every time they want to present “statistical evidence”, I don’t think it’s nitpicking to tell them that they don’t get to claim evidence for “no effect” simply on account of a p-value being bigger than the magic number.

        • I can’t count the number of times I’ve read or heard p > 0.05 interpreted as “there was no effect / difference / correlation / etc”. This is a serious mistake that has serious consequences.

          I propose the principle that “everything is related to everything else”, ie in the absence of some theory predicting otherwise (eg two events are in different lightcones), always assume there is a correlation/effect/difference.

          Therefore when they designed a study to check for a correlation/effect/difference, the mistake has already been made. Instead the study should have been designed to compare data to some prediction or to simply measure/estimate/describe something carefully.

          We can argue that NHST shouldn’t be used, or that it’s used should be severely narrowed.

          Some people may also use “NHST” to refer to cases when the research hypothesis corresponds to the null hypothesis. I consider that a completely different scenario that is probably ok. Otherwise I see no usecase other than to subsequently commit any number of logical fallacies.

          But so long as it remains common practice for researchers to crank out p-values every time they want to present “statistical evidence”, I don’t think it’s nitpicking to tell them that they don’t get to claim evidence for “no effect” simply on account of a p-value being bigger than the magic number.

          It is like hiking with someone trying to head East but instead they lead you North all day, then they trip over a root and you criticize them for slowing you down. That is just a distraction from the main problem.

      • It would seem that S. is thinking about statistical hypotheses and R. is thinking about scientific hypotheses. The literature is full of people conflating these two quite disparate concepts.

  6. On-topic (I think) and too delicious [sic] to pass up: yesterday Malte Elson on twitter (@maltoesermalte) posted a brief timeline of the replication crisis in psychology, culminating with

    “2018: The Twitter account of a cookbook lectures one of the most prolific social scientists on research methods.”

    and then a link to a tweetstorm from “The Joy of Cooking” dismantling a 2009 Wansink paper in Annals of Internal Medicine, “The Joy of Cooking Too Much: 70 Years of Calorie Increases in Classic Recipes” :

    https://twitter.com/TheJoyofCooking/status/968560552438988800

    • Mark:

      Have you ever seen the recipe for bagels in the Joy of Cooking? I haven’t checked it recently, but last time I looked it was hilarious. You’re thinking it’s gonna say: 4 cups flour, 2 cups water, 1 teaspoon yeast, etc. . . . No! The recipe’s something like this; Start with 6 bagels. Slice them in half . . . Wha…?

      • Lol! That’s good. I looked for it in the edition we have – copyright 1964 – but bagels don’t appear in the index or TOC. I wonder when it got inserted?

        A pity bagels didn’t appear in all the editions, or we could ask whether the revised recipe (=joke) that halved the calories per serving should have been in his otherwise cherry-picked subset of recipes.

        • Mark:

          Yes, the bagel recipe was added in the new edition from the 80s, maybe it was? They were updating by adding bagels, which I guess had become generally popular in the U.S., but they must have decided that their readership was not so interested in actually baking them.

          Speaking of cherry-picking, I tried the cherry pie recipe in Joy of Cooking but it didn’t come out that well. That said, I’ve never been able to make a successful cherry pie with just the right level of consistency of the cherries. The crust is no problem; it’s the filling that’s a challenge.

          To get slightly more serious for a moment: there have been trends in recipes, and I’m sure people have studied that. Just to speak anecdotally, I remember reading an old recipe book from the 60s, and everything had added ingredients. Even rice! The recipe was not just to put rice and water on the stove and cook till done. I think they had salt and butter as part of the recipe. And I have memories of frustration from long ago, of trying to use a fork chase down slippery grains of rice on my plate. For some reason, nobody thought of just cooking the damn stuff and serving it straight.

        • “I remember reading an old recipe book from the 60s”

          Oh, my, I guess I really am old.

          To me, “old recipe books” are my mother’s from the early 40’s (where they talk about substitutions for hard-to-get ingredients like sugar), or some handed down from my grandmother, that say things like, “A lump of butter the size of an egg,” or “a handful of flour”, or (from the beginning of the instructions for fruitcake) “chop the salt pork finely, then pour the hot coffee over it.”

  7. P-hacking is like sports doping. Wansink was caught, but he’s not that special. I’m sure it’s widespread. More importantly, the problem of multiple comparisons, garden of forking paths, specification searches (EdLeamer 1978), is just not sinking in. We’ve known this at least 40 years! The problem is incentives. Publish or perish.

    • Jack:

      I think the emphasis on p-hacking misses the point. Wansink’s big problem was not p-hacking, his big problem was weak theory and noisy data.

      I fear that many people are drawing the wrong lessons from the Wansink saga, focusing on procedural issues such as “p-hacking” rather than scientifically more important concerns about empty theory and hopelessly noisy data. If your theory is weak and your data are noisy, all the preregistration in the world won’t save you.

    • Jack:

      One more way P-hacking is like sports doping: it’s really hard to be competitive in a lot of fields without it, even if it’s tarnishing the field! I recall Andrew disagreed with some description of p-hacking being like research on steroids, because usually people on steroids get really good results. I think whomever made the comment did mean that the researchers get good results…like tenure, but not like positive advances to their field.

      Andrew, I also agree that it’s a bit dangerous that a lot of people are taking the lesson that you should never play around with your data. The lesson should be that we’re sitting in a bad place if we get a dataset, and if our first question of it doesn’t play out, we need to throw the full data set away. What if Wansink’s dataset had cost $10 million? When the standard set of tools we have (p-values) tells us “you get one shot per dataset collected and if you do it wrong, get a new dataset”, of course people are going to start lying about what their first question was. They can’t afford not too!

      • @Anonymous: yes, the analogy to winning is publications, grants, tenure, TED talks, going on talks shows… NOT making *true* discoveries.(You could say the same for an Olympic record while doped is not a true record.)

        Some have suggested that grants should only be awarded to obtain (generate) datasets, which then automatically become public. The grantwriters would have a head start (incentive), but everyone can use the data.

  8. I recently came across this article and it reminded me of this post, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02133-w

    Maybe it’s a matter of semantics, but I can’t say I completely agree with the article or your post. I don’t think there’s necessarily anything wrong with having hypotheses, the problem lies in either not knowing when to abandon the hypothesis or falling into the trap of doing cookie-cutter science (this latter point seems to be what you’re saying). For example, in my experience in the life sciences the PI usually knows what result they want, and when an experiment doesn’t “work” it’s the fault of the grad student / postdoc. This results in biologists performing experiments until they get the results their boss wanted (or sometimes fabricating the results their boss wanted), and then when they report their results they don’t mention all the times the experiments didn’t work out. I’ve always considered this the main source of unreproducibility in biology.

    To me the mark of a good scientist is knowing when it’s time to abandon the project and work on something else. At least in biology this is not an easy decision to make. For example, let’s say you have a drug you think selectively kills cancer cells, but in your first experiment all the cells died (even the normal ones). Obviously you should repeat the experiment with a lower dose, but if you gave up after the first experiment you would have potentially missed out on an important result.

    Another characteristic I look for in good scientists is being able to notice something unusual, or question the cookie-cutter procedures. Here’s an example from my field. In small RNA-Seq data it was common practice to just throw out reads below a certain length. If you were a grad student at the time and questioned what happens if you don’t do that you could have identified some non-canonical small RNAs.

    I don’t think it’s accurate to say that Wansink didn’t have hypotheses and just did data exploration. He had an entire “Mindless Eating” theory he continually tried to support with his experiments. Remember, in the pizza experiment he had a “Plan A”. The problem was once the data contradicted his hypothesis (and previous published papers) he still wanted to get some papers out of the data so he then turned to HARKing (and didn’t publish his data so people couldn’t notice the data contradicted his previous research).

    Anyways, going back to whether it’s good to have a hypothesis or not. I think it all comes down to balancing 1. trying to get the data to fit your hypothesis, 2. when to alter/abandon your hypothesis, and 3. exploring new lines of research based on things you noticed during your analyses. If you have a strong prior that your hypothesis is correct I think it makes sense to spend a lot of time on 1. It also depends on how important the discovery would be. If it will get you a Nobel Prize then that could be another reason to spend more time on 1. If you don’t really have a good reason for your hypothesis or it isn’t even that interesting then that’s when you should lean more readily to 2 and 3.

      • Maybe we are thinking about different situations. For the typical researcher I’d agree with a lot of what you say about not setting out to prove an hypothesis, but there are a lot of famous cases where researchers were so confident in their hypothesis that they set out upon massive multi-year experiments to prove it, for example showing that Huntington was caused by a single gene. I’m all for data exploration, but at some point once there are strong theories then there’s value in setting out to prove them.

Leave a Reply to anon Cancel reply

Your email address will not be published. Required fields are marked *