I fear that many people are drawing the wrong lessons from the Wansink saga, focusing on procedural issues such as “p-hacking” rather than scientifically more important concerns about empty theory and hopelessly noisy data. If your theory is weak and your data are noisy, all the preregistration in the world won’t save you.

This came up in the discussion of yesterday’s post.

We’ve discussed theory and measurement in this space before. And here’s a discussion of how the problems of selection bias are magnified when measurements are noisy.

Forking paths and p-hacking do play a role in this story: forking paths (multiple potential analyses on a given experiment) allow researchers to find apparent “statistical significance” in the presence of junk theory and junk data, and p-hacking (selection on multiple analyses on the same dataset) allows ambitious researchers to do this more effectively. P-hacking is turbocharged forking paths.

So, in the absence of forking paths and p-hacking, there’d be much more of an incentive to use stronger theories and better data. But I think the fundamental problem in work such as Wansink’s is that his noise is so much larger than his signal that essentially nothing can be learned from his data. And that’s a problem with lots and lots of studies in the human sciences.

The forking paths and p-hacking are relevant only in the indirect way that it explained how the food behavior researchers (like the beauty-and sex-ratio researchers, the ovulation-and-clothing researchers, the embodied-cognition researchers, the fat-arms-and-voting researchers, etc etc etc) managed to get apparent statistical success out of hopelessly noisy data.

So I hope the lesson that researchers will draw from Pizzagate is not “I should not p-hack,” but, rather, “I should think more seriously about theory and I should work hard to take better measurements and use within-person designs to study within-person effects, and, when forking paths are no longer available to me, I’ll need to get better measurements anyway, so I might as well start now.

19 thoughts on “I fear that many people are drawing the wrong lessons from the Wansink saga, focusing on procedural issues such as “p-hacking” rather than scientifically more important concerns about empty theory and hopelessly noisy data. If your theory is weak and your data are noisy, all the preregistration in the world won’t save you.

  1. What are the lessons that journal reviewers or editorial boards should draw? Do reviewers of the touched literatures share some culpability in this scandal? While I cannot know for sure, I have reason to believe that I have refereed at least one Wansink study submitted after the pizzagate scandal broke. These studies suffered from all of the problems described in this and related posts (e.g., clear errors in reporting statistics, lack of theory, study not designed to precisely measure outcome of interest, noisy data, multiple comparisons). Furthermore, it was completely obvious in each of the manuscripts that all of this was going on.

    Whether or not it was Wansink’s work is immaterial, and I have no desire to further impugn his reputation. The point is that despite the visibility of the (very valid) criticism of his work, none of my fellow reviewers caught the errors or asked the hard questions about methodology that they should have. In my (admittedly brief) experience, this omission of scrutiny in the review process happens in a lot of this nutrition/social science type work. What can be done to correct that? I am very early career so I really do not know.

      • Anon:

        Sure, but just about every paper gets published . . . somewhere. I assume the Cornell Food and Brand Lab had a system where they’d routinely send a paper to Journal A, then Journal B, etc. And, wherever it appeared, it was considered good enough to get government grants, corporate funding, and uncritical coverage in the news media.

      • Andrew:

        This raises an important question: how to react when you see dubious work get huge media attention and get published in a very high-profile tabloid journal e.g. N**.

        Some journals, I don’t know how many and in which fields, allow comments or letters, which I assume authors can write pointing out all the flaws in one of their published papers. But it’s said that it’s hard getting these accepted. You’re up against both the editor and the flawed paper’s authors, who are sent it. Their criterion is e.g. that the flaw must be central to the paper’s main conclusions, etc. The editor is not on your side, because he and the reviewers were the ones who accepted the flawed work, and he won’t want to admit making a mistake.

        Another possibility is to write up such a paper and send it to a different journal. The question is how useful that is for the field and for the author of such a rebuttal. Will people see it? Will you get credit among your peers for it, or will you just make new enemies, i.e. the authors whose work you’re discrediting?

    • Okay, I was very confused…
      But now I think I know the source of my confusion.
      There are two posts with exactly the same very long title.
      Deliberately? It just confused me.

  2. “But I think the fundamental problem in work such as Wansink’s is that his noise is so much larger than his signal that essentially nothing can be learned from his data. And that’s a problem with lots and lots of studies in the human sciences.”

    You’re correct, but (unfortunately) people are using bad methods to figure out whether they have a signal above their noise. In other words, the vast majority of people doing this sort of bad science would agree that if noise swamps the thing they’re trying to measure, there’s nothing that can be learned from it. However, being uncomfortable with quantitative assessments, and lacking the intuition that can be built up by, for example, simulating data, they don’t have a sense of whether data are “too noisy.” So, things like p-values are used, unfortunately, as the measure of whether there’s a “signal” there or not — i.e. as the way of informing oneself about whether one’s measurements are meaningful. If one couples this with a lack of understanding of what a p-value is, disaster ensues. (E.g. disasters like: if I split my data into twenty chunks, and one of them has p<0.05, it tells me that there's a signal there in the data that my measurement has revealed!) Many times, I've seen people use the outcomes of statistical tests as the thing that convinces themselves that their signal is stronger than the noise, so asking that they first realize that their system is noisy is simply not a possibility.

    • Raghu:

      Yes, definitely. People take statistical significance as a signal that their standard errors are low enough not to matter, not fully realizing that what they’re implicitly doing is a really bad power analysis that uses a noisy and biased point estimate of effect size.

  3. Yes, I really think more people should take a look at the data the lab has released.

    I understand field studies are “messy”, but when there are that many missing responses and impossible values you need to rethink how you are doing your field studies. In addition, it would be nice if the data didn’t contradict your description of your study.

    The Wansink story is really a holy trinity of issues. No theory, terrible data, and imaginary statistics.

    • I’d add a fourth issue. It’s this bad idea: http://www.butlerscientifics.com/seehowitworks – that scientific discovery can be automated. They took down the “10,000 correlations per minute!” ad but the process they sell is still the same. When Andrew first posted on it (see especially Butler’s comments: http://statmodeling.stat.columbia.edu/2014/01/27/disappointed-results-boost-scientific-paper/ ) I was taking a MOOC on epidemiology and with the free (unnamed stats software) license playing with data from the Framingham Heart Study. Things were going fine until the lecturer said two things. The first was something to the effect that “this is what you dream of finding; a great big pile of data that you can build a whole career on.” The second, later on, was about how to turn on and off the record of your data exploring. Now while he absolutely did not suggest that anyone ought to dredge until they find something, HARK, and then start recording, if I could put 2 and 2 together surely every one of those bright young kids who were actually in the classroom understood exactly what was going on – turning scientific discovery into a point and click assembly line process in which you never have to leave the comfort of your office. For me it was one of those epiphanies that permanently changed how I approach scientific claims; which is sad, actually.

  4. Okay, here’s the thing. Wansink’s studies DO start with a hypothesis sometimes. There was the Plan A which failed on the pizza studies. The problem is that he then generated other hypotheses, and acted as if those were Plan A. But obviously you can’t have more than one Plan A.

    This happens all the time, but I’m not sure what the terminology is for it. It is HARKing, but you also start with a hypothesis, which you then readily abandon and pretend never existed.

  5. Biasing selection effects, researcher degrees of freedom, data-dredging, all of these have been known for donkey’s years & yet many of the new reformers & their opponents act like these are brand new norms forcing a revolution over a field like psych. I agree with Andrew’s pessimism. I’m afraid the reformers and replicationists in psych are failing to address the problems with their experiments and measurements (and all the presuppositions in thinking they are learning about the effects of interest). They can easily set out and falsify some of these presuppositions (with modified experiments, not more statistical tricks) Current ministrations are largely superficial, it’s as if thefield is merely rechanelling the same perverse incentives. Registered reports are rightly aiming to prevent flexibility and dredging, but I don’t get the impression they are being evaluated stringently. Nor are the authors required to say how they will employ negative results so as to raise a deeper criticism about the very “treatments”, measurements of effects, validity of Likert scales and the rest. Significance tests do just what they should: when an effect could only be found by engaging in QRPS, they give you a hard time in trying to replicate. But a bunch of published negative results does not a science make.

    • I’d guess the “strength” of a theory depends on the predictions you can derive from it.

      Worthless prediction: x is correlated with y
      Weak prediction: x is positively correlated with y
      Strong prediction: x is related to y by function f

      There is a whole range of possibilities between weak and strong. A prediction gets stronger as it becomes more precise, maxing out when it is an exact numerical value.

      For each theory you can have a number of predictions of various precisions. The overall “strength” of the theory would depend on the number and precision of these predictions. I’m sure the details of how to determine “strength” could be discussed quite a bit.

      Also, I’d treat how well the theory is supported by data as an entirely different issue. That will depend upon accuracy of the predictions, accuracy of predictions of other theories, and quality of the data. This is an obvious application of bayes rule.

  6. Andrew G’s passage sums up what I have seen over the last 30 years in much empirical research. Specifically, in their PhDs and later work, social science researchers master some statistical software and learn how to enter data and get results, and then automatically presume that this approach, i.e. being “empirical” and running technical looking statistical methods using very highly developed software on fast computers, will virtually of itself produce reliable conclusions in whatever context no matter how complicated. All that the researcher needs to do, not that this is without work and lots of disappointments, is to get the output that “works” and then reproduce the tables of results and their p-levels in a paper that follows the usual script, and that will itself ensure that new scientific knowledge has arrived and is reliable. The “empirical research process” is regarded as both necessary and sufficient, no matter how complicated the underlying causal world and how changeable or “human” it is in its dynamics.
    Such automation and mechanisation of inference would not be assumed by the same researchers if they found themselves on jury duty. Then they would concern themselves directly with what the evidence means or says, how strong it is, and whether there is enough to ever come to any clear conclusion in what might be seen as a very complicated set of circumstances. Yet in “empirical research” no such discussion is entered into, the p-level and the underlying regression model, whatever its lack of theoretical derivation, is taken to excuse the researcher from ever wondering out loud whether the evidence, in the context, means much.
    I am not suggesting that any empirical research in some fields is impossible. Rather, I am suggesting that the some of the problems studied (e.g. whether a firm’s better financial disclosure methods impress the stock market so much that investors will offer the firm a lower cost of capital) are so complicated theoretically and so hard to test empirically that the philosophical issue has to be raised of whether any result in such a study can be given much evidential credence. I raise this particular research question (information and the cost of capital) because it is often described in the related literature as one of its most interesting and important questions, and also because in my own survey of work on it and built on that work, I find that authors repeatedly describe the results of past empirical research as being “mixed” (that word being the accepted euphemism for having reached no reliable conclusion).
    Put simply, when embarking of an empirical research study, there is no guarantee, no matter what empirical research protocols are in place, and no matter what Bayesian or non-Bayesian statistical methods are applied, that there will ever be possible a rigorous, or any way reliable, outcome. That point is not usually made, at least not so directly, in PhD methods training in social sciences. Rather, one empirical question is seen to be the same as another, each to be answered to the same satisfaction and reliability by the same methods and write-up.

Leave a Reply to Jordan Anaya Cancel reply

Your email address will not be published. Required fields are marked *