Here’s a quote:
Instead of focusing on theory, the focus is on asking and answering practical research questions.
It sounds eminently reasonable, yet in context I think it’s completely wrong.
I will explain. But first some background.
Junk science and statistics
They say that hard cases make bad law. But bad research can make good statistics. Or, to be more precise, discussion of bad research can lead to good statistical insights. During the past decade, examples of bad science such as beauty-and-sex-ratio, ESP, ovulation-and-clothing, etc., have made us more aware of the importance of type M and type S errors in understanding statistical claims, the importance of the garden of forking paths in understanding where statistical significant results are coming from, and the role of prior information in data analysis. (Yes, I’d written a whole book on Bayesian data analysis but I’d not realized the useful role that direct prior information can play in practical inference.) Or, to consider another theme of this blog: years of discussion of bad graphs made us aware of the different goals of statistical communication.
The general idea is that, when we see problems in statistical and communication, we use the disconnect between observed practice and our ideals to gain insight into research goals. Theoretical statistics is the theory of applied statistics, and we can make progress by observing how statistics is actually applied.
We had an example recently, with two long discussions of the work of Brian Wansink, a Cornell University business school professor and self-described “world-renowned eating behavior expert for over 25 years.”
It started with an experiment done by Wansink that he himself characterized as a “failed study which had null results”—but which he then published four different papers on, with each paper presenting the experiment not as a failure but as a success. Perhaps one outside the world of food science would’ve heard about these papers had Wansink not boldly written about them on his blog, in a post where he openly advertises his p-hacking:
I [Wansink] had three ideas for potential Plan B, C, & D directions (since Plan A had failed). . . . Every day she [Wansink’s colleague] came back with puzzling new results, and every day we would scratch our heads, ask “Why,” and come up with another way to reanalyze the data with yet another set of plausible hypotheses. Eventually we started discovering solutions that help up regardless of how we pressure-tested them.
I was curious and looked up the papers in question, and, indeed, they sliced and diced their data in different ways to come up with statistical significance. The data were all from the same experiment but different analyses used different data-exclusion rules and controlled for different variables.
Following up, Tim van der Zee, Jordan Anaya, and Nicholas Brown looked into those four papers in even more detail and found over 100 errors in there. Basically, just about none of the numbers made sense. Also, a blog commenter pointed out that Wansink had told written two contradictory things about how his collaborator got involved in the project (see P.P.S. at my above-linked post).
So far, so typical. Low-quality research, noise mining, sloppiness, it’s Psych Science minus the psychology, or PPNAS minus the himmicanes. Run-of-the-mill, everyday, bread-and-butter junk science. PhD’s on the hamster wheel, going in circles, releasing publications and press releases and going on NPR and Ted, 9 to 5, Monday through Friday, until retirement. With all the errors and contradictory stories, this is maybe a bit worse than normal and it raises the question of whether there is any deliberate dishonesty going on, but the overall picture is of data being put into a meat grinder and being published as mass-produced hamburgers. Nothing interesting to report.
So far, the only really notable thing is Wansink’s openness about all of this. In the psychology department they know enough to realize that you’re not supposed to p-hack, that there is such a thing as research protocol, and that churning out papers is not supposed to be a goal in itself. Wansink’s overt description of his research process indicates that this understanding has not yet made it all the way to Cornell business school.
It’s a paradox. On one hand, Wansink would’ve been better off keeping his head down and not telling the world about his workflow; on the other hand, publicity is one of his legitimate goals. After all, if you’re doing food research and you think your research is high quality—if you think you actually are making discoveries—then you do want to publicize your findings, as they can make a difference in the world.
As the saying goes: You may not be interested in bad research, but bad research is interested in you.
Three statistical issues came out in our blog discussions. The first was that Wansink and his colleagues engage in what are known as “questionable research practices” which invalidate the statistical conclusions which are the basis of those articles getting published in peer-reviewed journals. The second was that they were, at best, extremely sloppy in publishing work with so many errors and contradictions. The third was that Wansink explicitly works with no substantive theory.
The first two problems are nothing new; they’re part of the standard playbook of Psychological Science or PPNAS-type research: hyped claims based on noisy data, messy data manipulation, and a general attitude that once a paper is published, it should be immune from criticism. The usual Ted-talk attitude.
The third item is interesting, though. Let’s again pass the mic over to Wansink:
Instead of focusing on theory, the focus is on asking and answering practical research questions.
That sounds good—who among us does not prefer empirics to theory?—but is missing a key step. You don’t just want to ask and answer questions, you also want those answers to be correct, to give insight, ultimately to give good predictions.
If you have no theory and the ability to produce noisy data, you can ask all the questions you want (ok, actually I have some doubts about the quality of the questions that will get asked in the absence of theory), and if you’re willing to sift through your data enough you can get “p less than .05” answers, but there’s no reason to expect these answers will be any more useful than what you’d get just by flipping coins.
With noisy data, in the absence of theory, effect sizes will be low, and anything statistically significant is likely to be a huge overestimate of any effect and also likely to be in the wrong direction (that’s type M and type S errors).
Why am I so sure that effect sizes will be low in the absence of theory? Because there are just too many things to look at. Without theory (or effective intuition or heuristics, which are just informal versions of theory), you’re basically picking potential effects at random, and most potential effects are small.
Kurt Lewin wasn’t kidding when he said, “There’s nothing so practical as a good theory.”
OK, maybe he was kidding. I have no idea. I know nothing about Kurt Lewin. Perhaps I could ask Karl Weick to tell me some stories about Lewin, next time I’m in Ann Arbor.
As Emilio Estevez never said, I blame society. More specifically, I blame the statistics profession for contributing to the mistaken attitudes of people such as Wansink. For decades we’ve been telling people that statistics can reject the null hypothesis in the absence of substantive theory. So it makes sense that these dudes will believe us!
I remember in grad school our professor patiently explaining to us the magic of random assignment, that you can demonstrate the existence of a treatment effect, and accurately estimate its magnitude, without any substantive theory at all. What he didn’t tell us was that these methods fall apart when effect sizes are small and noise is high.
And, hey, what happens if you have no theory?
1. Effect sizes tend to be small. With no theory, the plan is to stumble onto effects, not to search them out.
2. Noise tends to be high. With no theory of measurement, it can be a challenge to measure well.
It’s worse than that, actually: in the presence of uncontrolled researcher degrees of freedom, where “p less than .05” is so easy to attain that a research team can produced four published papers from a single failed experiment, there’s not really any motivation to measure anything accurately. Indeed, in many ways, noisy measurements are a plus for an ambitious researcher: When standard errors are high, statistically significant results will be automatically large, thus more dramatic, better headlines, more impressive graphs for your PPNAS papers and Ted talks, and so on.
I’m not saying that anyone’s making their measurements extra-noisy on purpose; it’s just that the incentives favor noisy measurements.
As a wise economist once said, people don’t always respond to incentives, but responding to incentives is usually a lot easier than not responding to incentives.
In the garden
In comments, Thomas Basbøll discussed Wansink’s anti-theoretical attitude in the context of a beautiful Van Morrison song. (As an aside, Basbøll’s invocation of Morrison was a great move, because now when I write on this topic, I have that song running pleasantly in the background in my head.) Basbøll writes:
In my adaptation of Morrison’s slogan, I’ve only replaced the “guru” with “theory”. It reminds me of Bertrand Russell’s observation that sometimes a system of logical notation can bring insights as good as a live teacher. . . .
Academic knowledge is the sort of thing we can learn from others. That’s what makes an education something quite different than a spiritual journey. We’re not just supposed to find the answers within ourselves (though we may find many of them there while attending a university); we’re supposed to be brought up to speed about what the culture already knows.
A “scientific” discovery, likewise, is one we can teach to others, it is “contribution” to others, especially other researchers. That’s why theory is so important. It’s what you are contributing a particular result to. In science, you can’t really claim to answer “important questions” instead of extending or testing a theory. It’s the theory that gives the question its importance.
For Wansink to present an “experimental” approach to economic behavior with no theory is as odd as if he proposed to conduct his experiments with no method.
Before concluding this post, I should emphasize that theory isn’t perfect. Some theories or frameworks are flat-out wrong; others are useless; others were once useful but are now played out. To get a sense of how theory can lead one astray, look at the career of sociologist Satoshi Kanazawa, famous for his indefatigable attempts to squeeze statistical blood out of the dry stone which is N=3000 sex-ratio data. His (and others’) misunderstanding of statistics led him to publish claims which were essentially pure noise, but his attachment to a particular theory has, I fear, kept him going, nourishing his confidence in settings where a better response would’ve been to quit.
So, sure, I’m aware that theory can only go so far, and we need to be open to the unexpected, to learning new things from data.
But, remember, we can best learn from the unexpected when we carefully specify what is the “expected” that the world deviates from.
In some way, this comes down to technical issues in statistical modeling. Here’s Wansink again:
With field studies, hypotheses usually don’t “come out” on the first data run. But instead of dropping the study, a person contributes more to science by figuring out when the hypo worked and when it didn’t. This is Plan B. Perhaps your hypo worked during lunches but not dinners, or with small groups but not large groups. . . .
I don’t actually object to any of this. But the way to study these interactions is not to sift through looking for “p less than .05”: that’s a procedure with poor frequency properties, as we say in statistics: it’s a way to produce overconfident overestimates (high type M and type S errors). Instead, I recommend multilevel modeling, partially pooling interactions toward zero. When effects are small and measurement error is high, as in Wansink’s experiment, just about everything will be pooled toward zero.
But that’s ok. At least, it’s ok if your goal is to learn about the world. It’s not so great if your goal is to produce a stream of publications claiming statistically significant discoveries.
P.S. Just to be clear: I’m not saying that all bad research points us to statistical insights. I don’t think we got anything useful at all from discussing that himmicanes paper, for example.
P.P.S. In my criticism of Wansink’s research, I’m not saying that he’s doing more harm than good from this work. I can see strong arguments in both directions, and this will be the subject of tomorrow’s post.