Skip to content

John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice

Several people pointed me to this awesome story by John Bohannon:

“Slim by Chocolate!” the headlines blared. A team of German researchers had found that people on a low-carb diet lost weight 10 percent faster if they ate a chocolate bar every day. It made the front page of Bild, Europe’s largest daily newspaper, just beneath their update about the Germanwings crash. From there, it ricocheted around the internet and beyond, making news in more than 20 countries and half a dozen languages. . . .

My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.

How did the study go?

5 men and 11 women showed up, aged 19 to 67. . . . After a round of questionnaires and blood tests to ensure that no one had eating disorders, diabetes, or other illnesses that might endanger them, Frank randomly assigned the subjects to one of three diet groups. One group followed a low-carbohydrate diet. Another followed the same low-carb diet plus a daily 1.5 oz. bar of dark chocolate. And the rest, a control group, were instructed to make no changes to their current diet. They weighed themselves each morning for 21 days, and the study finished with a final round of questionnaires and blood tests.

A sample size of 16 might seem pretty low to you, but remember this, from a couple of years ago in Psychological Science:

Screen Shot 2015-05-29 at 8.58.35 AM

Screen Shot 2015-05-29 at 8.58.53 AM

Screen Shot 2015-05-29 at 8.59.10 AM

So, yeah, these small-N studies are a thing. Bohannon writes, “And almost no one takes studies with fewer than 30 subjects seriously anymore. Editors of reputable journals reject them out of hand before sending them to peer reviewers.” Tell that to Psychological Science!

Bohannon continues:

Onneken then turned to his friend Alex Droste-Haars, a financial analyst, to crunch the numbers. One beer-fueled weekend later and… jackpot! Both of the treatment groups lost about 5 pounds over the course of the study, while the control group’s average body weight fluctuated up and down around zero. But the people on the low-carb diet plus chocolate? They lost weight 10 percent faster. Not only was that difference statistically significant, but the chocolate group had better cholesterol readings and higher scores on the well-being survey.

To me, the conclusion is obvious: Beer has a positive effect on scientific progress! They just need to run an experiment with a no-beer control group, and . . .

Ok, you get the point. But a crappy study is not enough. All sorts of crappy work is done all the time but doesn’t make it into the news. So Bohannon did more:

I called a friend of a friend who works in scientific PR. She walked me through some of the dirty tricks for grabbing headlines. . . .

The key is to exploit journalists’ incredible laziness. If you lay out the information just right, you can shape the story that emerges in the media almost like you were writing those stories yourself. In fact, that’s literally what you’re doing, since many reporters just copied and pasted our text.

Take a look at the press release I cooked up. It has everything. In reporter lingo: a sexy lede, a clear nut graf, some punchy quotes, and a kicker. And there’s no need to even read the scientific paper because the key details are already boiled down. I took special care to keep it accurate. Rather than tricking journalists, the goal was to lure them with a completely typical press release about a research paper.

It’s even worse than Bohannon says!

I think Bohannon’s stunt is just great and is a wonderful jab at the Ted-talkin, tabloid-runnin statistical significance culture that is associated so much with science today.

My only statistical comment is that Bohannan actually understates the way in which statistical significance can be found via the garden of forking paths.

Bohannan’s understatement comes in a few ways:

1. He writes:

If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. . . .

P(winning) = 1 – (1-p)^n [or, as Ed Wegman would say, 1 – (1-p)*n — ed.]

With our 18 measurements, we had a 60% chance of getting some “significant” result with p < 0.05.

That’s all fine, but actually it’s much worse than that, because researchers can, and do, also look at subgroups and interactions. 18 measurements corresponds to a lot more than 18 possible tests! I say this because I can already see a researcher saying, “No, we only looked at one outcome variable so this couldn’t happen to us.” But that would be mistaken. As Daryl Bem demonstrated oh-so-eloquently, there many many possible comparisons can come from a single outcome.

2. Bohannon then writes:

It’s called p-hacking—fiddling with your experimental design and data to push p under 0.05—and it’s a big problem. Most scientists are honest and do it unconsciously. They get negative results, convince themselves they goofed, and repeat the experiment until it “works”.

Sure, but it’s not just that. As Eric Loken and I discussed in our recent article, multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Even if a researcher only performs a single comparison on his or her data and thus did not do any “fishing” or “fiddling” at all, the garden of forking paths is still a problem, because the particular data analysis that was chosen, is typically informed by the data. That is, a researcher will, after looking at the data, choose data-exclusion rules and a data analysis. A unique analysis is done for these data, but the analysis depends on those data. Mathematically this of course is very similar to performing a lot of tests and selecting the ones with good p-values, but it can feel very different.

I always worry when people write about p-hacking, that they mislead by giving the wrong impression that, if a researcher performs only one analysis on his her data, that all is ok.

3. Bohannon notes in passing that he excluded one person from his study, and elsewhere he notes that researchers “drop ‘outlier’ data points” in their quest for scientific discovery. But I think he could’ve emphasized this a bit more, that researcher-degrees-of-freedom is not just about running lots of tests on your data, it’s also about the flexibility in rules for what data to exclude and how to code your responses. (Mark Hauser is an extreme case here but even with simple survey responses there are coding issues in the very very common setting that a numerical outcome is dichotomized.)

4. Finally, Bohannon is, I think, a bit too optimistic when he writes:

Luckily, scientists are getting wise to these problems. Some journals are trying to phase out p value significance testing altogether to nudge scientists into better habits.

I agree that p-values are generally a bad idea. But I think the real problem is with null hypothesis significance testing more generally, the idea that the goal of science is to find “true positives.”

In the real world, effects of interest are generally not true or false, it’s not so simple. Chocolate does have effects, and of course chocolate in our diet is paired with sugar and can also be a substitute for other desserts, etc etc etc. So, yes, I do think chocolate will have effects on weight. The effects will be positive for some people and negative for others, they’ll vary in their magnitude and they’ll vary situationally. If you try to nail this down as a “true” or “false” claim, you’re already going down the wrong road, and I don’t see it as a solution to replace p-values by confidence intervals or Bayes factors or whatever. I think we just have to get off this particular bus entirely. We need to embrace variation and accept uncertainty.

Again, just to be clear, I think Bohannon’s story is great, and I’m not trying to be picky here. Rather, I want to support what he did by putting it in a larger statistical perspective. > Huffington Post, Wall Street Journal, New York Times

David Christopher Bell goes to the trouble (link from Palko) to explain why “Every Map of ‘The Most Popular _________ by State’ Is Bullshit.”

As long as enterprising P.R. firms are willing to supply unsourced data, lazy journalists (or whatever you call these people) will promote it.

We saw this a few years ago in a Wall Street Journal article by Robert Frank (not the academic economist of same name) that purported to give news on the political attitudes of the super-rich but really was actually just credulously giving reporting unsubstantiated statements from some consulting company.

And of course we saw this a couple years ago when New York Times columnist David Brooks promoted some fake statistics on ethnicity and high school achievement.

I get it: journalism is hard work, and sometimes a reporter or columnist will take a little break and just report a press release or promote the claims of some political ideologue. It happens. But I don’t have to like it.

What’s the worst joke you’ve ever heard?

When I say worst, I mean worst. A joke with no redeeming qualities.

Here’s my contender, from the book “1000 Knock-Knock Jokes for Kids”:

– Knock Knock.
– Who’s there?
– Ann
– Ann who?
– An apple fell on my head.

There’s something beautiful about this one. It’s the clerihew of jokes. Zero cleverness. It lacks any sense of inevitability, in that any sentence whatsoever could work here, as long as it begins with the word “An.”

Stock, flow, and two smoking regressions


In a comment on our recent discussion of stock and flow, Tom Fiddaman writes:

Here’s an egregious example of statistical stock-flow confusion that got published.

Fiddaman is pointing to a post of his from 2011 discussing a paper that “examines the relationship between CO2 concentration and flooding in the US, and finds no significant impact.”

Here’s the title and abstract of the paper in question:

Has the magnitude of floods across the USA changed with global CO2 levels?

R. M. Hirsch & K. R. Ryberg


Statistical relationships between annual floods at 200 long-term (85–127 years of record) streamgauges in the coterminous United States and the global mean carbon dioxide concentration (GMCO2) record are explored. The streamgauge locations are limited to those with little or no regulation or urban development. The coterminous US is divided into four large regions and stationary bootstrapping is used to evaluate if the patterns of these statistical associations are significantly different from what would be expected under the null hypothesis that flood magnitudes are independent of GMCO2. In none of the four regions defined in this study is there strong statistical evidence for flood magnitudes increasing with increasing GMCO2. One region, the southwest, showed a statistically significant negative relationship between GMCO2 and flood magnitudes. The statistical methods applied compensate both for the inter-site correlation of flood magnitudes and the shorter-term (up to a few decades) serial correlation of floods.

And here’s Fiddaman’s takedown:

There are several serious problems here.

First, it ignores bathtub dynamics. The authors describe causality from CO2 -> energy balance -> temperature & precipitation -> flooding. But they regress:

ln(peak streamflow) = beta0 + beta1 × global mean CO2 + error

That alone is a fatal gaffe, because temperature and precipitation depend on the integration of the global energy balance. Integration renders simple pattern matching of cause and effect invalid. For example, if A influences B, with B as the integral of A, and A grows linearly with time, B will grow quadratically with time.

This sort of thing comes up a lot in political science, where the right thing to do is not so clear. For example, suppose we’re comparing economic outcomes under Democratic and Republican presidents. The standard thing to look at is economic growth. But maybe it is changes in growth that should matter? As Jim Campbell points out, if you run a regression using economic growth as an outcome, you’re implicitly assuming that these effects on growth persist indefinitely, and that’s a strong assumption.

Anyway, back to Fiddaman’s critique of that climate-change regression:

The situation is actually worse than that for climate, because the system is not first order; you need at least a second-order model to do a decent job of approximating the global dynamics, and much higher order models to even think about simulating regional effects. At the very least, the authors might have explored the usual approach of taking first differences to undo the integration, though it seems likely that the data are too noisy for this to reveal much.

Second, it ignores a lot of other influences. The global energy balance, temperature and precipitation are influenced by a lot of natural and anthropogenic forcings in addition to CO2. Aerosols are particularly problematic since they offset the warming effect of CO2 and influence cloud formation directly. Since data for total GHG loads (CO2eq), total forcing and temperature, which are more proximate in the causal chain to precipitation, are readily available, using CO2 alone seems like willful ignorance. The authors also discuss issues “downstream” in the causal chain, with difficult-to-assess changes due to human disturbance of watersheds; while these seem plausible (not my area), they are not a good argument for the use of CO2. The authors also test other factors by including oscillatory climate indices, the AMO, PDO and ENSO, but these don’t address the problem either. . . .

I’ll skip a bit, but there’s one more point I wanted to pick up on:

Fourth, the treatment of nonlinearity and distributions is a bit fishy. The relationship between CO2 and forcing is logarithmic, which is captured in the regression equation, but I’m surprised that there aren’t other important nonlinearities or nonnormalities. Isn’t flooding heavy-tailed, for example? I’d like to see just a bit more physics in the model to handle such issues.

If there’s a monotonic pattern, it should show up even if the functional form is wrong. But in this case Fiddaman has a point, in that the paper he’s criticizing makes a big deal about not finding a pattern, in which case, yes, using a less efficient model could be a problem.

Similarly with this point:

Fifth, I question the approach of estimating each watershed individually, then examining the distribution of results. The signal to noise ratio on any individual watershed is probably pretty horrible, so one ought to be able to do a lot better with some spatial pooling of the betas (which would also help with issue three above).

Fiddaman concludes:

I think that it’s actually interesting to hold your nose and use linear regression as a simple screening tool, in spite of violated assumptions. If a relationship is strong, you may still find it. If you don’t find it, that may not tell you much, other than that you need better methods. The authors seem to hold to this philosophy in the conclusion, though it doesn’t come across that way in the abstract.

An inundation of significance tests

Jan Vanhove writes:

The last three research papers I’ve read contained 51, 49 and 70 significance tests (counting conservatively), and to the extent that I’m able to see the forest for the trees, mostly poorly motivated ones.

I wonder what the motivation behind this deluge of tests is.
Is it wanton obfuscation (seems unlikely), a legalistic conception of what research papers are (i.e. ‘don’t blame us, we’ve run that test, too!’) or something else?

Perhaps you know of some interesting paper that discusses this phenomenon? Or whether it has an established name?
It’s not primarily the multiple comparisons problem but more the inundation aspect I’m interested in here.

He also links to this post of his on the topic. Just a quick comment on his post: he is trying to estimate a treatment effect via a before-after comparison, he’s plotting y-x vs. x and running into a big regression-to-the-mean pattern:

Screen Shot 2015-03-13 at 4.26.35 PM

Actually he’s plotting y/x not y-x but that’s irrelevant for the present discussion.

Anyway, I think he should have a treatment and a control group and plot y vs. x (or, in this case, log y vs. log x) with separate lines for the two groups: the difference between the lines represents the treatment effect.

I don’t have an example with his data but here’s the general idea:

Screen Shot 2015-03-13 at 4.33.15 PM

Back to the original question: I think it’s good to display more rather than less but I agree with Vanhove that if you want to display more, just display raw data. Or, if you want to show a bunch of comparisons, please structure them in a reasonable way and display as a readable grid. All these p-values in the text, they’re just a mess.

Thinking about this from a historical perspective, I feel (or, at least, hope) that null hypothesis significance tests—whether expresses using p-values, Bayes factors, or any other approach—are on their way out. But, until they go away, we may be seeing more and more of them leading to the final flame-out.

In the immortal words of Jim Thompson, it’s always lightest just before the dark.

On deck this week

Mon: An inundation of significance tests

Tues: Stock, flow, and two smoking regressions

Wed: What’s the worst joke you’ve ever heard?

Thurs: > Huffington Post, Wall Street Journal, New York Times

Fri: Measurement is part of design

Sat: “17 Baby Names You Didn’t Know Were Totally Made Up”

Sun: What to do to train to apply statistical models to political science and public policy issues

Chess + statistics + plagiarism, again!


In response to this post (in which I noted that the Elo chess rating system is a static model which, paradoxically, is used to for the purposes of studying changes), Keith Knight writes:

It’s notable that Glickman’s work is related to some research by Harry Joe at UBC, which in turn was inspired by data provided by Nathan Divinsky who was (wait for it) a co-author of one of your favourite plagiarists, Raymond Keene.

In the 1980s, Keene and Divinsky wrote a book, Warriors of the Mind, which included an all-time ranking of the greatest chess players – it was actually Harry Joe who did the modeling and analysis although Keene and Divinsky didn’t really give him credit for it. (Divinsky was a very colourful character – he owned a very popular restaurant in Vancouver and was briefly married to future Canadian Prime Minister Kim Campbell. Certainly not your typical Math professor!)

I wonder what Chrissy would think of this?

Knight continues:

And speaking of plagiarism, check out the two attached papers. Somewhat amusingly (and to their credit), the plagiarized version actually cites the original paper!

Screen Shot 2015-03-24 at 10.28.20 PM

Screen Shot 2015-03-24 at 10.27.49 PM

“Double Blind,” indeed!

Kaiser’s beef


The Numbersense guy writes in:

Have you seen this?

It has one of your pet peeves… let’s draw some data-driven line in the categorical variable and show significance.

To make it worse, he adds a final paragraph saying essentially this is just a silly exercise that I hastily put together and don’t take it seriously!

Kaiser was pointing me to a news article by economist Justin Wolfers, entitled “Fewer Women Run Big Companies Than Men Named John.”

Here’s what I wrote back to Kaiser:

I took a look and it doesn’t seem so bad. Basically the sex difference is so huge that it can be dramatized in this clever way. So I’m not quite sure what you dislike about it.

Kaiser explained:

Here’s my beef with it…

Just to make up some numbers. Let’s say there are 500 male CEOs and 25 female CEOs so the aggregate index is 20.

Instead of reporting that number, they reduce the count of male CEOs while keeping the females fixed. So let’s say 200 of those male CEOs are named Richard, William, John, and whatever the 4th name is. So they now report an index of 200/25 = 8.

Problem 1 is that this only “works” if they cherry pick the top male names, probably the 4 most common names from the period where most CEOs are born. As he admitted at the end, this index is not robust as names change in popularity over time. Kind of like that economist who said that anyone whose surname begins with A-N has a better chance of winning the Nobel Prize (or some such thing).

Problem 2: we may need an experiment to discover which of the following two statements are more effective/persuasive:

a) there are 20 male CEOs for every female CEO in America
b) there are 8 male CEOs named Richard, Wiliam, John and David for every female CEO in America

For me, I think b) is more complex to understand and in fact the magnitude of the issue has been artificially reduced by restricting to 4 names!

How about that?

I replied that I agree that the picking-names approach destroys much of the quantitative comparisons. Still, I think the point here is that the differences are so huge that this doesn’t matter. It’s a dramatic comparison. The relevant point, perhaps, is that these ratios shouldn’t be used as any sort of “index” for comparisons between scenarios. If Wolfers just wants to present the story as a way of dramatizing the underrepresentation of women, that works. But it would not be correct to use this to compare representation of women in different fields or in different eras.

I wonder if the problem is that econ has these gimmicky measures, for example the cost-of-living index constructed using the price of the Big Mac, etc. I don’t know why, but these sorts of gimmicks seem to have some sort of appeal.

John Lott as possible template for future career of “Bruno” Lacour

Screen Shot 2015-05-22 at 2.34.30 PM

The recent story about the retracted paper on political persuasion reminded me of the last time that a politically loaded survey was discredited because the researcher couldn’t come up with the data.

I’m referring to John Lott, the “economist, political commentator, and gun rights advocate” (in the words of Wikipedia) who is perhaps more well known on the internet by the name of Mary Rosh, an alter ego he created to respond to negative comments (among other things, Lott used the Rosh handle to refer to himself as “the best professor I ever had”).

Again from Wikipedia:

Lott claimed to have undertaken a national survey of 2,424 respondents in 1997, the results of which were the source for claims he had made beginning in 1997. However, in 2000 Lott was unable to produce the data, or any records showing that the survey had been undertaken. He said the 1997 hard drive crash that had affected several projects with co-authors had destroyed his survey data set, the original tally sheets had been abandoned with other personal property in his move from Chicago to Yale, and he could not recall the names of any of the students who he said had worked on it. . . .

On the other hand, Rosh Lott has continued to insist that the survey actually happened. So he shares that with Michael LaCour, the coauthor of the recently retracted political science paper.

I have nothing particularly new to say about either case, but I was thinking that some enterprising reporter might call up Lott and see what he thinks about all this.

Also, Lott’s career offers some clues as to what might happen next to LaCour. Lott’s academic career dissipated and now he seems to spend his time running an organization called the Crime Prevention Research Center which is staffed by conservative scholars, so I guess he pays the bills by raising funds for this group.

One could imagine LaCour doing something similar—but he got caught with data problems before receiving his UCLA social science PhD, so his academic credentials aren’t so strong. But, speaking more generally, given that it appears that respected scholars (and, I suppose, funders, but I can’t be so sure of that as I don’t see a list of funders on the website) are willing to work with Lott, despite the credibility questions surrounding his research, I suppose that the same could occur with LaCour. Perhaps, like Lott, he has the right mixture of ability, brazenness, and political commitment to have a successful career in advocacy.

The above might all seem like unseemly speculation—and maybe it is—but this sort of thing is important. Social science isn’t just about the research (or, in this case, the false claims masquerading as research); it’s also about the social and political networks that promote the work.

Creativity is the ability to see relationships where none exist

Screen Shot 2014-11-17 at 11.19.42 AM

Brent Goldfarb and Andrew King, in a paper to appear in the journal Strategic Management, write:

In a recent issue of this journal, Bettis (2012) reports a conversation with a graduate student who forthrightly announced that he had been trained by faculty to “search for asterisks”. The student explained that he sifted through large databases for statistically significant results and “[w]hen such models were found, he helped his mentors propose theories and hypotheses on the basis of which the ‘asterisks’ might be explained” (p. 109). Such an approach, Bettis notes, is an excellent way to find seemingly meaningful patterns in random data. He expresses concern that these practices are common, but notes that unfortunately “we simply do not have any baseline data on how big or small are the problems” (Bettis, 2012: p. 112).

In this article, we [Goldfarb and King] address the need for empirical evidence . . . in research on strategic management. . . .

Bettis (2012) reports that computer power now allows researchers to sift repeatedly through data in search of patterns. Such specification searches can greatly increase the probability of finding an apparently meaningful relationship in random data. . . . just by trying four functional forms for X, a researcher can increase the chance of a false positive from one in twenty to about one in six. . . .

Simmons et. al (2011) contend that some authors also push almost significant results over thresholds by removing or gathering more data, by dropping experimental conditions, by adding covariates to specified models, and so on.

And, beyond this, there’s the garden of forking paths: even if a researcher performs only one analysis of a given dataset, the multiplicity of choices involved in data coding and analysis are such that we can typically assume that different comparisons would have been studied had the data been different. That is, you can have misleading p-values without any cheating or “fishing” or “hacking” going on.

Goldfarb and King continue:

When evidence is uncertain, a single example is often considered representative of the whole (Tversky & Kahneman, 1973). Such inference is incorrect, however, if selection occurs on significant results. In fact, if “significant” results are more likely to be published, coefficient estimates will inflate the true magnitude of the studied effect — particularly if a low powered test has been used (Stanley, 2005).

They conducted a study of “estimates reported in 300 published articles in a random stratified sample from five top outlets for research on strategic management . . . [and] 60 additional proposals submitted to three prestigious strategy conferences.”

And here’s what they find:

We estimate that between 24% and 40% of published findings based on “statistically significant” (i.e. p<0.05) coefficients could not be distinguished from the Null if the tests were repeated once. Our best guess is that for about 70% of non-confirmed results, the coefficient should be interpreted to be zero. For the remaining 30%, the true B is not zero, but insufficient test power prevents an immediate replication of a significant finding. We also calculate that the magnitude of coefficient estimates of most true effects are inflated by 13%.

I’m surprised their estimated exaggeration factor is only 13%; I’d have expected much higher, even if only conditioning on “true” effects (however that is defined).

I have not tried to follow the details of the authors’ data collection and analysis process and thus can neither criticize nor endorse their specific findings. But I’m sympathetic to their general goals and perspective.

As a commenter wrote in an earlier discussion, it is the combination of a strawman with the concept of “statistical significance” (ie the filtering step) that seems to be a problem, not the p-value per se.