Skip to content

Chocolate milk! Another stunning discovery from an experiment on 24 people!

Mike Hull writes:

I was reading over this JAMA Brief Report and could not figure out what they were doing with the composite score. Here are the cliff notes:

Study tested milk vs dark chocolate consumption on three eyesight performance parameters:

(1) High-contrast visual acuity
(2) Small-letter contrast sensitivity
(3) Large-letter contrast sensitivity

Only small-letter contrast sensitivity was significant, but then the authors do this:

Visual acuity was expressed as log of the minimum angle of resolution (logMAR) and contrast sensitivity as the log of the inverse of the minimum detectable contrast (logCS). We scored logMAR and logCS as the number of letters read correctly (0.02 logMAR per visual acuity letter and 0.05 logCS per letter).

Because all 3 measures of spatial vision showed improvement after consumption of dark chocolate, we sought to combine these data in a unique and meaningful way that encompassed different contrasts and letter sizes (spatial frequencies). To quantify overall improvement in spatial vision, we computed the sum of logMAR (corrected for sign) and logCS values from each participant to achieve a composite score that spans letter size and contrast. Composite score results were analyzed using Bland-Altman analysis, with P < .05 indicating significance.

There are more details in the short Results section, but the conclusion was that “Twenty-four participants (80%) showed some improvement with dark chocolate vs milk chocolate (Wilcoxon signed-rank test, P < .001)." Any idea what's going on here? Trial pre-registration here.

I replied that I don’t have much to say on this one. They seemed to have scoured through their data so I’m not surprised they found low p-values. Too bad to see this sort of thing appearing in Jama. I guess chocolate’s such a fun topic, there’s always room for another claim to be made for it.


  1. Dieter Menne says:

    Tis is a case of the misunderstood “primary endpoint”. The standard study planning template asks for ONE and only ONE primary variable/endpint. It is mainly used to assess power of the study, whatever this (and created by cheating in the last 30 years I have worked in the field).

    The idea of the regulations is to avoid 75 p-values per study – it does not help in practice since reviewers count number of p-values on the positive side. A “composite endoint” makes some sense when one has to weight efficacy against adverse events (reduction in atrillar fibrillation against fatigue), but the rule must be laid out in the study plan a priory.

    So the real cause of concern is the “because” here: “Because all 3 measures of spatial vision showed improvement after consumption of dark chocolate, we sought to combine these data in a “

  2. Terry says:

    Is it possible this paper is a joke? If I had to make up a joke paper it would sound like this one. “I know, I know, let’s say chocolate makes you see better. No, no that’s not wacky enough, let’s say DARK chocolate makes you see better than MILK chocolate. Now we’re talking.”

    But its probably not a joke. The authors put a lot of effort into this new measure of visual acuity.

    On the other hand, the authors are at the University of the Incarnate Word, Rosenberg School of Optometry in San Antonio, Texas.

  3. Mike Hull says:

    Thanks for sharing this Dr. Gelman. Another BIG issue greatly limiting the conclusions of this study is that the flavanol content of the chocolates used were not directly measured. The dark chocolate’s flavanol content was inferred from a 2014 analysis performed by ConsumerLab. No values were reported for the milk chocolate control. Thus, actual flavanol dose is unknown.

    I actually did a quick write-up on this study at the link below. From the article:

    “Quantifying the flavanol content is very important because not all chocolate is processed in a manner that preserves flavanols. Cocoa treated with alkali (aka dutching), roasted at high temperatures, or fermented can have greatly reduced flavanol concentrations.[4][5][6][7] Shorter roasting and fermenting durations can help offset this loss, though. Simply buying dark chocolate does not guarantee it will have appreciable amounts of flavanols.[8] The cacao percentage is also not a reliable indicator of flavanol content.[8] Additionally, cocoa flavanol concentrations can vary from batch to batch depending on growing conditions. Thus, the actual flavanol intake of the study participants is uncertain.”

    • Terry says:

      You seem like you know what you’re talking about.

      Is it reasonable to think flavonals might have significant physical effects? Are they really that strong?

      Are they comparable in power to caffeine?

      • Mike Hull says:

        The flavonoids in cocoa can improve blood flow (to a certain degree at a sufficient dose), so there is a plausible mechanism by which it *might* improve some health endpoints. As for being comparable to caffeine it depends on which endpoint you’re looking at. I will say that, on the whole, caffeine has been studied much more in depth.

  4. Zad Chow says:

    We (Mike, Andrew, and I) had a bit of discussion on this, and Mike and I ended up writing a letter to the editor that apparently got accepted but is still taking forever to publish. The basic premise of the LTE was that they (the authors) went through all that trouble of doing a power analysis (look at the trial preregistration) and stuff but then still ended up saying that there was improvement in outcomes where there was no statistically significant difference.

    Let’s say they really didn’t care about statistical significance and cared about estimation all along, they could’ve just said that.

    “Our CIs give this plausible range of effect sizes that are compatible with the test model and its assumptions, and even though there is no statistically significant effect, the CIs lean towards an effect”

    or, “while there were differences between the samples, the effect was not statistically significant.”

    However, they just spun a null finding into improvement without any sort of context.

  5. Michael Bailey says:

    Did I misunderstand or did they preregister their analyses? If so, I don’t understand how it could be cheating, and if they found evidence for their hypotheses, I guess I don’t know enough about the topic to declare it silly.

    • Andrew says:


      I don’t think anyone’s talking about cheating. The point is that the data are not as strong as the researchers claim they are. Preregistration doesn’t turn noisy data into clean data, and even if no cheating is involved, it’s easy for people to overstate their evidence. Honesty and transparency are not enough.

    • Terry says:

      I’m confused.

      A pre registered paper is publishable even if there are no significant results? That can’t be true.

      So the authors must have misrepresented their results. Isn’t that dishonest? At least a little?

      • Andrew says:


        A preregistered paper is definitely publishable even if the preregistered claim does not work out. That’s part of the point of preregistration! But then researchers will spin their results. I wouldn’t label this spin as dishonest, but I would say that sometimes it leads people into pointless noise chasing.

        • Terry says:


          So how are wacky papers weeded out? Could I pre-register a hundred silly papers tomorrow and get them all published?

          Presumably not. And presumably the pre-registration process is not much of a filter, if any.

          So, presumably a journal will only publish a pre-registered paper if the negative results are in some way interesting. So no journal will publish a paper testing whether I can make a tasty omelette.

          So, it sounds like someone out there actually cares whether dark chocolate improves visual acuity. I would not have guessed that.

          • Andrew says:


            There are different ways to do it, but one model is for the journal to accept or reject the paper based on the design, so that if the paper is accepted the journal has committed to publishing it, whatever the findings.

  6. Anonymous says:

    “They seemed to have scoured through their data so I’m not surprised they found low p-values. “

    Imagine what they could “find” with N = 5000 (or perhaps even more). Especially without (the availability to the reader of) any pre-registration! The (possibly “p-hacked” and/or “selectively reported”) “findings” coming from such large scale (possibly “collaborative”) study would seem to in turn to be almost unfalsifiable!

    It’s easy for skeptics to replicate this N = 24, not so much when N = 10000 (possibly using many labs, and combining the results in an “internal meta-analysis”).

    Also see the recent “Data Colada” post:

    “Make sure you spend time to breathe this in: If researchers barely p-hack in a way that increases their single-study false-positive rate from 2.5% to a measly 8%, the probability that their 10-study meta-analysis will yield a false-positive finding is 83%!”

    • Keith O’Rourke says:

      > partially emptying the file-drawer almost surely makes things much worse

      More information, if un-selected, can’t be less information unless it is incorrectly processed. I could not discern from what was freely available but was the meta-analysis crudely based on combining p values or effect estimates rather than raw data or sufficient statistics?

      If you can’t profitably analyse all the studies conducted, you can’t profitably do science at all.

      • Anonymous says:

        From the Data Colada post:

        “The problem is that researchers do not select studies at random, but are more likely to choose to include studies that are supportive of their claims.”

        I simply understand this to refer to “publication bias”/”the file drawer” problem.

        For instance, if i were a chocolate conglomerate and wanted to get everyone to buy my brand of chocolate, i could use dozens of labs around the world to try and find support for all kinds of “beneficial” stuff that folks could enjoy from eating my chocolate.

        Without (the availability to the reader of) a detailed pre-registration that includes not only the data-analyses i will use, but also the exact (no. of) studies and/or exact (no. of) labs, i could 1) p-hack all i want, and 2) selectively report the labs that found what i want/like.

        For me, the bottom line of the Data Colada post (if i understood things correctly) is that the availability of pre-registration concerning studies that use “internal meta-analyses” is crucial. If i understood things correctly, this also holds for “collaborative” projects that use many labs to perform a single study, like “Registered Replication Reports”.

        In light of this latter point, i tried to find pre-registration information from two of the “Registered Replication Reports” in order to check if they pre-registered the exact (number of) labs that would contribute to the project (so the reader is able to verify that there was no “selective reporting”), but i could not find anything about this!?!?! Am i missing something here?

        1) “Professor priming” (Dijksterhuis et al.) Registered Replication Report:
        The project page seems to have no “registration” ( and in the “protocol” ( i can’t find anything about the exact (number of) labs that will participate.

        2) “Pen in mouth” (Strack et al.) Registered Replication Report:
        The project page seems to have a “registration” ( but again nothing is mentioned concerning the exact (number of) labs that will be participating.

        So if i understood things correctly, and am not missing anything, doesn’t this mean that we can not even conclude anything from these “Registered Replication Reports” using thousands of participants, because we can’t rule out “selective reporting” of findings by means of verifying things via pre-registration information (i.c. certain results of certain labs may have been left out which we can’t check)?????

        • Keith O’Rourke says:

          OK but the “the absolute worst thing the field can do” is to is to suggest that others “Don’t do internal meta-analysis”.

          Rather, they should persist in doing them until they can be done profitably (properly). Believe it or not similar arguments were raised to convince folks not to try to look at more than one study at all –

          Unless by “internal” meta-analysis they mean a black box hidden activity that would be no different than selectively deleting annoying individual observations in a single study. Stop the presses – faked data summaries can make studies misleading!

          One quick point here regarding “we can not even conclude anything from these “Registered Replication Reports” using thousands of participants, because we can’t rule out “selective reporting” of findings by means of verifying things”. By persisting in doing meta-analysis more thoughtfully, some sense may be gained about the selection that is going on and how to lessen it.

          • Anonymous says:

            “Unless by “internal” meta-analysis they mean a black box hidden activity that would be no different than selectively deleting annoying individual observations in a single study. Stop the presses – faked data summaries can make studies misleading! “

            Imagine the big chocolate conglomerate, wanting to show all the benefits of eating their chocolate, making labs all over the world perform their research. However, they do not make a pre-registration (which should include things like planned analyses, but also crucially should include all the labs that are supposed to be involved in data collection if i understood things correctly) available to the reader.

            This would enhance the possibilities for the big chocolate conglomerate to be able to manipulate matters. They could simply leave out the findings of (some of) the individual labs that didn’t find what they wanted to. I would view this the same as your example of “deleting annoying individual observations in a single study”.

            Who knows whether this happened with the 2 “Registered Replication Reports” i used as examples. If i haven’t missed anything, it just seems to me that the pre-registration has been performed sub-optimally so the reader can’t verify these matters like they can, for instance, concerning the data analyses. It just seems to me that if pre-registration is used for things like data exclusion, and planned analyses, it should also be used for indicating the exact (number of) labs that will participate in data collection…

            It seems very important to me, especially with large-scale (“collaborative”) projects, that these possible ways to manipulate things are tackled in a sensible matter. If i understood things correctly, i hope you would agree with me that the (availability to the reader of) pre-registration of the exact (number of) participating labs seems to be as crucial as that of, for instance, the planned analyses, and data exclusion criteria.

            • Keith O’Rourke says:

              Yes, the number of participating labs needs to be known but the only real way of dealing with these issues is random audits of scientific claims. Currently pharmaceutical companies are subject to such as well as other industries.

              So just the same as anyone claiming a tax deductions is subject to audit, anyone making a scientific claim should be subject to an audit of the materials they used in that claim.

              At least one university, this is already the case for any faculty member making a claim of following ethical requirements in research. Apparently objections to be audited to ensure one is acting ethically were easy enough for the university administration to set aside. It will be harder to set aside objections to be audited to ensure one is acting diligently in making scientific claims.

              Folks in revenue agencies likely have a sense of what the sweet spot is for percent of claims audited – my guess is around 5% As a former dean of mine once said, nothing sanitises like sunshine. That will be the case if random because it could shine on anyone at any time.

Leave a Reply