The Mannequin

[cat picture]

Jonathan Falk points to this article, “Examining the impact of grape consumption on brain metabolism and cognitive function in patients with mild decline in cognition: A double-blinded placebo controlled pilot study,” and writes:

Drink up! N=10, no effect on thing you’re aiming at, p value result on a few brain measurements (out of?), eminently pr-able topic…Seems like a TED talk is nigh…

In all seriousness, don’t these people know that the purpose of a pilot study is to test out the methods, not to draw statistical conclusions? Pilot studies are fine. It’s a great idea to publish pilot studies, letting people know what worked and what didn’t, give all your raw data, let it all hang out. But leave the statistical significance calculator at home.

Also this:

1.112 ± 0.005262

Noooooooooooooooooooooooooooooooooooooooooooooooooooooooo!!!!!

P.S. I’m not saying these researchers are bad guys; I assume they’re just following standard practice which is to try to get at least one statistically significant result and publication out of any experiment. We just need a better scientific communication system that’s focused more on methods, data, and understanding, and less on the promotion of spurious breakthroughs.

36 thoughts on “The Mannequin

  1. One idea that I haven’t seen discussed much is that research grants should be awarded to collect and publish *data*, not findings. This would increase the rewards to obtaining good data and making it publicly available for all to study and analyze. I’m an economist, so I always have incentives in mind. This would also force researchers to share their data. If one wants to nitpick, we could allow a one -year embargo to give the original researchers a chance to get the scoop, but it wouldn’t be long before anyone else could use the data.

    • I understood the general problem with the junk science that is regularly criticised here to be that the data themselves are noise. What is the point of publishing it if there’s no reason to think it contains anything of interest? Findings accompanied by data allow the findings to be checked. Large systematically collected data (e.g. the census) is useful in itself. But N = 20?

      • Richard:

        There’s a logic to publishing a pilot study, but then the point is to share qualitative information about challenges in data collection which will be relevant to other people doing similar studies. And it’s ok to present simple summaries; no reason not to share the data.

        • Tough to get pilot studies published. Just helped a researcher complete a 24 subject, first in children with a rare disease, pbo- controlled crossover study. Even though we indicated in protocol that it was not for demonstrating efficacy, we had a hard time convincing reviewers that simply showing the trt estimates and 95% CI’s for the various mental health scales was sufficient. We were criticized for running such a small study (no power) and not showing p-values. Finally accepted in the third journal we submitted to.

    • Carol:

      “Not N = 20. N = 10! 5 in the control group and 5 in the grape group.”

      At first, I thought your “N = 10!” meant ten factorial. After all, they must be kidding to base everything on 5 in the control and 5 in the grape group. You deserve applause for searching this out. In fact, more praise because the abstract fails to mention what you say in your next comment:

      “Funded, I see, by the California Table Grape Commission ….”

        • Jack:

          Trolls are boring. But, just in case you’re not trolling: n=2 or even n=1 is fine if you’re studying a phenomenon with very little variation, or if you have enough intermediate measurements and understanding. But n=10 is not going to work for a between-person black-box-style study with minimal theory, lots of variation, and noisy measurements. n=10 can be ok for a pilot study, as I wrote in my post above.

        • I know you don’t find me boring! But as you say, the problem here is not sample size alone, it’s experimental conditions, noise signal ratio. I know you know that but some young people don’t, they think small samples invalidates conclusions by themselves. I think this is an important distinction and many newly trained scientists/statisticians don’t know that – once I was hired to help some researchers with a horse study where the sample size was 4. Their statistician was saying they needed more samples (which was expensive) but the effect size was so huge and the experiment conditions so precisely controlled that it was nonsense to waste more money on another experiment. And gladly they didn’t.

        • >”the effect size was so huge and the experiment conditions so precisely controlled that it was nonsense to waste more money on another experiment”

          You may have something reversed here. The point of replication is to:
          1)check “whether the experiment conditions [are] so precisely controlled”
          2)ensure you understand the situation well enough to repeat it at will and communicate to others
          3)Improve your estimate of whatever model parameter(s) of interest

          I also don’t see how the size of the effect can determine whether it is a waste of money.

        • So go ahead and replicate it, if you think a difference of 0.23 vs 10.583 is not enough of an evidence for you, I think you should do it — just have $200-$400K and one year of your life to spare (this $ is for a sample size of 4).

        • >”So go ahead and replicate it, if you think a difference of 0.23 vs 10.583 is not enough of an evidence for you, I think you should do it — just have $200-$400K and one year of your life to spare (this $ is for a sample size of 4).”

          Once again, I think you have something reversed in your thinking. I would not run any study at all to determine simply whether a difference existed, rather I would assume that to be the case to begin with.

          I bet if it really was a waste of money to do the replications, after proper consideration we will find out it was a waste of money to do the original study too.

        • If you’re serious, I just hope you’re not a statistician. Yes, let’s eat the whole soup to make sure it’s properly seasoned.

        • jack:

          If we conducted a leave one out cross validation on your horse study? Would the results change?

          What if you randomly selected 4 horses from a stable of 100? What is the percentage of replications you would expect to produce the same results?

        • Anoneuoid:

          True, you can’t rule out a hypothesis by the way it was generated – even if that is mindlessly dredging through a noisy biased study perhaps even with made up data.

          But everyone needs to decided on how to spend their time and to make and keep an area of science profitable (in the sense of getting less wrong about reality so actions taken are less frustrated by it) and it is that – that argues for focusing on a select subset of hypothesis and or past work to best further that.

          Hypotheses generating in the garden of forking paths are _likely_ the worst choice of anything to focus on, after ESP, angel therapy, etc.

        • Anoneuiod thinks that the researchers wanted to simply state if there’s an effect or not, he’s projecting himself in other people. These researchers wanted to move on and continue manipulating the drug on the right direction. They did. They succeeded, and now implementing on larger scale. And they didn’t have to waste their money on the first experiment which showed an obvious huge effect where only a wrongly “trained” statistician would see a problem because of his own narrow mind. They used that time and money more wisely.

        • Anoneuiod thinks that the researchers wanted to simply state if there’s an effect or not, he’s projecting himself in other people.

          No. What did you mean by this: “a difference of 0.23 vs 10.583 is not enough of an evidence for you”? Evidence regarding what exactly? (I am pretty sure you mean that there is an effect)

          These researchers wanted to move on and continue manipulating the drug on the right direction.

          Hard to say without details, but I bet they should have done that to begin with and saved the $200k-$400k on seeing “if there is an effect”.

          They succeeded, and now implementing on larger scale.

          How was success assessed? Most people use NHST, which gives a false sense of success much more often than not.

          only a wrongly “trained” statistician would see a problem because of his own narrow mind.

          The root problem (seems to be from your description) that they designed a study to merely check for an effect, when that is not what they wanted to know. Also, I was not at all trained as a statistician. I was trained to be a psuedoscientist.

          Still, none of this really addresses my original points:
          1) The size of an effect has nothing to do with whether a replication study is warranted
          2) You cannot know whether you have precisely controlled the conditions without a replication study

        • “The root problem (seems to be from your description) that they designed a study to merely check for an effect”

          Oh god, of course not, there’s no root problem, their first experiment was a success and they only doubted themselves because their statistician told them so. They didn’t want to “check if there’s an effect”, they already knew there had to be, they wanted to check how big it was and what was mediating this effect under a controlled environment to rule out possible alternative explanations and narrow down where to proceed with their research. Their research is the typical case where statistics is almost not needed, because uncertainty and variation do not play a huge role. Their main problem is measurement error which has known variance and it’s super small relative to the size of effects they are measuring.

          Regarding effect size and replication, I won’t even bother to reply, this clearly shows you have no formal training on statistics.

        • Anon, Jack:

          This is so much more fun than the comments section of Marginal Revolution, where they have flame wars about Israel and Obama and whether it’s cool to be a racist and stuff like that. Our flame wars are so much more intellectual!

        • Regarding effect size and replication, I won’t even bother to reply, this clearly shows you have no formal training on statistics.

          Statisticians aren’t the ones who came up with the replication idea… whatever justifications they have apparently differ from those of the scientist. I suspect those justifications are probably ad hoc and “miss the point”, but can’t tell from this discussion.

          they wanted to check how big it was

          Remember the offending quote: “a difference of 0.23 vs 10.583 is not enough of an evidence for you”.

          If they wanted to check how big it was, then what is the difference between ‘.23 vs 10.583’, or ‘.23 vs 1.583’? Ceteris peribus, the estimates will be equally precise, why would the smaller effect warrant a replication moreso than the larger?

          Andrew wrote:

          Our flame wars are so much more intellectual!

          It would be much more productive if we could talk about actual methods and data. If this study is not suitable, maybe there is a publicly available one that can be used instead?

        • Reminds me of the lung transplant researcher, who came to see me after the Head of the Statistics Dept’s Statistical Consulting Center refused to discuss their two group, 6 dogs per group study – shouting at him to never back which such a ridiculously small study ever again.

          The highest score in placebo group was 20 standard deviations below the lowest score on the treatment group.

          I suggested he move on to the next step in his research program as I thought that was reasonable.

        • Sometimes I think researchers have the correct bayesian intuition, and frequentism just messes it all up Was the Head of the Statistics Dept’s Statistical Consulting Center frequentist? Just asking.

        • > I think researchers have the correct bayesian intuition
          Really, I would like to what that is ;-)
          http://statmodeling.stat.columbia.edu/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

          Was the Head of the Statistics Dept’s Statistical Consulting Center frequentist?
          I think they would have been open to Bayes but would defaulted to frequentist.

          When I mentioned the outcome to them later they said – “well my advice would have been right in most cases – so no regrets.”

    • Carol:

      Good catch! I thought you were kidding but then I scrolled down to the end of the paper and there it was! I guess this is still a step up from studies funded by the cigarette companies.

  2. This is funny, you guys think n=10 is too little, like research and decisions have to wait to people who don’t know better to be comfortable with the sample size. Gosset set used n = 3, n=4. The biggest problem with some areas is not sample size, it’s the questions they are asking which can’t be answered even with infinite data.

  3. Uh on, I’m not sure I make the connection “No…o!”, Re: 1.112 ± 0.005262. Is this just little meaning of an SEM for small N?
    Thanks for clarification in advance.

    • What’s bothersome about this sort of thing is that it fails the common sense smell test. It should be obvious that so much precision is nonsense. That it isn’t obvious, and that someone supposedly professional writes it, calls into question that person’s competence at a basic level. It’s an error similar in spirit to a first-year calculus student computing an integral to find the volume or mass of something and getting a negative answer. The problem is not that somewhere in the calculations he made a sign error, rather that he didn’t notice that he must have made a sign error, because the result he obtained is clearly silly.

      • There’s a story that someone once gave an exam in which students were to estimate the distance from the Sun to the Earth from given information. Quite a few gave answer 3 meters — which, when the seating of the students was also considered, was good evidence of copying.

  4. Recently, Andrew blogged about one of end products of red grapes, red wine:

    http://statmodeling.stat.columbia.edu/2017/01/27/absence-of-evidence-is-evidence-of-alcohol/

    “I drank about one glass of red wine every evening with dinner, on the advice of my cardiologist.”

    He is also legendary for his celery consumption: http://statmodeling.stat.columbia.edu/2007/04/24/the_psychology/

    “as a person who regularly eats celery in meetings,”

    Celery and red win may be good for one merely because the former takes the place of junk food and the latter implies no consumption of diet soda.

  5. “We just need a better scientific communication system that’s focused more on methods, data, and understanding, and less on the promotion of spurious breakthroughs.” But what is your proposed solution? Do you believe better training with a new textbook is sufficient? How does one enforce that in application after the training is complete? With shaming, failed replications, and occasional retractions?
    I won’t bore you again with my proposal in my two comments here:
    http://statmodeling.stat.columbia.edu/2017/01/30/no-guru-no-method-no-teacher-just-nature-garden-forking-paths/

    Both statisticians and journalists focus on ethical standards, but avoid formal written standards, relying on a shared notion of “professionalism”, but when it comes to communications of technical topics, why not formalize those standards?

Leave a Reply to John Cancel reply

Your email address will not be published. Required fields are marked *