Skip to content

The purpose of a pilot study is to demonstrate the feasibility of an experiment, not to estimate the treatment effect

David Allison sent this along:

Press release from original paper: “The dramatic decrease in BMI, although unexpected in this short time frame, demonstrated that the [Shaping Healthy Choices Program] SHCP was effective . . .”

Comment on paper and call for correction or retraction: “. . . these facts show that the analyses . . . are unable to assess the effect of the SHCP, and so conclusions stating that the data demonstrated effects of the SHCP on BMI are unsubstantiated.”

Authors’ response to “A Comment on Scherr et al ‘A Multicomponent, School-Based Intervention, the Shaping Healthy Choices Program, Improves Nutrition-Related Outcomes’.”

From the authors’ response:

We appreciate that Dr David B. Allison, the current Dean and Provost Professor at the Indiana University School of Public Health, and his colleagues [the comment is by Wood, Brown, Li, Oakes, Pavela, Thomas, and Allison] have shown interest in our pilot study. Although we appreciate their expertise, we respectfully submit that they may not be fully familiar with the challenges of designing and implementing community nutrition education interventions in kindergarten through sixth grade. . . .

It is evident that researchers conducting community-based programs are typically faced with limitations of sample size and study design when working with schools. We fully agree that the work we conducted is at a pilot scale. . . .

Given the limitations we had in sample size, we agree it should be viewed as a pilot study. Although this work can be viewed as a pilot study, we submit that it generates hypotheses for future larger-scale multicomponent studies. . . .

Huh? So you’re clear that it’s a pilot study, but you still released a statement saying that your data “demonstrated that [the treatment] was effective”???


There were also specific problems with the analysis in the published paper (see the above-linked comment by Wood et al.) but, really, it’s easier than that. You did a pilot study so don’t go around claiming you’ve demonstrated the treatment was effective.

Also a slightly more subtle point: I think the authors are also wrong when they write that the patterns of statistical significance in their pilot study “generates hypotheses for future larger-scale multicomponent studies.” Lots of people think this sort of thing but they’re mistaken. The problem is that patterns in a pilot study are just too noisy. You might as well be just rolling dice. To take a bunch of data and root around for statistically significant differences—this’ll just lead you in circles.

As the authors write, it is challenging to design and implement community nutrition education interventions in kindergarten through sixth grade. It’s tough that things are so difficult, but that doesn’t mean that you get a break and can make strong claims from noisy data. Science doesn’t work that way.

Why bother?

Why bother to bang on an statistical error published in an obscure journal? Two reasons. First, I assume that these sorts of research claims really can affect policy. Wood et al. are doing a service by writing that letter to the journal, and I wish the authors of the original study would recognize their mistake. Second, I find these sorts of examples to be instructive, both in illustrating statistical misconceptions that people have, and helping us clarify our thinking. For example, it seems so innocuous and moderate to say, “Although this work can be viewed as a pilot study, we submit that it generates hypotheses for future larger-scale multicomponent studies.” But, no, that’s a mistake.


  1. Anoneuoid says:

    Emphasis mine:

    The article is “A Multicomponent, School-Based Intervention, the Shaping Healthy Choices Program, Improves Nutrition-Related Outcomes,” by Rachel E. Scherr, PhD, Jessica D. Linnell, PhD, Madan Dharmar, MBBS, PhD, Lori M. Beccarelli, PhD, Jacqueline J. Bergman, PhD, Marilyn Briggs, PhD, RD, Kelley M. Brian, MPH, Gail Feenstra, EdD, RD, J. Carol Hillhouse, MS, Carl L. Keen, PhD, Lenna L. Ontai, PhD, Sara E. Schaefer, PhD, Martin H. Smith, EdD, Theresa Spezzano, MS; Francene M. Steinberg, PhD, RD, Carolyn Sutter, MS, Heather M. Young, PhD, RN, FAAN, and Sheri Zidenberg-Cherr, PhD ( It is published in Journal of Nutrition Education and Behavior, volume 49, issue 5 (May 2017) by Elsevier and is openly available at


    To read this article in full, please review your options for gaining access at the bottom of the page.

  2. Liz Stuart says:

    Totally agree with the title! People can actually play around with a shiny app a former student of mine wrote, which shows the dangers you can run into by relying on point estimates from pilot studies to inform later trials:

    It’s all based on basic power analysis calculations but I think helpful to step things through. This paper also talks through the issues (which was motivated by too many grant review panels where people over-interpreted effect sizes from tiny pilot studies):

    Westlund and Stuart. The Nonuse, Misuse, and Proper Use of Pilot Studies in Experimental Evaluation Research. American Journal of Evaluation.

  3. Thanatos Savehn says:

    Speaking of causal claims, Andrew, have you had a chance to read Judea Pearl’s post on Deborah Mayo’s blog ( ) and the paper “Causal inference and the data-fusion problem” – ? You make a couple of appearances.

    As usual he argues his point(s) well and I was left wondering two things. First, is all this statistical learning/rethinking I’ve been doing about to be supplanted by a causal approach that comes with a whole new language? After all, “the vocabulary of statistical analysis, since it is built entirely on properties of distribution functions, is inadequate for expressing those substantive assumptions that are needed for getting causal conclusions”. Second, because I am grateful for your blog I’m very much hoping that you’ll be able to “cope with the painful transition from statistical to causal thinking.” (That’s from Judea’s blog and not Mayo’s).

    • Andrew says:


      I think Pearl’s approach to thinking about causal inference is just fine. But in any case some statistics will still be be needed. It’s fine for me if the statistics is done within Pearl’s framework.

      To put it another way, I don’t think it’s statistical or causal thinking; I think it’s statistical and causal thinking.

      At the linked post, Pearl writes:

      “Once we recognize the importance of diverse sources of data, statistics can be helpful in making decisions and quantifying uncertainty.” [Quoted from Andrew Gelman’s blog]. The reason I question the sufficiency of statistics to manage the integration of diverse sources of data is that statistics lacks the vocabulary needed for the job.

      I think Pearl and I are in agreement. I don’t think that statistical data analysis is sufficient to manage the integration of diverse sources of data. Substantive knowledge is needed too. What I wrote above (and what Pearl quoted) is that statistics can be helpful, not that statistics is all that is needed.

      • I’m glad you clarified that b/c I would have been surprised if you had claimed that statistics was all that was needed. I think the qualitative end needs way more development than I have observed, requiring an intellectual effort that may be forestalled by conflicts of interests.

    • Andrew says:

      P.S. I looked at the other link and I saw a discussion of “transportability.” We discussed that topic here a few years ago. I think it makes sense to use hierarchical modeling and partial pooling to combine data from different sources. This is an old idea in statistics but it continues to be refined; see for example here.

      • Keith O'Rourke says:

        I do still have a hard time grasping what is really distinctive about “transportability” versus thoughtful meta-analysis extending back to the great mathematician/astronomers as Andrew and I outlined here

        “Awareness of commonness can lead to an increase in evidence regarding the target; disregarding commonness wastes evidence; and mistaken acceptance of commonness destroys otherwise available evidence … A concrete but simple example that demonstrates practical controversies nicely would be the situation depicted in the wiki entry on Simpson’s paradox … an underlying reality of exactly the same positive trend for two groups (both slopes equal to one) but that happen to have different intercepts,”

        I do find Pearl and his colleagues’ work informative and worth reading but the extensive discussions on “everything” being supplanted by a “revolutionary” causal approach – I find distracting.

        Perhaps, I have this sort of issue in mind “Astronomers and others would often reflect on how to determine which was dataset [approach] was the best (thus implicitly assigning weights of 0 to all the remaining data [insights]), anticipating that was the obvious solution, but they had yet to learn that, as Stigler (2016) put it, “the details of individual observations [methods] had to be, in effect, erased to reveal a better indication than any single observation [approach] could on its own.”

        • Keith you mean the Pearl reference to Data Fusion approach? Looks extremely detailed as described.

          • Keith O'Rourke says:

            I have just had a quick look at that which seemed not that different from the half dozen to dozen of Elias Bareinboim and Judea Pearl’s earlier papers going back to 2012 along with multiple of Andrew’s blog posts on this work, numerous comments, emails and a face to face meeting with Judea.

            The Data Fusion might be a nice summary/overview of that work.

        • Jack says:

          They derived complete conditions for identification as well as an algorithm that gives you an estimand, when it is identifiable. How is that any how similar to “thoughtful” ad-hoc meta-analysis? One gives you a solution to *all* non-parametric data-fusion problems, the other is about solving an ad-hoc specific case problem someone invented.

      • Thanatos Savehn says:

        Two things. First, I read your response at Mayo’s blog and as a result hereby make perhaps a category error. I (just a slack-jawed observer of all this methodological to-do) had put your thoughts on these matters alongside those of Philip Dawid in a category that emphasizes risk and uncertainty rather than Hume-ian (i.e. counterfactual) causation. Pearl on the other hand goes into the “yes, the ‘true’ p(H|E) can now be determined and the actual killer of the desert traveler sent off to prison” category. Now I’m not trying (I swear) to draw you into their bitter debate but as they’re arguing over the ultimate issue in the law where I labor (e.g. – how would ye be judged|data were it yer backside in the dock?), I would be grateful if at some distant time in the future you’d weigh in on this question (as I think the law’s sensible proximate cause inquiry is actually one of risk rather than cause in fact and so seek to have my bias confirmed). Second, Ioannidis has just weighed in on the p<.005? question and wrote: "These solutions may involve abandoning statistical significance thresholds or P values entirely." No cItz 4u but thought you'd be interested nonetheless:

        • I had a long public discussion with Pearl on this blog a few years ago… in the end my conclusion was that he did not understand the concept of Bayesian probability as generalization of boolean logic. He kept asking me to show him how to calculate some example probability quantity without specifying the state of information from which one would assign probabilities. I kept asking him for details about the mechanisms involved and other information which I would require to assign probabilities, he kept wanting me to plug in frequencies.

          In short, he confused frequency and probability, as so many have done, and the result is that we couldn’t make progress. It was an enlightening conversation though. Might be hard to find by searching as it was a while back, and I don’t think users of the blog can search through comments.

  4. Eric says:

    Andrew, this is a tangent, but what is the difference between a pilot trial of the procedures that generates a preliminary effect size (let’s assume without the flaws of the Scherr et al. article, e.g., different analysis) and a study “early in clinical development” that generates “preliminary clinical evidence”, the FDA’s term used in defining a “breakthrough” therapy that can enter expedited development and review?

    Here the full definition of breakthrough therapy from S.3187, Section 902(a)(1): “—The Secretary shall, at the request of the sponsor of a drug, expedite the development and review of such drug if the drug is intended, alone or in combination with 1 or more other drugs, to treat a serious or life-threatening disease or condition and preliminary clinical evidence indicates that the drug may demonstrate substantial improvement over existing therapies on 1 or more clinically significant endpoints, such as substantial treatment effects observed early in clinical development. (In this section, such a drug is referred to as a ‘breakthrough therapy’.) “

    It seems like some “pilots” could lead to the use of data on clinical endpoints, and not just data on the procedures. I’m not thinking about stopping rules for full-scale randomized trials, but rather earlier stage non-randomized trials like the Duke poliovirus Phase 1 clinical trial ( that was granted breakthrough status. A quote in the article gives a clear explanation of what the designation means: “Breakthrough status means that we can work with the highest levels in the FDA to develop the most efficient clinical trial and pathway to fully evaluate the safety and efficacy of the genetically modified poliovirus for treating recurrent glioblastoma…Ultimately, we hope the therapy will one day obtain FDA approval.”

    So it’s clear that the conclusion is not “this treatment is efficacious and deserves FDA approval”, but it also seems to be a use of what you might call pilot data—from a Phase 1 trial (—to infer that the drug “may demonstrate substantial improvement” and warrant further review. In other words, they are not saying the treatment is effective, but that it might be effective. With “pilot” data.

    • Andrew says:


      I don’t think it’s generally appropriate to use a pilot study to generate a preliminary effect size.

      • Eric says:

        Andrew, is this the same as saying you would not use a pilot study to determine the direction of an effect? That seems to be what is implied by the regulations governing the determination of breakthrough therapies: “preliminary clinical evidence indicates that the drug may demonstrate substantial improvement over existing therapies”.

        • Andrew says:


          I do not think it’s generally appropriate to use a pilot study to estimate the direction of an effect.

          To be fair, though, the quote you give does not refer to pilot studies it refers to “preliminary clinical evidence,” which could refer to a large observational study, for example, or careful experimental data.

          • Eric says:

            Thanks, Andrew. I think that is the core of my question: what differentiates a “pilot study” from “preliminary clinical evidence”? The example I cited was a Phase I trial of 61 patients (single group, non-randomized) where the primary outcome was “Maximum Tolerated Dose or Optimal Dose”. We don’t call Phase I trials “pilot trials”, but what is the difference?

            • Andrew says:


              I like the wikipedia definition: “A pilot study, pilot project or pilot experiment is a small scale preliminary study conducted in order to evaluate feasibility, time, cost, adverse events, and improve upon the study design prior to performance of a full-scale research project.”

              • Eric says:

                I think that’s a good definition of a pilot study. A Phase I clinical trial designed to study adverse events related to dosing seems to me like it would qualify as a “pilot” to inform drug development and later stage trials, but had you asked me yesterday I would have said no, it’s not. Maybe I should revert to this view.

                I guess where I land on this example is as follows:

                This Phase I trial of 61 patients (single group, non-randomized) designed to evaluate the maximum tolerated dose has some overlap with what we would define as a pilot study and, given the positive response observed in the study, equated to “preliminary clinical evidence [that] indicates that the drug may demonstrate substantial improvement over existing therapies”.

                I don’t think you agree that this example shares enough in common with a pilot study to label it as such. I hesitate to use the pilot label, but at the same time I can’t explain why the data generated from this study would be different from the data generated by a “pilot” study.

                Thanks for the back and forth, Andrew. Your post certainly got me to think more critically about the meaning of pilot.

    • Keith O'Rourke says:

      > we can work with the highest levels in the FDA to develop the most efficient clinical trial and pathway to fully evaluate the safety and efficacy

      Hopefully those highest levels in the FDA will include a few who understand the uncertainties in a pilot study and can offer good advice on the risks. I am sure the shareholders/stakeholder would – if they realized the risks. Liz Stuart’s comments above underline the need for more awareness of the issues. I don’t think they are widely appreciated.

  5. Zad Chow says:

    As many times as this is repeated, very little will change. People will continue to abuse pilot studies and use effect sizes and standard deviations from them in order to determine how close they are to achieving their 80% statistical power claim. This is one of the rarer examples where the authors actually tried to pass off the results as suggesting that the intervention is effective

  6. Mark says:

    If pilot studies shouldn’t be used to estimate effect sizes for subsequent power studies, and they shouldn’t be used to generate hypotheses for future larger-scale multicomponent studies, what are they for?

  7. Stacey says:

    Hi Andrew,
    Totally agree!! I did a pilot study to test a methodology in slightly different field than it has been applied.

    I am interested in hearing your view/definition of ‘feasibility’. What should researchers pay attention to in terms of feasibility-testing in stead of the absolute results?

    Thank you!

  8. Andrew, I wonder if everyone has the same definition of “pilot” (or rather, I’m pretty sure that this isn’t true). A pilot study in rats of a surgical treatment for tumor removal at institution X is probably a pretty different thing than a pilot study in humans of a diet modification for weight loss at institution Y. Yes, we should check whether the recruitment and intervention and measurement techniques are all feasible, but we should also then do a Bayesian Decision Theory analysis on the results and determine whether a larger study is justified because it results in higher expected value than “do nothing”.

    But of course, this is rarely done…

    • Hence says:

      Daniel, the higher expected value of doing nothing is often overlooked…

      • Nice analogy. Yes I really wish lots of “scientists” would just go “do nothing” right away.

        I will say though that I disagree with the idea that there’s no universal method for inference. Inference comes from logic, and logic can be generalized to the mathematics of probability theory, and this is the universal method for doing inference… The thing is it’s like saying “use math” which is basically the same as the advice in your link to “use your brain”:

        think about the problem, describe your understanding of it using logic and mathematics, and then infer from that model and some data what you should think about the quantities whose quantitative values you weren’t sure of in your model.

        oh and by the way, don’t get too attached to your model either, think up some other alternative models and see what you’d infer for them as well.

        What there isn’t is a universal *push button* method for modeling. And unfortunately the history of statistics is a lot of selling of such snake oil.

        • Hence says:


          Thank you. So, yeah, brain and math are too broad categories, and I’m afraid the word “method” ends up no longer applying. The *push button* method doesn’t exist; and the broad thing that infers (and is in some sense “universal”) isn’t a method.

          But that is just me using logic to make inferences about your inference about there being a universal method for inference…

  9. Hence says:

    Thank you for that, Andrew. Especially the minor modification of the interjection “No!”, which might have succeeded in leaving little room for ambiguity in your declaration of disagreement…

  10. Hence says:

    Since we’re on the topic of letters to authors of nutrition studies, I have recently written one. Remarkably, a randomized controlled trial in a field dominated by observational studies. Which is good.

    What I didn’t find so good is their use of “significant” – not just the testing itself, but the trail of semantic implications that the word seems to leave throughout the paper. I felt it should be addressed.

    Does my analysis seem fair?

    Readers here might want to complement with comments on other aspects of that paper (methodology and reporting), which just wasn’t my focus.

Leave a Reply