Skip to content

I’m all confused!

Are our experiments too large or are they too small?

Seth Roberts says . . .

Seth makes a strong case for small-n studies–basically, a case for never doing large-n studies. If you understand the effect you’re studying, the large-n study isn’t necessary, and if you don’t understand the effect, the large-n study probably won’t work. This seems like a frontal attack on the “NIH paradigm” of performing formal controlled studies with sufficient power to have a high probability of detecting an effect of specified magnitude.

(See here for my related commments on statistical power.)

On the other hand, I wrote a paper myself once recommending that, for reasons of statistical power, one should generally not take measurements at an intermediate design point.

Esther Duflo says . . .

More recently, I was impressed by a talk by Esther Duflo, who argued strongly that, in studying interventions, it’s worth it to put in the extra effort to perform controlled experiments, in order to make a convincing case that can be recommended to policymakers. For a couple of her papers making this point, see here and here.

Moving in opposite directions

Seth Roberts and Esther Duflo seem to be moving in opposite directions: Seth, in psychology and health studies, moving away from formal experimentation, toward n=1 studies on himself and, gradually, others; and Duflo, in economic development, moving away from case studies toward randomized experiments.

Meanwhile, I remember hearing John Carlin, another person I respect a lot, telling me that in the medical research community there is a movement towards larger studies, because the typical “NIH-type” study with 100 or 200 patients just doesn’t give enough information. Studies are inconclusive, requiring new studies and, eventually, extensive literature reviews to assess the data on a treatment that might better have been studied with one big experiment.

What to think?

I don’t really know what do think about this, and I don’t really have a good framework about thinking about these issues, either. These questions seem important–huge, even–considering how closely they tie into the fundamental assumptions of the teaching and practice of applied statistics.


  1. Doug Yu says:

    I've recently started reading this blog, which I'm very much enjoying.

    If i might give a simplified example from ecology…

    My impression is that our sample sizes are dictated not only by funding (always the best predictor) but also by how close we are to the mechanism of interest, as opposed to the phenomenon created by that mechanism.

    For example, many ecologists are interested in tropical tree coexistence. For a long time, we knew very little about the mechanisms of coexistence, so some ecologists observed tree dynamics using plots of more than 100,000 trees. Now, we know more about the mechanisms (generated by observations of those large tree plots), and our experiments testing those mechanisms can have smaller sample sizes, because we are asking more precise questions.

    If we could truly see the detailed workings of, say, a trial medicine on a disease, then is it reasonable to suggest that n=1 might be sufficient?

  2. MDM says:

    I agree with Doug. This is from an article I wrote in 1994 (Deviating from the Mean): "Mosteller describes an experiment conducted in 1747, with sailors afflicted with scurvy. A physician administered one of six treatments-vinegar, sea water, cider, vitriol elixir, citrus fruit, and nutmeg-to two sailors each, for a total N of 12. Only the two who ate the citrus fruit were cured, in just a few days, leading the physician to believe that he had found the cure, which is a logical conclusion. But this result is doubtless not significant at the .05 level. [What would be the fate of a crime analyst who told the police chief, 'We only have two cases in which the victim was dismembered; this is too small an N to infer a pattern'?]"

    In other words, it depends on how much we know about the distribution of outcomes — if everyone who drinks cyanide ends up dead, and one person who ingests an antidote doesn't die, we know for certain that the antidote works (at least, for some).

  3. needleman says:

    I have been working recently on quality improvement in health care, where the two frames clash. The clinical trial model (randomization, n large enough to detect an effect at a selected effect size) clashes with the model of using PDSA (Plan-Do-Study-Act) cycles, small rapid tests of alternative approaches to delivering care. If the first cycles seem promising, those testing will run cycles with more patientsbut rarely run tests with enough cases to find statistically significant differences before the decision is made to go to scale.

    The difference between these two approaches is in how they use theory and prior knowledge in deciding whether there is enough information on which to act.

    The clinical trial is fundamentally atheoretical, except for statisical theory. Allowing for differences in design based on when and what information is collected, the designs do not vary from test to test. No information on the phenomenon being studied is brought into the assessment of data in the pure randomized clinical trial once the outcome measure is chosen. Run the trial, measure the outcome, compute the confidence interval on the difference. If the CI doesn't include zero, the hypothesis of no association is rejected as unlikely.

    By contrast, the PDSA cycle draws heavily on the theoretical and practical knowledge of those running the cycles. The cycle is run. Those running the cycle make a judgement of how what they observed differed from what would have been expected if the cycle were not run, based on their understanding of the process and circumstances (in the case of health care, the patient, their condition, and care delivery). In a colloquial, although not formally in a statistical sense, the process is Baysian, drawing heavily on prior beliefs to influence the interpretation of the trials.

    The randomized clinical trial and PDSA cycle are at the extremes. Is there a way, by thinking about the role of strong theory or understanding of the processes being studied, that the two can be reconciled, or decisions made about how much testing is actually necessary before we draw a judgment about what we now know. (MDM provides an example in the scurvy trial.) This question is not an idle one. The clinical trial and rapid cycle quality improvement perspectives confront one another in many clinical settings. The issues also arise in assessing quasi-experimental studies, and conducting evaluations in which the interventions are being modified based on intermediate observations of how the project is going.

  4. Seth Roberts says:

    I think Richard Doll has said that if an effect is large enough to be clinically useful you won't need a large study to show it works. He was speaking from experience. With small studies you can more easily discover effects and develop them — learn how to get the most bang for your buck. I think of ants looking for food. They don't look in packs — that would be silly. Once they find food then they go en masse. But NIH doesn't have the wisdom of ants. All these big studies to search for effects that have not been demonstrated in smaller studies.