Skip to content

Don’t define reproducibility based on p-values

Lizzie Wolkovich writes:

I just got asked to comment on this article [“Genotypic variability enhances the reproducibility of an ecological study,” by Alexandru Milcu et al.
]—I have yet to have time to fully sort out their stats but the first thing that hit me about it was they seem to be suggesting a way to increase reproducibility is to increase some aspect that leads to important variation in the experiment (like genotypic variation in plants, which we know is important). But that doesn’t seem to make sense!

My response:

Regarding the general issue, I had a conversation with Paul Rosenbaum once about choices in design of experiments, where one can decide to perform: (a) a focused experiment with very little variation on x, which should improve precision but harm generalizability; or (b) a broader experiment in which one purposely chooses a wide range of x, which should reduce precision in estimation but allow the thing being estimated to be more relevant for out-of-sample applications. That sounds related to what’s going on here.

Regarding this particular paper, I am finding the details hard to follow, in part because they aren’t always so clear in distinguishing between data and parameters. For example, they write, “the net legume effect on mean total plant biomass varied among laboratories from 1.31 to 6.72 g dry weight (DW) per microcosm in growth chambers, suggesting that unmeasured laboratory-specific conditions outweighed effects of experimental standardization.” But I assume they are referring not to the effect but to the estimated effect, so that some of this variation could be explained as estimation error.

I also find it frustrating to read a paper about replication in which decisions are made based on statistical significance; for example, see lines 174-184 of text, and, even more explicitly, on lines 187-188: “To answer the question of how many laboratories produced results that were statistically indistinguishable from one another (i.e. reproduced the same finding) . . .”

Also there are comparisons of significance and non-significance, for example this: “Introducing genotypic CSV increased reproducibility in growth chambers but not in glasshouses,” followed by post-hoc explanations: “This observation is in line with the hypothesis put forward by Richter et al. . . .”

This is not to say that the claims in this paper are wrong, just that I’m finding it difficult to make sense of this paper and understand exactly what they mean by reproducibility, which is never defined in the paper.

Lizzie replied:

Yes, the theme of the paper seems to be, “When all you care about is an asterisk above your bargraph in one paper, but no asterisks when you compare papers.” They also do define reproducibility: “Because we considered that statistically significant differences among the 14 laboratories would indicate a lack of reproducibility….”

I guess what we’re saying here is that reproducibility is important, but defining it based on p-values is a mistake, it’s kinda sending you around in circles.


  1. bill raynor says:

    Well, what did you and Paul conclude, beyond “it depends”?

    • Andrew says:


      This was all Paul’s work. He had a conclusion that made a lot of sense, but now I’m forgetting what his conclusion was, or where he wrote it up! I think it’s in one of his books.

      • David Bailey says:

        In his article “In Heterogeneity and Causality: Unit Heterogeneity and Design Sensitivity in Observational Studies” (, Rosenbaum asks:
        “Does reducing the heterogeneity of experimental units strengthen causal claims? Or does reducing the heterogeneity without randomizing simply reduce the standard error of a biased estimator?”
        and he concludes:
        “In observational studies, reducing heterogeneity reduces both sampling variability and sensitivity to unobserved bias …. In contrast, increasing the sample size reduces sampling variability, which is, of course useful, but it does little to reduce concerns about unobserved bias.”

  2. Kleber says:

    I’m convinced p-values/significance are not a good criteria here. Where I’m usually stuck is in how to do otherwise in a sensible way. What would be a better way to judge replication success?

  3. Marcel van Asse says:

    Andrew, I believe you made an error.

    By increasing heterogeneity one increases BOTH precision and generalizability wrt estimating the average effect. This is because the standard error of the average effect equals (simple design) sqrt( (tau2 + sigma2/n)/K ), with N the sample size of one experiment, K the numer of experiments with different settings/manipulations of the same variable, and tau2 is heterogeneity of the effect size because of different settings/manipulations. As you can see from the formula, choosing only ONE setting can never reduce se to a value below tau. However, precision is more easily increased by increasing K. So, ONE experiment with N = 1,000,000 results in a less precise estimate than three studies with N = 100, even if heterogeneity is not that large… and you may check generalizability, which is of course impossible when having one experiment with N = 1,000,000.

    From this perspective, the often heard statement “we need more powerful studies” is misplaced, and may better be replaced by “we need more studies explicitly varying settings/manipulations” to obtain more knowledgde about an effect and how it depends on other factors.

  4. Vagueness and ambiguity both implicated.

Leave a Reply