“The idea of replication is central not just to scientific practice but also to formal statistics . . . Frequentist statistics relies on the reference set of repeated experiments, and Bayesian statistics relies on the prior distribution which represents the population of effects.”

Rolf Zwaan (who we last encountered here in “From zero to Ted talk in 18 simple steps”), Alexander Etz, Richard Lucas, and M. Brent Donnellan wrote an article, “Making replication mainstream,” which begins:

Many philosophers of science and methodologists have argued that the ability to repeat studies and obtain similar results is an essential component of science. . . . To address the need for an integrative summary, we review various types of replication studies and then discuss the most commonly voiced concerns about direct replication. We provide detailed responses to these concerns and consider different statistical ways to evaluate replications. We conclude there are no theoretical or statistical obstacles to making direct replication a routine aspect of psychological science.

The article was published in Behavioral and Brain Sciences, a journal that runs articles with many discussants (see here for an example from a few years back).

I wrote a discussion, “Don’t characterize replications as successes or failures”:

No replication is truly direct, and I recommend moving away from the classification of replications as “direct” or “conceptual” to a framework in which we accept that treatment effects vary across conditions. Relatedly, we should stop labeling replications as successes or failures and instead use continuous measures to compare different studies, again using meta-analysis of raw data where possible. . . .

I also agree that various concerns about the difficulty of replication should, in fact, be interpreted as arguments in favor of replication. For example, if effects can vary by context, this provides more reason why replication is necessary for scientific progress. . . .

It may well make sense to assign lower value to replications than to original studies, when considered as intellectual products, as we can assume the replication requires less creative effort. When considered as scientific evidence, however, the results from a replication can well be better than those of the original study, in that the replication can have more control in its design, measurement, and analysis. . . .

Beyond this, I would like to add two points from a statistician’s perspective.

First, the idea of replication is central not just to scientific practice but also to formal statistics, even though this has not always been recognized. Frequentist statistics relies on the reference set of repeated experiments, and Bayesian statistics relies on the prior distribution which represents the population of effects—and in the analysis of replication studies it is important for the model to allow effects to vary across scenarios.

My second point is that in the analysis of replication studies I recommend continuous analysis and multilevel modeling (meta-analysis), in contrast to the target article which recommends binary decision rules which which I think are contrary to the spirit of inquiry that motivates replication in the first place.

Jennifer Tackett and Blake McShane wrote a discussion, “Conceptualizing and evaluating replication across domains of behavioral research,” which begins:

We discuss the authors’ conceptualization of replication, in particular the false dichotomy of direct versus conceptual replication intrinsic to it, and suggest a broader one that better generalizes to other domains of psychological research. We also discuss their approach to the evaluation of replication results and suggest moving beyond their dichotomous statistical paradigms and employing hierarchical / meta-analytic statistical models.

Also relevant is this talk on Bayes, statistics, and reproducibility from earlier this year.

28 thoughts on ““The idea of replication is central not just to scientific practice but also to formal statistics . . . Frequentist statistics relies on the reference set of repeated experiments, and Bayesian statistics relies on the prior distribution which represents the population of effects.”

    • Replication should be restricted to referring to repeating empirical studies and not math (which is abstract).

      Simulation is just math done by computer – with the same random seedS and equivalent program implementation – every one should get the exact same answer for each “replication”.

      Now, let the random seed be fixed on the first run or varied haphazardly and one may then say its an emulation of empirical study replication under ideal conditions with fixed estimands.

      • Keith:

        I’ve sometimes talked about “replication” vs. “duplication,” but I think we have to accept that the term “replication” is out there and will be used to refer to all sorts of things. So we’ll just have to clarify what we mean by replication, each time we use the word.

      • Your last paragraph is what I mean.

        It’s reasonably common to attempt to directly replicate existing simulation studies – because you don’t believe results and/or are using previous work as a springboard to further data-generating mechanisms, new methods etc. This is often just based on what people have reported, so your starting seed is different.

        Are you saying ’emulation’ of an empirical experiment because of pseudo-random (not ‘truly’ random) numbers?

        • > saying ’emulation’ of an empirical experiment because of pseudo-random (not ‘truly’ random) numbers?
          No, (and as Andrew pointed out I am being a bit pedantic here) the pseudo-random issue is minor.

          Rather, the empirical study is being represented by the simulation so it can only be an emulation or as Leonard Cohen poetically put it ;-)
          “The word butterfly is not a real butterfly. There is the word and there is the butterfly. If you confuse these two items people have the right to laugh at you.”

        • Andrew, sounds interesting. Where have you talked about ‘replication’ vs. ‘duplication’?

          Keith, I think your Cohen quote says we are thinking of different things.

          I am talking about some existing simulation study being the thing you are trying to replicate. It is the empirical experiment* (butterfly), in that the results are subject to Monte Carlo error and not exact (for a finite number of repetitions). You can try to mimic it and see if you get (about) the same results.

          Your butterfly-vs.-the-word-‘butterfly’ quote suggests you are talking about looking at some existing applied research (not involving simulation) and doing simulation to try and… do something relating to replication?
          Can you clear this up for me?

          *Some people complain that pseudo-random numbers mean sim studies are not ‘real’ empirical experiments and happily produce randomisation lists using them

        • We are largely just mis-communicating – the Cohen quote refers to using simulations to understand what should be happening when empirical studies are repeatedly done given everything stays the same.

          > I am talking about some existing simulation study being the thing you are trying to replicate.
          That is what I consider math – where the representation itself is just taken be what is being represented. Like when we sketch a triangle. The angles will never add exactly to 180 but it still represents an abstract object whose angles do add exactly to 180 (followed by an unending number of 0 decimal places.

  1. I think that it is usually more valuable to “repeat” experiments with differences as opposed to their being “exactly” the same. The variations test the generality of the results. Yes, I see that it is useful to repeat “exactly” to get a gauge as to whether some result may have been a statistical fluke. But can a single or few repetitions really accomplish that anyway? If you have two experiments (that were repetitions) that get different results, which one should you rely on? You need to have some other information (perhaps a well-established theory or a reliable prior) to go on.

    • Very good observations. Yes a well established theory. But how does one go about assessing the merit of a theory or theories. This doesn’t seem to be a high priority at this juncture, judging from the emphases on the quantitative side. Theory building strikes me as a harder challenge today than say it was 40 years ago.

    • I think from a design of experiment perspective there is an argument to be made for doing somewhat different experiments, rather than direct replications. In a perfect world, we would explore the design space by doing different experiments while simultaneously reducing uncertainty. In the real world, I think this may be difficult because the reported uncertainty estimates for an effect size may not be very accurate (e.g., due to “context”). Therefore, it may be difficult for a meta-analysis to synthesize results from experiments that are not direct replications.

      • This seems to miss the most important point: If the design used in the original experiment was flawed, and if there is a better design that is feasible to use, then it’s silly to do a replication using the original flawed design; it only makes sense to do a new experiment with the best design possible. (See Justin Smith’s comment below and my reply below it for an example from physics illustrating this.)

        • I think the deflection of the light around the sun is basically a point estimate (true value is not changing with time). If you are interested in reducing uncertainty about a single point then I would agree that you continue to refine the design and do improved replications. There may be other variables that affect the result, but they are basically nuisance parameters. If the “original experiment was flawed” then the new results largely take precedence over the old.

          The situation I was considering was one where we we want to know how some effect varies across conditions. Doing a replication under the original conditions might be considered a waste of resources. Several papers measure the effect in different conditions. None of the experiments are flawed, but they are all slightly different. Eventually a meta-analysis could (theoretically) be performed to synthesize the results, not at a single point, but across a range of conditions (multidimensional space). This would reduce uncertainty, but also provide a broader understanding (i.e., explore the design space).

        • Yes, if the purpose is to see how the effect varies over conditions, then of course one must measure under varying conditions. But my impression is that the term “replication” generally refers to the situation where a summary statistic (e.g, a mean) is being estimated over a population, which usually involves some varying conditions. And the point of not replicating a poor design is still important, even if one is looking at varying conditions.

        • Nat: Sorry I had not seen this when I posted my comment (it was held up in the cue for a bit).

          > Eventually a meta-analysis could (theoretically) be performed to synthesize the results, not at a single point, but across a range of conditions (multidimensional space).
          Yup definitely the way forward and no reason not to attempt doing it.

          Martha is right about “the term “replication” generally refers to” but that is just part of the problem – it needs to change.

      • Nat – more purposeful meta-analysis do not synthesis just to the average result of the somewhat arbitrary set of trials that have been conducted to date (that varied in design and quality of execution) but rather strive to do effect surface estimation to extrapolate to ideal designs with highest quality of execution. This is known as Rubin’s effect surface estimation approach to meta-analysis and it would be better enabled by your design suggestions above.

        Unfortunately most statisticians what to hear no variation, see no variation and speak no variation other than what an extremely inefficient very nosy random-effects approach will allow for or hope to define a weighted average of the haphazard selection of studies done so far that would be relevant for some population.

        See here http://statmodeling.stat.columbia.edu/2017/11/01/missed-fixed-effects-plural/ or here http://statmodeling.stat.columbia.edu/2017/10/05/missing-will-paper-likely-lead-researchers-think/

        • Thanks for the relevant links Keith. The Rubin’s effect surface approach sounds like the basic idea I was trying to describe. However, I had not considered how the “study quality factors” fit into this approach which is very important as pointed out by Martha.

    • Yes, I see that it is useful to repeat “exactly” to get a gauge as to whether some result may have been a statistical fluke. But can a single or few repetitions really accomplish that anyway? If you have two experiments (that were repetitions) that get different results, which one should you rely on?

      You should rely on none of them. But yes, the end result of this is no one ever being proved wrong since “my rats were 8 weeks old and yours were six weeks old”, etc. Then it becomes a thing to study why “6 vs 8 weeks old rats are so different”.

  2. Andrew,

    My norton anti-virus wanted to block this as a ‘fraudulent web page’. I do not know if you can do anything about it.

    I obviously over-rode it. Dumb software

  3. I’ve read the paper by Zwaan et al. and subsequently wondered whether the peer-review for it was blinded or not (i still keep finding it amazing that some, or perhaps even lots (?) of journals don’t even keep the author information hidden from the peer-reviewers).

    I don’t really see what i should take from the paper, and i wonder why so many folks seem to have written their “replies” to this. To me, this paper is about 5 years past its possible usefulness and/or relevance.

    Anyway, here’s a blogpost by Zwaan about how (what he names) “concurrent” replications can be used in psychological science, which i found to be much more interesting, useful, and relevant at this point in time:

    https://rolfzwaan.blogspot.com/2017/05/concurrent-replication.html

    Should he (and/or his colleagues) possibly think about possibly writing a paper about this, here’s are 5 links to some stuff that might be relevant and/or useful for that:

    1) http://statmodeling.stat.columbia.edu/2017/12/17/stranger-than-fiction/#comment-628652

    2) http://statmodeling.stat.columbia.edu/2017/12/17/stranger-than-fiction/#comment-630283

    3) http://statmodeling.stat.columbia.edu/2018/06/26/psychological-science-accelerator-probably-good-idea-im-still-skeptical/#comment-774026

    4) http://statmodeling.stat.columbia.edu/2018/06/26/psychological-science-accelerator-probably-good-idea-im-still-skeptical/#comment-773815

    5) http://statmodeling.stat.columbia.edu/2018/06/26/psychological-science-accelerator-probably-good-idea-im-still-skeptical/#comment-775297

  4. “Frequentist statistics relies on the reference set of repeated experiments, and Bayesian statistics relies on the prior distribution which represents the population of effects…” How do a reference set and a population of effects differ? The one seems too narrow, the other seems to loose, and otherwise it’s not clear how they’re not the same thing.

    If you view science as the current best knowledge of the way things are, something indistinguishable in practice from the current best practices in determining how we know the way things are, the replication problem isn’t just repeating the same experiments and getting the same results. It’s also getting the same results from different experiments, finding the same thing using different procedures. Even more, it’s also finding results that fit with all the other knowledge of the way things are. Not in the narrow sense of a neatly schematic unity of science, which doesn’t exist now, if it ever did, but in the wider sense of the unity of nature. Parapsychologists worked on repeating results for decades but forgot that communicating information faster than light didn’t fit with the knowledge of nature we already possessed.

    Well, at least that seems the same to me.

    • “the replication problem isn’t just repeating the same experiments and getting the same results. It’s also getting the same results from different experiments, finding the same thing using different procedures.”

      This requires the proviso that the different experiments/different procedures are all equally sound. The example Justin linked to is a situation where the initial experiments/procedures were flawed; the later, more sound experiments/procedures gave more credible estimates precisely because they were more sound.

      • Yes, but at some point the whole reason it’s deemed knowledge is the continue replication of experiments, which implies methods validly discovering the effect/phenomenon. If science is supposed to be refutation of conjectures, then a false negative prematurely terminate investigation.

        If science is the investigation of what is, then inaccurate positives for a real effect/phenomenon in the end serve to increase error bars and delay consensus. Inaccurate negatives will not be replicated, which makes for progress. If you squint, I think you can see this as what Popper meant. It does seem to me that there are certain experiments that succeed in changing the priors (if I use the jargon correctly.) Excellent examples would be Pasteur’s experiments on spontaneous generation, de Lavoisier’s experiments measuring mass before and after reactions, van Helmont’s experiments measuring the mass of soil and water and plant before and after growth, Torricelli’s excursion to a mountaintop with a barometer. Contra Popper, biogenesis, conservation of mass, plants using some part of air as a nutrient, the absence of atmosphere beyond the earth were the new priors. And other possibilities were neglected, ignoring the supposedly provisional nature of science.

    • Thanks, Sameera! Here is our abstract – comments on the paper are welcome.

      Inferential Statistics are Descriptive Statistics

      There has been much discussion of a “replication crisis” related to statistical inference, which has largely been attributed to overemphasis on and abuse of hypothesis testing. Much of the abuse stems from failure to recognize that statistical tests not only test hypotheses, but countless assumptions and the entire environment in which research takes place. Honestly reported results must vary from replication to replication because of varying assumption violations and random variation; excessive agreement itself would suggest deeper problems, such as failure to publish results in conflict with group expectations or desires. Considerable non-replication is thus to be expected even with the best reporting practices, and generalizations from single studies are rarely if ever warranted. Because of all the uncertain and unknown assumptions that underpin statistical inferences, we should treat inferential statistics as highly unstable local descriptions of relations between assumptions and data, rather than as generalizable inferences about hypotheses or models. And that means we should treat statistical results as being much more incomplete and uncertain than is currently the norm. Rather than focusing our study reports on uncertain conclusions, we should thus focus on describing accurately how the study was conducted, what data resulted, what analysis methods were used and why, and what problems occurred.

      https://peerj.com/preprints/26857

      • Of course Valentin. I read the articles you post on your Twitter account.

        Very logical analysis. In keeping with principles and practices of Open Science network. I think the change of title reflected in this article is timely & relevant.

        Preprints, preregistrations, and blogs are critical toward improving science. I come at it from the international relations & national security arena, which must increasingly rely on rigorous empiricism as well; two arenas where assumptions are not sufficiently excavated & precisely spelled out. It speaks to the fact that the qualitative side lags sometimes.

Leave a Reply

Your email address will not be published. Required fields are marked *