A completely reasonable-sounding statement with which I strongly disagree

In the context of a listserv discussion about replication in psychology experiments, someone wrote:

The current best estimate of the effect size is somewhere in between the original study and the replication’s reported value.

This conciliatory, split-the-difference statement sounds reasonable, and it might well represent good politics in the context of a war over replications—but from a statistical perspective I strongly disagree with it, for the following reason.

The original study’s estimate typically has a huge bias (due to the statistical significance filter). The estimate from the replicated study, assuming it’s a preregistered replication, is unbiased. I think in such a setting the safest course is to use the replication’s reported value as our current best estimate. That doesn’t mean that the original study is “wrong,” but it is wrong to report a biased estimate as if it’s unbiased.

And this doesn’t even bring in the possibility of an informative prior distribution, which in these sorts of examples could bring the estimate even closer to zero.

20 thoughts on “A completely reasonable-sounding statement with which I strongly disagree

  1. Shouldn’t we also be concerned that there is a null finding filter on replications? A replication that contradicts early evidence is more novel or interesting that a replication that produces the same result. I wouldn’t be comfortable assuming that either effect size is unbiased.

    • Thomas:

      As I wrote above, my statement about unbiased is assuming the registration is preregistered. My impression is that when a replication is preregistered, it will be reported however it goes. In general I think the garden of forking paths and the statistical significance filter are a much bigger deal, compared to selection of what to report. Once people go to the trouble of running a study, they’ll usually want to publish it in some way.

      Also, on the particular point, I disagree about your claim that a replication that contradicts early evidence is more novel or interesting that a replication that produces the same result. It depends on the claim. For example, a positive preregistered replication of Daryl Bem’s claims would be much more novel or interesting than a negative result. Indeed, it was my impression that one reason the negative replications were hard to get published was that they were no surprise.

      • Blob:

        The replication discussed in the listserv was preregistered, but it’s my impression that most replications are not preregistered. Then it’s a different story. You can even get cases where a researcher claims as a “replication” an experiment that was performed eight years earlier.

        • What if both the original study & the replication were pre-registered? Should we still use the replication’s result as our best estimate?

        • Rahul:

          If the original study were preregistered, then its estimate would be unbiased. But in the examples being discussed, the original study was never preregistered.

  2. In my experience, through many rejections of my replication work, most editors think that it is not enough to point to some error or omission in a previous study that materially effects the conclusion. They want new theory, data, “important stuff”

    In my view this is unscientific. But it is consistent with a convention whereby each published paper claims a new discovery. (If that were true, and at a rate of millions of papers published every year, then we would have solved all the world’s problems by now.)

    I think part of the problem is most editors misunderstand what a replication standard ought to be. The current practice is to say that it involves making accessible to other scientists replication materials. My view is that, in addition, a replication standard must involve editors taking responsibility for what they publish. Thus, if the latest breakthrough you published turns out to be based on some error, then you have a duty to report this to your readers. Sharing replication data is necessary but not sufficient.

    • Fernando:

      I agree, and that’s one reason I prefer to speak of a “criticism crisis” rather than a “replication crisis.” There seems to be a lot of resistance to publishing findings of flaws in published papers. I discussed this in my 2013 paper, It’s too hard to publish criticisms and obtain data for replication.

    • That’s because most editors primarily care about the popularity & impact of their Journal and not so much about whether or not they publish a fundamentally “correct” result.

      In many cases the conclusions in either direction are so relatively unimportant (e.g. whether women wear pink at fertility) that no substantial harm comes about by publishing the wrong yet sensationalist result. And ergo no desire to change that.

      If not publishing errata would result in, say, bridges crashing or reactors exploding, I am sure the editors would be more responsive & promptly attending to failed replications.

    • Everything I’ve ever submitted with the word “replicate” or “replication” in it has been rejected, with something like not contributing enough knowledge cited in the reviews. This is despite the fact that all such papers I’ve written have started with replications and followed them with additional analysis that builds on the replicated stuff. Reframing “replication” as “establishing consistency” or something along those lines has been fine. This is small-sample, anecdotal data (probably three or four papers), but it sends a clear message that my field is all talk when it comes to fixing the problems associated with publication.

  3. Naive question: Why does an informative prior distribution, bring the estimate even closer to zero?

    Why couldn’t it push the estimate further away from zero? Wouldn’t it depend case to case?

    • Rahul, you’re right in general, but I think the cases that Andrew’s referring to – correct me if I’m wrong! – are ones where our prior information would suggest that we anticipate small effects, if any.

      • Jonah:

        Yes. And, even more generally, if there is really strong prior information that an effect should be approximately X, then any new experiment will typically be looking for changes from X, and the prior will pull toward zero any estimates of the change.

  4. Rahul:

    I agree. That is why I think much social science, as practiced today, is inconsequential.

    Indeed it points to a fundamental contradiction. The editors think they are publishing “important stuff”. Yet, by not having the type of replication policy I described above, they are acting as if the published work is totally inconsequential. And such actions speak louder than words.

    So here is one standard of an “important” work: if we publish this, and then someone shows it is likely wrong, will we put it on the front page of our next issue.

    Personally, I believe political science is hugely consequential. Yet, in my experience, many political scientists act as if it is completely inconsequential. Broken bridges are small potatoes compared to broken institutions that condemn whole generations to poverty, hardship, and disease.

    • I totally agree with your point that broken bridges are small potatoes compared to broken institutions.

      The question is, whether anything that academic political science does has much of an impact on these institutions.

      OTOH, I’m not singling out Poli. Sci. There’s a lot of other areas that have become like that.

      Even more generally, areas of hard-sciences have the same inconsequential-ity problem because if what you are publishing is a (a) tiny effect or an incremental advance or a weak powered, noisy study & (b) the consequences of choosing the “wrong” path at the fork are subtle, diffuse, unclear & delayed (e.g. efficacy of various drug eluting stents) then the implication to the editors of publishing a “wrong” result is fairly low. e.g. red wine makes you live longer?

      Again, editors prioritize short-term newsworthiness over long term “correctness”.

  5. “I think in such a setting the safest course is to use the replication’s reported value as our current best estimate.”

    That does sound completely reasonable.

    But at the same time it seems that some might challenge of “throwing out” the data from the initial study.

    I was wondering if you think there might be ways to *estimate* the bias due to the statistical significance filter in the previous study, and then combine the estimates in the spirit of a meta-analysis?

    This seems like a hard problem, yes. And I suspect it would require a careful examination of departures and potential departures from what might have been the pre-registered experimental plan and analysis consistent with the results. This sounds like it would require a lot of modeling from there. But perhaps someone has thought about a somewhat ad-hoc “shrinkage” ratio to account for the number of “forking paths”?

    • JD: “then combine the estimates in the spirit of a meta-analysis? This seems like a hard problem”

      I do think its hard and did argue that one would need a credible informative prior on the direction and size of the bias in Meta-Analysis: Conceptual Issues of Addressing Apparent Failure of Individual Study Replication or “Inexplicable” Heterogeneity 2001. http://link.springer.com/chapter/10.1007/978-1-4613-0141-7_11#page-1 I used the term sweet and sour apples, sour being baised.

      Not aware of any serious challenges of this claim (that can be seen with simple simulation). It often arises when folks consider the use of historical controls so its important to see the least.

      I’ll share a recent exchange with a very experienced Bayesian statistician who has worked in drug development.

      > [me] … presented a talk on using historical data in orphan drug evaluation and mentioned down weighting the historical data some how. (Sorry, I should have made a note of the presenter’s name.)

      But it reminded me that we discussed this issue at JSM2013 and I was trying to argue that down weighting was not good enough but that the direction and size of the bias had to explicitly dealt with.
      (For instance commensurate priors don’t as they assume the non-exchangeability is symmetric.)

      > [Them] Regarding your substantive point, I agree. Assuming the information about bias is symmetric then discounting would accommodate its uncertainty. If it is asymmetric then a greater amount of discounting may suffice, or of course one can incorporate an informative prior for the bias.

      So I would appreciate hearing about other points of view here.

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *