We have a ways to go in communicating the replication crisis

I happened to come across this old post today with this amazing, amazing quote from a Harvard University public relations writer:

The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.

This came up in the context of a paper by Daniel Gilbert et al. defending the reputation of social psychology, a field that has recently been shredded—and rightly so—but revelations of questionable research practices, p-hacking, gardens of forking paths, and high-profile failed replications.

When I came across the above quote, I mocked it, but in retrospect I think it hadn’t disturbed me enough. The trouble was that I was associating it with Gilbert et al.: those guys don’t know a lot of statistics so it didn’t really surprise me that they could be so innumerate. I let the publicist off the hook on the grounds that he was following the lead of some Harvard professors. Harvard professors can make mistakes or even be wrong on purpose, but it’s not typically the job of a Harvard publicist to concern himself with such possibilities.

But now, on reflection, I’m disturbed. That statement about the 100% replication rate is so wrong, it’s so inane, I’m bothered that it didn’t trigger some sort of switch in the publicist’s brain.

Consider the following statements:

“Harvard physicist builds perpetual motion machine”

“Harvard biologist discovers evidence for creationism”

That wouldn’t happen, right? The P.R. guy would sniff that something’s up. This isn’t the University of Utah, right?

I’m not saying Harvard’s always right. Harvard has publicized the power pose and all sorts of silly things. But the idea of a 100% replication rate, that’s not just silly or unproven or speculative or even mistaken: it’s obviously wrong. It’s ridiculous.

But the P.R. guy didn’t realize it. If a Harvard prof told him about a perpetual motion machine or proof of creationism, the public relations officer would make a few calls before running the story. But a 100% replication rate? Sure, why not, he must’ve thought.

We have a ways to go. We’ll always have research slip-ups and publicized claims that fall through, but let’s hope it’s not much longer that people can claim 100% replication rates with a straight face. That’s just embarrassing.

P.S. I have to keep adding these postscripts . . . I wrote this post months ago, it just happens to be appearing now, at a time in which we’re talking a lot about the replication crisis.

42 thoughts on “We have a ways to go in communicating the replication crisis

  1. If it had been on purpose, this would have been a clever and ironic response to the problems with hypothesis testing. The response to your post may be: “Well, maybe social psychology doesn’t have a 100% replication rate overall, but I have identified some subfields where I cannot reject 100% replication (p<.05) which confirms my theory.

  2. To tie lightning rods together, this is also what bothers me about the anti-anti-vaxxers.
    If you tell someone that the risk of debilitating chronic disease, say, is indistinguishable from
    zero – and then say ‘because it is less than 0.05 (1 in 20)’ – they will be justifiably concerned
    that you have NO idea how risk management works.

    So, how many bad studies do you need in a field before it becomes a crisis of credibility?
    How many from an individual researcher? How bad do they need to be?

  3. Hi, grad student in econ here; I have been reading many of your posts and a few of your published articles concerning the replication crisis and p-hacking, garden of forking paths etc, I agree completely with the conceptual problem of p-hacking and the garden of forking paths. I think I understand the meaning of a p-value, and that for it to have literal meaning the implication is that, given a different sample from the same population, the same research decisions would have been made and the same hypothesis test would have been run. However, given recent arguments I have had with people, I have realized that I do not have the mathematical rigor to back up this claim. Taking a simple example, how do I explain to someone that they should put less weight on a drug study that looked at 20 outcomes and found that the drug has a statistically significant effect on blood pressure versus a pre-registered study that finds a statistically significant effect on blood pressure (and that is the only outcome looked at). I can given all sorts of conceptual arguments for this, but I would like to hear anyone’s more mathematical argument for this claim.

    • Matt:

      Suppose a study has data matrix and J possible data summaries (which might be comparisons, regression coefficients, whatever), T_j(y), for j=1,…,J.

      Consider three possible scenarios:

      1. J=1. One can perform a test on T_1(y), comparing it to its distribution T(y_rep) under a null model and perform a hypothesis test.

      2. J=20 and the researcher picks the best result (this could be via “p-hacking” in which all 20 tests are computed and the best one is chosen, or less formally through a “garden of forking paths” in which the data are set up opportunistically and tested in a way that makes sense, conditional on the values actually observed. In either case, the test being used is T(y) = max_j T_j(y), and if you want to perform a hypothesis test you need to figure out the distribution of T(y_rep) = max_j T_j(y_rep) under the null hypothesis. The way this works is that the T_j that’s picked will depend on the data.

      3. J=20 and the researcher looks at all comparisons together. I’d suggest doing this using a hierarchical model.

      The following paper is relevant to option 2 above:
      http://www.stat.columbia.edu/~gelman/research/published/multiverse_published.pdf

      And this paper is relevant to option 3 (my preferred approach):
      http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf

      • Madam/Mr President – I concur with 3 if there is a reasonable/credible/responsible prior and data generating model (representation of underlying reality we hope to be connected with) whilst otherwise 2.

        (Additionally, your new adviser’s point that 1 is just 3 with a point prior – is seldom in practice helpful.)

      • Thanks for getting back to me. I agree with all this (except option 3, which I would have to read about because my understanding is quite limited there).

        This is still slightly fuzzy though. To keep with my example, suppose the first researcher preregisters and gets the effect on blood pressure. Then, suppose we can rewind time, and the same researcher gets to do the experiment over, but this time he looks at 20 outcomes – but again finds that blood pressure is significant (because data is identical). Why should I not take away the same information from both of these hypothetical experiments?

        This is a hard concept to communicate I think, even your explanation is still conceptual – it is difficult to wrap one’s head around the fact that the researcher’s intentions impact the interpretation of a p-value. That is odd to me, perhaps that is not odd to a statistician. Anyways, just thinking out loud at this point. I guess the replication crisis is evidence that people are behaving according to option 2. However, I was trying to figure out the other day what an acceptable replication rate should be…and obviously that depends on what the “true” effects were in these experiments…so I’m not sure what we are even comparing this seemingly abysmal replication rate to (i.e. what rate should we expect?).

        • Matt:

          First off, it’s fine to be confused about this. As shown by the link in my post above, a Harvard professor of psychology and a Harvard professor of political science have difficulty with these concepts too, so they’re not simple. Indeed, the convoluted logic of hypothesis testing has confused many prominent researchers.

          To get to your example in your second paragraph there: When we get new information, our inferences change. I don’t know enough about blood pressure to comment on your specific example, but in general we understand effects better when we consider multiple outcomes. Different outcomes are related to each other, and it makes perfect sense to me that learning about 19 other outcomes will affect my inferences about effects on blood pressure.

          In your third paragraph, you write, “it is difficult to wrap one’s head around the fact that the researcher’s intentions impact the interpretation of a p-value.” I agree, this is odd, but unfortunately this is baked into the definition of the p-value. You have data y, a test statistic T(y), and the p-value is Pr(T(y_rep) >= T(y)) where y_rep is sampled from the null model. The point is that in this definition, it is necessary to define T(y_rep) as a function of y_rep, which means that to define a p-value, you need to make some assumption about what test statistic would be reported, for any y_rep. This assumption is absolutely necessary for the p-value to have any definition at all.

        • Note that in a Bayesian analysis you would take away the same thing from the 1 vs 20 experiment. It’s only the fact that frequentist statistics tells you “how often would x occur” that means that doing X many times or even potentially having a bunch of different Xes you could do and choosing one through some path that biases you towards p < 0.05 is a problem.

          The bayesian analysis answers a different question, not "how often would X occur" but "how much information does my model and my data give me about Y" where Y is some unknown unobserved thing that leads to X.

        • More generally “it is difficult to wrap one’s head around the fact that the researcher’s intentions impact the interpretation of what human inquiry ought to be to connect us best with reality (that’s not directly accessible).”

          Currently re-reading how Russel, Wittgenstein and Ramsay struggled with this issue – of course Peirce figured it out but wrote multiple faulty drafts and the final draft was not clearly marked ;-)

          One thing that does seem to clear is that you can’t evaluate the value of a method of inquiry in any single instance or group of instances – which is what your doing now. Rather you need to evaluate in an inexhaustible set of inquiries – here the statistically significant outcome will vary and not always be the same one.

          It applies to Bayesian analyses as well, with a reasonable/credible/responsible prior and data generating model (an adequate representation of underlying reality we hope to be connected with for the current purpose) you might get an unlucky data set. More likely you will have an inadequate representation of underlying reality and not notice it in a given data set (this time or in the first n times).

          What is baked into the definition of the p-value for the purposes that it is often put to in many disciplines is the amplification of how bad the evaluation of it is in an inexhaustible set of inquiries (from the naive unadjusted p_value.)

    • >”how do I explain to someone that they should put less weight on a drug study that looked at 20 outcomes and found that the drug has a statistically significant effect on blood pressure versus a pre-registered study that finds a statistically significant effect on blood pressure (and that is the only outcome looked at).”

      If you still care about statistical significance rather than estimating the size of the effect (or better, figuring out a model that can reproduce the functional relationship between the two parameters, here a dose response), then I don’t think you get his point.

      He can correct me if he disagrees, but the multiple comparisons issue is just more problems on top of an already dead paradigm, you shouldn’t be doing those tests anyway… The main problem with what you mention is that it leads to a literature filled with hugely overestimated effect sizes. For example, here:

      “I call it the statistical significance filter because when you select only the statistically significant results, your “type M” (magnitude) errors become worse.

      And classical multiple comparisons procedures—which select at an even higher threshold—make the type M problem worse still (even if these corrections solve other problems). This is one of the troubles with using multiple comparisons to attempt to adjust for spurious correlations in neuroscience. Whatever happens to exceed the threshold is almost certainly an overestimate. ”
      http://statmodeling.stat.columbia.edu/2011/09/10/the-statistical-significance-filter/

      If you want the fleshed out, more mathematical argument, the best way is to run your own monte carlo simulations.

      • I don’t care about statistical significance vs. effect size, this was just an example. I’m an economist so most empirical work I do will be grounded in theory. However it is definitely interesting/disconcerting to read this stuff. What is the dead paradigm – frequentist statistics? Sorry about being such a noob on this stuff.

        • The dead paradigm is testing a default null hypothesis of zero difference between groups, or zero correlation between some parameters, and then trying to link rejection of this hypothesis to a theory or model of interest. The only time it makes sense is when your model predicts the zero difference/correlation.

          What happened is that mathematicians/statisticians turned the logic of science on its head. You are supposed to compare the predictions of your hypothesis to observation, not some other hypothesis. All the other issues follow from that initial error. Here is a good write up: http://www.fisme.science.uu.nl/staff/christianb/downloads/meehl1967.pdf

        • Thank you. That’s a great explanation. It’s funny — stripped of its technicalities, the conceptual mistake really is a simple one. Yet it fools most of science.

        • Thanks, it is encouraging to see it getting across to some people. The “fooling” going on is due to:

          1) People wanting to think they are making progress without doing the necessary hard work of figuring out the premises and deducing precise predictions from their speculations. In fields like medical research the vast majority study extremely dynamic systems without any need for tools like calculus. That alone should be a huge red flag.

          2) Overreliance on argument from authority and consensus heuristics. These are necessary tools, but when they fail it can be quite spectacular.

          3) The extreme cognitive dissonance that results amongst those who have spent a lot of time/effort/money on NHST when they take this realization to its logical conclusion. It took me a few years to really accept it and I realized the problem relatively early on.

        • Another quote, this one from Richard Royall: “The experimenter who is primarily interested in studying the distribution of X… will ordinarily make observations on other random variables, Y, Z, etc. in the same survey….To the researcher who conceived the study in the first place to obtain evidence about the distribution of X, who finds a mean that differs from zero by 2.5 standard errors, and who is now told that this observation is not statistically significant, something seems wrong….The problem…is that the significance level is being used in two roles, only one of which is valid. Whether or not the researcher performs tests of other hypotheses affects his overall probability of committing at least one Type I error, but it does not change his evidence about X or how that evidence should be interpreted.” (My note: this assumes that the other observations give no information about the distribution of X, in which case Andrew’s hierarchical approach actually does tell you more.)

        • Herman Rubin always said that he didn’t need any data at all to reject such a null hypothesis, since it is virtually certain that it is false. Even null hypotheses that you might be sure could be true (Jim Berger gives the example, “My plants grow better if I talk to them”) might not be exactly true (e.g., if your breath gives them extra CO2 which helps them to grow better).

        • I have to say that the Berger example comes from Jim Berger’s paper with Mohan Delampady. I should have given credit to both of them.

        • Bill: Jim also did an analysis of an HIV vaccine where the virologists seemed quite convinced the effect could be zero – intuitively it trained the body to recognize a now extinct version of HIV (they go extinct very quickly).

          Jim had put a non-zero probability on zero effect, which I complained about and had to back down given the virologists…

          Of course one needs to avoid sure things, putting probability 1 on zero effect because RCTs are blind to mechanism of effect and one may always be wrong about that (i.e. the extra CO2 in your example and Herman’s use of the adjective virtually.)

        • >”the virologists seemed quite convinced the effect could be zero – intuitively it trained the body to recognize a now extinct version of HIV (they go extinct very quickly).”

          I don’t see why the strain would need to be non-“extinct”. Off the top of my head:

          Vaccine -> Immune Response (eg Fever) -> Reduced libido for a few days -> lower HIV incidence

          Or if they included any antibody tests in the pipeline, you could get a cross reaction with the vaccine peptide, which will affect diagnosis rates.

        • Anoneuoid:

          I am not a virologist – but Jim’s virologists and mine agreed that a zero effect was probable (maybe > .5).

          The effect was defined as protection from currently circulating HIV not extinct versions.

          > included any antibody tests in the pipeline
          The vaccine development pipeline was very different in this case, in most vaccines doing an RCT without being fairly sure of an effect would be extremely unlikely.

          Desperate researchers often waste resources and risk high false positive claims.

        • >”Jim’s virologists and mine agreed that a zero effect was probable (maybe > .5).”

          Yes, I am seeing now why there has been so little progress on the HIV vaccine front. The subject area experts are apparently extremely hubristic and uncreative. In the meantime I was looking for more info and found this paper:

          “Because in HIV vaccine efficacy trials the null hypothesis (of no efficacy) is scientifically plausible, the Bayesian analysis assigns a prior probability Pr(VE = 0%) to this hypothesis. An obvious choice is Pr(VE = 0%) = .5, so that there is an even chance of zero efficacy and of nonzero efficacy.”
          http://jid.oxfordjournals.org/content/203/7/969.short

          I would like to see more about how the zero effect was deemed to be not only scientifically plausible, but the by far most likely outcome. I highly doubt that would stand up to scrutiny, there are simply so many routes by which a vaccine could have an effect. I bet they only considered one favorite mechanism during the discussion and improperly conflated that with the statistical hypothesis. In other words, the usual NHST error.

    • Matt,

      Different people may need different types of explanation to help them “get it”. You may find some of the explanations in the slides (under Course Notes) at http://www.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html to be helpful — those for Day 2 and Day 4 are probably most relevant. Also, the link further down to Jerry Dallal’s Simulation of Multiple Testing (and the two items following it) can be helpful to some people.

    • Matt, as a statistical consultant for scientists I encounter similar situations far too often. Therefore, I got used to the following strategy: I try to convince my clients that analyzing the effects of 30 covariates on 20 outcomes is ok as long as it is declared as exploratory data analysis and the results are not used for any inferential statements. My main argument is about reproducibility: Results obtained by such an analysis are most often not reproducible, and thus practically worthless for publication. Instead, I encourage them to use the information of this analysis to plan a more detailed study about the effect of interest which – if it is carried out correctly – most often yields reproducible results. If this argumentation does not work and the client insists on publishing the results as they were carried out by a planned experiment I usually end the cooperation with this particular client at this stage and demand that my name must not be brought into connection with any result of their analysis.

      Over the years I lost some clients because of this strategy. However, I have some regular clients so that I’m able to survive without those black sheep. What really annoys me is that they manage to survive as well, although they perform crappy research.

  4. More fundamentally, does it even make sense to talk about being “statistically indistinguishable from 100%”?

    If the null hypothesis is that 100% of replication studies replicate the original findings, then A SINGLE INSTANCE of a failed replication demonstrates with 100% certainty that the replication rate is not 100%. Statistics is not even necessary here, simple logic will do.

    • I am asking myself the same question since I first read the Gilbert et al. comment. I really admire most of King’s work and don’t understand what went wrong there…

      • Marko:

        I’m not quite sure what went wrong there either. But you have to remember that King works well when collaborating with people who do know statistics. I have no idea exactly what went wrong with that Gilbert, King, Pettigrew, and Wilson paper, but it’s possible that: (a) Gilbert deferred to King under the impression that King was a statistics expert, and (b) King deferred to Gilbert under the impression that Gilbert was a subject-matter expert. This sometimes can happen with collaborations, that with multiple people involved, there’s no one to ultimately take responsibility for the conclusions. I’d like to think that either King or Gilbert acting alone would not have made these mistakes: King would not have been emboldened by Gilbert to take such a strong and mistaken position regarding psychology’s replication crisis, and Gilbert would not have been emboldened by King to make such strong and mistaken statistical claims. The whole episode was a disaster.

        • Thanks – but I would have thought King would be better at disaster recognition and recovery given he was one of the few working on research on reproducible/replicatable science in the early 2000,s (when almost no one was successful at getting funding for it).

        • Keith:

          Yes, I was surprised too. I wouldn’t’ve expected King to understand much about how p-values and confidence intervals work, as this is kinda technical and lots of applied researchers and even textbook writers get confused on this point—but I was surprised to see him dismiss the value of replications. Here’s where I think he made the mistake of trusting Gilbert on the substance, and then conversely Gilbert naively trusted King on the stats. I have no idea what either King or Gilbert thinks about this now, but my guess is that King may have realized that he screwed up on this one, but he’s not sure whether to publicly admit his error or just quietly move on and hope that people forget this whole episode.

Leave a Reply

Your email address will not be published. Required fields are marked *