3 more articles (by others) on statistical aspects of the replication crisis

A bunch of items came in today, all related to the replication crisis:

– Valentin Amrhein points us to this fifty-authored paper, “Manipulating the alpha level cannot cure significance testing – comments on Redefine statistical significance,” by Trafimow, Amrhein, et al., who make some points similar to those made by Blake McShane et al. here.

– Torbjørn Skardhamar points us to this paper, “The power of bias in economics research,” by Ioannidis, Stanley, and Doucouliagos, which is all about type M errors, but for a different audience (economics instead of psychology and statistics), so that’s a good thing.

– Jonathan Falk points us to this paper, “Consistency without inference: Instrumental variables in practical Application,” by Alwyn Young, which argues, convincingly, that instrumental variables estimates are typically too noisy to be useful. Here’s the link to the replication crisis: If IV estimates are so noisy, how is it that people thought they were ok for so long? Because researchers had so many unrecognized degrees of freedom that they were able to routinely obtain statistical significance from IV estimates—and, traditionally, once you have statistical significance, you just assume, retrospectively, that your design had sufficient precision.

It’s good to see such a flood of articles of this sort. When it’s one or two at a time, the defenders of the status quo can try to ignore, dodge, or parry the criticism. But when it’s coming in from all directions, this perhaps will lead us to a new, healthy consensus.

22 thoughts on “3 more articles (by others) on statistical aspects of the replication crisis

  1. I thought it was odd that the 3rd paper anonymizes its results so that one cannot discuss the robustness of individual papers. I can see why the Young might want to avoid controversy, but if there are dozens of unreliable papers in the literature, wouldn’t it help to have a list of them? Or is the idea that all published analyses based on the instrumental variables approach should be presumed to be unreliable?

  2. re: the Alwyn Young paper – I don’t doubt the connection to researcher degrees of freedom, but I interpreted the results differently. IV has poor finite sample properties, and when instruments are weak, the asymptotics are hardly any better. Both of these facts mean you can get spurious significance, even with a pre-registered research design, and both are well understood within economics at this point. The issue is a quantitative one: is the finite sample bias associated with IV better or worse than the endogeneity bias we are trying to correct for? The Young paper suggests that in many cases, the cure is worse than the disease.

      • There are even forking paths there. On thing Young discusses is that F-statistics (and other criteria) are used to screen for strong enough instruments. So part of what is happening is that people get “lucky” with their first-stage F, even when the instrument is very weak.

        • Indeed, pre-testing for relevance distorts size in the second stage. Trying to figure out how to do robust inference with weak instruments is a huge literature in econometrics, going back to 2003 or earlier (see, eg, Andrews and Stock, or Imbens and Rosenbaum). It would be interesting to know the age distribution of the papers Young examines, to see if things have gotten any better in the last 10 years or so. I guess what I’m trying to push back on is the idea that this is a “debacle”, or that the researcher degrees of freedom were unrecognized. It’s well known by now that some of the papers that kicked off this whole literature in economics (eg Angrist and Krueger (1991) ) have this problem, so while it definitely looks bad from the outside, I assume most people working within the field take it for granted that a lot of these results will be fragile.

  3. Alwyn Young’s paper looks great at an initial skim.

    It’s possible that he agreed to anonymity in order to get additional analysis details from the authors. The published detail aren’t always sufficient to do an analytic replication with full confidence.

    Also, he’s making a general point about the method, which have implications well beyond these papers.

  4. I skimmed the first paper (Trafimow et al) and ok. Honestly I didn’t see anything disagreeable in it. However, two points are missing:

    1) The tenuous (and usually convoluted) link between the research hypothesis and null hypothesis.

    They do cite Meehl 1967 but I don’t really see what it has to do with the context. The important point of that paper is this:

    “Statistical significance” plays a logical role in psychology precisely the reverse of its role in physics”

    By “physics”, Meehl means cases where:

    Research hypothesis -> prediction -> null hypothesis

    By “psychology”, Meehl means cases where

    Research hypothesis -> prediction -> ! null hypothesis

    This simple difference is the underlying cause of so many problems. Its like having buggy code where if(!x){…} is there instead of if(x){…}. That stray single character may be difficult to notice, but it will ruin everything. It will cause the process to generate seemingly valid results that are actually misleading garbage.

    2) “The unmentionable”.

    Almost everybody who has been doing/teaching this NHST stuff has basically wasted their career or worse: likely made a negative contribution to science. Future researchers will need to reexamine all of the phenomena “studied” this way (basically redo everything since NHST was adopted in a given field, back to the 1940s in some cases). I think it is better to simply put this in the open. There is a huge sunk cost fallacy that needs to be dealt with.

    • Author on the third paper here:

      I strongly agree with the second point, but this is one of those diplomatic things where it’s hard to get fifty researchers to attach their name to a public statement which effectively dismisses entire fields of social science as a waste of time, and harder still to get it published — though I actually believe that it is true: there are certainly decades long research programs in psychology built around effects which have been shown to be unreplicable. I’d be happy to see this stated bluntly in the literature.

  5. Given the premise that science is something that can be manufactured I suppose it makes sense to approach quality control in the same fashion as widget manufacturing. Too bad the premise is fatally flawed.

  6. Are pre-analysis plans reasonable to ask of IV analyses? I would think so but I wonder if someone from econ would feel differently/have arguments I haven’t thought of.

    Related, are there good IV papers with PAPs anyone would like to recommend?

  7. The Manipulating the Alpha Level paper seems very comprehensive and thoughtful.

    I especially liked this point “The mere fact that researchers are concerned with replication, however it is conceptualized, indicates an appreciation that single studies are rarely definitive and rarely justify a final decision.”

    As for how replication is conceptualized, it seems too obvious to me that the various study likelihoods concordant re-weighting of a given prior should be the primary focus (i.e. the prior probabilities of the parameters which are taken to be possibly common being moved in similar directions in all studies). For instance, if the likelihoods are quadratic, a Forrest plot would be a good start if not adequate. For two group binary outcome randomized trails, the control group’s event rate is allowed to be arbitrary by study and the commonness of odds ratio is then assessed by their confidence intervals in a plot. But maybe this is all too familiar from the meta-analysis literature and I can’t see the trees…

    If the likelihoods are not quadratic there is this more technical approach http://statmodeling.stat.columbia.edu/wp-content/uploads/2011/05/plot13.pdf

  8. “This last point sheds an interesting light on the methodologies that have evolved to fill the niche left by the abandonment of candidate-gene association. In genome-wide-association studies, data on hundreds of thousands of individual bits of DNA are collected in large samples and then searched for significant results at highly stringent p levels. If (as usually happens) no significant results are discovered the first time around, the process is repeated with even larger samples, continuing until something significant finally emerges. “Hits,” as they have come to be known, are now being accumulated for many behavioral characteristics, but the effect sizes for individual SNPs or alleles are vanishingly small (Chabris et al., 2015).

    But does this methodology sound familiar? Genome-wide association is unapologetic, high-tech p-hacking. In the modern era, when major social science journals discourage null-hypothesis significance tests and replication as opposed to significance has become an obsession, it is nothing short of odd that behavioral science at the bleeding edge of genomic technology has become an extended exercise in stringent but fundamentally old-fashioned significance testing. To assume that the current list of single nucleotide polymorphisms (SNPs) reaching significance for some behavioral trait will be significant again the next time someone collects DNA from 100,000 people is to make the most basic of errors about the relationship between statistical significance and replicable science. Some SNPs will replicate. Others will not. It will depend on context.”
    http://journals.sagepub.com/doi/full/10.1177/1745691615617442

    • Tk:

      Yes. p-values are super-noisy. For example, the z-scores corresponding to (two-sided) p-values 1e-5 and 1e-8 are 4.4 and 5.7, respectively. Thus, the difference between these two seemingly huge differences in “significance”—a factor of 1000 in the p-value—is well within the variation that would be expected by chance. (The comparison of two z-scores from independent studies would have a sd of sqrt(2), or 1.4.)

      Thus, if a researcher is feeling particularly virtuous and decides to screen out everything with a p-value less than 1e-8, he or she can end up screening away lots of real effects. Screening can just be a way to add noise.

  9. Perhaps this belongs here, the latest reply to the “redefining statistical significance”-paper:

    https://psyarxiv.com/bp2z4/

    “Using the same theoretical device (i.e., false positive rate under NHST) and empirical evidence (i.e., the psychology replication study in [16]), we have analyzed the RSS proposal in light of claims that it will improve reproducibility. By accounting for the effects of Phacking, we see that the claimed benefits to false positive rate and replication rate are much less certain than suggested in [2]. In fact, if false positive rate were to decrease at all, it will be virtually unnoticeable, and will remain much higher than claimed in [2].”

    “Altogether, these observations point to one conclusion: The proposal to redefine statistical significance is severely flawed, presented under false pretenses, supported by a misleading analysis, and should not be adopted.”

    “Defenders of the proposal will inevitably criticize this conclusion as “perpetuating the status quo,” as one of them already has [12]. Such a rebuttal is in keeping with the spirit of the original RSS proposal, which has attained legitimacy not by coherent reasoning or compelling evidence but rather by appealing to the authority and number of its 72 authors.”

Leave a Reply to Corson N. Areshenkoff Cancel reply

Your email address will not be published. Required fields are marked *