Political scientist Anselm Rink points me to this paper by economist Alwyn Young which is entitled, “Channelling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results,” and begins,
I [Young] follow R.A. Fisher’s The Design of Experiments, using randomization statistical inference to test the null hypothesis of no treatment effect in a comprehensive sample of 2003 regressions in 53 experimental papers drawn from the journals of the American Economic Association. Randomization tests reduce the number of regression specifications with statistically significant treatment effects by 30 to 40 percent. An omnibus randomization test of overall experimental significance that incorporates all of the regressions in each paper finds that only 25 to 50 percent of experimental papers, depending upon the significance level and test, are able to reject the null of no treatment effect whatsoever. Bootstrap methods support and confirm these results.
This is all fine, I’m supportive of the general point, but it seems to that the title of this paper is slightly misleading, as the real difference here comes not from the randomization but from the careful treatment of the multiple comparisons problem. All this stuff about randomization and bootstrap is kind of irrelevant. I mean, sure, it’s fine if you have nothing better to do, if that’s what it takes to convince you and you don’t take it too seriously, but that’s not where the real juice is coming from. So maybe Young could take the same paper and replace “randomization tests” and “randomization inference” by “multiple comparisons corrections” throughout.