Skip to content

2 new reasons not to trust published p-values: You won’t believe what this rogue economist has to say.

Political scientist Anselm Rink points me to this paper by economist Alwyn Young which is entitled, “Channelling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results,” and begins,

I [Young] follow R.A. Fisher’s The Design of Experiments, using randomization statistical inference to test the null hypothesis of no treatment effect in a comprehensive sample of 2003 regressions in 53 experimental papers drawn from the journals of the American Economic Association. Randomization tests reduce the number of regression specifications with statistically significant treatment effects by 30 to 40 percent. An omnibus randomization test of overall experimental significance that incorporates all of the regressions in each paper finds that only 25 to 50 percent of experimental papers, depending upon the significance level and test, are able to reject the null of no treatment effect whatsoever. Bootstrap methods support and confirm these results.

This is all fine, I’m supportive of the general point, but it seems to that the title of this paper is slightly misleading, as the real difference here comes not from the randomization but from the careful treatment of the multiple comparisons problem. All this stuff about randomization and bootstrap is kind of irrelevant. I mean, sure, it’s fine if you have nothing better to do, if that’s what it takes to convince you and you don’t take it too seriously, but that’s not where the real juice is coming from. So maybe Young could take the same paper and replace “randomization tests” and “randomization inference” by “multiple comparisons corrections” throughout.


  1. Gaurav says:


    I thought part of the reason behind the results was lower power of nonparam tests. I understand that it may not be a general rule. If you have the time, please say more.

  2. Z says:

    The abstract says that “randomization tests [alone, right?] reduce the number of regression specifications with statistically significant treatment effects by 30 to 40 percent.” How is this not a lot of juice? Or is the abstract also misleading?

  3. Martha (Smith) says:

    I skimmed the paper as far as p. 5, and noticed a couple of choice phrases:

    p. 4 “The quantitative deconstruction of results is as follows:”

    p. 5 “Notwithstanding its results, this paper confirms the value of randomized experiments.”

    Is this the way economists talk? Or is it something distinctive to Young?

    • Dale says:

      Yes, this is the way economists talk. Not that other disciplines are free from such problems, but economists like to be as difficult to understand as possible – and cover it with overly complex mathematics. It enhances publish-ability and helps sustain the profession’s status as “the queen of the social sciences.”

  4. D.O. says:

    Eeeeh, but why 0 treatment effect is a reasonable null-hypothesis? If someone takes victims of shootings and finds that they life expectancy is not changed compared to the controls (that is people not being shot) isn’t it highly significant?

    • Anoneuoid says:

      Unfortunately, this is true. Any such findings about better ways to test the null hypothesis are only of technical interest (because it could be applied to testing real hypotheses in the future). So, while it is useful to investigate such matters, it is not really addressing how worthwhile any of the conclusions are.

  5. Mayo says:

    Based on what Gelman wrote this appears to show that the actual (adjusted) p-value gets it right by correcting for multiple comparisons. (I haven’t read the paper.)

    • Anonymous says:

      I don’t know. Would a Frequentist consider the following hypothetical example a success?

      Initially 30% of p-value less than .05 replicate, but when an adjustment for multiple comparisons is made (so an alpha much smaller than .05 is used) it turns out 60% of those with p-values below the cutoff replicate.

  6. R says:

    I find the conclusion that p-values aren’t as small under randomization tests as the ones obtained using more standard regression assumptions unsurprising. These aren’t testing the same null hypotheses (sharp null vs. ATE=0), model selection is influenced by the sample data, the big multiple comparisons issue, etc.

    The (ir)reproducible research angle is more interesting to me. For 53 papers in high profile economic journals, a lot of crap made its way through peer review:

    “Regressions as they appear in the published tables of journals in many cases do not follow the explanations in the papers. To give a few examples: (a) a table indicates date fixed effects were added to the regression, when what is actually added is the numerical code for the date; or location is entered in the regression, not as a fixed effect, but as simply the location code. (b) regressions are stacked, but not all independent variables are duplicated in the stacked regression. (c) clustering is done on variables other than those mentioned, these variables changing from table to table. (d) unmentioned treatment and non-treatment variables are added or removed between columns of a table. (e) cluster fixed effects are added in a regression where aspects of treatment are applied at the cluster level, so that those treatment coefficients are identified by two observations which miscoded treatment for a cluster (I drop those treatment measures from the analysis).”

    “(13a) appears to be an unfortunate error in Stata,[fn21] where the xtreg fixed effects command forgets to count the number of absorbed fixed affects kfe in k. This means that papers that cluster when using the xtreg fixed effects command, as opposed to, say, the otherwise identical areg command, get systematically lower p-values. Three papers in my sample use the xtreg fixed effects clustered command in a total of 100 regressions.”

    “I also reproduce any coding errors in the original do-files that affect treatment measures, e.g. a line of code that unintentionally drops half the sample or another piece that intends to recode individuals of a limited type to have a zero xvariable but unintentionally recodes all individuals in broader groups to have that zero x-variable.”

    • Fernando says:

      Science is a medieval artisanal craft. Hence the highly variable quality.

      Although scientists have more complex tools, the basic organization of production appears unchanged for centuries.

  7. Charlie says:

    So maybe Young could take the same paper and replace “randomization tests” and “randomization inference” by “multiple comparisons corrections” throughout.

    A second pub!

Leave a Reply