The Fault in Our Stars: It’s even worse than they say

In our recent discussion of publication bias, a commenter link to a recent paper, “Star Wars: The Empirics Strike Back,” by Abel Brodeur, Mathias Le, Marc Sangnier, Yanos Zylberberg, who point to the notorious overrepresentation in scientific publications of p-values that are just below 0.05 (that is, just barely statistically significant at the conventional level) and the corresponding underrepresentation of p-values that are just above the 0.05 cutoff.

Brodeur et al. correctly (in my view) attribute this pattern not just to selection (the much-talked-about “file drawer”) but also to data-contingent analyses (what Simmons, Nelson, and Simonsohn call “p-hacking” and what Loken and I call “the garden of forking paths”). They write:

We have identified a misallocation in the distribution of the test statistics in some of the most respected academic journals in economics. Our analysis suggests that the pattern of this misallocation is consistent with what we dubbed an inflation bias: researchers might be tempted to inflate the value of those almost-rejected tests by choosing a “significant” specification. We have also quantified this inflation bias: among the tests that are marginally significant, 10% to 20% are misreported.

They continue with “These figures are likely to be lower bounds of the true misallocation as we use very conservative collecting and estimating processes”—but I would go much further. One way to put it is that there are (at least) three selection processes going on here:

1. (“the file drawer”) Significant results (traditionally presented in a table with asterisks or “stars,” hence the photo above) more less likely to get published.

2. (“inflation”) Near-significant results get jiggled a bit until they fall into the box

3. (“the garden of forking paths”) The direction of an analysis is continually adjusted in light of the data.

Brodeur et al. point out that item 1 doesn’t tell the whole story, and they come up with an analysis (featuring a “lemma” and a “corollary”!) explaining things based on item 2. But I think item 3 is important too.

The point is that the analysis is a moving target. Or, to put it another way, there’s a one-to-many mapping from scientific theories to statistical analyses.

So I’m wary of any general model explaining scientific publication based on a fixed set of findings that are then selected or altered. In many research projects, there is either no baseline analysis or else the final analysis is so far away from the starting point that the concept of a baseline is not so relevant.

Although maybe things are different in certain branches of economics, in that people are arguing over an agreed-upon set of research questions.

P.S. I only wish I’d known about these people when I was still in Paris; we could’ve met and talked.

9 thoughts on “The Fault in Our Stars: It’s even worse than they say

      • Thanks for linking to my site. I looked at the publication, and I don’t understand why the authora are not modeling a Z-score distribution consisting of some true effect with publicaton bias, some effects without publication bias, and some null-effects. If you’d do this, I think you’d practically always end up with the shap they observe, with a dip just below 1.97. It’s just due to publication bias, I see no reason to assume anyone is p-hacking on a massive scale – but perhaps I’m missing something, or my approach is not valid. Anyway, my R-code is below. Play around with it and you’ll see the pattern is easily simulated.

        #####EFFECT WITH PUBLICATION BIAS#########

        nSims <- 10000 #number of simulated experiments
        z1 <-numeric(nSims) #set up empty container for z-scores

        for(i in 1:nSims){ #for each simulated experiment
        x<-rnorm(n = 23, mean = 100, sd = 20) #produce simulated participants
        y<-rnorm(n = 23, mean = 114, sd = 20) #produce simulated participants
        t<-t.test(x,y) #perform the t-test
        if(t$p.value<0.05) {
        z1[i]<-qnorm(1-(t$p.value/2)) #convert to the z-score and store it
        }
        }

        z10.0001] #remove empty cells of studies due to publication bias

        #####EFFECT WITHOUT PUBLICATION BIAS#########

        nSims <- 10000 #number of simulated experiments

        z2 <-numeric(nSims) #set up empty container for z-scores

        for(i in 1:nSims){ #for each simulated experiment
        x<-rnorm(n = 23, mean = 100, sd = 20) #produce simulated participants
        y<-rnorm(n = 23, mean = 110, sd = 20) #produce simulated participants
        t<-t.test(x,y) #perform the t-test
        z2[i]<-qnorm(1-(t$p.value/2)) #convert to the z-score and store it
        }

        #####NO EFFECT WITHOUT PUBLICATION BIAS#########

        nSims <- 10000 #number of simulated experiments
        z3 <-numeric(nSims) #set up empty container for z-scores

        for(i in 1:nSims){ #for each simulated experiment
        x<-rnorm(n = 23, mean = 100, sd = 40) #produce simulated participants
        y<-rnorm(n = 23, mean = 100, sd = 40) #produce simulated participants
        t<-t.test(x,y) #perform the t-test
        z3[i]<-qnorm(1-(t$p.value/2)) #convert to the z-score and store it
        }

        z<-c(z1,z2,z3)
        #now plot the histogram NOTE if you get error some 'x' not counted; maybe 'breaks' do not span range of 'x' increase max scale end from 8 to higher value
        hist(z, main="Histogram of z-scores", xlab=("Observed z-score"), seq(0,8,by=0.245))

  1. To redress the balance, it seems quite common on climate blogs to try and find the longest period ending at the current date such that the trend in global mean surface temperatures just fails to be statistically significant at the 95% level, i.e. p > 0.05, but only just ;o)

    2b. (“deflation”) Near-significant results get jiggled a bit just to make sure they fall out of the box

    c.f. Ross R. McKitrick HAC-Robust Measurement of the Duration of a Trendless Subsample in a Global Climate Time Series
    Open Journal of Statistics, 2014, 4, 527-535 http://dx.doi.org/10.4236/ojs.2014.47050

  2. I think the difference between 2 and 3 is that 2 keeps the alternative hypothesis constant, but jiggles whatever to reject the null, while 3 goes through a menu of hypothesis until one is found that is significant.

    This is similar to the distinction in international trade between the intensive margin (exporting more of the same thing, perhaps with technical improvements; or more to the same destination), and the extensive margin (diversifying into different exports categories; or destinations).

    So with significance you can work along the intensive margin (keeping H1 constant), or the extensive margin (fishing for H1*)

  3. If your p-value lies between .05 and .055, just round it to two decimal places, then you can add an asterisk to denote p<.05.
    If your p-value lies between .055 and .10, then perform a one-tailed test: now p<.05 and you have your asterisk.

Leave a Reply to ATJ Cancel reply

Your email address will not be published. Required fields are marked *