X and I heard about this much-publicized recent paper by Val Johnson, who suggests changing the default level of statistical significance from z=2 to z=3 (or, as he puts it, going from p=.05 to p=.005 or .001). Val argues that you need to go out to 3 standard errors to get a Bayes factor of 25 or 50 in favor of the alternative hypothesis. I don’t really buy this, first because Val’s model is a weird (to me) mixture of two point masses, which he creates in order to make a minimax argument, and second because I don’t see why you need a Bayes factor of 25 to 50 in order to make a claim. I’d think that a factor of 5:1, say, provides strong information already—if you really believe those odds. The real issue, as I see it, is that we’re getting Bayes factors and posterior probabilities we don’t believe, because we’re assuming flat priors that don’t really make sense. This is a topic that’s come up over and over in recent months on this blog, for example in this discussion of why I don’t believe that early childhood stimulation really raised earnings by 42%—and not because I think the study in question was horribly flawed (sure, it suffers from selection issues and more could be done in the analysis, but the same could be said of just about any observational study, including many if not all of mine) but because, fundamentally, a point estimate of 42% is a Bayes estimate of 42% if you have a flat prior, and I don’t have a flat prior, I think effects are typically much closer to zero.
Anyway, that’s all background. Val’s paper got enough attention that X and I thought it would be worth trying to clear the air about a couple of points, most notably where his 0.005 came from and how it could be interpreted.
Here’s what X and I wrote:
In his article, “Revised standards for statistical evidence,” Valen Johnson proposes replacing the usual p = 0.05 standard for significance with the more stringent p = 0.005. This might be good advice in practice but we remain troubled by Johnson’s logic because it seems to dodge the essential nature of any such rule, that it expresses a tradeoff between the risks of publishing misleading results and of important results being left unpublished. Ultimately such decisions should depend on costs, benefits, and probabilities of all outcomes.
Johnson’s minimax prior is not intended to correspond to any distribution of effect sizes; rather it represents a worst-case scenario under some mathematical assumptions. Minimax and tradeoffs do not play well together (Berger, 1985), and it is hard for us to see how any worst-case procedure can supply much guidance on how to balance between two different losses.
Johnson’s evidence threshold is chosen relative to a conventional value, namely Jeffreys’ target Bayes factor of 1/25 or 1/50, for which we do not see any particular justification except with reference to the tail-area probability of 0.025, traditionally associated with statistical significance.
To understand the difficulty of this approach, consider the hypothetical scenario in which R. A. Fisher had chosen p = 0.005 rather than p = 0.05 as a significance threshold. In this alternative history, the discrepancy between p-values and Bayes factors remains and Johnson could have written a paper noting that the accepted 0.005 standard fails to correspond to 200-to-1 evidence against the null. Indeed, a 200:1 evidence in a minimax sense gets processed by his fixed-point equation γ = exp[z*sqrt(2 log(γ)) − log(γ)] at the value γ = 0.005, into z = sqrt (-2 log(0.005)) = 3.86, which corresponds to a (one-sided) tail probability of Φ(−3.86), approximately 0.0005. Moreover, the proposition approximately divides any small initial p-level by a factor of sqrt(−4π log p), roughly equal to 10 for the p’s of interest. Thus, Johnson’s recommended threshold p = 0.005 stems from taking 1/20 as a starting point; p = 0.005 has no justification on its own (any more than does the p = 0.0005 threshold derived from the alternative default standard of 1/200).
One might then ask, was Fisher foolish to settle for the p = 0.05 rule that has caused so many problems in later decades? We would argue that the appropriate significance level depends on the scenario, and that what worked well for agricultural experiments in the 1920s might not be so appropriate for many applications in modern biosciences. Thus, Johnson’s recommendation to rethink significance thresholds seems like a good idea that needs to include assessments of actual costs, benefits, and probabilities, rather than being based on an abstract calculation.
X and I seem to be getting into a habit of writing “soft” papers (in particular, this little article, this book review, and our discussion and rejoinder on Feller), but in our defense let me point out that the above analysis does involve some algebra (yes, it’s pretty simple but we did a bunch of other calculations too, as usual these things look simple at the end only after some careful thinking went on earlier), also we are trying to do some real research as well (including some work on Bayes factors and posterior probabilities motivated by our conversations about Val’s paper).
X presents our discussion on his blog here.