Seth sent along an article (not by him) from the psychology literature and wrote:
This is a good example of your complaint about statistical significance. The authors want to say that predictability of information determines how distracting something is and have two conditions that vary in predictability. One is significantly distracting, the other isn’t. But the two conditions are not significantly different from each other. So the two conditions are different more weakly than p = 0.05.
I don’t think the reviewers failed to notice this. They just thought it should be published anyway, is my guess.
To me, the interesting question is: where should the bar be? at p = 0.05? at p = 0.10? something else? How can we figure out where to put the bar?
My quick answer is that we have to get away from .05 and .10 and move to something that takes into account prior information. This could be Bayesian (of course) or could be done classically using power calculations, as discussed in this article.
I don’t think this is completely impossible. Just as it is de facto required to do a prospective power calculation to get a NIH grant, a retrospective power calculation could be an effective requirement for publication in a journal. Sure, many of these power calculations would be b.s. (just as they are for NIH proposals), but the vecy act of trying–of taking numbers from the literature review and making some assumptions–would be useful, I think.
To which Seth replied:
I think whether or not you use prior information to assess the contribution is orthogonal to the question I’m asking, which is: how much is enough? How much contribution is enough for publication? In this particular case I’m with the authors and the reviewers: 0.05 was too harsh.
Maybe if there is a switch to using prior information the question of how much is enough? can be reopened. In that sense the two questions are not orthogonal.
Seth’s point, I think, is that setting a p<0.05 threshold is too strict: Such a rule encourages the study of already-known effects and also encourages studies to have very large sample sizes. From Seth's perspective, both these incentives go in the wrong direction: He'd rather have more exploratory studies with smaller sample sizes (hence more flexibility, lower cost, etc.) So Seth would prefer a more relaxed threshold (perhaps p<0.2?) so that people could do small, opportunistic studies and still have a reasonable chance of being published.
On the other hand, if you’re really studying noise, p=0.2 is a license to spill all sorts of crap over the peer-reviewed pages. That’s why I prefer to present results in the context of prior knowledge.
For example, suppose you’re a sociologist interested in studying sex ratios. A quick review of the literature will tell you that the differences in %girl births, comparing race of mother, age of mother, birth order, etc, are less than 1%. So if you want to study, say, the correlation between parental beauty and sex ratio, you’re gonna be expecting very small effects which you’ll need very large sample sizes to find. Statistical significance has nothing to do with it: Unless you have huge samples and good measurements, you can pretty much forget about it, whether your p-value is 0.02 or 0.05 or 0.01.
On the other hand, if you’re studying something more interesting and innovative–along the lines of Seth’s self-experiments–then, yeah, maybe the usual standards of p-values are too strict. But, even then, it depends on what p-value you’re looking at. With n=1 you can’t get any p-value at all, and I think that even Seth would agree that some formal replication of his methods by others (in addition to the existing anecdotal evidence) would be a plus.