Seth sent along an article (not by him) from the psychology literature and wrote:

This is a good example of your complaint about statistical significance. The authors want to say that predictability of information determines how distracting something is and have two conditions that vary in predictability. One is significantly distracting, the other isn’t. But the two conditions are not significantly different from each other. So the two conditions are different more weakly than p = 0.05.

I don’t think the reviewers failed to notice this. They just thought it should be published anyway, is my guess.

To me, the interesting question is: where should the bar be? at p = 0.05? at p = 0.10? something else? How can we figure out where to put the bar?

I replied:

My quick answer is that we have to get away from .05 and .10 and move to something that takes into account prior information. This could be Bayesian (of course) or could be done classically using power calculations, as discussed in this article.

I don’t think this is completely impossible. Just as it is de facto required to do a prospective power calculation to get a NIH grant, a retrospective power calculation could be an effective requirement for publication in a journal. Sure, many of these power calculations would be b.s. (just as they are for NIH proposals), but the vecy act of trying–of taking numbers from the literature review and making some assumptions–would be useful, I think.

To which Seth replied:

I think whether or not you use prior information to assess the contribution is orthogonal to the question I’m asking, which is: how much is enough? How much contribution is enough for publication? In this particular case I’m with the authors and the reviewers: 0.05 was too harsh.

Maybe if there is a switch to using prior information the question of how much is enough? can be reopened. In that sense the two questions are not orthogonal.

Seth’s point, I think, is that setting a p<0.05 threshold is too strict: Such a rule encourages the study of already-known effects and also encourages studies to have very large sample sizes. From Seth's perspective, both these incentives go in the wrong direction: He'd rather have more exploratory studies with smaller sample sizes (hence more flexibility, lower cost, etc.) So Seth would prefer a more relaxed threshold (perhaps p<0.2?) so that people could do small, opportunistic studies and still have a reasonable chance of being published.

On the other hand, if you’re really studying noise, p=0.2 is a license to spill all sorts of crap over the peer-reviewed pages. That’s why I prefer to present results in the context of prior knowledge.

For example, suppose you’re a sociologist interested in studying sex ratios. A quick review of the literature will tell you that the differences in %girl births, comparing race of mother, age of mother, birth order, etc, are less than 1%. So if you want to study, say, the correlation between parental beauty and sex ratio, you’re gonna be expecting very small effects which you’ll need very large sample sizes to find. Statistical significance has nothing to do with it: Unless you have huge samples and good measurements, you can pretty much forget about it, whether your p-value is 0.02 or 0.05 or 0.01.

On the other hand, if you’re studying something more interesting and innovative–along the lines of Seth’s self-experiments–then, yeah, maybe the usual standards of p-values are too strict. But, even then, it depends on what p-value you’re looking at. With n=1 you can’t get any p-value at all, and I think that even Seth would agree that some formal replication of his methods by others (in addition to the existing anecdotal evidence) would be a plus.

I agree that the universal norm of requiring p-value

Joseph Kadane noted that using a p-value of 0.05 is an agreed-upon rule of thumb or convention handed down from Fisher that is completely lacking in theoretical motivation. If someone can figure out a way to motivate the decision rule theoretically, that would be a huge contribution.

I think Berger and Sellke have make a convincing case that .05 is not a very stringent evidentiary threshold.

Hurlbert and Lombari also address this in their 2009 article "Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian". They propose the term neoFisherian

significance assessments (NFSA) and write, "Their role is assessment of the existence, sign and magnitude of statistical effects. The common label of null hypothesis significance tests (NHST) is retained for paleoFisherian and Neyman-Pearsonian approaches and their hybrids. The original Neyman-Pearson framework has no utility outside quality control type applications."

I believe signficance always comes down what are you testing. A signficant p-value for the effectivness of a medication being set above .05 may be more than what the FDA would accept. Remember that p-values are related to our confidence that future occurences will be predicted accurately. The nature of the study must be taken into account as many sociology and pyshcology papers are normally accepted with larger critical p-values. Perhaps someone should write a paper that sets specefic standards in each discpline.

Great point, low hanging fruit but I don't see enough high status statisticians making it in public. I don't understand how numberism arises (beyond the mere efficiency of it).

Like the whole "six sigma" cult of quality control.

I agree. Reminds me of the numberism involved in "six sigma"

The problem is not where the bar is placed, but that it is placed at all. First, the relevant metric is not p-value, but effect size. Second, even if it WERE p-value, selecting one p-value for all research is …. well, there is a quote from Fisher saying this is silly; I can't find it, but I have a reference to it at work, I think it's in the book "The Cult of Statistical Significance"

Changing the permissible p value to 0.2 wouldn't be "a license to spill all sorts of crap over the peer-reviewed pages" because it would not increase the number of available pages. Plenty of papers with p less than 0.05 are rejected.

In most papers the actual significance level is weaker than p = 0.05 because the analysis wasn't specified in advance. Whether the average reviewer understands this I don't know. But it represents a system-wide weakening. Journals as prestigious as Science and Nature allow plenty of room for the investigator to adjust the analysis to favor a particular result. Often it doesn't matter; the effect is really strong. But sometimes it does.

I have always wondered what you can write when you have, say P=0.09. The effect is significant at at cut-off 0.1, yes. But most people use 0.05 as the cut-off.

-Some evidence for the alternative hypothesis? (but the P-value is not a measure of evidence, rigt?)

-The effect seems to be important? (in combination with a large effect size)

Suggestions??

When my favorite professor in grad. school, Herman Friedman, saw papers with tons of p-values, he would say that the authors were "p-ing all over the research"