In a discussion of some of the recent controversy over promiscuously statistically-significant science,
Jeff Leek Rafael Irizarry points out there is a tradeoff between stringency and discovery and suggests that raising the bar of statistical significance (for example, to the .01 or .001 level instead of the conventional .05) will reduce the noise level but will also reduce the rate of identification of actual discoveries.
I agree. But I should clarify that when I criticize a claim of statistical significance, arguing that the claimed “p less than .05″ could easily occur under the null hypothesis, given that the hypothesis test that is chosen is contingent on the data (see examples here of clothing and menstrual cycle, arm circumference and political attitudes, and ESP), I am not recommending a switch to a more stringent p-value threshold. Rather, I would prefer p-values not to be used as a threshold for publication at all.
Here’s my point: The question is not whether something gets published, but rather where it is published, in what form, and how it is received and followed up.
In the era of online repositories, every study can and should be published. (I do think we can and should do better than Arxiv—or, I should say, I hope there will be an equivalent for fields other than mathematics and physics—but Arxiv has established the principle.)
For most if not all of the studies we’ve been discussing lately, I think the raw data should be published too. Sometimes the authors will share their data to people who request, but that shouldn’t even be an issue if the data and survey forms are just posted.
In a world where everything can be posted, what is the point of publishing in Psychological Science? Publicity and that stamp of approval. Assuming that both of these are limited quantities in some flexible way (in the same way that the government cannot simply print unlimited amounts of money and that banks cannot simply give out unlimited amounts of government-backed loans), some selection is necessary.
I would make that selection based on the quality of the data collection and analysis, the scientific interest of the research, and the importance of the topic—but not on the significance level of the results.
There are exceptions, of course: I could imagine a clean, well-defined, internally-replicated study with a surprising effect, where statistical significance would be part of the argument for why to believe it. But this is usually not what we see. Instead, over and over again we see poorly-measured data with analyses that are iffy or data-dependent. Studies such as those should demand our attention because of their data quality or scientific importance, not because they are attention-grabbing and have a p-value of .04.
I think Rafa’s point of the tradeoff between stringency and discovery is important, and I’d like to move this discussion away from p-values and toward concerns of data quality.
Of course I also think data should be analyzed appropriately. For example, with Bem’s ESP study, a proper analysis would not pull out some wacky interactions and declare them statistically significant; instead, it would display all interactions and show the estimates and uncertainties for all of them. The point would be to see what can be learned from the data, not to attempt to obtain a claim of certainty.
Attention and resources are limited so there will always be some sort of selection of what studies get followed up. I’d just like to do a different selection than that based on p-values. Especially considering all the small-N studies of small effects where any statistical significance is essentially noise anyway.
The dangerous lure of certainty
Regarding the title of this post: I’m not saying that Rafa is suffering from the dangerous lure of certainty. Rather, I’m saying that p-value thresholds are connected to this dangerous lure, present among producers and consumers of science.
This is not a Bayes vs. non-Bayes debate. If Bayesians were going around using a posterior probability threshold (e.g., only accept a paper if the direction of its main finding is at least 95% certain), I’d be bothered there too.
And the lure of certainty even arises in completely non-quantitative studies. Consider disgraced primatologist Marc Hauser, who refused to let others look at his data tapes. I expect that he has a deterministic view of his own theories, that he views them as true and thus thinks any manipulation in that direction is valid in some deep sense.