Neal writes,

Thanks for bringing up the most interesting piece by Gerber and Malhotra and the Drum comment.

My own take is perhaps a bit less sinister but more worrisome than Drum’s interpretation of the results. The issue is how “tweaking” is interpreted. Imagine a preliminary analysis which shows a key variable to have a standard error as large as its coefficient (in a regression). Many people would simply stop analysis at that point. Now consider getting a coefficient one and a half times its standard error (or 1.6 times its standard error). We all know it is not hard at that point to try a few different specifications and find one that gives a magic p-value just under .05 and hence earning the magic star. But of course the magic star seems critical for publication.

Thus I think the problem is with journal editors and reviewers who love that magic star. And hence to authors who think that it matters whether t is 1.64 or 1.65. Journal editors could (and should) correct this.

When Political Analysis went quarterly we got it about a third right. Our instructions are:

“In most cases, the uncertainty of numerical estimates is better conveyed by confidence intervals or standard errors (or complete likelihood functions or posterior distributions), rather than by hypothesis tests and p-values. However, for those authors who wish to report “statistical significance,” statistics with probability levels of less than .001, .01, and .05 may be flagged with 3, 2, and 1 asterisks, respectively, with notes that they are significant at the given levels. Exact probability values may always be given. Political Analysis follows the conventional usage that the unmodified term “significant” implies statistical significance at the 5% level. Authors should not depart from this convention without good reason and without clearly indicating to readers the departure from convention.”

Would that I had had the guts to drop “In most cases” and stop after the first sentence. And even better would have been to simply demand a confidence interval.

Most (of the few) people I talk with have no difficulty distinguishing “insignificant” from “equals zero,” but Jeff Gill in his “The Insignificance of Null Hypothesis Significance Testing” (Political Research Quarterly, 1999) has a lot of examples showing I do not talk with a random sample of political scientists. Has the world improved since 1999?

BTW, since you know my obsession with what Bayes can or cannot do to improve life, this whole issue, is in my mind, the big win for Bayesians. Anything that lets people not get excited or depressed depending on whether a CI (er HPD credible region) is (-.01,1.99) or (.01,2.01) has to be good.

My take on this: I basically agree. In many fields, you need that statistical significance–even if you have to try lots of tests to find it.

This is an interesting point, but isn't it just reframing the same problem? If one get a p-value of 1.60, then one could try to respecify and play around to get "significance." Asking things to be reported as confidence intervals doesn't really change incentives…

In the same case, I'd get a 95% confidence interval of, say, (-0.02,1.45) and I would play around to get (0.02,1.49). Right? I mean, we're all smart enough to know that "significance" means "zero is not in the 95% C.I." True, the change isn't as dramatic in the C.I. case, but aren't most empircists smart enough to realize that a p-value of 1.68 and a C.I. that looks like (0.02, 1.49) both mean "barely significant?"

If we're allowed to keep dreaming … Don Rubin suggested that to make observational studies more objective, researchers should fully specify the design, data collection, and analysis before seeing the outcome data. Journal editors could take this a step further and base publication decisions on the intro and methods sections before seeing the results.

Nothing said on Oct. 5 is wrong, but p values are misleading (sadly, in political science * means big, ** means quite big and *** means wow, whereas no star means zero). Clearly that is easy to translate in p-values with same level of sadness. CI's are in the units of interest and to my mind convey the information that is wanted. Clearly once you have an estimate and a standard error and the god of asympotic normality you can compute anything, and clearly people would probably continue to respecify until they got zero out of the CI.