We interrupt our usual programming of mockery of buffoons to discuss a bit of statistical theory . . .
Like many Bayesians, I have often represented classical confidence intervals as posterior probability intervals and interpreted one-sided p-values as the posterior probability of a positive effect. These are valid conditional on the assumed noninformative prior but typically do not make sense as unconditional probability statements.
The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong. At first this may sound paradoxical, that a noninformative or weakly informative prior yields posteriors that are too forceful—and let me deepen the paradox by stating that a stronger, more informative prior will tend to yield weaker, more plausible posterior statements.
How can it be that adding prior information weakens the posterior? It has to do with the sort of probability statements we are often interested in making. Here is an example from Gelman and Weakliem (2009). A sociologist examining a publicly available survey discovered a pattern relating attractiveness of parents to the sexes of their children. He found that 56% of the children of the most attractive parents were girls, compared to 48% of the children of the other parents, and the difference was statistically significant at p<0.02. The assessments of attractiveness had been performed many years before these people had children, so the researcher felt he had support for a claim of an underlying biological connection between attractiveness and sex ratio. The original analysis by Kanazawa (2007) had multiple comparisons issues, and after performing a regression rather than selecting the most significant comparison, we get a p-value closer to 0.2 rather than the stated 0.02. For the purposes of our present discussion, though, in which we are evaluating the connection between p-values and posterior probabilities, it will not matter much which number we use. We shall go with p=0.2 because it seems like a more reasonable analysis given the data. Let θ be the true (population) difference in sex ratios of attractive and less attractive parents. Then the data under discussion (with a two-sided p-value of 0.2), combined with a uniform prior on θ, yields a 90% posterior probability that θ is positive. Do I believe this? No. Do I even consider this a reasonable data summary? No again. We can derive these No responses in three different ways, first by looking directly at the evidence, second by considering the prior, and third by considering the implications for statistical practice if this sort of probability statement were computed routinely. First off, a claimed 90% probability that θ>0 seems too strong. Given that the p-value (adjusted for multiple comparisons) was only 0.2—that is, a result that strong would occur a full 20% of the time just by chance alone, even with no true difference—it seems absurd to assign a 90% belief to the conclusion. I am not prepared to offer 9 to 1 odds on the basis of a pattern someone happened to see that could plausibly have occurred by chance, nor for that matter would I offer 99 to 1 odds based on the original claim of the 2% significance level.
Second, the prior uniform distribution on θ seems much too weak. There is a large literature on sex ratios, with factors such as ethnicity, maternal age, and season of birth corresponding to difference in probability of girl birth of less than 0.5 percentage points. It is a priori implausible that sex-ratio differences corresponding to attractiveness are larger than for these other factors. Assigning an informative prior centered on zero shrinks the posterior toward zero, and the resulting posterior probability that θ>0 moves to a more plausible value in the range of 60%, corresponding to the idea that the result is suggestive but not close to convincing.
Third, consider what would happen if we routinely interpreted one-sided p-values as posterior probabilities. In that case, an experimental result that is 1 standard error from zero—that is, exactly what one might expect from chance alone—would imply an 83% posterior probability that the true effect in the population has the same direction as the observed pattern in the data at hand. It does not make sense to me to claim 83% certainty—5 to 1 odds—based on data that not only could occur by chance but in fact represent an expected level of discrepancy. This system-level analysis accords with my criticism of the flat prior: as Greenland and Poole note in their article, the effects being studied in epidemiology are typically range from -1 to 1 on the logit scale, hence analyses assuming broader priors will systematically overstate the probabilities of very large effects and will overstate the probability that an estimate from a small sample will agree in sign with the corresponding population quantity.
Rather than relying on noninformative priors, I prefer the suggestion of Greenland and Poole to bound posterior probabilities using real prior information.
OK, I did discuss some buffoonish research here. But, look, no mockery! I was using the silly stuff as a lever to better understand some statistical principles. And that’s ok.