I promised I wouldn’t do any new blogging until January but I’m here at this conference and someone asked me a question about the above slide from my talk.

The point of the story in that slide is that flat priors consistently give bad inferences. Or, to put it another way, the routine use of flat priors results in poor frequency properties in realistic settings where studies are noisy and effect sizes are small. (More here.)

Saying it that way, it’s obvious: Bayesian methods are calibrated if you average over the prior. If the distribution of effect sizes that you average over, is not the same as the prior distribution that you’re using in the analysis, your Bayesian inferences in general will have problems.

But, simple as this statement is, the practical implications are huge, because it’s standard to use flat priors in Bayesian analysis (just see most of the examples in our books!) and it’s even more standard to take classical maximum likelihood or least squares inferences and interpret them Bayesianly, for example interpreting a 95% interval that excludes zero as strong evidence for the sign of the underlying parameter.

In our 2000 paper, “Type S error rates for classical and Bayesian single and multiple comparison procedures,” Francis Tuerlinckx and I framed this in terms of researchers making “claims with confidence.” In classical statistics, you make a claim with confidence on the sign of an effect if the 95% confidence interval excludes zero. In Bayesian statistics, one can make a comparable claim with confidence if the 95% *posterior* interval excludes zero. With a flat prior, these two are the same. But with a Bayesian prior, they are different. In particular, with normal data and a normal prior centered at 0, the Bayesian interval is always more likely to include zero, compared to the classical interval; hence we can say that Bayesian inference is more conservative, in being less likely to result in claims with confidence.

Here’s the relevant graph from that 2000 paper:

This plot shows the probability of making a claim with confidence, as a function of the variance ratio, based on the simple model:

True effect theta is simulated from normal(0, tau).

Data y are simulated from normal(theta, sigma).

Classical 95% interval is y +/- 2*sigma

Bayesian 95% interval is theta.hat.bayes +/- 2*theta.se.bayes,

where theta.hat.bayes = y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)

and theta.se.bayes = sqrt(1 / (1/sigma^2 + 1/tau^2))

What’s really cool here is what happens when tau/sigma is near 0, which we might call the “Psychological Science” or “PPNAS” domain. In that limit, the classical interval has a 5% chance of excluding 0. Of course, that’s what the 95% interval is all about: if there’s no effect, you have a 5% chance of seeing something.

But . . . look at the Bayesian procedure. There, the probability of a claim with confidence is essentially 0 when tau/sigma is low. This is right: in this setting, the data only very rarely supply enough information to determine the sign of any effect. But this can be counterintuitive if you have classical statistical training: we’re so used to hearing about 5% error rate that it can be surprising to realize that, if you’re doing things right, your rate of making claims with confidence can be much lower.

We *are* assuming here that the prior distribution and the data model are correct—that is, we compute probabilities by averaging over the data-generating process in our model.

**Multiple comparisons**

OK, so what does this have to do with multiple comparisons? The usual worry is that if we are making a lot of claims with confidence, we can be way off if we don’t do some correction. And, indeed, with the classical approach, if tau/sigma is small, you’ll still be making claims with confidence 5% of the time, and a large proportion of these claims will be in the wrong direction (a “type S,” or sign, error) or much too large (a “type M,” or magnitude, error), compared to the underlying truth.

With Bayesian inference (and the correct prior), though, this problem disappears. Amazingly enough, you *don’t* have to correct Bayesian inferences for multiple comparisons.

I did a demonstration in R to show this, simulating a million comparisons and seeing what the Bayesian method does.

Here’s the R code:

setwd("~/AndrewFiles/research/multiplecomparisons")
library("arm")
spidey <- function(sigma, tau, N) {
cat("sigma = ", sigma, ", tau = ", tau, ", N = ", N, "\n", sep="")
theta <- rnorm(N, 0, tau)
y <- theta + rnorm(N, 0, sigma)
signif_classical <- abs(y) > 2*sigma
cat(sum(signif_classical), " (", fround(100*mean(signif_classical), 1), "%) of the 95% classical intervals exclude 0\n", sep="")
cat("Mean absolute value of these classical estimates is", fround(mean(abs(y)[signif_classical]), 2), "\n")
cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_classical]), 2), "\n")
cat(fround(100*mean((sign(theta)!=sign(y))[signif_classical]), 1), "% of these are the wrong sign (Type S error)\n", sep="")
theta_hat_bayes <- y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)
theta_se_bayes <- sqrt(1 / (1/sigma^2 + 1/tau^2))
signif_bayes <- abs(theta_hat_bayes) > 2*theta_se_bayes
cat(sum(signif_bayes), " (", fround(100*mean(signif_bayes), 1), "%) of the 95% posterior intervals exclude 0\n", sep="")
cat("Mean absolute value of these Bayes estimates is", fround(mean(abs(theta_hat_bayes)[signif_bayes]), 2), "\n")
cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_bayes]), 2), "\n")
cat(fround(100*mean((sign(theta)!=sign(theta_hat_bayes))[signif_bayes]), 1), "% of these are the wrong sign (Type S error)\n", sep="")
}
sigma <- 1
tau <- .5
N <- 1e6
spidey(sigma, tau, N)

Here's the first half of the results:

sigma = 1, tau = 0.5, N = 1e+06
73774 (7.4%) of the 95% classical intervals exclude 0
Mean absolute value of these classical estimates is 2.45
Mean absolute value of the corresponding true parameters is 0.56
13.9% of these are the wrong sign (Type S error)

So, when tau is half of sigma, the classical procedure yields claims with confidence 7% of the time. The estimates are huge (after all, they have to be at least two standard errors from 0), much higher than the underlying parameters. And 14% of these claims with confidence are in the wrong direction.

The next half of the output shows the results from the Bayesian intervals:

62 (0.0%) of the 95% posterior intervals exclude 0
Mean absolute value of these Bayes estimates is 0.95
Mean absolute value of the corresponding true parameters is 0.97
3.2% of these are the wrong sign (Type S error)

When tau is half of sigma, Bayesian claims with confidence are extremely rare. When there *is* a Bayesian claim with confidence, it will be large---that makes sense; the posterior standard error is sqrt(1/(1/1 + 1/.5^2)) = 0.45, and so any posterior mean corresponding to a Bayesian claim with confidence here will have to be at least 0.9. The average for these million comparisons turns out to be 0.94.

So, hey, watch out for selection effects! But no, not at all. If we look at the underlying *true effects* corresponding to these claims with confidence, these have a mean of 0.97 (in this simulation; in other simulations of a million comparisons, we get means such as 0.89 or 1.06). And very few of these are in the wrong direction; indeed, with enough simulations you'll find a type S error rate of a bit less 2.5% which is what you'd expect, given that these 95% posterior intervals exclude 0, so something less than 2.5% of the interval will be of the wrong sign.

So, the Bayesian procedure only very rarely makes a claim with confidence. But, when it does, it's typically picking up something real, large, and in the right direction.

We then re-ran with tau = 1, a world in which the standard deviation of true effects is equal to the standard error of the estimates:

sigma <- 1
tau <- 1
N <- 1e6
spidey(sigma, tau, N)
And here's what we get:

sigma = 1, tau = 1, N = 1e+06
157950 (15.8%) of the 95% classical intervals exclude 0
Mean absolute value of these classical estimates is 2.64
Mean absolute value of the corresponding true parameters is 1.34
3.9% of these are the wrong sign (Type S error)
45634 (4.6%) of the 95% posterior intervals exclude 0
Mean absolute value of these Bayes estimates is 1.68
Mean absolute value of the corresponding true parameters is 1.69
1.0% of these are the wrong sign (Type S error)

The classical estimates remain too high, on average about twice as large as the true effect sizes; the Bayesian procedure is more conservative, making fewer claims with confidence and not overestimating effect sizes.

**Bayes does better because it uses more information**

We should not be surprised by these results. The Bayesian procedure uses more information and so it can better estimate effect sizes.

But this can seem like a problem: what if this prior information on theta isn’t available? I have two answers. First, in many cases, some prior information *is* available. Second, if you have a lot of comparisons, you can fit a multilevel model and estimate tau. Thus, what can seem like the worst multiple comparisons problems are not so bad.

One should also be able to obtain comparable results non-Bayesianly by setting a threshold so as to control the type S error rate. The key is to go beyond the false-positive, false-negative framework, to set the goals of estimating the sign and magnitudes of the thetas rather than to frame things in terms of the unrealistic and uninteresting theta=0 hypothesis.

**P.S.** Now I know why I swore off blogging! The analysis, the simulation, and the writing of this post took an hour and a half of my work time.

**P.P.S.** Sorry for the ugly code. Let this be a motivation for all of you to learn how to code better.