## Why I don’t like Bayesian statistics

Clarification: Somebody pointed out that, when people come here from a web search, they won’t realize that it’s an April Fool’s joke. See here for my article in Bayesian analysis that expands on the blog entry below, along with discussion by four statisticians and a rejoinder by myself that responds to the criticisms that I raised.

Below is the original blog entry.

Bayesian inference is a coherent mathematical theory but I wouldn’t trust it in scientific applications. Subjective prior distributions don’t inspire confidence, and there’s no good objective principle for choosing a noninformative prior (even if that concept were mathematically defined, which it’s not). Where do prior distributions come from, anyway? I don’t trust them and I see no reason to recommend that other people do, just so that I can have the warm feeling of philosophical coherence.

Bayesian theory requires a great deal of thought about the given situation to apply sensibly, and recommending that scientists use Bayes’ theorem is like giving the neighborhood kids the key to your F-16. I’d rather start with tried and true methods, and then generalizing using something I can trust, like statistical theory and minimax principles, that don’t depend on your subjective beliefs. Especially when the priors I see in practice are typically just convenient conjugate forms. What a coincidence that, of all the infinite variety of priors that could be chosen, it always seems like the normal, gamma, beta, etc., that turn out to be the right choice?

To restate these concerns mathematically: I like unbiased estimates and I like confidence intervals that really have their advertised confidence coverage. I know that these aren’t always going to be possible, but I think the right way forward is to get as close to these goals as possible and to develop robust methods that work with minimal assumptions. The Bayesian approach–to give up even trying to approximate unbiasedness and to instead rely on stronger and stronger assumptions–that seems like the wrong way to go.

In the old days, Bayesian methods at least had the virtue of being mathematically clean. Nowadays, they all seem to be computed using Markov chain Monte Carlo, which means that, not only can you not realistically evaluate the statistical properties of the method, you can’t even be sure it’s converged, just adding one more item to the list of unverifiable assumptions.

People tend to believe results that support their preconceptions and disbelieve results that surprise them. Bayesian methods encourage this undisciplined mode of thinking. I’m sure that many individual Bayesian statisticians and are acting in good faith, but they’re providing encouragement to sloppy and unethical scientists everywhere. And, probably worse, Bayesian techniques motivate even the best-intentioned researchers to get stuck in the rut of prior beliefs.

Bayesianism assumes: (a) Either a weak or uniform prior, in which case why bother?, (b) Or a strong prior, in which case why collect new data?, (c) Or more realistically, something in between, in which case Bayesianism always seems to duck the issue.

Nowadays people use a lot of empirical Bayes methods. I applaud the Bayesians’ newfound commitment to empiricism but am skeptical of this particular approach, which always seems to rely on an assumption of “exchangeability.” I do a lot of work in political science, where people are embracing Bayesian statistics as the latest methodological fad. Well, let me tell you something. The 50 states aren’t exchangeable. I’ve lived in a few of them and visited nearly all the others, and calling them exchangeable is just silly. Calling it a hierarchical or a multilevel model doesn’t change things–it’s an additional level of modeling that I’d rather not do. Call me old-fashioned, but I’d rather let the data speak without applying a probability distribution to something like the 50 states which are neither random nor a sample.

Also, don’t these empirical Bayes methods use the data twice? If you’re going to be Bayesian, then be Bayesian: it seems like a cop-out and contradictory to the Bayesian philosophy to estimate the prior from the data. If you want to do hierarchical modeling, I prefer a method such as generalized estimating equations that makes minimal assumptions.

And don’t even get me started on what Bayesians say about data collection. The mathematics of Bayesian decision theory lead inexorably to the idea that random sampling and random treatment allocation are inefficient, that the best designs are deterministic. I have no quarrel with the mathematics here–the mistake lies deeper in the philosophical foundations, the idea that the goal of statistics is to make an optimal decision. A Bayes estimator is a statistical estimator that minimizes the average risk, but when we do statistics, we’re not trying to “minimize the average risk,” we’re trying to do estimation and hypothesis testing. If the Bayesian philosophy of axiomatic reasoning implies that we shouldn’t be doing random sampling, then that’s a strike against the theory right there. Bayesians also believe in the irrelevance of stopping times–that, if you stop an experiment based on the data, it doesn’t change your inference. Unfortunately for the Bayesian theory, the p-value _does_ change when you alter the stopping rule, and no amount of philosophical reasoning will get you around that point.

I can’t keep track of what all those Bayesians are doing nowadays–unfortunately, all sorts of people are being seduced by the promises of automatic inference through the “magic of MCMC”–but I wish they would all just stop already and get back to doing statistics the way it should be done, back in the old days when a p-value stood for something, when a confidence interval meant what it said, and statistical bias was something to eliminate, not something to embrace.

1. GoodnessOfFit says:

got me.

2. jfalk says:

OK. I get it. I can read a calendar as well as the next guy. The only problem is, I agree with most of it.

3. 1/356 says:

I'm with jfalk.

At least the blogger and I can agree 1/365 of the time.

4. Andrew says:

Finally Gelman talks some sense!

5. Daniel Lakeland says:

Don't ruin the perfectly good Apr 1 joke by replying today, but TOMORROW I'm expecting an overview of your counterargument :-)

6. charlie w says:

So, for April 2nd I assume you'll give your responses to some of these objections? I just bought your Bayesian book, but I'd be curious to read your informal thoughts on these issues.

7. noahpoah says:

Well played, sir. I was very confused there for a bit. I actually briefly considered the possibility that the Andrew Gelman who coauthored Bayesian Data Analysis and the Andrew Gelman posting on this blog were two different people.

The last paragraph is beautiful.

8. Manoel Neto says:

Well,
I am a graduate student at Universidade de São Paulo, Brazil, who always have read your blog, but never happend to me to say anything. But now I have to say: what a joke! I expect you response to some claims, which are good one indeed – at least to me.

regards

you had me going for a sec there; you are not afraid to be controversial after all…

See you in 15 days and we can debate this issue some more ;)

10. John says:

Yes, please make your next post a picking-apart of this one. That would be edifying.

11. greg says:

Brilliant!

12. John says:

"Bayesians also believe in the irrelevance of stopping times…."

Isn't that one of the methods so-called psychics use to fool researchers & the general public?

13. Yu-Sung says:

You almost got me fooled! The tone is so unlike you. So it is an April 1 joke!!

14. Anonymous says:

Yes, please write a response to this post, professor Gelman!

15. Anonymous says:

I am so embarrassed that I fell for the joke that I opt to remain anonymous. A counter argument would be cool. While you are at it, if you could explain how 90000 parameters could be estimated from data with 1200 observations using Bayesian MCMC methods, I would be much obliged.

16. Seth Roberts says:

Shouldn't "the old days" be "the good old days"?

17. Best nerdy laugh I've had in a while. Thanks Andy…

18. Anonymous says:

Apparently this is the one day a year that it's worth reading a Bayesian blog. A better signal to noise ratio than we normally expect from Bayesians. :)

19. JR says:

Re the "good old days", I prefer to use modern methods of statistical inference (approximately 50 to 70 years old) rather than superceded methods which are more than 240-years-old.

20. Concerned says:

Andrew –

This post is now a top search result on Google and Bing for "Bayesian Statistics". Given that many people won't understand that this is an April fool's joke, I would modify the title and top of the article with an update to be explicit and link to your other posts and articles about why all these concerns are bogus or don't worry you.

For those looking for Gelman's real opinions, see his response here:

21. Andrew Gelman says:

Concerned: Unfortunately, the #1 Google hit for "Bayesian statistics" is the Wikipedia article on Bayesian inference, which I really really don't like, as it's entirely focused on discrete models. As I've discussed earlier on the blog, I much prefer Spiegelhalter and Rice's Scholarpedia article, which is #9 on the Google list. Maybe some time, instead of writing a couple blog entries, I'll edit the Wikipedia page. This would probably have more impact than any given day of blogging. On the other hand, I'm a little worried that some joker would come in and delete most`of my changes; this seems to have happened the last time I added to a Wikipedia entry.

22. Keith O'Rourke says:

Would the Bayesian omelette smell as good – if made only from discretely cracked eggs?

Andrew complained that the Wiki Bayesian entry was “entirely focussed on discrete models.” Is there some _conceptual_ advantage of a continuous model over discrete model – i.e. something important about Bayes theorem or conceptually applying Bayesian analysis in an area of interest that can’t be shown (cartooned) in a discrete model?

Of course, anyone hoping to implement a realistic Bayesian analysis will need the convenience and conceptual flexibility of continuous models as well MCMC, etc. but for non-statisticians to meaningfully and critically grasp the Bayesian approach?

Perhaps better put, what can’t be demonstrated without continuity? It must be something about the prior part of the joint probability model (possible parameters values) as the likelihood part is defined by observables that are always really discrete.

I re-read Spiegelhalter and Rice's Scholarpedia entry spurred by Andrew’s comment and I do think it deserves a _third cheer_. Compared to the Wiki entry, I believe it will be much more helpful, especially for those trying to do science than those trying to pass an introductory course in statistics, than the Wiki entry (or maybe even most other intro Bayes material).

But as much as we would like to forget it, many researchers blank out completely when they see those conjugate formulas we so easily process and do not excuse themselves from doing research until they can grasp the math, but instead just plod ahead with their incomplete understanding. “Leaving them behind” is I believe a more wrong strategy than trying to help them and the Scholarpedia entry would benefit by providing something in addition to the single conjugate example – even if just by a link.

The wiki entry has a very simple discrete example were people should be able to see how to calculate the desired posterior directly [ P(Girl|Pants) ] and how it could alternatively and equivalently be calculated by [ P(Girl) * P(Pants|Girl) / P(Pants) ] _exactly_ what Bayes _theorem_ is for this simple discrete joint probability model. The explanation given in the Wiki entry though seems to _stumble_ over this obvious point.

Or the Scholarpedia’s entry could easily be directly simulated by first drawing from the joint distribution and then conditioning on having 20 infections.

n=10000000
prior=rgamma(n, shape=10, rate = 1)
data.possible=rpois(n,prior * 4)
joint.sim=cbind(param=prior,data=data.possible)
posterior.sim=joint.sim[joint.sim[,"data"]==20,"param"]
likelihood.sim=split(joint.sim[,"data"], round(joint.sim[,"param"]))
likelihood.sim=sapply(likelihood.sim,function(x) sum(x==20))

Scholarpedia’s entry “For [scientific rather than personal?] inference, a full report of the posterior distribution is the correct and final conclusion of a statistical analysis” undermines much of the other informed comments in the entry about the tentativeness of models and the need for model checking and revision [assuming they meant scientific inference]. In particular, as pointed out early on this blog by Andrew and others there are real difficulties for other researchers if the likelihood part can not be recovered so that they can apply more appropriate (or at least their) priors. Replication is the perhaps essence of statistics and these things block its fuller assessment.

“While an innocuous theory” might be better put as “innocuous theorem” [might have been a typo?]

The “high-dimensional problem” (especially going from regions to intervals) and “summarization of posteriors” (other than univariate ones) was not well discussed. Even in a simple two group experiment going from a credible region for Pc and Pt to a credible region for just Pt/Pc, though less confusing than for confidence regions and intervals is a definite conceptual step – especially for reporting and understanding the contributions of prior and likelihood (e.g. 0/40 versus 0/80).

The “sets of prior distributions” … “sharing unknown parameters” and “hierarchical models … constructed … on the assumption of shared prior distribution” seems to contradict the distinction made between “epistemological” and “aleatory” probability made at the beginning of the entry. At least that is in meta-analysis where one is modeling an effect that does vary from study population to study population that cannot be predicted – as a random effect. These sometimes inappropriate distinctions between prior and likelihood roles can be a problem – for instance in Indirect Treatment Comparison meta-analysis where at a meeting, experienced researches could not discern that is was a re-parameterization in the likelihood that enabled the indirect comparisons to be made – somehow thinking it must have something to do with the prior. The prior enabled MCMC which makes the analysis easier, but that is a logically different role.

For “the philosophical rationale” the only thing that is settled is the math – everything will always be wrong but hopefully getting less and less wrong over time.

But, I would concur with Andrew, but not _discretely_ – the Scholorpedia is definitely a less wrong entry that the Wiki entry.

Keith