Psychology researcher Gary Marcus points me to this comment he posted regarding popular representations of Bayesian and non-Bayesian statistics. Gary guessed that I’d disagree with him, but I actually thought that what he wrote was pretty reasonable. (Or maybe that’s just my disagreeable nature, that in this case I show my contrarian nature by agreeing when I’m not supposed to!)

Here’s what Marcus wrote:

[In his recent book, Nate] Silver’s one misstep comes in his advocacy of an approach known as Bayesian inference. . . . Silver’s discussion of alternatives to the Bayesian approach is dismissive, incomplete, and misleading. . . .

A Bayesian approach is particularly useful when predicting outcome probabilities in cases where one has strong prior knowledge of a situation. . . . But the Bayesian approach is much less helpful when there is no consensus about what the prior probabilities should be. For example, in a notorious series of experiments, Stanley Milgram showed that many people would torture a victim if they were told that it was for the good of science. Before these experiments were carried out, should these results have been assigned a low prior (because no one would suppose that they themselves would do this) or a high prior (because we know that people accept authority)? In actual practice, the method of evaluation most scientists use most of the time is a variant of a technique proposed by the statistician Ronald Fisher in the early 1900s. Roughly speaking, in this approach, a hypothesis is considered validated by data only if the data pass a test that would be failed ninety-five or ninety-nine per cent of the time if the data were generated randomly. The advantage of Fisher’s approach (which is by no means perfect) is that to some degree it sidesteps the problem of estimating priors where no sufficient advance information exists. In the vast majority of scientific papers, Fisher’s statistics (and more sophisticated statistics in that tradition) are used. . . .

In any study, there is some small chance of a false positive; if you do a lot of experiments, you will eventually get a lot of false positive results (even putting aside self-deception, biases toward reporting positive results, and outright fraud)—as Silver himself actually explains two pages earlier. Switching to a Bayesian method of evaluating statistics will not fix the underlying problems; cleaning up science requires changes to the way in which scientific research is done and evaluated, not just a new formula.

It is perfectly reasonable for Silver to prefer the Bayesian approach—the field has remained split for nearly a century, with each side having its own arguments, innovations, and work-arounds—but the case for preferring Bayes to Fisher is far weaker than Silver lets on, and there is no reason whatsoever to think that a Bayesian approach is a “think differently” revolution.

This was similar to the comment of Deborah Mayo, who felt that Nate was too casually identifying Bayes with all the good things in the statistical world while not being aware of modern developments in non-Bayesian statistics. (Larry Wasserman went even further and characterized Nate as a “frequentist” based on Nate’s respect for calibration, but I think that in that case Larry was missing the point, because calibration is actually central to Bayesian inference and decision making; see chapter 1 of Bayesian Data Analysis or various textbooks on decision analysis).

I pretty much agreed with Marcus’s and Mayo’s general points. Bayesian and non-Bayesian approaches both can get the job done (see footnote 1 of this article for my definitive statement on that topic).

I do, however, dispute a couple of Marcus’s points:

1. The paragraph about the Milgram experiments. From what I’ve read about the experiment (although I have to admit not being an expert in that field), Milgram’s data are so strong that the prior distribution would be pretty much irrelevant. The main concern would be potential biases in the experiment, generalizability to other settings, etc.—and for those problems, you pretty much have to use an assumption-based (whether or not formally Bayesian) approach (as discussed, for example, in the writings of Sander Greenland). All the hypothesis testing and randomization in the world won’t address the validity problem.

2. Marcus’s statement, “In any study, there is some small chance of a false positive.” I get what he means, and I think the general impression he’s giving is fine, but I disagree with the statement as written. Some experiments are definitely studying real effects, in which case a “false positive” is impossible.

3. Marcus writes, “there is no reason whatsoever to think that a Bayesian approach is a ‘think differently’ revolution.” I think “no reason whatsoever” is a bit strong! For some statisticians, it can truly be revolutionary to allow the use of external information rather than get trapped in the world of p-values. I agree that, if you’re already using sophisticated non-Bayesian methods such as those of Tibshirani, Efron, and others, that Bayes is more of an option than a revolution. But if you’re coming out of a pure hypothesis testing training, then Bayes can be a true revelation. I think that is one reason that many methodologists in your own field (psychology) are such avid Bayesians: they find the openness and the directness of the Bayesian approach to be so liberating.

Despite these points of disagreement (and my items 2 and 3 are matters of emphasis more than anything else), I agree strongly with Marcus’s general message that Bayes is not magic. The key step is to abandon rigid textbook thinking on hypothesis testing and confidence intervals; one can move forward from there using Bayesian methods or various non-Bayesian ideas of regularization, meta-analysis, etc. I have not read Nate’s book but if Nate’s message is that modern statistics is about models rather than p-values, I support that message even if it’s not phrased in the most technically correct manner. And I also support Marcus’s message that it’s not so much about the word “Bayes” as about escaping out-of-date rigid statistical ideas.

You nailed it on what I thought Nate Silver’s point was (cautious modeling). He went a little overboard on the virtues of Bayesian updating; not as much as John Kruschke, but a little. OTOH, I thought Silver’s greater error was in his portrayal of climate science (see Michael Mann’s comments at http://www.huffingtonpost.com/michael-e-mann/nate-silver-climate-change_b_1909482.html). If the worst thing that comes from The Signal and the Noise is people who can correctly reason through the Monty Hall problem and think they understand Bayesian stats, the world is still in better shape.

One section of Mann’s commentary particularly jumped out at me:

“But [Silver] falls victim to a fallacy that has become all too common among those who view the issue through the prism of economics rather than science. Nate conflates problems of prediction in the realm of human behavior — where there are no fundamental governing ‘laws’ and any “predictions” are potentially laden with subjective and untestable assumptions — with problems such as climate change, which are governed by laws of physics, like the greenhouse effect, that are true whether or not you choose to believe them.”

I’m looking at problems in physical science. In that realm, if there’s data available then you can establish a reasonably objective prior. When you’re dealing with human behavior then I imagine that establishing a prior is more problematic. With those things in mind, it’s easy for me to imagine that one’s own experience with priors, i.e., the degree to which you believe them to be objective, could lead one to be biased in evaluating their utility in other fields.

The problem with Mann’s statement regarding the dichotomy between social science and physical science is that it’s wrong, via the route of being way too vague and over-generalized. You can find plenty of “laws” in human behavior if you look at the right scale and use the right data. Conversely, you can find plenty of cases in complex physical sciences, like climate science for example, where the supposed operation of various laws is murky or less than perfectly clear, because well, it’s a complex system. You can’t just say that because we have a radiative forcing equation for carbon dioxide that this fact somehow explains everything about observed climatic changes in the world and thereby sets it apart from the social sciences, but that is exactly what he is doing with his statement. There are plenty of “subjective and untestable assumptions” embedded in various aspects of climate change, especially in Mann’s sub-discipline, paleoclimatology. The real reason he doesn’t like Silver’s statements on climate change is because Silver points out that the uncertainties are larger than Mann would like to have anyone point out.

>>>I also support Marcus’s message that it’s not so much about the word “Bayes” as about escaping out-of-date rigid statistical ideas.

<<<

Rigidity does have its virtues. Less room to guide your conclusions to what you want them to be?

Or are my fears misplaced? In the hands of an unscrupulous practitioner can the Bayesian tools do more damage?

Interesting point. But I think you have to distinguish between relying on rules (or rigidity) that have some sort of axiomatic basis (e.g. basic tenets of probability theory), and relying on rules that exist purely to constrain discretion or bias (e.g. p = 0.05 = significance). I’m not saying that the latter is wrong, but it’s important to make that distinction when you’re thinking about what constitutes sound reasoning.

Even relaxing the ” p = 0.05 = significance” rule I think Bayesian methods allow more leeway for hidden-from-sight tweaking.

I believe that Bayesian methods allow more leeway, but also make your choices explicit and thus less-hidden-from-sight.

To suggest that the conglomeration of tools surrounding non-Bayesian, frequentist, statistical hypotheses testing, confidence intervals, significance tests, experimental design, misspecification testing and statistical data analysis preclude, rather than require, the use of external information is misleading and unfair. It’s just that this external information, be it about background theories, about effects shown/not shown, about existing precision, about causes, and about flaws and foibles in previous attempts to learn about a phenomenon are very rarely in the form of prior probabilities. Of course, if there are legitimate priors relevant to the problem, we would/do use them too. There is no reason to suppose that knowledge doesn’t enter simply because it is not formalizable in terms of prior probabilities of statistical parameters. Nor do I think Gelman would disagree.

http://errorstatistics.com/2011/10/30/background-knowledge-not-to-quantify-but-to-avoid-being-misled-by-subjective-beliefs/

Whether or not folks think I “go overboard” in advocating Bayesian methods, the main goal in my blog about the Marcus and Davis critique (http://doingbayesiandataanalysis.blogspot.com/2013/01/bayesian-disease-diagnosis-with.html) is to clarify what Bayesian methods do, and to challenge misconceptions about Bayesian methods. The post is not about anything Silver does or doesn’t say; it’s just about getting clarity on the potential of Bayesian methods.

[Sorry if this is a repeat … seemed that my first attempt to comment didn’t register?]

John, I thought your response to M&D was right (though the difference between point priors and a distribution is a harder concept to grasp than you might think — that’s a teaching issue, not an issue about Bayesian stats). I was thinking primarily of your focus on p-value hypothesis testing as an argument in favor of an Bayesian approach (e.g., Chapter 11 of DBDA). I grasp and grant your point (actually several of them), but the frequentists on my floor are pretty laid-back and sensible and don’t resemble the rigid approach you’re fighting.

When I was being _evaluated_ about visiting at Duke, I was asked for my definition of frequentist statistics.

I replied “its trying to get by (as well as possible) without explicitly using a prior”.

I think that was what Fisher was up to (as well as getting invariance). But, he did not have today’s computational resources nor the clarity of not confusing the representation with what is being represented (aka all model are false some are useful).

As Don Rubin told me once, even if you can’t use a Bayesian approach in your work (1990,s) it’s always worth while to think through the Bayesian approach (and thats getting more flexible and easier to do every day).

So – start fully Bayesian with a convenient informative prior and _then_ lessen the informativeness and calibrate repeated sampling properties, retreating to making that calibration fully uniform if you are foirced to be NormalDeviate (Larry Wasserman has a nice post on uniform calibration).

The scaffolding helps prevent damage – but unscrupulous practitioners can always do unlimited damage using any method.

I would disagree. The non-Bayesian frequentist methods are interested in appraising and controlling error probabilities associated with methods and inferences to hypotheses that are not events. Ironically, the flaws in the most egregious cases of simple p-values that people get upset about are immediately precluded by an account that can pick up on selection effects, stopping rules, incomplete reports, biases and wishful thinking, and the like. They are picked up on by altered error probabilities and failed underlying assumptions. Yet Bayesians give us methods where stopping rules don’t matter, and differences between computed and actual error probabilities are not discerned.

If we had to start with a “convenient informative prior” reflecting someone’s beliefs in an exhaustive set of hypotheses (assignments which would vary depending on how much to give to “H is false”), we would rarely make progress in science. Understandably, after years of trying to use methods of subjective elicitation, these attempts are considered unreliable and a diversion from the work of model building, even by leading Bayesians. Well I’ve said all this before (errorstatistics.com).

http://www.rmm-journal.de/downloads/Article_Mayo.pdf

“differences between computed and actual error probabilities are not discerned”

Take data=mu+error and assume the errors are IID N(0,1). Suppose the actual errors are 2,2,2,2,2,-2,-2,-2,-2,-2 and further suppose mu is in the high probability manifold of the prior. Then any reasonable Bayesian interval computed using the NIID assumption, which contains the high probability manifold of the posterior, will contain the true value of mu.

This is true even thought the “actual error probabilities” are violently non-random, non-normal, and non-independent. And it’s also true no matter what the future errors thrown off by the supposed “data generation mechanism” are (i.e. the long range frequency of errors is completely and totally irrelevant).

Now I know that for Dr. Mayo the real goal here is to produce intervals which are guaranteed to be wrong some objective percentage of the time in future measurements that wont be made, based on assumptions that are almost certainly wrong and generally uncheckable (if they make any physical sense at all). I think that’s pretty much insane.

So for anyone out there who is simply interested in finding an interval for mu, based on the actual data at hand, which contains the true value, then it’s worth thinking about what conditions are required on the actual errors to make this happen. (Hint: the answer is not that computed and actual error probabilities coincide even approximately).

As an added bonus, you’ll get deeper incite into why the NIID often isn’t such a bad assumption in practice even though real errors usually aren’t “normally distributed”. This should help in picking likelihoods in other instances.

@Entsophy: Your Bayesian interval with this data *might* contain the true value of mu and on your prior, though you haven’t said what either is, so we can’t verify your claims. For many other datasets/priors the interval does not cover the true mu, and your argument becomes considerably weaker.

You might also consider the accuracy of standard errors and/or posterior standard deviations, computed under the iid assumption, which have little to zero robustness against violation of independence. These are important parts of inference, under any paradigm, and your “not such a bad” iid assumption can in fact be terrible in practice, regardless of Normality.

George: Mine was a hard question to state and have it’s intent understood because Frequentists (and most Bayesians) want to think in terms of phantom distributions instead of the real concrete errors. So I’ll put it into different words.

Think of the process of assuming NIID and calculating intervals as just an algorithm. Remove any kind of probabilistic interpretation to it. It’s just an algorithm that spits out an interval. Now think of the errors only as the real actual numbers in the data actually collected (i.e. 2,2,2,2,2,-2,-2,-2,-2,-2). Forget about whatever distribution you think these numbers came from. In fact just assume no such stable distribution exits or that no future measurements are possible even in principle.

Now what condition do those error numbers have to satisfy so that the real mu lies in the in interval generated by the algorithm?

The necessary condition is not “the histogram of errors has to look like a normal curve” or even more absurdly “the histogram of future errors has to look like a normal curve”.

Dr. Mayo thinks this is a necessary condition and if it doesn’t hold at least approximately something very bad happens. Her entire statistical philosophy is founded on it.

But this is trivially wrong. You can see this by looking at the example I gave. Almost any interval that a Baysian or Frequentist would write down for mu, at almost any level of alpha, will contain the true value of mu (the average of the data will exactly equal the true value of mu for those errors). This happens despite the fact the errors are not random, or normal, or independent in any way whatsoever.

The condition Frequentists suppose is necessary to drive this inference isn’t even close to being necessary. Failure to see this simple point is why Frequentists like Feller are constantly surprised at how well the NIID assumption works in practice even though it can’t be “right”.

@Entsophy: you write that “Almost any interval … will contain the true value of mu (the average of the data will exactly equal the true value of mu for those errors)”. So I interpret what you wrote to mean that mu (whatever it represents) *must* exactly equal zero, the average of 2,2,2,2,2,-2,-2,-2,-2,-2. Huh? No statistician, of any flavor, would ever conclude this. Please state – maybe on your own blog – what you mean by mu, because it’s surely not what Bayesians and/or Frequentists mean.

George: the errors (not ‘data’) are for that example 2,2,2,2,2,-2,-2,-2,-2,2. The data would be of the form: data=mu+error.

With those errors the average of the data will exactly equal mu, so any interval that includes the point estimate for mu will contain the true value of mu. The intervals that pretty much anyone would create from that data with these errors would be of the form:

sample mean +/- delta = true mu +/- delta

Entsophy: I never said the goal is intervals guaranteed to be wrong with some probability. You must be talking about someone else. And it is the frequentist who can give guarantees against model violations.

The goal in creating a (1-alpha)% CI is to get intervals which incorrectly identify the location of the parameter alpha% of the time.

> Bayesians give us methods where stopping rules don’t matter

I do agree that some do sloppily take “don’t matter” as literally true.

It is true (mathematically) that some things will not affect the transformation of the prior probabilities into posterior probabilities but in actual applications there is a lot more that should be considered (e.g. such as checking prior(s) and data model(s), splitting/assessing the contributions of prior and data.)

I think Don Berry has written on this and I think specifically regarding stopping rules.

> make progress in science.

That requires continually getting less wrong and there must be many ways to do that.

Here’s another attempt at communicating what Entsophy is trying to say:

Suppose we have a data set which we want to model as having been generated by a generating distribution. Naturally, the form of F is unknown. Suppose we are interested in the mean (mu) of F. Then it makes sense to write F = mu + E, where E is a zero-mean quantity that we can choose to call an “error”. In a Bayesian context, there are now two probability distributions of interest:

1) the (unknown, and in principle unknowable) generating distribution of E (what Mayo refers to as the “actual” error probabilities)

2) the (computable) predictive distribution of future errors, given a particular choice of model (what Mayo refers to as the “computed” error probabilities).

E.g. (for 2) one might choose the maximum entropy model under the assumption that the data are informative only about the first two moments of the generating distribution. For real-valued data, this yields the normal distribution – this is true even when the generating distribution is not normal.

So in the Bayesian context, the conceptual distinction between “computed and actual error probabilities” is very explicit and (depending on available information) the differences can be large. By contrast, the frequentist context offers no clear-cut distinction (there is no notion of a predictive distribution that is computed conditioned on a model assumption; instead, frequentists just attempt to approximate the generative distribution and use that as a predictive distribution).

I don’t think that’s what Entsophy is trying to say. I think it’s more like this: suppose without loss of generality that the true parameter is zero. Ignore any probabilistic assumptions; just treat a given interval procedure as a map from a point in the complete sample space to an interval in parameter space. For what regions of sample space does the map generate an interval that covers zero?

Konrad: I’d never noticed this. No you’ve got the computed vs actual error probabilities wrong. The computed p-value, for example, might be .05, despite multiple testing while the actual might be much higher. The actual error rate is inflated, and the problem is not about long runs, but producing misleading evidence about this one hypothesis with this data! Maybe look at chapter 9 of my Error and the Growth of Experimental Knowledge (which can be found off my web page). Today’s post happens to be relevant actually.

Everybody, including Nate Silver, agrees that Bayes is not magic, right?

Obviously there’s a whole gigantic apparatus of statistics which goes beyond “when the needle goes past p=.05 you win,” but it can’t be denied that the pure hypothesis testing approach is widely used, often thoughtlessly. The problems with that are well-known to people who read this blog, but not to the general public, and I think Silver is helping on that score.

If the prior information is sparse, all you have to do is use uninformative priors. Priors can not contain information one does not have. In this case the data dominates but all the other advantages of using a state of knowledge rather than a state of nature (followed by ad hoc assumptions) are realized. E.T. Jaynes’ work contains the details.

Low quality data combined with limited prior information = a high level of uncertainty… this happens. But high quality data with limited prior information does not = a high level of uncertainty. If the priors become as important as the data, then at least one must rigorously define (describe) the priors. The priors are in plain view. The Fisher approach relies on ad hoc methods to transform a state of nature into a state of knowledge.

I don’t know where you guys get this stuff. I’d rather learn something about “the state of nature” as modelled than convert ignorance into knowledge as do the Bayesians who equate knowing nothing with equiprobability!

Efron’s recent article in the Bulletin of the American Mathematical Society, A 250-Year Argument: Belief, Behavior, and the Bootstrap, http://www.ams.org/journals/bull/2013-50-01/S0273-0979-2012-01374-5/S0273-0979-2012-01374-5.pdf presents an interesting perspective on bayesian,frequentist, and things-in-between.

Thanks Martha, that is a wonderful article, and good for teaching too. I have always admired Efron’s writing — clear and entertaining. It is obvious that he enjoyed writing this paper.

Cheers,

E.J.

Martha: thank you so much for linking to this excellent paper.

Efron’s article is charming but I think it has many mistakes, indicating areas where he is not aware of modern Bayesian ideas. For him to say that Bayes is all about subjectivity and noninformative priors is analogous to a Bayesian saying that frequentist methods are all about p<0.05. In both cases, I think we're seeing a comfortable misunderstanding, comfortable in the sense that it can be pleasant to think that people following other schools of thought are simplistic in some ways.

[…] if you’re already using sophisticated non-Bayesian methods such as those of Tibshirani, Efron,… […]

[…] Gelman tem dois posts (1 e 2) sobre o assunto que merecem ser lidos (e lá você encontrará links para os demais posts de […]

[…] not really saying anything he hasn’t said before, but Andrew Gelman has two new posts up here and here on modern ideas about Bayesian vs. frequentist statistics and how they’re ultimately […]

[…] has also found bayesian statistics this week liberating to psychologists and controversial to […]

[…] if you’re already using sophisticated non-Bayesian methods such as those of Tibshirani, Efron,… (andrewgelman.com) […]

[…] if you’re already using sophisticated non-Bayesian methods such as those of Tibshirani, Efron,… (andrewgelman.com) […]