## The holes in my philosophy of Bayesian data analysis

I’ve been writing a lot about my philosophy of Bayesian statistics and how it fits into Popper’s ideas about falsification and Kuhn’s ideas about scientific revolutions.

Here’s my long, somewhat technical paper with Cosma Shalizi.
Here’s our shorter overview for the volume on the philosophy of social science.
Here’s my latest try (for an online symposium), focusing on the key issues.

I’m pretty happy with my approach–the familiar idea that Bayesian data analysis iterates the three steps of model building, inference, and model checking–but it does have some unresolved (maybe unresolvable) problems. Here are a couple mentioned in the third of the above links.

Consider a simple model with independent data y_1, y_2, .., y_10 ~ N(θ,σ^2), with a prior distribution θ ~ N(0,10^2) and σ known and taking on some value of approximately 10. Inference about μ is straightforward, as is model checking, whether based on graphs or numerical summaries such as the sample variance and skewness.

But now suppose we consider μ as a random variable defined on the integers. Thus θ = 0 or 1 or 2 or 3 or … or -1 or -2 or -3 or …, and with a discrete prior distribution formed by the discrete approximation to the N(0,10^2) distribution. In practice, with the sample size and parameters as defined above, the inferences are essentially unchanged from the continuous case, as we have defined θ on a suitably tight grid.

But from my philosophical position, the discrete model is completely different: I have already written that I do not like to choose or average over a discrete set of models. This is a silly example but it illustrates a hole in my philosophical foundations: when am I allowed to do normal Bayesian inference about a parameter θ in a model, and when do I consider θ to be indexing a class of models, in which case I consider posterior inference about θ to be an illegitimate bit of induction? I understand the distinction in extreme cases–they correspond to the difference between normal science and potential scientific revolutions–but the demarcation does not cleanly align with whether a model is discrete or continuous.

Another incoherence in Bayesian data analysis, as I practice it, arises after a model check. Judgment is required to decide what to do after learning that an aspect of data is not fitted well by the model–or, for that matter, in deciding what to do in the other case, when a test does not reject. In either case, we must think about the purposes of our modeling and our available resources for data collection and computation. I am deductively Bayesian when performing inference and checking within a model, but I must go outside this framework when making decisions about whether and how to alter my model.

In my defense, I see comparable incoherence in all other statistical philosophies:

– Subjective Bayesianism appears fully coherent but falls apart when you examine the assumption that your prior distribution can completely reflect prior knowledge. This can’t be, even setting aside that actual prior distributions tend to be chosen from convenient parametric families. If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.

– Classical parametric statistics disallows probabilistic prior information but assumes the likelihood function to be precisely known, which can’t make sense except in some very special cases. Robust analysis attempts to account for uncertainty about model specification but relies on additional assumptions such as independence.

– Classical nonparametric methods rely strongly on symmetry, translation invariance, independence, and other generally unrealistic assumptions.

My point here is not to say that my preferred methods are better than others but rather to couple my admission of philosophical incoherence with a reminder that there is no available coherent alternative.

To put it another way, how would an “AI” do Bayesian analysis (or statistical inference in general)? The straight-up subjective Bayes approach requires the AI to already have all possible models specified in its database with appropriate prior probabilities. That doesn’t seem plausible to me. But my approach requires models to be generated on the fly (in response to earlier model checks and the appearance of new data). It’s clear enough how an AI could perform inference on a specified graphical model (or on a mixture of such models); it’s not so clear how an AI could do model checking. When I do the three steps of Bayesian data analysis, human input is needed to interpret graphs and decide on model improvements. But we can’t ask the AI to delegate such tasks to a homunculus. Another way of saying this is that, if Bayesian data analysis is a form of applied statistics, an AI can’t fully do statistics until it can do science. In the meantime, though, the computer is a hugely useful tool in each of the three stages of Bayesian data analysis, even if it can’t put it all together yet. I’m hoping that by automating some of the steps required to evaluate and compare models, we can get a better sense of what outside knowledge we are adding at each step.

P.S. Cosma Shalizi writes:

If your graphical model does not have all possible edges, then there are conditional independencies which could be checked mechanically (and there are programs which do things like that). I know you don’t like zeroes in the model, but it’s at least a step in the right direction, no?

Yup. Or at times it could be a step in the wrong direction, depending on the model. But if the program has the ability to check the model (or at least to pass the relevant graphs on to the homunculus), then I would think that working with conditional independence approximations could be a useful way to move forward.

That’s part of my incoherence. On one hand, I hate discrete model averaging. On the other hand, I typically end up with one model, which is just a special case of an average of models.

1. Corey says:

"The straight-up subjective Bayes approach requires the AI to already have all possible models specified in its database with appropriate prior probabilities. That doesn't seem plausible to me."

Ray Solomonoff did research on this question in the 60s and 70s. Short version: it's possible in a formal mathematical sense, but it's exactly as uncomputable as Chaitin's constant. So the problem becomes how to design good computable approximations.

2. Andrew Gelman says:

Corey:

I followed you link. Dude has a cool beard but I don't see that his model makes sense even if it could be computed.

3. Hassan says:

Andrew:

The version of the universal mixture presented in the Scholarpedia article is a little obscure. In fact the M in the article is equivalent to doing Bayesian inference by assigning a computable distribution p the prior 2^{-K(p)} where K(p) is a measure of the complexity of p (it's Kolmogorov complexity).

Roughly, p computable means that there is a computer program hat{p} that given x as input, outputs p(x). Here x is an arbitrary sequence from a finite alphabet (so p can be any distribution that can be written down as a computer program). K(p) is the length of hat{p} (the fact that there are many such hat{p}s is dealt with in the theory), and it is easily shown (via Kraft inequality) that 2^{-K(p)} is a "semi-prior" – that is

sum_{computable p} 2^{-K(p)} &lt = 1

And this is good enough…

So you define the Bayes mixture

J(x) = sum_{computable p} p(x) 2^{-K(p)}

and do your usual Bayesian inference with this. It can be shown that the prediction J(y|x) of this mixture converges to the prediction of the best computable distribution p^*(y|x) fairly rapidly with increasing length of x.

The intuition (may be this should've come first): In practice, all we can ever deal with are computable distribution. So why not see what we can do if we just consider Bayesian inference with computable distributions only (while ignoring any and all kinds of constraints on computational resources).

Final Words: To me it seems very cool that you can get these kinds of convergence results (references in the Scholarpedia article) in such an abstract and general setting.

From the perspective of an Stats/ML researcher, it seems like a useful picture to keep in mind while developing your next inference algorithm. For instance, it was an inspiration for V. Vovk's version of online learning algorithms, and through it the whole field of Prediction with Expert Advice. However, trying to approximate it directly seems very difficult, if not impossible.

4. Andrew Gelman says:

Hassan: Asymptotic results are fine but it's not clear to me that this prior makes sense in finite samples. JUst because the prior distributions are defined mathematically, that doesn't make them appropriate for applied statistical inference.

5. K? O'Rourke says:

Do the models average understandably/sensibly?

What first caught me about model averaging of even simple linear models was that the coefficients had a different meaning in each model.

A good frequency example might be Efron and Tibshirani's Prevalidation paper.

"three steps of model building, inference, and model checking"

Last summer some of us came up with fit, understanding, criticism and keeping – though I think you are right: criticism (via inference) comes before understanding – but that would have given a nastier acronym than FCUK ;-)

Nor is it linear.

Peirce's tricotomy could be put as possible models, implication of models and tentative understanding of models which moves on to his continuity of models that evolve – not a single 1-2-3 but over and over again.

So maybe in R code

for(i in 1:infinity)
fit.i; understand.i, (criticism and tentative keeping).i

K?

6. Manuel Moe G says:

You have the capability to consider multiple models: call them M1, M2… Obviously, you have a system for ranking models, all things considered equal, based on the models' situational performance. Call this S1. There will be sticky situations where you have to admit the possibility of other systems of ranking, call them S2, S3… Eventually, you arrive at a choice, so there must a system for the ranking of S1, S2, S3, — call this S'1. There will be sticky situations where you have to admit the possibility of a S'2, S'3. So there must exists S''1, S''2, S'''1, S'''2, etc.

You can flatter yourself that there is an infinite regress, but there is not. The human brain is made of components that readily surrender to automatic fast and frugal heuristics. Ecologically, these heuristics perform quite well, or else our parents would have never reproduced. In natural corner cases or in artificial situations, these heuristics perform poorly – that is exactly how we know they are but automatic heuristics.

These automatic fast and frugal heuristics can be idiosyncratic, or widely shared, or universally shared by all humans. If your heuristic is universally shared, sigh in relief, you will never get called out on it, and there will be sympathy for failure from everywhere.

When you consider cases where one person exercises preference and makes the model and makes the choice and takes the action and enjoys the reward and bares the responsibility (defined as a claw-back of reward upon demonstration of failure of responsibility), these issues are embarrassingly clear, and never flattering. At this level, impossible to distinguish fool from king from knight from knave.

If the data collector, the model maker, the decision maker, the actor, the residual claimant are different people and all distinct from who bares the ultimate liability for failure, the description of the situation can be taffy-pulled into a huge gray block of text like in the Sokal hoax, flattering to all.

What I wrote is much snarkier than can be justified, so I guess my ignorance and poor communication skills are compounded by insufficient levels of caffeine. I will admit I lack the brainpower to intelligently separate preference, analysis, choice, action, and responsibility – when I consider one, I consider them all.

7. John says:

" … If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, …"

Well, you can, of course.

However, if you want your subjective posterior to be the same as if you had actually done the Bayesian updating, I don't think the conclusion follows, as it requires that you can do the Bayesian updating calculations precisely in your head. If you can't, you can certainly engage in the process of observing the data and writing a subjective posterior down, but it won't be as "good" a posterior as if you had done the calculations.

Being able to write your prior down also does not imply that you are able to do the calculations, e.g., because you would have had to have done the calculations correctly at some time in the past in order to obtain your current prior. For example, you could have a prior based on anecdotal evidence rather than data.

8. Murray Jorgensen says:

For a less artificial example of this type of problem you could consider the Richards family of growth curves. See eg
http://en.wikipedia.org/wiki/Generalised_logistic

For different choices of the continuous shape parameter you can get von Bertalanfy, Logistic, Gompertz curves and limiting cases include the step function and exponential growth up to a constant maximum. This shape parameter turns out to be intrinsically highly nonlinear in the Bates/Watts sense, so perhaps what you gain by considering all these models as the single Richards model is less than you might hope for. In any case setting a rigid boundary between parameter estimation and model selection seems problematical.

Murray Jorgensen

PD My friend David Dowe will not forgive me if I fail to mention the Monash "Minimum Message Length" school of inference and the memorial conference for Ray Solomonoff that they are hosting:

9. Hassan says:

Andrew:

– I should stress that the asymptotics are only in terms of computational resources, that is the convergence results hold in the limit of unlimited resources. The bounds themselves are for finite samples. For instance, the most basic result says that J satisfies

sum_{n=1}^T E_{over x of length n} [sum_y (J(y|x) – p(y|x))^2 ] &lt 2 ln (K(p))

for any computable distribution p and T is finite. So the total expected squared error is bounded by the log of the complexity of p, which is a finite constant. This can also be extended to arbitrary bounded loss functions.

– I forgot to mention another very successful approximation of universal prior, context tree weighting, which is a method for predicting binary sequences. The algorithm does exact Bayesian inference over variable order Markov models of fixed and arbitrary depth and uses a complexity prior. The algorithm is also efficient computationally. In fact, it is probably one of the (if not the) best compression algorithms out there, the only problem being it's patented.

This context tree method was adapted to the online learning/prediction with expert advice case where we make no assumptions on the generating process and simply try to minimize the regret w.r.t the best predictor in hindsight. Also applies to non-binary sequences. (this is in chapter 5 of Prediction Learning and Games by Cesa-Bianchi and Lugosi – it seems to be available online for free).

– I do understand your deeper point, which is what is an applied statistician such as yourself to do with Solomonoff's work. Since I am a ML person I don't really have a good answer to this question and for the moment can only point to the what I have written before. I guess given your recent postings on binary predictions, the context tree method (or it's extensions) might be of interest…

10. D. Mayo says:

A quick note: frequentist statistics does not disallow probabilistic prior information if one is dealing with a random variable having a frequentist prior probability distribution, but this is rarely the case with statistical hypotheses of interest in science. Nor does it assume “the likelihood function to be precisely known,” but provides methods for checking statistical model assumptions. One of the purposes of an upcoming on-line symposium will be to bring out a handful of charges with which frequentist (error) statistical methods are often saddled, these being two.

A distinct issue that might be the focus for a possible discussion here or elsewhere concerns the central difference between frequentist error statistics and Bayesian statistics on the role of the sampling distribution—the basis for error probabilities— with its consideration of outcomes other than those observed in reasoning from data. The importance of such error statistical considerations underlies key intuitions in epistemological discussions that take place quite apart from statistics and yet, oddly, these intuitions conflict with Bayesian reasoning. So this might be a general linchpin for illuminating what is really at the heart of rival accounts of inductive-statistical inference.

11. Andrew Gelman says:

Mayo:

Indeed, frequentist statistics allows regularization procedures. But it is my impression that frequentist statisticians (and researchers in that area) have a soft spot in their hearts for maximum likelihood estimates and hang on to them a bit longer than necessary. Frequentists tend to use the mle without question until they are bludgeoned by the evidence into doing something else. I think this has something to do with the idea of the likelihood being "objective" (or more objective than any regularization procedure), but to me this makes little sense. Similarly, there is frequentist research on robustness to misspecified likelihoods, but in general the statisticians who espouse frequentist principles seem to be pretty accepting of whatever likelihood functions are out there, while holding an extreme skepticism when it comes to regularization procedures.

12. D. Mayo says:

I don't know what you mean by regularization procedures …or being bludgeoned by the evidence—

13. Andrew Gelman says:

Mayo:

By regularization in this context I mean maximum penalized likelihood. The penalty function (or, in a Bayesian context, the log-prior density) keeps the estimate well-behaved ("regularized").

By bludgeoned by the evidence in this context I mean that some estimates are so dumb that a sensible statistician will adjust his or her principles accordingly to rule it out. The funny thing, though, is that maximum likelihood estimation is more of a clever trick than a good principle. Maximum likelihood is a fine and useful trick–don't get me wrong–but there's no purely statistical (as opposed to sociological or historical) reason to consider it your default estimation procedure.

14. Nick Cox says:

E.B. Wilson pointed out that maximum likelihood alone would lead you to suppose from one toss of a coin that the coin was same on both sides (or at least that it always returned the same result).

15. K? O'Rourke says:

An historical account of Maximum likelihood as a fine and useful trick is given by Stigler.

K?