Asymptotically we are all dead (Thoughts about the Bernstein-von Mises theorem before and after a Diamanda Galás concert)

They say I did something bad, then why’s it feel so good–Taylor Swift

It’s a Sunday afternoon and I’m trying to work myself up to the sort of emotional fortitude where I can survive the Diamanda Galás concert that I was super excited about a few months ago, but now, as I stare down the barrel of a Greek woman vocalizing at me for 2 hours somewhere in East Toronto, I am starting to feel the fear.

Rather than anticipating the horror of being an emotional wreck on public transportation at about 10pm tonight, I’m thinking about Bayesian asymptotics. (Things that will make me cry on public transport: Baby Dee, Le Gateau Chocolate, Sinead O’Connor, The Mountain Goats. Things that will not make me cry on any type of transportation: Bayesian Asymptotics.)

So why am I thinking about Bayesian asymptotics? Well because somebody pointed out a thread on the Twitter (which is now apparently a place where people have long technical discussions about statistics, rather than a place where we can learn Bette Midler’s views on Jean Genet or reminisce about that time Cher was really into horses) that says a very bad thing:

The Bernstein-von Mises theorem kills any criticism against non-informative priors (for the models commonly used). Priors only matter if one wishes to combine one’s confirmation bias with small studies. Time to move to more interesting stuff(predictive inference)

I’ve written in other places about how Bayesian models do well at prediction (and Andrew and Aki have written even more on it), so I’m leaving the last sentence alone. Similarly the criticisms in the second last sentence are mainly rendered irrelevant if we focus on weakly informative priors. So let’s talk about the first sentence.

Look what you made me do

The Bernstein-von Mises theorem, like Right Said Fred, is a both a wonder and a horror that refuses to stay confined to a bygone era. So what is it?

The Bernstein-von Mises theorem (or BvM when I’m feeling lazy) says the following:

Under some conditions, a posterior distribution converges as you get more and more data to a multivariate normal distribution centred at the maximum likelihood estimator with covariance matrix given by $latex n^{-1} I(\theta_0)^{-1}$, where $latex \theta_0$ is the true population parameter (Edit: here $latex I(\theta_0)$ is the Fisher information matrix at the true population parameter value).

A shorter version of this is that (under some conditions) a posterior distribution looks asymptotically like the sampling distribution of a maximum likelihood estimator.

Or we can do the wikipedia version (which lacks in both clarity and precision. A rare feat.):

[T]he posterior distribution for unknown quantities in any problem is effectively independent of the prior distribution (assuming it obeys Cromwell’s rule) once the amount of information supplied by a sample of data is large enough.

Like a lot of theorems that are imprecisely stated, BvM is both almost always true and absolutely never true.  So in order to do anything useful, we need to actually think about the assumptions. They are written, in great and loving detail, in Section 2.25 of  these lecture notes from Richard Nickl. I have neither the time nor energy to write these all out but here are some important assumptions:

  1. The maximum likelihood estimator is consistent.
  2. The model has a fixed, finite number of parameters.
  3. The true parameter value lies on the interior of the parameter space (ie if you’re estimating a standard deviation, the true value can’t be zero).
  4. The prior density must be non-zero in a neighbourhood of $latex \theta_0$.
  5. The log-likelihood needs to be smooth (two derivates at the true value and some other stuff)

The first condition rules out any problem where you’d want to use a penalized maximum likelihood estimator. (Edit: Well this was awkwardly stated. You need the MLE to be unbiased [Edit: Consistent! Not unbiased. Thanks A Reader] and there to be a uniformly consistent estimator, so I’m skeptical these things hold in the situation where you would use penalized MLE.) The third one makes estimating variance components difficult. The fifth condition may not be satisfied after you integrate out nuance parameters as this can lead to spikes in the likelihood.

I guess the key thing to remember is that this “thing that’s always true” is, like everything else in statistics, a highly technical statement that can be wrong when the the conditions under which it is true are not satisfied.

Call it what you want

You do it ’cause you can. Because when I was younger, I couldn’t sustain those phrases as long as I could. Now I can just go on and on. If you keep your stamina and you learn how to sing right, you should get better rather than worse. – Diamanda Galás in Rolling Stone 

Andrew has pointed out many times that the problem with scientists misapplying statistics is not that they haven’t been listening, it’s that they have listened too well. It is not hard to find statisticians (Bayesian or not) who will espouse a similar sentiment to the first sentence of that ill-begotten tweet. And that’s a problem.

When someone says to me “Bernstein-von Mises implies that the prior only has a higher-order effect on the posterior”, I know what they mean (or, I guess, what they should mean). I know that they’re talking about a regular model, a lot of information, and a true parameter that isn’t on the boundary of the parameter space. I know that declaring something a higher-order effect effect is a loaded statement because the “pre-asymptotic” regime can be large for complex models.

Or, to put it differently, when someone says that I know they are not saying that priors aren’t important in Bayesian inference. But it can be hard to know if they know this. And to be fair, if you make a super-simple model that can be used in the type of situation where you could read the entrails of a recently gutted chicken and still get an efficient, asymptotically normal estimator, then the prior is not a big deal unless you get it startlingly wrong.

No matter how much applied scientists may want to just keep on gutting those chickens, there aren’t that many chicken gutting problems around. (Let the record state that a “chicken gutting” problem is one where you only have a couple of parameters to control your system, and you have accurate, random, iid samples from your population of interest. NHSTs are pretty good at gutting chickens.) And the moment that data gets “big” (or gets called that to get the attention of a granting agency), all the chickens have left the building in some sort of “chicken rapture” leaving behind only tiny pairs of chicken shoes.

Big reputation, big reputation. Ooh, you and me would be a big conversation.

I guess what I’m describing is a communication problem. We spend a lot of time communicating the chicken gutting case to our students, applied researchers, applied scientists, and the public rather than properly preparing them for the poultry armageddon that is data analysis in 2017. They have nothing but a tiny knife as they attempt to wrestle truth from the menagerie of rhinoceroses, pumas, and semi-mythical megafauna that are all that remain in this chicken-free wasteland we call a home.

(I acknowledge that this metaphor has gotten away from me.)

The mathematical end of statistics is a highly technical discipline. That’s not so unusual–lots of disciplines are highly technical. What is unusual about statistics as a discipline is that the highly technical parts of the field mix with the deeply applied parts. Without either of these ingredients, statistics wouldn’t be an effective discipline.  The problem is, as it always is, that people at different ends of the discipline often aren’t very good at talking to each other.

Many people who work as excellent statisticians do not have a mathematics background and do not follow the nuances of the technical language. And many people who work as excellent statisticians do not do any applied work and do not understand that the nuances of their work are lost on the broader audience.

My background is a bit weird. I wasn’t trained as a statistician, so a lot of the probabilistic arguments in the theoretical stats literature feel unnatural to me. So I know that when people like Judith Rousseau or Natalia Bochkina or Ismael Castillo or Aad van der Vaart or any of the slew of people who understand Bayesian asymptotics more deeply that I can ever hope to speak or write, I need to pay a lot of attention to the specifics.  I will never understand their work on the first pass, and may never understand it deeply no matter how much effort I put in.

The only reason that I now know more than nothing about Bayesian asymptotics is that I hit a point where I no longer had the luxury to not know. So now I know enough to at least know what I don’t know.

Replication is not just Taylor Swift’s new album

The main thing that I want to make clear about the Bernstein-von Mises theorem is that it is hard to apply it in practice. This is for the exact same reason that the asymptotic arguments behind NHSTs rarely apply in practice.

Just because you have a lot of data, doesn’t mean you have independent replicates of the same experiment.

In particular, issues around selection, experimental design, forking paths, etc all are relevant to applying statistical asymptotics. Asymptotic statements are about what would happen if you gather more data and analyze it, and therefore they are statements about the entire procedure of doing inference on a new data set. So you can’t just declare that BvM holds. The existence of a Bernstein-von Mises theorem for your analysis is a statement about how you have conducted your entire analysis.

Let’s start with the obvious candidate for breaking BvM: big data. Large, modern data sets are typically observational (that is, they are not designed and the mechanism for including the data may be correlated with the inferential aim of the study). For observational data, it is unlikely that the posterior mean (for example) be a consistent estimator of the population parameter, which precludes a BvM theorem from applying.

Lesson: Consistency is a necessary condition for a BvM to hold, and it is unlikely to hold for undesigned data.

Now onto the next victim: concept drift.  Let’s imagine that we can somehow guarantee that the sampling mechanism we are using to collect our big data set will give us a sample that is representative of the population as a whole.  Now we have to deal with the fact that it takes a lot of time to collect a lot of data. Over this time, the process you’re measuring can change.  Unless your model is able to specifically model the mechanism for this change, you are unlikely to be in a situation where BvM holds.

Lesson: Big data sets are not always instantaneous snapshots of a static process, and this can kill off BvM.

For all the final girls: Big data sets are often built by merging many smaller data sets. One particular example springs to mind here: global health data. This is gathered from individual countries, each of which has its own data gathering protocols. To some extent, you can get around this by carefully including design information in your inference, but if you’ve come to the data after it has been collated, you may not know enough to do this. Once again, this can lead to biased inferences for which the Bernstein-von Mises theorem will not hold.

Lesson: The assumptions of the Bernstein-von Mises theorem are fragile and it’s very easy for a dataset or analysis to violate them. It is very hard to tell, without outside information, that this has not happened.

`Cause I know that it’s delicate

(Written after seeing Diamanda Galas, who was incredible, or possibly unbelievable, and definitely unreal. Sitting with a thousand or so people sitting in total silence listening to the world end is a hell of a way to wrap up a weekend.)

Much like in life, in statistics things typically only ever get worse when they get more complicated.  In the land of the Bernstein-von Mises theorem, this manifests in the guise of a dependence on the complexity of  the model.  Typically, if there are $latex p$ parameters in a model and we observe $latex n$ independent data points (and all the assumptions of the BvM are satisfied), then the distance from the posterior to a Normal distribution is $latex \mathcal{O}\left(\sqrt{pn^{-1}}\right)$.  That is, it takes longer to converge to a normal distribution when you have more parameters to estimate.

Do you have the fear yet? As readers of this blog, you might be seeing the problem already. With multilevel models, you will frequently have at least as many parameters as you have observations.  Of course, the number of effective parameters is usually much much smaller due to partial pooling. Exactly how much smaller depends on how much pooling takes place, which depends on the data set that is observed.  So you can see the problem.

Once again, it’s those pesky assumptions (like the Scooby gang, they’re not going away no mater how much latex you wear).  In particular, the fundamental assumption is that you have replicates which, in a multilevel model with as many parameters as data, essentially means that you can pool more and more as you observe more and more categories. Or that you keep the number of categories fixed and you see more and more data in each category (and eventually see an infinite amount of data in each category).

All this means that the speed at which you hit the asymptotic regime (ie how much data you need before you can just pretend you posterior is Gaussian) will be a complicated function of your data. If you are using a multilevel model and the data does not support very much pooling, then you will reach infinity very very slowly.

This is why we can’t have nice things

Rolling Stone: Taylor Swift?

Diamanda Galás: [Gagging noises]

The ultimate death knell for simple statements about the Bernstein-von Mises theorem is the case where the model has an infinite dimensional parameter (aka a non-parametric effect).  For example, if one of your parameters is an unknown function.

A common example relevant in a health context would be if you’re fitting survival data using a Cox Proportional Hazards model, where the baseline hazard function is typically modelled by a non-parametric effect.  In this case, you don’t actually care about the baseline hazard (it’s a nuisance parameter), but you still have to model it because you’re being Bayesian. In the literature, this type of model is called a “semi-parametric model” as you care about the parametric part, but you still have to account for a non-parametric term.

To summarize a very long story that’s not been completely mapped out yet, BvM does not hold in general for models with an infinite dimensional parameter. But it does hold in some specific cases. And maybe these cases are common, although honestly it’s hard for me to tell. This is because working out exactly when BvM holds for these sorts of models involves pouring through some really tough theory papers, which typically only give explicit results for toy problems where some difficult calculations are possible.

And then there are Bayesian models of sparsity, trans-dimensional models (eg models where the number of parameters isn’t fixed) etc etc etc.

But I’ll be cleaning up bottles with you on New Year’s Day

So to summarise, a thing that a person said on twitter wasn’t very well thought out. Bernstein-von Mises fails in a whole variety of ways for a whole variety of interesting, difficult models that are extremely relevant for applied data analysis.

But I feel like I’ve been a bit down on asymptotic theory in this post. And just like Diamanda Galás ended her show with a spirited rendition of Johnny Paycheck’s (Pardon Me) I’ve Got Someone to Kill, I want to end this on a lighter note.

Almost all of my papers (or at least the ones that have any theory in them at all) have quite a lot of asymptotic results, asymptotic reasoning, and applications of other people’s asymptotic theory. So I strongly believe that asymptotics are a vital part of our toolbox as statisticians. Why? Because non-asymptotic theory is just too bloody hard.

In the end, we only have two tools at hand to understand and criticize modern statistical models: computation (a lot of which also relies on asymptotic theory) and asymptotic theory. We are trying to build a rocket ship with an earthmover and a rusty spoon, so we need to use them very well. We need to make sure we communicate our two tools better to the end-users of statistical methods. Because most people do not have the time, the training, or the interest to go back to the source material and understand everything themselves.

24 thoughts on “Asymptotically we are all dead (Thoughts about the Bernstein-von Mises theorem before and after a Diamanda Galás concert)

  1. Under your 5 assumptions, and what they mean in practice, you seem to be implying that penalized MLE is not consistent. Why do you make this claim? Penalized MLE’s are almost always biased, but as far as I know, usually consistent (for a fixed number of parameters and an increasing number of samples).

    • I’ve added a little clarification, but it does’t really matter if penalized MLE is consistent, only that the MLE is consistent (with the extra condition that there needs to be a uniformly consistent estimator that doesn’t necessarily have to be the MLE, so I guess the penalized ML could fulfil that role).

  2. > We need to make sure we communicate our two tools better to the end-users of statistical methods.
    Very much agree and the usual brief limitations section that can be interpreted (or the author can interpret) to rule out what users will likely think they can (or really really want to) use the methods for, does not cut it.

    Also, for those who may not be aware, Charlie Geyer http://www.stat.umn.edu/geyer/lecam/simple.pdf argued for the following attitude about asymptotics:
    • Asymptotics is only a heuristic. It provides no guarantees.
    • If worried about the asymptotics, bootstrap!
    • If worried about the bootstrap, iterate the bootstrap!

  3. Had I jumped right to “So to summarise, a thing that a person said on twitter wasn’t very well thought out.”, I could’ve saved myself a lot of trouble.

    I think you meant “nuisance parameter” instead of “nuance parameter” near the start.

    @Keith O’Rourke: Geyer is also the one who has been arguing that you can get away with only running a single chain, e.g., in section 1.11.3 of the intro chapter to the Handbook of MCMC. This can be dangerous advice in practice because you can get autocorrelation estimates from one chain that are consistent with convergence even when multiple chains would indicate the entire posterior isn’t being explored. Just because you can’t cover all regions of the space with initialization, I see no reason not to try a few—it helps avoid false positives in convergence assessment based only on autocorrelation estimates.

  4. On this topic, I also like section 12.6 and 12.7 of Larry Wasserman’s
    http://www.stat.cmu.edu/~larry/=sml/Bayes.pdf
    His last two paragraphs of section 12.7 are:

    This means that, in a topological sense, consistency is rare for Bayesian
    procedures. From this result, it can also be shown that most pairs of priors lead to
    inferences that disagree. (The agreeing pairs are meager.) Or as Freedman says in his paper:
    “ … it is easy to prove that for essentially any pair of Bayesians, each thinks the other
    is crazy.”

    Now, it is possible to choose a prior that will guarantee consistency in the frequentist
    sense. However, Freedman’s theorem says that such priors are rare. Why would a Bayesian
    choose such a prior? If they choose the prior just to get consistency, this suggests that
    they are realy (sic) trying to be frequentists. If they choose a prior that truly represents
    their beliefs, then Freedman’s theorem implies that the posterior will likely be
    inconsistent.

    • All that talk about “belief” in the context of non-parametric priors does not ring true to me. You simply can’t have a firm opinion on the infinite number of degrees of freedom needed, so non-parametric bayes has nothing to do with belief.

      Once you restrict yourself to classes of NP priors (like a GP or a dirichlet process), consistency becomes much more likely.

      Or, to say it shorter, I think Wasserman is being disingenuous here.

    • “I was asked to comment”… yeah right, it’s less embarassing to just say “hey, I commented on your article”, beacuse we all know nobdy asked you to comment on it.

      • It’s precisely because saying “I was asked to comment on…” is more peculiar than saying “I wrote up a post about…” (and because I can’t see that any benefit accrues for saying the one rather than the other) that I’m inclined to believe that he actually was asked by someone to comment on the post.

    • Sean:

      I wrote “asymptotically we are all dead” a few years ago, but a quick google search finds an attribution to psychometrician Melvin Novick.

      I think the phrase is such an obvious idea (following the famous Keynes line, “In the long run we are all dead”) that it’s no surprise that people keep independently coming up with it.

  5. Just a small correction for anyone is reading this post in 2018.

    In Briggs’s article he writes (http://wmbriggs.com/post/23244/):
    “No probability can be defined in frequentist theory unless infinite samples are available.”

    This is incorrect I believe. The Strong Law of Large Numbers says that for any e>0, the probability of Heads, p, say for a coin flip, will be in [p-e, p+e] with probability 1-1/(m*e^2), where m is sufficiently large. It does not say anything about “infinity”. In practice, the relative frequency of heads settles down rather fast.

Leave a Reply

Your email address will not be published. Required fields are marked *