This blog has roughly a month’s worth of items waiting to be posted. I post about once a day, sometimes rescheduling posts to make room for something topical.

Anyway, it struck me that I know what’s coming up, but *you* don’t. So, here’s what we have for you during the next few days:

**Mon:** More on US health care overkill

**Tues:** My talks in Bristol this Wed and London this Thurs

**Wed:** How to think about “identifiability” in Bayesian inference?

**Thurs:** Stopping rules and Bayesian analysis

**Fri:** The popularity of certain baby names is falling off the clifffffffffffff

Plus anything our cobloggers might choose to post during these days. And, if Woody Allen or Ed Wegman or anyone else newsworthy asks us to publish an op-ed for them, we’ll consider it.

Enjoy.

i’m looking forward to identifiability in bayesian approaches.

I’ve come across a confusingly amount of definitions and perspectives.

Topic request:

Entropy in statistics. As a self-trained quant, I’ve seen entropy used to discriminate between models, but I don’t really get it. What does it do? How does it work?

I humbly request the Prof. Gelman primer on entropy. Is it premised on Bayesian “thinking”? How does the concept of statistical entropy relate to what they talk about in thermodynamics? Your explanations of regression are so clear and understandable in your book that hopefully you can bring that level of clarity to entropy.

thanks,

Sean

Sean:

There’s this paper. I don’t think I have the slides to go with it, though; I gave the talk in 1988.

Entropy is closely linked to unresolved foundational issues in statistics. The reason you’re having trouble getting a handle on it is because the subject matter itself is screwed up, or at least very incomplete. It’s been quite some time since anyone’s made any real progress on it. Maybe one day a deeper and simpler understanding will be had, and it’ll be easy to explain, but that day isn’t today.

There are subareas and special problems though where things are clear and proven enough that might be worth looking at. One is image reconstruction as Gelman suggested. That Kullback book from the 50’s put out by dover not’s a bad place to look if you’ve had a more typical/Frequentist education. There are a few others. These wont answer your questions though for the reasons given in the paragraph above.

If you want to do more than that, you might give this a try:

http://bayes.wustl.edu/etj/articles/stand.on.entropy.pdf

Actually, maybe I will have a go explaining the connection between Bayes, Entropy, Statistics, and Thermodynamics.

The reasoning is inherently Bayesian. Instead of interpreting P(x) as describing the histogram you’d get from observing many x’s, think instead of P(x) in a purely non-frequentist way. The high probability area of P(x) defines a region W where we think the true x is (or where it will show up in the future). Note this definition makes sense even if there is only ever one x.

When we find an outcome which is highly probably according to this Bayesian P(x) what that means is the following: almost all x’s in W lead to that same outcome. Since the true x that actually exists is in W, then we have reason to believe the aforementioned outcome will hold for it as well.

This is exactly what’s happening in Stat Mech/thermodynamics. We have a P(x,p) which defines some big area in phase space which (hopefully) contains the true microstate of the system. Call this area W. Usually p(x,p) is found by using the fact that we know the energy of the true microstate, which confines the location of the true(x,p) in phase space. Although obviously we could use other measured functions of the microstate, like the total angular momentum.

If it should turn out that P(x,p) implies it’s “highly probable” the system will diffuse, then in concrete terms that means almost every microstate in W diffuses. Since the true microstate of the system is one of those elements of W, our best bet is to predict it will actually diffuse.

Of course to carry this out mathematically, we deal with things like S =ln |W| which is boltzmans definition of entropy. The “statistics” definition of entropy( -\int plnp) is just a generalization of this expression.

Yeah, there’s nothing really magic about entropy (or really what we should be referring to as the Kullback-Leibler divergence).

The basic idea is that we want to have some way of valuing a probability distribution — mathematically we want some function from the space of measures to R. Of course we don’t just want any function — we want something that respects independence (so that if our probability distribution factors into a bunch of independent terms then so does our function), we want something that is everywhere positive, and we want something that’s independent of any coordinate choice.

It turns that there is a unique function that accomplishes this, and that’s the Kullback-Leibler (KL) divergence. The one subtlety is that KL actually requires base measure, so what we’re really doing is comparing two distributions (being proper Bayesians, however, all we do are comparisons anyways!). And that comparison is asymmetric — the divergence from p1 to p2 is not the same as the divergence from p2 to p1 (there’s a lot of meaning in that difference, but it’s relatively intuitive). The usual definition of entropy is recovered if you take the base measure to be uniform, but you can only do that in finite-dimensionals space.

So what’s the utility of the KL divergence? Well for one we can use it for model comparison. If we let p(y’ | y, M) be the posterior predictive distribution for model M given measured data y and p(y’) be the true distribution of the data then the KL divergences gives us some sense of the agreement between our model and the truth. In particular, let us compare the KL divergence of the posterior predictive distribution from two models, p(y’ | y, M1) and p(y’ | y, M2), from the truth, p(y’). If we use one version of KL the dependence on p(y’) cancels out and we recover usual Bayesian model comparison with the log odds; if we use the other we get something that depends on p(y’), approximations of which yield cross validation, WAIC, DIC, AIC, etc. Incidentally, this is the reason why cross-validated log loss has been so successful in machine learning lately.

The concepts carry over to thermo, where we want to model a distribution over phase space with knowledge only of a few global expectations such as average energy. The idea is to use the KL divergence from the uniform distribution as a variational equation and choose a distribution that maximizes the divergence given the constraints of satisfying the global expectations (note that the usual issues with a uniform measure fall away here because the infinities are constant and don’t effect the variational optimization). In other words, we model the thermodynamical system with a distribution as uniform as possible provided it satisfies certain constraints. This is the Jaynes’s “maximum entropy” approach to thermo (although the alternative ergodic approach is basically the same thing conditioned on classical mechanics being well-behaved).

Anyways, it’s all very straightforward and consistent, although maybe a bit long for a blog post.

That’s a very interesting and helpful comment. Thanks for taking the time to write it. If it’s not too much trouble, would you mind expanding a little on this bit:

“…that comparison is asymmetric — the divergence from p1 to p2 is not the same as the divergence from p2 to p1 (there’s a lot of meaning in that difference, but it’s relatively intuitive).”

After reading your comment, I skimmed the Wikipedia page on Kullback-Leibler divergence for enlightenment on that point. From there, I get the impression that KL(p1|p2) can be interpreted as a measure of how well p2 approximates the true distribution p1 (please correct me if that’s wrong). If that’s the case, then KL(p2|p1) should measure how well p1 approximates p2. It is not obvious to me that “approximates” should be an asymmetric property in this case. I mean, if p1 produces samples that look like samples from p2, shouldn’t p2 produce samples that look like samples from p1? Any intuition you could supply would be much appreciated.

The asymmetry comes from the fact that KL, like any good object in statistics, is an expectation. Recall that we have

D(p || q) = \int dx p(x) log p(x) / q(x).

Now we should run into infinities whenever p(x) or q(x) vanish, but the weighting by p(x) prevents any problems with p(x) is zero. The only time we get infinite KL divergence is when p(x) has finite probability over some region but q(x) does not. In other words, KL only considers regions where p(x) is non-zero and that is the heart of the asymmetry.

Let’s consider the model comparison case. If we use KL( p(y’| y, M) || p(y’) ) (which yields Bayesian model comparison) then we only care about neighborhoods where p(y’| y, M) is non-zero, i.e. data that could be predicted by our model. We have no sensitivity to possible data that our model misses, so we could get burned if our model could generate only a very small space of data.

KL( p(y’) || p(y’| y, M) ), on the other hand, considers all neighborhoods where the true distribution of the data is non-zero (yielding cross validation and their ilk). This means a model is penalized for data set it could not generate, which makes this form of model comparison more robust to models vulnerable to overfitting.

Ah! Of course! A simple case of PTFF (“ponder the furnished formula”). Thanks for taking the time to spell it out for me.

Looking forward to identifiability, and to stopping rules!

Sean S.

I layed out the connection between Classical Statistics, Statistical Mechanics, and entropy as simply and explicitly as possible in this post:

http://www.entsophy.net/blog/?p=243