Andy McKenzie writes:

In their March 9 “counterpoint” in nature biotech to the prospect that we should try to integrate more sources of data in clinical practice (see “point” arguing for this), Isaac Kohane and David Margulies claim that,

“Finally, how much better is our new knowledge than older knowledge? When is the incremental benefit of a genomic variant(s) or gene expression profile relative to a family history or classic histopathology insufficient and when does it add rather than subtract variance?”

Perhaps I am mistaken (thus this email), but it seems that this claim runs contra to the definition of conditional probability. That is, if you have a hierarchical model, and the family history / classical histopathology already suggests a parameter estimate with some variance, how could the new genomic info possibly increase the variance of that parameter estimate? Surely the question is how much variance the new genomic info reduces and whether it therefore justifies the cost. Perhaps the authors mean the word “relative” as “replacing,” but then I don’t see why they use the word “incremental.” So color me confused, and I’d love it if you could help me clear this up.

My reply:

We consider this in chapter 2 in Bayesian Data Analysis, I think in a couple of the homework problems. The short answer is that, *in expectation*, the posterior variance decreases as you get more information, but, depending on the model, in particular cases the variance can increase. For some models such as the normal and binomial, the posterior variance can only decrease. But consider the t model with low degrees of freedom (which can be interpreted as a mixture of normals with common mean and different variances). if you observe an extreme value, that’s evidence that the variance is high, and indeed your posterior variance can go up.

That said, the quote above might be addressing a different issue, that of overfitting. But I always say that overfitting is not a problem of too much information, it’s a problem of a model that’s not set up to handle a given level of information.

This is a really important point that needs to be discussed more. In fact, I consider it the "dark secret" of the so-called big data movement. There is a belief bordering on a religion that more data = better. The fallacy is that this belief assumes, again bordering on religion, that any data = information. A Bayesian posterior is better than a prior only if the new data contains useful information, that is, the conditional probabilities are different from the unconditional probabilities. If the new data say consists of a set of random numbers, that data will not do anything useful, and could lead to false conclusions if the analyst is not careful.

In this age of big data, a lot of the data are literally picked off the ground – it's not collected consciously or purposefully, it's available because it's in some log files somewhere. There is no reason why such data necessarily improves inference.

Well, why do you think that greater variance is a negative thing?

If the posterior has greater variance than your prior, it only means that before observing the data, you were too certain of the value of the parameters. Leaving out data that would increase the variance amounts to deluding yourself about posterior. And, yes any data is good data, although there are indeed bad models.

This happens when you apply DLM's to political forecasting: A single poll *usually* increases the certainty of your election day estimate. But if it gives the model information that opinion is more volatile than it thought, then it can end up increasing overall variance since now opinion can move a lot farther between now and election day.

Bayesian estimates tend to regularize things to the point where that never happens, but it happens fairly frequently if you estimate with MLE's and have a low sample-size.

Clarification: in the binomial model, posterior variance is always smaller than prior variance for the uniform prior. But for other priors it can increase. For example, imagine starting with a uniform prior and observing three successes followed by three failures. After observing just the three successes, the posterior variance is 2/75; after observing all six outcomes, the posterior variance is 2/72.

(These results are taken from my notes for the homework problem referred to in the post.)

Another example of posterior variance increasing with additional information can occur in psychometrics. In what's known as the 3 parameter model for predicting correct/incorrect responses, certain response patterns can be problematic because the model has a lower finite asymptote for guessing, but the upper asymptote is one, suggesting that if a student is smart enough they will certainly get the question right. Imagine a student of high ability who gets a couple of early questions wrong (for some reason not behaving according to the model), then if they recover and start to answer difficult questions correctly the posterior variance can actually increase. In plain language the tension is between two hypotheses – is it a weak student getting lucky on the difficult questions, or is it a strong student who was unlucky on a couple of easy questions? As the information accumulates, the posterior is clearly bimodal reflecting these possibilities.

File this in your "everything comes up in psychometrics" file. Bradlow and Wainer have discussed the negative information. Kelly Rulison and I showed that it was relevant for computer adaptive testing.