The bias-variance tradeoff

Joshua Vogelstein asks for my thoughts as a Bayesian on the above topic. So here they are (briefly):

The concept of the bias-variance tradeoff can be useful if you don’t take it too seriously. The basic idea is as follows: if you’re estimating something, you can slice your data finer and finer, or perform more and more adjustments, each time getting a purer—and less biased—estimate. But each subdivision or each adjustment reduces your sample size or increases potential estimation error, hence the variance of your estimate goes up.

That story is real. In lots and lots of examples, there’s a continuum between a completely unadjusted general estimate (high bias, low variance) and a specific, focused, adjusted estimate (low bias, high variance).

Suppose, for example, you’re using data from a large experiment to estimate the effect of a treatment on a fairly narrow group, say, white men between the ages of 45 and 50. At one extreme, you could just take the estimated treatment effect for the entire population, which could have high bias (to the extent the effect varies by age, sex, and ethnicity) but low variance (because you’re using all the data). At the other extreme, you could form an estimate using only data from the group in question, which would then be unbiased (assuming an appropriate experimental design) but would have a high variance. In between are various model-based solutions such as Mister P.

The bit about the bias-variance tradeoff that I don’t buy is that a researcher can feel free to move along this efficient frontier, with the choice of estimate being somewhat of a matter of taste. The idea is that a conservative serious scientist type might prefer something unbiased, whereas a risk-lover might accept some trade-off.

One of the difficulties with this conventional view is that in some settings the unbiased estimate is taken to be the conservative choice while in other places it is considered more reasonable to go for low variance. Lots of classically-minded statistics textbooks will give you the sense that the unbiased estimate is the safe, sober choice. But then when it comes to subgroup analysis, these same sober advice-givers (link from Chris Blattman) will turn around and lecture you on why you shouldn’t try to estimate treatment effects in subgroups. Here they’re preferring the biased but less variable estimate—but they typically won’t describe it in that way because that would sound a bit odd. But, really, that’s what they’re saying.

So my first problem with the bias-variance tradeoff idea is that there’s typically an assumption that bias is more important—except when it’s not.

My second problem is that, from a Bayesian perspective, given the model, there actually is a best estimate, or at least a best summary of posterior information about the parameter being estimated. It’s not a matter of personal choice or a taste for unbiasedness or whatever.

How can this be? Where does Bayes come in? For a Bayesian, the problem with the “bias” concept is that is conditional on the true parameter value. But you don’t know the true parameter value. There’s no particular virtue in unbiasedness.

I could say more on the topic (for starters, you can look up “bias” in the index of Bayesian Data Analysis), but this will do for now.

12 thoughts on “The bias-variance tradeoff

  1. “For a Bayesian, the problem with the “bias” concept is that is conditional on the true parameter value. But you don’t know the true parameter value. There’s no particular virtue in unbiasedness.”

    I think for a quant social scientist this is actually a non-trivial observation, specially if you have been exposed to econometric thinking. It will take a lot of effort to convince many people about that, including many referees. I would be glad to hear more on that.

  2. I think the bias-variance tradeoff makes the most sense in the context of the design of the model. Some models are more flexible than others, e.g., splines are more flexible than linear models. Less flexible models are more biased but have lower variance. If you want to minimize squared error, you need to find a happy compromise between the two. The situation you describe, whether to use an entirely pooled model versus a subgroup model, is a special case of this. In this view, the tradeoff idea adds an extra bit of intuition behind the practice of picking the simplest model you can get away with.

    Also, the bias-variance decomposition of squared error can actually be used against the sober advice givers. Why is an unbiased estimate more “conservative” if a biased one has lower risk?

  3. What about the preferential use of unbiased estimates in Monte Carlo methodology? I always found difficult to argue about a specific loss function to compare Monte Carlo methods, while biased solutions may sometimes have important drawbacks (no name!)…

  4. You sound so pro-Bayes in this post! I agree with the implied point that one of the main reasons to be a Bayesian is that, once you have specified your model (and I agree that is a big “once”), you don’t have any choice about your posterior conclusions; there are no knobs to turn. As for the big “once”, what @Charles said. But to employ a technique you yourself perhaps don’t like, you can even handle @Charles by putting in every model that @Charles has in mind, from a straight line to the loopiest spline, throw them *all* into the mix, weight all the outputs as Bayes instructs us, and you once again you have no knobs to turn.

  5. One of the aspects of Bias/Variance is the issue that adding parameters or degrees of freedom can add bias rather than reduce it. With enough parameters you can fit your data exactly, and then whatever bias you have in your random sample is built in to your predictions. The main thing that I take away from the concept of Bias/Variance duality is that the smallest expected error comes from allowing some bias and some variance.

  6. This post articulates a point that I have always tried to express, which is that the classical and Bayesian methods have different starting points, and so it often feels hard to really compare them on level terms. Classicists take unbiasedness for granted and spend a lot of energy focusing on lowering variance. The unspoken part of what Andrew is stating here is that Bayesians take variance for granted: the best estimate is achieved despite the small sample size and high variance. And the posterior distribution measures the variance.

    I wonder if there are any psychological studies of how decision makers trade off between bias and variance. If told the chance of complications for a medical procedure for a small segment of people is between 10% and 50% (low bias, high variance), one might find it difficult to use this information than say being told it’s 25%-35% (but high bias). What if the total error is the same in both cases? What would people prefer? How would they react in each case if the prediction turned out to be wrong?

  7. @Daniel I don’t think that’s right, the problem with adding degrees of freedom is that it increases the variance, not the bias. Imprecise estimates can lead to a poorly predictive model.

    I think that’s what he was getting at with the subgroup analysis. Subgroup estimation => more degrees of freedom => less bias, more variance. Leaving out the degrees of freedom of the subgroups means fewer degrees of freedom, more precise parameter estimmates, but _more_ bias.

    • There is what you might call “structural” bias, that is for example bias because you average over several groups that should really be considered separately, and sampling bias, that is bias because the particular data you have is randomly or because of unknown sampling design problems different from the population.

      The more you add parameters the better your model can fit fluctuations from group to group that are not real (that is, only appear in your data not in your population). A model with fewer parameters can not fit this random or sampling bias and therefore can be better at out-of-sample performance. Is this bias or variance?

      I find the concept of what is bias and what is variance slightly confusing actually. I feel on better footing when saying that there are several sources of error: insufficient information to accurately estimate parameters, model mis-specification, sampling fluctuation and data collection bias, approximation during model fitting (finite MCMC sample size for example), and maybe other various things.

  8. @gelman: thank you for posting your ideas. some thoughts in response:

    i was surprised to see your description of bias/variance that seemed to be limited to discretization of data, as bias/variance is a concern much more generally. specifically, one must (at least implicitly) make a bias/variance decision upon choosing a model to operate within. for example, upon using an AR model, one must choose the order: higher orders lead to more variance and less bias. the interesting thing (to me) is that there is a “sweet spot”: you want to keep enough model complexity to describe the data, but not too much that you overfit. Trunk (1979) explains this in detail, and Jain (2000) discusses it in perhaps more friendly terms.

    another thought is that from an optimization or statistical decision theoretic perspective, there is a best estimator: that estimator that minizes loss (or expected risk). it is interesting that one can decompose error into bias and variance, but the unbiased estimator is only optimal under a particular loss function. this is somewhat in contrast to the bayesian perspective, because all bayes estimators have bias.

    i think the reason for the appeal to have unbiased estimators is that from a mean-squared-error loss perspective, separating error into bias and variance makes sense. and while having a zero-variance estimator does not make sense, a zero bias estimator does make sense. so, some of us get a warm and fuzzy feeling whenever we can demonstrate unbiasedness. in practice, however, there are many occasions for not caring about bias (whenever the loss function doesn’t require it).

    one further point. if one considers a bayesian estimator as a frequentist estimator with a prior on top to regularize, then the bayesian estimator, in some sense, induces some additional bias (by pushing the answer towards the prior), but reduces the variance. thus, as long as the additional bias is not too large, the bayesian estimator will often do better than the frequentist one, especially for small sample sizes. i think this, more than anything else, is the practical motivation for using “priors” in the absence of strong prior information.

    • Joshua:

      I know that it’s a common view that unbiased estimation is a good thing, and that there is a tradeoff in the sense that you can reduce variance by paying a price in bias. But I disagree with this attitude.

      To use classical terminology, I’m all in favor of unbiased prediction but not unbiased estimation. Prediction is conditional on the true theta, estimation is unconditional on theta and conditional on observed data. We discuss this on pages 248-249 of Bayesian Data Analysis (2nd edition).

      • I’m really only familiar with the idea of a bias-variance tradeoff in the machine learning / prediction context, where you get ideas like “Sum of Squared Errors = Bias^2 + Variance^2.”

        In that situation, it seems true that you can trade bias for variance, but not that you can do so freely along an efficient frontier. There’s some optimal tradeoff for achieving your goal (e.g., minimizing MSE).

Comments are closed.