Someone asked me about the distinction between bias and noise and I sent him some links. Then I thought this might interest some of you too, so here it is:

Here’s a recent paper on election polling where we try to be explicit about what is bias and what is variance:

And here are some other things I’ve written on the topic:

– The bias-variance tradeoff

– Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses

– There’s No Such Thing As Unbiased Estimation. And It’s a Good Thing, Too.

– Balancing bias and variance in the design of behavioral studies

Finally, here’s the sense in which variance and bias can’t quite be distinguished:

– An error term can be mathematically expressed as “variance” but if it only happens once or twice, it functions as “bias” in your experiment.

– Conversely, bias can vary. An experimental protocol could be positively biased one day and negatively biased another day or in another scenario.

**P.S.** These two posts are also relevant:

– How do you think about the values in a confidence interval?

(The question was “Are all values within the 95% CI equally likely (probable), or are the values at the “tails” of the 95% CI less likely than those in the middle of the CI closer to the point estimate?”

And my answer was: In general, No and It depends.)

– Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests

> The bias-variance tradeoff (15 October 2011)

> Suppose, for example, you’re using data from a large experiment to estimate the effect of a treatment on a fairly narrow group, say, white men between the ages of 45 and 50. At one extreme, you could just take the estimated treatment effect for the entire population, which could have high bias (to the extent the effect varies by age, sex, and ethnicity) but low variance (because you’re using all the data).

How would that be an acceptable estimator at all? And it’s not even guaranteed that the variance will be lower: imagine the measured effects are { 1.4 1.5 1.6 } for men and { 2.4 2.5 2.6 } for women (but maybe that’s why you said “could”).

Carlos:

I don’t know what you mean by “acceptable.” The point is that researchers do this all the time: they do some study and then come up with some estimated treatment effect and consider it as universally valid. And then if for some particular policy reason you only care about some subset of the population, you take the published treatment effect and use it. There’s no estimate for the narrow subset because that would’ve been so noisy, it was never reported in the first place.

(I replied in the wrong place and with too many typos… I’m doing it properly now)

Usually one would want an estimator to be consistent. Otherwise it’s too easy: I could estimate the effect to be 42. That has zero variance and I wouldn’t even need to collect any data. This estimator may also be biased, though.

My point is that if the question was about “bias” as the concept is usually used in statistics, an answer which is completely outside of the frequentist framework may not be helpful. Or at least if may be worth pointing out that this “estimate” doesn’t correspond to what we would usually call a good estimator, with some bias and variance but with some nice asymptotic frequentist properties.

Carlos:

I think the concept of “error” is important. The generic estimate of 42 can have a high error. I don’t really care if you call this “bias” or “variance.” Either way, high error is bad.

Even consistency isn’t quite the right concept — no property of a sequence that can ignore any finite initial subsequence actually captures what we want out of an inference procedure. Any such properties can be satisfied using Jonathan’s trick below; you can define your estimator as nonsense for the first N samples and then a consistent estimator thereafter, for any finite N. Something something monotonic decreasing MSE something something.

(https://www.youtube.com/watch?v=0oGMbAIcXCQ)

I didn’t really want to go down the consistency route, but I had to give some idea of what I meant by “acceptable”. Of course one can have good asymptotic convergence with atrocious low sample results (for values of low as large as we want). I guess the converse is also true, the estimator could behave nicely for every reachable value of N and then diverge. But estimating something using data for something else seems, using the technical term, “just silly”. And at this point you will remind me of shrinkage estimators :-)

Carlos:

You say that “estimating something using data for something else” seems “just silly.” That may be so, but it’s what people do all the time! That’s the point of my post, that there’s no avoiding “estimating something using data for something else.”

As I wrote in my example, suppose you’re using data from a large experiment to estimate the effect of a treatment on a fairly narrow group, say, white men between the ages of 45 and 50. It is standard practice to take a single estimate for the entire population (or maybe just for men or just for women) and take it to apply to particular narrow groups. There’s no way of avoiding this. A study might include 500 people but only have data from 31 people in this group, and nobody would even think of producing an estimate from that group alone. Instead there will be some aggregate estimate, which can be fine, but then the estimate for the narrow group is based primarily on data on people outside that group.

It’s true that people do silly things all the time. I plead guilty! Case at hand: I incorrectly assumed that a blog post titled “the bias-variance tradeoff” was about the “bias-variance tradeoff” as often discussed in a model selection context (for example in “The Elements of Statistical Learning”, chapter 9).

And upon further reflection I understand now how your point fits in that discussion. Sorry for the noise.

Estimating what happens to 45-50 year old males by looking at what happens to all males in the US isn’t nearly as silly as estimating it by asking the guy in the next cubicle “gee what do you think would happen if we set up policy X for 45-50 year old males in Arkansas?”

And yet, it’s clearly way less good than estimating 45-50 year olds by interpolating between a dataset of 30-40 year olds and a dataset of 60-70 year olds…

But as Phil has said “we go to war with the data we have” and if you don’t have a dataset of 45-50 year old males… you *have* to use some other data if you want a data based estimate.

I think the bigger problem is that if you stick to a “standard frequentist” concept of estimation, then you need a dataset containing samples from the relevant population, and you need a sample statistic. And if your dataset is for the whole population, you might have maybe 7 people in the relevant population and so when some group collects a big data set and computes a bunch of pre-tabulated statistics, they don’t necessarily break down things by all the combinations you’d like because the sample size gets ridiculously small.

“males between the ages of 45-50 who live alone and have income in the bottom quintile in Arkansas”

that has to be a nontrivial group, but it isn’t necessarily very highly represented in your national sample of all people in the US.

But that doesn’t mean you couldn’t use the enormous dataset across the whole country to estimate that quantity, it just means that you can’t estimate it accurately without doing so in a smart way, a way that includes other people from other states and other ages and takes advantage of the fact that there’s more structure to the question than you have when you just “take a sample of the relevant population and compute the average of the sample” But as soon as you start using people of same age from other states, or people of slightly different ages, or whatever, you potentially bias your estimate away from the specific group you’re interested in, but you gain from the partial information you get from the closely related groups.

That’s the real-world version of a bias-variance tradeoff

+1

Finite things don’t concern me, maestro. only infinities interests a careerist with sophistication.

This was a response to Corey.

Carlos, I definitely don’t think Andrew is recommending that people take the overall average across the population. he’s just saying that’s the kind of thing people do all the time.

In the consumer expenditure survey for example they give say housing expenditure for people in the bottom quintile of income….

well this includes individuals, married couples, families of two adults with 3 kids… just widely varying housing requirements. It’s essentially useless.

Usually one would want a estimator to be consistant. Otherwise it’s too easy: I could estimate the effect to be 42. That has zero variance and I wouldn’t even need toncollect any data. This estimator may also be biased, though.

42/n+1 + sum(x)/n is consistent for the mean of x and still gives an answer of 42 when you don’t have any data.

One could debate whether that is an “acceptable” estimator. But I think the one I proposed and the one proposed in the fragment I quoted are clearly “unacceptable”. Wouldn’t you agree?

I was thinking along similar lines. If I understand Andrew correctly, I _think_ he’s suggesting that the constant estimator has nonzero variance in the grand scheme of things because someone else might have chosen a different constant. Maybe he can clarify?

David:

The estimate of “42” is just silly. Nobody would do it. I’m not particularly interested in studying things that nobody would do.

Other than Douglas Adams, who gets a pass. And Scott Adams.

Speaking of bias/variance trade-offs and causal inference… I first came to this comment section just about exactly 4 years ago when I “randomly” stumbled on a post about the bias/variance tradeoff.

http://andrewgelman.com/2013/03/14/everyones-trading-bias-for-variance-at-some-point-its-just-done-at-different-places-in-the-analyses/#comment-143858

My training in causal inference teaches me I now have a within-person natural experiment generating variation in exposure to Andrew’s thoughts on bias/variance. Plugging in an interesting outcome variable – I estimate that the causal effect of Andrew’s blog on commenter earnings is about 400% over 4 years. I’m pretty sure that is statistically significant.

But seriously, I mostly came here to publish my poetry. And then Andrew totally reneged on me.

http://andrewgelman.com/2016/06/20/clarkes-law-of-research/#comment-279830

If it weren’t for the major earnings boost, I’d be outa here.

Perhaps http://www.annualreviews.org/doi/full/10.1146/annurev-statistics-010814-020310

Keith:

I doubt that I’d disagree with any of the specifics in that paper, but I feel like with all that “Holy Grail” talk they’re making it all seem more mysterious than it really is. It’s just Bayesian inference and partial pooling.

Can you say a bit more explicitly what bias is conditional upon, compared with variance?

I think of each, in a frequentist scenario, as conditional on 1. a true parameter value, and 2. a model (= subset of the parameter space) which may or may not contain the true value. As the model subset gets bigger, and closer to including the true parameter value, bias goes down, but variance goes up.

(To be even more explicit: I’m thinking of the “fitting a squiggly regression line w/ higher- and higher-order polynomials” scenario.)

I’m also aware of what Stephen Senn calls “inverse bias” which is E(theta – theta_hat | theta_hat), i.e. the difference between an estimate and a posterior mean in a fuller/Bayesian model. Is this connected with what you’re saying?

Leon:

The key is in your phrase, “true parameter value.” In real world applications, the parameter value varies. I’m not talking Bayesian here, I’m just saying that the treatment effect is different for men than for women, the population average is different in 2015 than in 2016, etc. A study designed to give an unbiased estimate for some parameter, will not be unbiased for the corresponding parameter that is needed for future decision making.