What We Talk About When We Talk About Bias

Shira Mitchell wrote:

I gave a talk today at Mathematica about NHST in low power settings (Type M/S errors). It was fun and the discussion was great.

One thing that came up is bias from doing some kind of regularization/shrinkage/partial-pooling versus selection bias (confounding, nonrandom samples, etc). One difference (I think?) is that the first kind of bias decreases with sample size, but the latter won’t. Though I’m not sure how comforting that is in small-sample settings. I’ve read this post which emphasizes that unbiased estimates don’t actually exist, but I’m not sure how relevant this is.

I replied that the error is to think that an “unbiased” estimate is a good thing. See p.94 of BDA.

And then Shira shot back:

I think what is confusing to folks is when you use unbiasedness as a principle here, for example here:

Ahhhh, good point! I was being sloppy. One difficulty is that in classical statistics, there are two similar-sounding but different concepts, unbiased estimation and unbiased prediction. For Bayesian inference we talk about calibration, which is yet another way that an estimate can be correct on average.

The point of my above-linked BDA excerpt is that, in some settings, unbiased estimation is not just a nice idea that can’t be done in practice or can be improved in some ways; rather it’s an actively bad idea that leads to terrible estimates. The key is that classical unbiased estimation requires E(theta.hat|theta) = theta for any theta, and, given that some outlying regions of theta are highly unlikely, the unbiased estimate has to be a contortionist in order to get things right for those values.

But in certain settings the idea of unbiasedness is relevant, as in the linked post above where we discuss the problems of selection bias. And, indeed, type M and type S errors are defined with respect to the true parameter values. The key difference is that we’re estimating these errors—these biases—conditional on reasonable values of the underlying parameters. We’re not interested in these biases conditional on unreasonable values of theta.

Subtle point, worth thinking about carefully. Bias is important, but only conditional on reasonable values of theta.

P.S. Thanks to Jaime Ashander for the above picture.

23 thoughts on “What We Talk About When We Talk About Bias

  1. I don’t wanna nitpick but confounding is not a selection bias. Anyways, bias due to selection into study best controlled during design phase not during statistical analysis. Biased/unbiased estimates due to systematic vs random errors are handled differently and sample size would not reduce bias due to systematic errors.

    • Ayse:

      In the above post I wrote, “the linked post above where we discuss the problems of selection bias.” And in the linked post we did discuss the problems of selection bias. Perhaps the confusion is that there were many linked posts.

        • Sure selection bias can lead to confounding variables. There are different forms of selection, not all determined by the researcher

        • yes, of course, there are many ways selection bias can be introduced such as Berksonian, missing data, lost to follow-up etc. but they are called selection bias not confounding. it is about how these terminologies defined in epidemiology/causal inference, not whether selection bias confounds the association. That’s why I called it nitpicking because we understand the context but the terminology has implications when communicating in pubic.

        • in a second thought, perhaps I should not minimize the importance of understanding the differences between confounding bias and selection bias. if we have data on confounders, we can adjust in the analysis but with selection bias, it would be almost impossible to correct it in the analysis – few fancy models by inference but lots of juggling and assumption making. it is a great threat to the external validity of a study; therefore, we invest so much resource to prevent selection bias up front in design and data collection phase of the study.

  2. I think the key distinction to be made here is between biased statistical inference (biased estimators) and biased causal inference (biased identification of the causal effect). Confounding, selection bias etc are examples of the latter, and that is why those biases do not decrease with sample size.

    Suppose the “theta” you are discussing above is an observable parameter. In causal inference, we are not interested in theta, but in some unobservable causal parameter, for example the average treatment effect. If you have confounding bias, the problem is NOT that E(theta.hat|theta) is not equal to theta – it very well may be equal, but that question is perpendicular to whether there is confounding. Rather, the key issue is whether theta is equal to the average treatment effect, i.e. whether your identification of the causal effect is biased.

    Statistical biases and causal biases have very different properties, and it is very confusing when they are discussed as if they are the same phenomenon. For reasonable statistical estimators, the statistical bias is generally small and quantifiable, and it may be sensible to argue that statistical unbiasedness is “not a good thing” – this is not something I take a position on. However, it is very misleading when this argument is used in favor of reducing the emphasis on unbiased identification.

    Not only do biases due to confounding and selection bias tend to be much larger than biases due to biased statistical estimators, it is also generally not possible to quantify their magnitude in a given study. Therefore, for any given estimate, you generally have no idea how wrong it could be. In my view, this makes it impossible to interpret the results of the study, or to use it in decision making.

    Randomized trials guarantee that you get unbiased identification of the intention-to-treat effect. This ensures both that the bias is 0, and that you can know that the bias is 0. In my view, the latter is more important. If a study design existed such that you could guarantee that the bias was at most “X”, then that design would have most of the advantages of a randomized trials: What I am defending is my ability to have a reasonably accurate impression of how wrong/biased the results could be. That is going to be almost impossible in the presence of confounding bias and selection bias.

    Note that even in randomized trials, you could in principle estimate the ITT effect using a biased statistical estimator. In practice, there is no point in doing this, since you can just use the non-parametric sample ITT effect as an unbiased statistical estimator. The point I am trying to make, is that the advantage of randomization is NOT to avoid deviations from E(theta.hat|theta)

    • this is a great post.

      are you able to comment on the relation between an identification strategy for a causal estimand to parameter identification in a statiatical model?

      are these just unfortunately unrelated concepts that share a common keyword.

      • They’re somewhat related. In both cases, identification means that you can construct an estimate from data that converges to what you’re interested in as you collect more data.

        In the case of a statistical parameter, this happens when your likelihood function has a unique maximum.

        In the case of causal inference, this happens when certain typically unverifiable assumptions about counterfactuals hold (e.g. no confounding, or some set of instrumental variable assumptions, etc.)

      • (In causal inference, we’re usually interested in non-parametric identification, i.e. whether there exists a function of the data that converges to the causal effect of interest without any parametric assumptions on the data generating process. That’s why I didn’t mention anything about the likelihood that’s necessary for causal identification.)

        • putting it all together, can we say that

          a ‘strategy’ to identify a causal estimand from your hypothetical causal model is a combination of:

          1) the “data collection design” (whether to stratify/block & sample size) + 2) “unverifiable assumptions about counterfactuals” (eg presence/absence of links in a dag, selection on observables)

          The analyst can stop after steps 1+2 if they just care about non-parametric identification with a valid ‘identification strategy’ above

          But if the analyst makes further assumptions, 3) picking a particular parametric function, they now consider “likelihood identification” of a parameter.

          Here the parameter should be related to the causal estimand thru choices of the 1+2+3

    • I agree with Anders substantially. Statistical biases may also entail causal biases. Here I din’t know why it is couched as ‘statistical’ vs.’causal. I advance this b/c it’s not necessarily in the actual study that one gathers the types of biases, including causal ones, implicated. It is in informal conversations their detection may be more feasible.

    • I’ve explored these issues here https://onlinelibrary.wiley.com/doi/full/10.1111/evo.12406 and here https://www.biorxiv.org/content/early/2017/12/04/133785. In the 2014 paper, I try to model the size of the error due to confounding based on simple models of data in observational designs. In the second, I try to really distinguish between the two different parameters that are estimated (theta – the conditional “effect” or regression parameter and beta – the causal effect parameter). Pound away (I welcome constructive criticism!).

  3. Isn’t it related to Stein’s paradox and bias/variance tradeoff? Shrinkage works when the parameter vector is more than 2 dimensional and noise offsets work.

  4. One of the confusing aspects of estimation in Bayesian inference is that if we do full Bayes, we typically marginalize out the parameters rather than estimating.

    But we are estimating—we’re estimating expectations of functions of the model parameters,

    $latex \displaystyle \mathbb{E}\left[ f(\theta) \right] = \int_{\Theta} f(\theta) \, p(\theta | y) \, \mathrm{d}y$

    Now the confusing part is that even though this marginalizes out the estimate of $latex \theta$, if we take $latex f$ to be the identity function, the expectation is the posterior mean. The posterior mean is the standard Bayesian point estimate of $latex \theta$, because it minimizes expected squared error in the estimate. But if we take $latex f(\theta)$ to be something like an indicator function, e.g., $latex f(\theta) = \mathrm{I}[\theta_1 > \theta_2]$, the expectation is an event probability. In this case, we are estimating the event probability, not the parameters.

  5. Jaynes has a good discussion of unbiased estimators in Sections 17.2 and 17.3 of _Probability Theory: The Logic of Science_. One example he gives is that the only unbiased estimator of lambda^2 for a Poisson distribution yields the logically impossible estimate lambda^2 = 0 when the observed count is n = 1.

Leave a Reply

Your email address will not be published. Required fields are marked *