Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses

Some things I respect

When it comes to meta-models of statistics, here are two philosophies that I respect:

1. (My) Bayesian approach, which I associate with E. T. Jaynes, in which you construct models with strong assumptions, ride your models hard, check their fit to data, and then scrap them and improve them as necessary.

2. At the other extreme, model-free statistical procedures that are designed to work well under very weak assumptions—for example, instead of assuming a distribution is Gaussian, you would just want the procedure to work well under some conditions on the smoothness of the second derivative of the log density function.

Both the above philosophies recognize that (almost) all important assumptions will be wrong, and they resolve this concern via aggressive model checking or via robustness. And of course there are intermediate positions, such as working with Bayesian models that have been shown to be robust, and then still checking them. Or, to flip it around, using robust methods and checking their implicit assumptions.

I don’t like these

The statistical philosophies I don’t like so much are those that make strong assumptions with no checking and no robustness. For example, the purely subjective Bayes approach in which it’s illegal to check the fit of a model because it’a supposed to represent your personal belief. I’ve always thought this was ridiculous, first because personal beliefs should be checked where possible, second because it’s hard for me to believe that all these analysts happen to be using logistic regression, normal distributions, and all the other standard tools, out of personal belief. Or the likelihood approach, advocated by those people who refuse to make any assumptions or restrictions on parameters but are willing to rely 100% on the normal distributions, logistic regressions, etc., that they pull out of the toolbox.

Unbiased estimation is a snare and a delusion

Unbiased estimation used to be a big deal in statistics and remains popular in econometrics and applied economics. The basic idea is that you don’t want to be biased; there might be more efficient estimators out there but it’s generally more kosher to play it safe and stay unbiased.

But in practice one can only use the unbiased estimates after pooling data (for example, from several years). In a familiar Heisenberg’s-uncertainty-principle sort of story, you can only get an unbiased estimate if the quantity being estimated is itself a blurry average. That’s why you’ll see economists (and sometimes political scientists, who really should know better!) doing time-series cross-sectional analysis pooling 50 years of data and controlling for spatial variation using “state fixed effects.” That’s not such a great model, but it’s unbiased—conditional on you being interested in estimating some time-averaged parameter. Or you could estimate separately using data from each decade but then the unbiased estimates would be too noisy.

To say it again: the way people get to unbiasedness is by pooling lots of data. Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses.

t seems (to me) to be lamentably common for classically-trained researchers to first do a bunch of pooling without talking about it, then getting all rigorous about unbiasedness. I’d rather fit a multilevel model and accept that practical unbiased estimates don’t in general exist.

B-b-b-but . . .

What about the following argument in defense of unbiased estimation: Sure, any proofs of unbiasedness are based on assumptions that in practice will be untrue—but, it makes sense to see how a procedure works under ideal circumstances. If your estimate is biased even under ideal circumstances, you got problems. I respect this reasoning but the problem is that the unbiased estimate doesn’t always come for free. As noted above, typically the price people pay for unbiasedness is to do some averaging, which at the extreme case leads to the New York Times publishing unreasonable claims about the Tea Party movement (which started in 2009) based on data from the cumulative General Social Survey (which started in 1972). I’m not in any way saying that the New York Times writer was motivated by unbiasedness, I’m just illustrating the real-world problems that can arise from pooling data over time.

Here’s the cumulative GSS:

And here are the data just from 2009-2010 (which is what is directly relevant to the Tea Party movement):

The choice to pool is no joke; it can have serious consequences. If we can avoid pooling, instead using multilevel modeling and other regularization techniques.

The tough-guy culture

There seems to be tough-guy attitude that is prevalent among academic economists, the idea that methods are better if they are more mathematically difficult. For many of these tough guys, Bayes is just a way of cheating. It’s too easy. Or, as they would say, it makes assumptions they’d rather not make (but I think that’s ok, see point 1 at the very top of this post). I’m not saying I’m right and the tough guys are wrong; as noted above, I respect approach 2 as well, even though it’s not what I usually do. But among many economists it’s more than that, I think it’s an attitude that only the most theoretical work is important, that everything else is stamp collecting. As I wrote a couple years ago, I think this attitude is a problem in that it encourages a focus on theory and testing rather than modeling and scientific understanding. It was a bit scary to me to see that when applied economists Bannerjee and Duflo wrote a general-interest overview paper about experimental economics, that when discussing data analysis, they cited papers such as this:

Bootstrap tests for distributional treatment effects in instrumental variables models
Nonparametric tests for treatment effect heterogeneity
Testing the correlated random coefficient model
Asymptotics for statistical decision rules

I worry that the tough-guy attitude that Bannerjee and Duflo have inherited might not be allowing them to get the most out of their data–and that they’re looking in the wrong place when researching better methods. The problem, I think, is that they (like many economists) think of statistical methods not as a tool for learning but as a tool for rigor. So they gravitate toward math-heavy methods based on testing, asymptotics, and abstract theories, rather than toward complex modeling. The result is a disconnect between statistical methods and applied goals.

Lasso to the rescue?

One of the (many) things I like about Rob Tibshirani’s lasso method (L_1 regularization) is that it has been presented in a way that gives non-Bayesians (and, oddly enough, some Bayesians as well) permission to regularize, to abandon least squares and unbiasedness. I’m hoping that lasso and similar ideas will eventually make their way into econometrics—and that once they recognize that the tough guys in statistics have abandoned unbiasedness, the econometricians will follow, and maybe be more routinely open to ideas such as multilevel modeling that allow more flexible estimates of parameters that vary over time and across groups in the population.

29 thoughts on “Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses

  1. Pingback: Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses | Curious Young Statistician.

  2. Naively I still prefer #2 over #1 (given a choice) because I worry that often there isn’t enough data to validate one strong assumption versus another. Robustness, if an option, seems far safer.

    Am I wrong?

    • I think strong assumptions are usually easier to disprove because they make strong predictions. This is one of the values of hard-core modeling. If you have two or three competing hard-core models that all have similar high quality fit to data, you should be asking yourself if these models are somehow closely related in some abstract sense you aren’t immediately seeing. If you really do have different models that produce the same predictions, you should look for what region of outcomes they diverge in, and then try to collect more data to see which one is predicting better in that region.

      This is more or less the real honest to goodness scientific method: observation, model, prediction, validation or revision wash rinse repeat.

      • Note, not that I think robustness isn’t good, but robustness is maybe better when you want to predict than when you want to *understand*.

        If your model is sensitive to the choice of distribution, for example, then this is a way to understand what distribution makes more sense for the process. An example of this is a paper I read a couple years back where they were trying to model the time to loss-of-immunity for natural Pertussis infection vs Pertussis vaccine. The model was sensitive to the choice of distribution, and fit best when an exponential time to loss distribution was used. They concluded that we tend to lose immunity to Pertussis much faster than people really think.

        A nonparametric/robust model of this process would probably predict the average and/or the standard deviation of time to loss well without needing to make assumptions about the distribution of time to loss, but it wouldn’t give you the understanding of the time to loss distribution that strong distributional assumptions followed by model-checking would.

        • So, do you think there won’t be other distributions that give an equally good (or close to) fit?

          Typically, does one end up with a unique assumption (ignoring closely related ones) that lets us make sense of the data? I doubt it.

        • Well, either your model fit is sensitive to the distributional assumptions, in which case you should be able to come up with a “unique” answer (ie. small family of related distributions) that give “best fit”, or it’s *not* sensitive to distributional assumptions, and you wind up with the same answer so long as your distribution is reasonable. (and there’s a spectrum in between too)

          In the first case, you can find out interesting things about the process by comparing distributions, and in the second case you discover that there is something robust about the process itself. In the middle ground, perhaps you just muddle around. Depending on your area of research you might spend more time in the middle ground than at either end.

  3. Very interesting post, Professor Gelman! I’ve been aspiring to “think Bayesian” lately, and a lot of these issues are coming to mind.

    Could you clarify your thoughts on economists vs political scientists? In your Aug 2010 post (the one you alluded to above), you write “[economists] gravitate toward math-heavy methods based on testing, asymptotics, and abstract theories, rather than toward complex modeling.” For example, economists like to pool data across time instead of using multilevel modeling. What are some other examples? Are you basically equating economists to “classically trained” statisticians, and in a “bad” way?

    (fyi, could I suggest you label the top graph, to make it clear it represents the CUMULATIVE GSS information?)

    CZ

  4. I enjoyed this post very much. The last part on “tough-guy” methods reminded me of Jaynes’s complaint in PTLOS about one-upmanship.

    “the purely subjective Bayes approach in which it’s illegal to check the fit of a model because it’a supposed to represent your personal belief”

    I regard Bayesian models as hypothetical statements about *someone’s* state of personal belief. I suppose it’s a bit like Jaynes’s robot. It might be (close enough to) mine, or it might be something that obviously isn’t mine but I just want to know its consequences anyway.

  5. In the concluding paragraph of the post Andrew uses a phrase “more flexible estimates of parameters”

    What does “flexible” mean in this context? And is “flexible” a good thing when estimating parameters?

    • Rahul:

      By “flexible” I mean various things including nonparametric estimates and partial pooling. Instead of the crude options of no pooling and complete pooling. And, yes, I think “flexible” is better than “rigid” when estimating parameters. The rigid models are applied in arbitrary ways, for example the decision of how many years to pool. Once you get out of the straitjacket you can make modeling decisions that are less arbitrary.

  6. Amen! The set of interesting policy problems addressable with clever-dick IVs is so disappointingly small, especially when you’re working in a country with only a few decent economic datasets. But don’t worry too much–folk like yourself, Andrew, are influencing many young econometricians.

  7. I’d really appreciate this post. It puts clearly enough something that I though myself, but couldn’t organize to be clear about what I didn’t like about econometricians approach to statistics. Actually, I think it’s so important that you really should transform it in a paper and publish at an economic journal. Your point 1 is well developed in your paper with Cosma. Point 2 may need more elaboration, but someone who works with non-parametric statistics can help to develop. Then you just need a few examples of econometricians paper focusing on unbiased estimates.

    The only point that may need to be addressed is the following: When we study the Rubin causal model (for instance in the book Mostly Harmless Econometrics), selection bias is a problem because it causes the average treatment effect to be biased. As far as I know, there is no discussion about selection bias decreasing variance (as in the bias-variance trade-off). I know that multilevel models allow to estimate varying treatment effects. But, again, with randomization shouldn’t I be able to estimate average treatment effects within groups?

  8. This was cross posted to the economics job rumors site, in which I responded:

    Poster A: “Economists don’t care about validation of functional form and distributional assumptions. They of course care about testing theory”

    Poster B: “How can one be independent of the other one?”

    Tudor:
    They can’t, but we can make our tests as robust as possible to such assumptions, hence all the “tough” math. It’s partly cultural, arising from our study of utility models and the effect of their functional form on choice behavior. In case we need reminders about how functional assumptions can ruin your day, look no further than the effect bivariate Gaussian copulas had on VaR forecasts during this latest crisis.

    However, while I agree that econometricians are a bit insular, it’s not because we want to be “tough guys,” but because economics has unique problems that are present nearly nowhere else. It started with the Cowles Commission looking at causality, which statisticians are finally waking up to now that Pearl has rediscovered it and framed it in Bayesian networks. The Lucas critique lead us further down the path of carefully modeling data processes in ways no scientist or statistician would consider.

    I maintain a point from a previous post that the problem with Bayesian methods isn’t the prior, it’s all the distributional assumptions you have to bake into the model to get a likelihood. Sure, Bayesians using stupid priors are a problem, but those are usually easily identified and argued about. However, the sensitivity of the model to different assumptions regarding error distribution and persistence are much more subtle and difficult to tease out.

    I believe that I’m arguing that modern econometrics is pursuing the second type of statistical analysis you admire, but using it for program evaluation and testing theory. As such, we require minimum distance rather than likelihood based estimators. I think you would argue for a Multi-level model or empirical likelihood methods that allow for flexible distributions in approximating the true likelihood of the data but these often mask the most relevant features of endogeneity in economic settings. What scientific experiment has the results that completely change if you tell the particle you’re doing an experiment and it might not be treated?

    I’m skeptical that pattern recognition algorithms could effectively measure something like returns to schooling, although the double-LASSO techniques being developed by econometricians have potential for automated discovery. Of course, econometric methods would never be able to move a prosthesis using decoded neural signals in an amputee, but that’s not the objective of economics.

    The world is not constructed of only nails, which is why we need tools beyond a hammer. Claiming that people are using the tools they are to suit their arrogance rather than because their problems’ demands does more to build walls between research communities than to break them down. I have papers using Bayesian methods, L1 and L2 regularization, pattern recognition, and “tough guy” econometrics. All these tools have an irreplacable role in the pantheon of inferential methods.

    • Sorry, there’s a formatting error above, the text beginning with “I believe that…” did not appear in the original post.

    • Tudor:

      I agree that all sorts of statistical methods can be useful, and there is no need for researchers to restrict themselves to a narrow toolkit. There is, though, a technical error that I’ve seen made by many applied researchers in economics and political science, and that is to have a sort of lexicographic attitude in which all that matters is getting bias down to zero. The reason why I see this as a technical error is that it makes an implicit assumption that there are no interactions during the particular pooling being done (for example, combining data from 1972-2010). Hence the title of his post.

    • Tudor:

      > Sure, Bayesians using stupid priors are a problem, but those are usually easily identified and argued about.

      Really, have you looked at this or have a reference?

  9. I’ve avoided asking for a while, but as a student of applied economics and econometrics, I have to know. This post shares a common theme with the blog in general: Economists are analyzing data wrong.
    That seems totally plausible, but I find myself unclear on where to go to understand what the problem is or how else to think about data anlysis. And to suggest abandoning least sqares, non-parametric estimates of heterogeneity in treatment effects, and questions about biased estimators, can make an econometrics student feel pretty lost indeed.

    So where can a young man go to understand the problems with current statistical methods in social science, and for a glimse of what it ought to become? (I’d read your textbooks, but a more succinct introduction wouldn’t go amiss.)

    • Eli:

      I don’t think Banerjee and Duflo (the economists mentioned in my post above) are analyzing their data wrong. I haven’t looked at their recent work, but my impression is that they engage in careful experimental designs in which simple methods can yield good, robust inferences. What I’m saying is that if they want to go further, they might do well to consider some hierarchical modeling rather than restricting their attention to topics such as asymptotics for statistical decision rules, etc.

      Similarly, sometimes economists are so concerned with bias that I think they avoid modeling, which leads them to implicitly make strong assumptions about stationarity of the processes they are studying.

      I’m not sure what quick things I can refer you to. You could look at my applied research publications, I suppose. I don’t think economists should abandon their methods, I just think it would help for them to be a bit more open to other approaches. And I think their misunderstanding of bias and variance might be the source of some of these problems.

  10. Nice post.

    One comment on “… the likelihood approach, advocated by those people who refuse to make any assumptions or restrictions on parameters but are willing to rely 100% on the normal distributions, logistic regressions, etc., that they pull out of the toolbox.” As someone who has relied on likelihood ratio tests (and makes liberal use of priors when estimating the model parameter values necessary to compute the likelihood ratios themselves) I’ll say the presumption of normality gets tossed very quickly if your primary interest is in getting a good answer to the question at hand. Two of my favorite quotes:

    “Everybody believes in the Gaussian law of errors: the mathematicians because they think it has been empirically demonstrated by experimenters and the experimenters because they think they mathematicians have proven it a priori.”
    – Roger Koenker (paraphrasing Poincare, I think)

    “All distributions are normal in the middle.”
    – John W. Tukey

    With those to comments in mind: It’s usually the tails which make analysis challenging and nature seems to love heavy tails. With that in mind, I believe your comment “… instead of assuming a distribution is Gaussian, you would just want the procedure to work well under some conditions on the smoothness of the second derivative of the log density function.” is spot on.

    Back to likelihood ratio tests: F tests, for example. F tests are wonderfully useful in many circumstances, just don’t presume that when you compute F-values using real data that the distributions under H0 and H1 will follow the F-distributions you’d expect for multivariate-normal-distributed data. Check your presumptions. (Careful with that p-value!) Look at the actual distributions under H0 and H1, dig down a level and look at the distributions of RSS0 and RSS1? Are they chi-squared distributed with the expect number of d.o.f.? (Probably not. Nature doesn’t hand you many pure normal distributions.) Check your assumptions! Check your assumptions! Check your assumptions! Use your observations to revise your assumptions. Adapt – modify your analysis if appropriate. Repeat as necessary.

  11. Pingback: Data Sets with Year-to-Year Country Dyads » Duck of Minerva

  12. Can I ask a really quick and stupid question that someone with my training should be smart enough not to have to ask (for real, I feel like I should understand this better, and apologize if I’m way off base): Is this post arguing that we should use less robust estimators because we get smaller standard errors and can maybe make stronger claims that way? If I were to respond at a purely gut level to that, I’d say Economists have been making overly-strong claims about their empirical backing for a very long time, and if it swings the other way towards requiring more rigorous proof, I’m not so torn up by that. I’d also say that we never believe one estimate anyway (regardless of p-value) and just use the evidence to update our previous position on the effect of some thing in the world.

    At a less gutty level, I’d also say that when we use Random Effects models or other parametric specifications of the V/C matrix to estimate standard errors in State-Year panels, we seem get really lousy inference properties (rejection rates on placebo treatments). If you want to convince some of us practitioners that there are gains to careful modelling of the V/C matrix and more careful data-set construction, re-doing Duflo’s inference paper (How much can we trust Diff-in-Diff) would be pretty darn convincing.

    Of course, maybe I missed the boat here completely, and we aren’t just talking about modelling error terms the way I’m thinking. It just seems to me that I would rather have a point estimate I can believe in magnitude-wise, and then use p-values to help me weight how much I update my general priors on the effect of some economic variable in the world.

    One last point: I think that fixed-effects estimators are not always used simply to reduce bias, but to answer fundamentally different kinds of questions. I might use a county-by-year fixed effect to ask something like “how much do local differences in X correlate with Y” and I might use state and year fixed effects to ask “How much does a state level policy change inducing a change in X affect Y”. They are really different kinds of questions, and I was trained to think about what variation in the world a model is latching onto, and thus what question it is posing to the data and what thing about the world is being revealed, and not to just write out some fake equation about the stochastic processes of Y and do some fancy math to make some greek letter disappear from it. So I guess I take a little exception to the idea that we use the models we use simply to eliminate bias. We use them to answer particular kinds of questions. Ignoring interpretation seems like a weakness of this argument.

    Sorry if this sounds snarky or rude. It wasn’t my intention and maybe I’ve missed a whole historiography of this line of inquiry (likely). I just think that 1) the very few papers I’ve seen testing parametric models of the V/C matrix in controlled, placebo environments show that these models don’t perform particularly well and 2) fixed-effects models aren’t used just to “eliminate bias” in some fancy-math sense – different fixed-effects models latch on to different variation in the world and thus teach us different things about it, and that nuance is part of what makes contemporary applied econometrics strong, at least in the hands of skilled practitioners.

    • Jrc:

      You ask, “Is this post arguing that we should use less robust estimators because we get smaller standard errors and can maybe make stronger claims that way?”

      No. Robust is good. What I’m arguing is we should be doing less pooling. Instead of assuming, for example, “state fixed effects” that are constant over a forty-year period, we should allow our estimates to vary over time. This will require more modeling of the data each year or each decade, but that’s a price I’m well willing to pay to get out of the trap of assuming stationarity. My point is that so-called unbiased estimates are only unbiased under the assumption of no interactions, and that’s not an assumption I want to make. A very clear example is shown in those GSS graphs in the above post. Pool the decades and you get the wrong answer. Analyze a shorter time period and your sample size is smaller, but I’m attacking the question I want to attack. In practice, the ideology of unbiasedness-above-all requires large sample sizes (otherwise variances are too large for estimates to be useful), and large sample sizes require pooling.

      • Andrew,

        Thanks for getting back. I see what you are saying about the problems of just using some cookie-cutter method without thinking about it, and bringing as much data as you can to that problem to satisfy it’s methodological demands without thinking about the data itself, and I wholeheartedly agree with you. We should think very clearly about what relationship in the world we are really trying to understand, and focus narrowly on estimating that as best as we can. I was mis-interpreting the thrust of the post.

        I guess what through me was the focus on modelling. For me, the best work is the work with the least modelling necessary, because the researcher has convincingly found some real-world environment that allows for clean estimation of their parameter of interest. It’s why Snow’s Table IX that is re-printed in the the Statistical Methods and Shoeleather paper is like my favorite table ever…the art and argument are in the research design, and not the statistical analysis. But I’m certainly aware that there are a lot of questions and problems where such clean identification is not possible, and that modelling can help, and that we’ve perhaps too hastily abandoned thinking hard about modelling the situation at hand in favor of some abstract claims about unbiasedness.

        Thanks again. This has had me thinking for a couple of days now, inducing that pleasant feeling where the dark, confused parts of my brain light up.

        • Jrc:

          Modeling helps when data are sparse. When we’re lucky enough to have a clean question and abundant data, we can sometimes get away without modeling (or, equivalently, we can get conclusions that are not sensitive to our model assumptions). Most of the time we’re not so lucky—or, if we are, we go to the next level and study subgroups, etc. What I’m objecting to is the practice of pooling lots of data in order to get the illusion of data density.

Comments are closed.