Fixed effects and identification

Tom Clark writes:

Drew Linzer and I [Tom] have been working on a paper about the use of modeled (“random”) and unmodeled (“fixed”) effects. Not directly in response to the paper, but in conversations about the topic over the past few months, several people have said to us things to the effect of “I prefer fixed effects over random effects because I care about identification.” Neither Drew nor I has any idea what this comment is supposed to mean. Have you come across someone saying something like this? Do you have any thoughts about what these people could possibly mean? I want to respond to this concern when people raise it, but I have failed thus far to inquire what is meant and so do not know what to say.

My reply:

I have a “cultural” reply, which is that so-called fixed effects are thought to make fewer assumptions, and making fewer assumptions is considered a generally good thing that serious people do, and identification is considered a concern of serious people, so they go together. Maybe there is more going on, though. Let’s see what the blog commenters have to say.

P.S. See also this paper with Joe Bafumi. I generally prefer for my varying coefficients to be modeled. I’m no fan of so-called fixed effects identification. It’s just another model, just not as flexible as the multilevel version.

50 thoughts on “Fixed effects and identification

  1. I think AG nails it. Sounds like too many people have been listening to economists (esp. of the applied micro and development species).

    FWIW, both FE and RE are examples of conditional models, and can be contrasted with marginal models like GEEs. The latter are completely different than the former, something that is not widely appreciated. For a nice, older take on the differences — that’s a bit hard on marginal models, IMO — see Lindsey and Lambert (1998 Stat. Med.).

  2. Another translation of that comment is: “I am willing to accept a lot of variance in order to minimize bias.” People who say that may be telling you that their objective function is not optimized by minimizing mean square error, but rather by purely (or at least preferentially) minimizing bias.

      • Jeff, I think anytime people try to decide which specification (FE v. RE) and rely on the Hausman test to do so, they are implicitly expressing a lexicographic preference against bias. If I were to think about it carefully, I’d bet there are lots of common modeling choices made that reflect a comparable kind of preference.

        • But sometimes the Hausman test says the RE and FE are close enough, while (except in happy magic land) there is always some bias at least. So when the Hausman test says, “you’re ok with RE” it’s saying the bias is small enough not to worry, relative to the noise”, no? So that would not be lexicographic. Unless… we mean I lexicographically prefer not to be confident there is much bias. Something like that.

    • B:

      But the so-called fixed effects model does not in general minimize bias. It only minimizes bias under some particular models. As I wrote above, “it’s just another model.”

      Another way to see this, in the time-series cross-sectional case, is to recognize that there’s no reason to think of group-level coefficients as truly “fixed.” One example I remember was a regression on some political outcomes, with 40 years of data for each of 50 states, where the analysis included “fixed effects” for states. I’m sorry but it doesn’t really make sense to think of Vermont from 1960 through 2000 as being “fixed” in any sense.

      • “it doesn’t really make sense to think of Vermont from 1960 through 2000 as being “fixed” in any sense.”

        Nice example.

  3. I have a guess at what they mean. A feature of proper Bayesian regression / random effects that I’ve encountered is that it allows estimation with singular design matrices. If your prior is chosen to look reasonable, then the output does also. That is, it allows you to not put the needed thought into what the estimable contrasts are. For fixed-effects designs to work you can’t use an over-complete basis and have to think about which data contrasts identify your target of inference.

    • But wouldn’t everyone notice this when they plot the implied prior and posterior for the contrasts of interest (i.e. if the posetrior/prior is pretty flat)?

  4. Well, I hesitate to comment if Andrew doesn’t have an answer. Perhaps I’m over-simplifying the question; identification implies over-parameterization. Identification dilemmas are generally handled by random effects with zero means. In that light, it seems to me that fixed effects and random effects are apples and oranges. I suppose that there are exceptions, but here’s the way that I use them in my work: fixed effects model the mean and random effects model the variance, i.e. variance components.

  5. Fixed effects models deprive the estimates of a large portion of the information in the data, by only looking at within-group variation and not at between-group variation.

    I never understood why that was considered to be a good idea.

    • to play devil’s advocate here – maybe because the between-group relationships aren’t understood – either that’s what’s being examined or there is not sufficient evidence to know beforehand what the correct relationship between groups should be.

  6. I guess that by identification they mean unbiased estimators. But I’m not sure, so it’s better to hear other people.

    In any case, one simples answer to this argument is to point to the bias variance trade-off and remember that hierarchical models have lower variance (though biased).

  7. Not sure if the bias variance trade-off is in general a good answer as often there are enough data so that the parameters can be precisely estimated with even without further modeling assumptions. Often many of these “lot of data” scenarios are evoked to make the case for the fixed effects. There might be other reasons why one would prefer random effects …

  8. ” fixed effects are thought to make fewer assumptions” isn’t it true? I am *not* saying that fewer assumptions is a good thing but isn’t it true that random effects make additional assumptions that might not be true in some scenarios? But I agree that some people are scary of it just because these additional assumptions are new for them; they usually don’t bother with other assumptions, for example linearity and additivity in regressions, even when these are patently wrong.

  9. Andrew — let me put this in the education context (by the way, I’ve studied with Pat Wolf). What I usually see, and have done, in education studies is student-level fixed effects, for example, if you want to estimate whether students do better when matched with a teacher of the same race. So you might have all sorts of teacher and school characteristics as predictors, along with an indicator for student-race matching in a particular year. The reason I hear for not using random effects is that the student intercepts are probably correlated with at least some of those other predictors.

    How exactly would you do that in a random effects model — i.e., what would the levels look like?

  10. It is possible that I’m completely wrong, but I am sort of surprised about the comments thus far, because I thought I immediately understood what people meant by “I prefer fixed effects over random effects because I care about identification”, but no one has attempted a translation yet. As Andrew has pointed out in a paper, “fixed effects” is used in various meanings, but I suppose that what’s meant is the standard usage – fixed effects as unit dummy variables (and perhaps, additionally, time dummies) that control for unmeasured variance between units (points in time) to the extent that it is stable over time (units). Which presupposes you have longitudinal data, although some authors have presented models that use, for example, family fixed effects (dummies). In econometrics, “the identification problem” refers to the problem of being able to make causal claims on the basis of observational data; fixed effects are thought to help with that because they control for an extra portion of the variance (see Stuart Buck’s example above).

    For a programmatic view, see Halaby, Charles N., 2004: “Panel Models in Sociological Research“, Annual Review of Sociology 30: 507-44; for the technique’s limitations, see Bjerk, David, 2009: “How Much Can We Trust Causal Interpretations of Fixed-Effects Estimators in the Context of Criminality?” Journal of Quantitative Criminology 25: 391-417.

      • Quite. But, not knowing the context, my suspicion was that the commenters the correspondent cited were referring to the above view of fixed effects and identification. Also see Cyrus’s comments below.

  11. I think the comments on lexicographic preferences are a reasonable interpretation, or some people might argue that even for minimizing MSE, with a large enough sample they’ll prefer FE, because as the number of “large units” (e.g., people in a panel study) increases, the efficiency disadvantage of FE diminishes but the bias advantage doesn’t.

    But FE is not necessarily less biased than RE. Mostly Harmless Econometrics ch. 5 discusses both the arguments for FE and some caveats. See also Bound and Solon’s very nice paper on twin studies of the returns to schooling. These issues are very relevant to the point Judea Pearl made here a few years ago (controlling for a pre-treatment variable can increase bias).

    • Well, I’m not sure about your claim that as the number of large units “increases, the efficiency disadvantage of FE diminishes, but eh bias advantage doesn’t.” In the paper that Drew and I are writing, we perform a series of simulation studies to assess this. What we find is that given the covariate of interest is not too sluggish, with enough data there is no bias in the RE estimate, regardless of the underlying level of correlation between the regressor and the unit effects.

      • Tom, I was just referring to the usual panel-data asymptotics, with the no. of people going to infinity but holding the no. of time periods fixed.

  12. So in a fixed effects model, using your notation (except in a way that I can type):
    Yi = a0 [fixed intercept per individual] + a1*X [other predictors] + ei.

    But a random effects model, substituting in the other level, would be:
    Yi = a0 [random intercept per individual?] + a1*X-bar + a2*X + [ei + ni].

    So in terms of Stata syntax, would the second option be as simple as adding columns for means of all the X variables to xtreg/re? That would take care of the endogeneity issue? If so, this is not common knowledge among econometricians, I don’t think. I’m pretty sure Wooldridge doesn’t discuss it, for example (see http://books.google.com/books?id=64vt5TDBNLwC&pg=PA493&lpg=PA493&dq=wooldridge+%22random+effects%22&source=bl&ots=JyYZlGcsva&sig=a_UX7JWX0CLXRCaxWT2CpWtKWkY&hl=en&sa=X&ei=URp6T4yCDNPq2wW06ey1Bg&ved=0CDYQ6AEwAg#v=onepage&q&f=false )

  13. I suspect that the “several people” to which Tom spoke are referring precisely to the arguments developed in, e.g., Mostly Harmless Econometrics, Chapter 5. The point about identification is captured in the first expression at the top of page 222. To paraphrase (though not necessarily to endorse): suppose an unmeasured, time invariant confounder. FE purges the variation from it, contributing, in theory to identifying the effect of the time-varying variable of interest on the outcome. So, to address these folks, you would want to discuss what happens to the type of RE model that you are fitting in the face of such time invariant confounding.

    I teach a class that teaches fixed effects regression in precisely this light. See lectures 12 and 13 here:

    http://cyrussamii.com/?page_id=1069

    Some issues with FE estimation, such as improper aggregation of heterogenous effects, are discussed in lecture 13. Easy fixes, when available, are discussed too.

    • Continuing, for the sake of illustration, next to the notes for lecture 13 on the link that I mentioned above, I added R code that simulates the canonical comparison of RE vs FE, where the between slope goes in one direction, and the within slope in another. If we allow this to be the case where the variable “a” is the unmeasured, time-invariant confounder of the otherwise causal relationship between x and y, then clearly RE fails to estimate the causal relationship between x and y, while FE gets it right. All the points about aggregating heterogenous effects would continue to apply, of course.

        • Yeah, that’s okay too, because the group mean of X stands in for the group dummy. But you shouldn’t try to interpret the coefficient on the group mean of X, because it will incorporate variation due to the unmeasured confounder. That is, with an unmeasured group level confounder, the partial association between the group mean of X and Y is not….identified.

        • Cyrus:

          I just don’t see how setting the group-level variance to infinity can be better than estimating it from the data or setting it to a reasonable finite value.

          That said, the big advantage of multilevel (“random effects”) modeling comes when you are interested in the varying coefficients themselves, or if you’re interested in predictions for new groups, or if you want the treatment effect itself to vary by group.

          On a slightly different note, I’m unhappy with many of the time-series cross-sectional analyses I’ve seen because I don’t buy the assumption of constancy over time. That is, I don’t really think those effects are “fixed”!

          P.S. Thanks for enlivening this comment thread by contributing links to actual research!

  14. I don’t interpret FE as “setting the group-level variance to infinity.” I just see it as OLS on “swept” data, where the sweeping takes care of the time invariant confounding problem with minimal hassle, and OLS coefficients have well-understood sampling distributions even under mis-specification (thanks to H. White, who just passed away yesterday), meaning that I can be confident in the size of my tests, etc.

    • Cyrus:

      I don’t know that there’s anything much that’s time-invariant in what I study. But, in any case, the so-called fixed-effects analysis is mathematically a special case of multilevel modeling in which the group-level variance is set to infinity. I agree that there’s no need to “believe” that model for the method to work; however, I think it works because of some implicit additivity assumptions. I’d prefer to (a) allow the group-level variance to be finite, and (b) work in the relevant assumptions more directly.

      See the example in Section 9.3 of Bayesian Data Analysis (the artificial example of estimating the total population of a state by sampling 100 cities and towns from a list of 800) for further discussion of the implicit assumptions involved in simple methods that work. My conclusion there is that ideally we’d like to identify the key assumptions and formally incorporate them in our model rather than simply using a method that makes use of the assumptions implicitly.

      • I agree with what I believe Cyrus is saying: the point of the “fixed effect” is to take care of omitted variable bias by ‘sweeping out’ time-invariant omitted variables. An additive random effect does not do this. They deal with separate issues: consistency of your estimator (FE) versus efficiency (RE). Both can require strong assumptions to deal with each issue they try to address.

        Also, I think economists use the term identification as a synonym for consistency. If I remember correctly, one of my econometrics books said something to the effect of, ‘consistency is the bare minimum requirement for an estimator–it is an extremely poor estimator that remains inaccurate as your sample size approaches infinity.’ You may be able to characterize the size or the direction of the bias, but this is difficult as most estimating equations have many omitted variables. Hence economists search for exogenous variation–instrumental variables, experimental variation, etc..

        • Cyrus, I looked at your R code (fevsre.R), which is useful. But if you append the code below, you see that a random effects approach yields the same “within” result as fixed effects, plus reveals the “between” effect too. I guess that’s Andrew’s point.

          mypanel$xM <- as.numeric(by(mypanel, mypanel$i, function(xx) mean(xx$x)))[mypanel$i] # group means of x
          mypanel$xD <- mypanel$x-mypanel$xM # differenced x
          library(lme4)
          lmer(y ~ xD + (1|i), mypanel) # same "within" result as fixed effects
          lmer(y ~ xD + xM + (1|i), mypanel) # "within" plus "between" results

        • Malcolm:
          I was only trying to make the “canonical” point about how RE and FE differ. But yes, if you add group intercepts as “fixed” parts of the model (or group means, or basically anything that stands in at the group level to soak up between variation), you control for the time invariant confounding in a way that is similar to the FE model. This is where the semantics get tortured: you’ve now got group intercepts that are “fixed effects” in a model that treats any other source of between variation as “random effects.” The RE model that I fit treated all sources of between variation as random. Is the fixed+random effects model the ultimate solution? Not so sure. There are other issues that make applied people like me lean in favor of simpler OLS methods—e.g., test size, especially under misspecification.
          C.

  15. First time caller here, but longtime–occasional–listener. Thanks for the interesting discussion.
    I’d like to replicate the apparently simple simulations in the Bafumi and Gelman 2006 paper, but can’t follow the procedure, and if code is available somewhere I haven’t found it.
    Can anyone (Andrew?) point out where I’m going wrong with the following R code? This generates results, but they aren’t close to those in the paper. The correct code might be a useful resource to others as well, to illustrate the point.

    library(lme4.0)
    library(multicore)
    sims <- 1000
    N <- 100
    dgp.bg <- function(N) {
    bg <- data.frame(x=rnorm(N, sd=2))
    bg$x2 <- rnorm(N, mean=rep(c(1,-1,-3,2), each=N/4), sd=0.001)
    bg$grp <- rep(1:4, each=N/4)
    bg$y <- bg$x + bg$x2 + rnorm(N, sd=7)
    bg
    }
    mod1 <- function(dat) {
    mod <- lmer(y ~ x + x2 + (1 | grp), data=dat) # generate data using dgp.bg, fit model
    fixef(mod)/vcov(mod)@factors$correlation@sd # calculate t statistics
    }
    system.time(res1 <- do.call("rbind", mclapply(1:sims, function(x) mod1(dgp.bg(N=N)))))
    colMeans(res1)

  16. This has been extremely useful and I think LemmusLemmus has the right answer. Digging around a bit more, it’s also in Mostly Harmless Econometrics page 7: “Angrist and Krueger (1999) used the term ‘identification strategy’ to describe the manner in which a researcher uses observational data (i.e., data not generated by a randomized trial) to approximate a real experiment.”

    Setting aside the question of whether FE or RE models can actually produce *causal* inferences from observational data (though thanks for the above debate), my reaction to this is, WHAT? How can you take a term, “identification,” that already has a meaning in statistics (i.e., there’s one solution to the model), and decide that it means something else completely different?

    • Drew, this might be a job for Jesse Sheidlower or Language Log, but you could try browsing Chuck Manski’s 1995 book Identification Problems in the Social Sciences. (The title nicely contrasts with his mentor Frank Fisher’s 1966 book The Identification Problem in Econometrics.)

      Manski writes (p. 4): “It is useful to separate the inferential problem into statistical and identification components. Studies of identification seek to characterize the conclusions that could be drawn if one could use the sampling process to obtain an unlimited number of observations. Studies of statistical inference seek to characterize the generally weaker conclusions that can be drawn from a finite number of observations.”

      • This is all very educational. I understand the term as K? O’Rourke describes it below. But it’s evidently not general knowledge that there’s this other meaning of “identification” floating around out there. Just look at the range of guesses at the top of this thread.

      • I think this citation is very confusing ….to me it is rom the obscure first sentence to the obvious second sentence.

        • I think Manski’s meaning is similar to K? O’Rourke’s first paragraph below. Identification doesn’t have to start with a parametric model. E.g., we can ask what combinations of assumptions and design features are sufficient to identify an average treatment effect. When people say “identification strategy”, it’s in that spirit, although the MHE sentence is a little loose.

          When people say “I prefer fixed effects over random effects because I care about identification”, that seems to mean they believe there’s a plausible set of assumptions under which FE consistently estimates the causal effect of interest, whereas they do not believe this to be true for RE. (It sounds a wee bit self-righteous, though, like “I care about identification, unlike you who have no scruples”, or “I support X because I care about children”.)

          So the usage has broadened but hasn’t become something completely different, and most of the commenters here basically had the right answer implicitly or explicitly.

        • One amendment to my comment: People who talk about “identification strategies” are usually referring to internal validity (unlike Manski, who puts at least as much emphasis on external validity). “I care about identification” may have roughly the same meaning as “I care about internal validity”.

  17. I am actually a little confused about what people mean by the term “identification”: I don’t think it has an unique meaning.

    • Without checking technical literature, not identified simply means that more than one point in the parameter space gives the same probability for generating the sample in hand (so the sample cannot “identify” the parameter that more likely generated the data) and this will not change with unending increases in the sample. If this is true only for the current sample, it only aliased rather than not identified and for instance David Dunson pointed, out the many people confuse the two.

      But more picturesquely (or Bayesianly), I prefer to re-cast it as a set of positive probability in the continuous parameter space (yes prefer continuity for parameters) where the posterior has exactly the same shape as the prior (i.e. the relative believe, posterior/prior = 1). Now the size and shape (direction the set runs in) becomes clearly very important. A small flat spot or even a fairly large one in just the direction of a nuisance parameter can be harmless. Fortunately one only needs to look at plots in applications to find important ones.

      Now for how this plays into observational studies there is a series of nice papers by Sander Greenland on this exact topic.

  18. For my methods class (linked in comment above, see the first lecture) I proposed a general definition of identification as ability to draw a conclusion given the type of data available. It is based on Manski’s definition and, I think, encompasses all of the different notions mentioned here (including the Mostly Harmless Econometrics usage and parametric models usage). Other applications that this definition encompasses include, e.g., in rational choice theory, an IIA assumption allows you to “identify” a preference ordering over three choices based on as few as two pairwise comparisons (A v B, B v C); in an age-period-cohort model, assumptions of additivity and constant effects allow you to identify age, period, and cohort effects from as few as three periods of data. Identification is a basic concept in deductive practice, referring to the ways that types of data and assumptions complete each other. It has many domain specific applications. One merely needs to be mindful of the context.

  19. I read the referenced paper and was wondering how your proposed solution (see quote below) would look in R code for, say, lmer.

    “This is accomplished in R and Stata by generating a variable equal in length to
    the individual-level predictors but varying only across the units or groups. So, this new
    group-level predictor will have the same value for each group of, say, states.”

    This would be a variable similar to the group level, but instead of each level, the value for each observation would be the (group specific) mean value of the outcome? Or of the correlated predictor? And would this variable be added as fixed effect?

    Or is this no real problem when I fit a model with varying intercept and slope (typically using time as random slope in longitudinal models)?

    Thanks in advance and best regards
    Daniel

    • Daniel:

      Jennifer and I discuss this in our book. Something like this. Suppose you want to regress y on x with grouping variable “group” and group-level predictor u. Then you can do things like:

      lm(y ~ x)
      lmer(y ~ x + (1 | group))
      lmer(y ~ x + (1 + x | group))
      u_full = u[group]
      lmer(y ~ x*u_full + (1 | group))
      lmer(y ~ x*u_full + (1 + x | group))
      
  20. If I’m not mistaken, the terminology “fixed” vs “random” effect first arose in observational astronomy, where a star could be considered to have a “fixed effect” (for practical purposes; stars don’t usually change their properties very quickly), but observations were also affected by things such as weather conditions, that varied from day to day independently of the star being observed. So these other effects were most appropriately modeled as “random” or “variable”.

Comments are closed.