Fixed effects, followed by Bayes shrinkage?

Stuart Buck writes:

I have a question about fixed effects vs. random effects. Amongst economists who study teacher value-added, it has become common to see people saying that they estimated teacher fixed effects (via least squares dummy variables, so that there is a parameter for each teacher), but that they then applied empirical Bayes shrinkage so that the teacher effects are brought closer to the mean. (See this paper by Jacob and Lefgren, for example.)

Can that really be what they are doing? Why wouldn’t they just run random (modeled) effects in the first place? I feel like there’s something I’m missing.

My reply: I don’t know the full story here, but I’m thinking there are two goals, first to get an unbiased estimate of an overall treatment effect (and there the econometricians prefer so-called fixed effects; I disagree with them on this but I know where they’re coming from) and second to estimate individual teacher effects (and there it makes sense to use so-called random effects, although in general I would shrink toward a regression model with teacher-level predictors rather than to an overall mean).

21 thoughts on “Fixed effects, followed by Bayes shrinkage?

  1. I think you have this backwards. They prefer unbiased estimates of teacher effects (to use as direct output) and efficient (biased) estimates when they are estimating other quantities. The EB approach is to handle estimation variance in teacher effects when using them as covariates (ie treating it is an error-in-predictors problem).

    • Ryan:

      Most of the econ papers I’ve looked at, they’re interested in one or two specific “beta’s.” I agree, once you get into an errors-in-variables regression, you’re already in Bayesian (or, as classical statistician might say, predictive) territory.

      I should also add that I think the focus on unbiasedness is misguided and naive, in that in practice one can only use the unbiased estimates after pooling data (for example, from several years). In a familiar Heisenberg’s-uncertainty-principle sort of story, you can only get an unbiased estimate if the quantity being estimated is itself a blurry average.

      • I think estimators do pool over several years. And I also agree that economists use the fixed effects as measures of teachers’ value added and random effects occur at a different level (e.g. school).

        Formally, I think economists are more concerned with consistency than unbiasedness. However unbiasedness and consistency are often talked about as if they’re equivalent.

        • Peter:

          The thing I was getting at was what I perceive as a “tough guy” attitude among some econometricians and theoretical statisticians, in which unbiasedness is king. To which I reply that the only way people get to unbiasedness, in practice, is by pooling lots of data. So everyone’s trading bias for variance at some point, it’s just done at different places in the analyses. It seems (to me) to be lamentably common for econometricians/statisticians of the “tough guy” school to first do a bunch of pooling without talking about it, then getting all rigorous about unbiasedness. I’d rather fit a multilevel model and accept that practical unbiased estimates don’t in general exist.

      • My comment was descriptive for this paper rather than normative. It’s an oddity that because they want to use teacher-effects as an intermediate stage in the analysis that they end up making the reverse of the usual choice (efficient estimates per-subject, consistent estimates when they are estimating other parameters). I also don’t think that consistency when estimating individual teacher effects is a particularly important goal, given how limited the information per-teacher is. I don’t want to step on the long-running bias-variance tradeoffs in panel data with fixed vs random effects, but it seems like they often have enough data and perverse enough selection effects that the trade makes sense.

  2. Can anyone give a reference or explain why econometricians actually prefer so called dummy-regression (i.e., one parameter per unit of observation) to repeated-measures ANOVA when you have repeated-measures?
    Coming from experimental Psychology where you usually have multiple measurements per participant this is (a) uncommon and (b) may drastically reduce the power when having not that many levels on the repeated measures factors (as the number of df is comparatively low). For me it’s really hard to see why and furthermore, not discussed in the standard statistics textbooks.

    • My basic take on this is because in the data economists usually deal with it is felt that the individual level effects dominate the results, but are generally uninteresting for the purpose of measuring the effects of whatever the important covariates are. Lots of effort goes into measuring the effects of the important variables in these short panels, but the dummy variable just reflects some mean quality variable that is generally not felt to be very important (except perhaps as to its distribution).

  3. I additionally wonder what they’re doing, in the most literal sense. That is, if you estimate fixed effects to individual teachers (i.e., indicator variables for each teacher), and then apply what they’re calling empirical Bayesian shrinkage, are the individual effects you estimate for each teacher:

    1) The exact equivalent of doing random effects and then extracting the BLUPs (via ranef in R) for the individual teachers?

    OR

    2) Something in between fixed effects and true random effects?

    OR

    3) Something else?

    • This is, I think, a case where two disciplines are separated by a common language. But let me take a stab at answering this question in my language (econometrics rather than statistics), in the hope that this will be comprehensible to Stuart and others.

      I’m pretty sure the answer is 2. Here’s why: Imagine a model where we are using the end-of-the-year score on the left side and the (student-level) prior-year score along with the teacher effect on the right side. Random effects would be forced to be uncorrelated with the prior-year score. Fixed effects, by contrast, _can_ be correlated with the prior-year score, and this remains true after shrinking the estimated effects toward the grand mean. So this isn’t the exact equivalent of doing random effects and extracting the BLUPs, though in practice it isn’t too far off.

      Also, I should mention that not all economists who estimate value-added models do it this way. Some (e.g., a recent paper by Chetty-Friedman-Rockoff that has gotten a lot of attention) use a two-step procedure, where they first regress the end-of-year score on a long list of individual, classroom, and school covariates, then take the average residual for each teacher and shrink it toward the grand mean. This isn’t fixed effects — it imposes the random effects orthogonality restriction. But it isn’t exactly random effects either — I think of it as inefficient random effects. So I guess it gives you LUPs, but not BLUPs.

    • Thanks, Jesse.

      The distinction made between fixed and random effects disappears, however, if one includes a group-level mean as another independent variable. Mundlak 1978 mentioned this; see also Bafumi and Gelman ( http://www.stat.columbia.edu/~gelman/research/unpublished/Bafumi_Gelman_Midwest06.pdf ) or Bell and Jones, especially pages 21-23 ( http://polmeth.wustl.edu/media/Paper/FixedversusRandom_1_2.pdf ). Thus, in the teacher context, if we fear that teacher quality is itself correlated with the students’ prior scores, that doesn’t mean we can’t use random effects; instead, the random effects need to be modeled so as to include an independent variable for the class average prior score. (At least I think that’s how it would work.)

      Anyway, I’m hoping that someone will write out the matrix algebra for how the two estimators look, side by side. Probably too much to expect in blog comments, though!

      I agree with what you say about the average residual approach — when I’ve seen it done, I wonder why they’re not using random effects . . . .

      • Stuart — Yes, you can include a group-level mean as an independent variable. In the economics literature, this goes under the name “correlated random effects” — Wooldridge and Chamberlain have (separately) written a great deal about these models. It’s a good question whether this would be exactly the same as the fixed-effects-plus-shrinkage approach. Suppose that you included group-level means of all of the individual controls. Then the individual covariate coefficients would be identical, I think, as in the FE model. So the only question, I guess, is whether the shrinkage factor would be the same as what you’d use in the correlated RE model to get the BLUPs.

        My guess is that the answer is no. Separate the random effect into the part predictable by the group-level average Xs and the remaining part. BLUPs will, I think, be the sum of the whole former part and a fraction of the average residual, while FE-plus-shrinkage will shrink both parts in equal amounts. No?

      • Makes sense. Maybe Wooldridge and his co-authors should do one of their Monte Carlo studies of teacher value-added where the teacher effect is correlated with one or more student-level variables, to compare: 1) FE with empirical Bayes shrinkage, 2) RE with group-level averages. (Their 2012 paper doesn’t seem to use the correlated random effects approach: http://education.msu.edu/epc/library/documents/Guarino-Reckase-Wooldridge-May-2012-Can-Value-Added-Measures-of-Teacher-Performace-Be-Truste.pdf ).

  4. Andrew: Can you give an example or two where you have done this?

    “in general I would shrink toward a regression model with teacher-level predictors rather than to an overall mean”

    Thanks

  5. Why not run an analysis in which the teacher effects are modelled with a DPP (Dirichlet Process Prior) or even a Dirichlet Process Mixture if what is required is something that behaves somewhere between a fixed and a random effect?

    There was an interesting paper by Ohlssen et al published in Stat Med a few years ago showing how DPPs (and DPMs) can be used to effect this sort of pooling in the context of random effects meta-analysis.

  6. The choice of fixed versus random effect for causal identification is a theoretical question.

    If the effect is d-separated from the outcome use random effects, and use fixed effects otherwise.

    For prediction, as opposed to causal inference, it’s almost always best to use a hierarchical model.

    Note that even if a variable has no causal effect it might still be a useful predictor if inter alia it is correlated with other omitted causes.

Comments are closed.