This is an echo of yesterday’s post, Basketball Stats: Don’t model the probability of win, model the expected score differential.
As with basketball, so with baseball: as the great Bill James wrote, if you want to predict a pitcher’s win-loss record, it’s better to use last year’s ERA than last year’s W-L.
As with basketball and baseball, so with epidemiology: as Joseph Delaney points out in my favorite blog that nobody reads, you will see much better prediction if you first model change in the parameter (e.g. blood pressure) and then convert that to the binary disease state (e.g. hypertension) then if you just develop a logistic model for prob(hypertension).
As with basketball, baseball, and epidemiology, so with political science: instead of modeling election winners, better to model vote differential, a point that I made back in 1993 (see page 120 here) but which seems to continually need repeating. A forecasting method should get essentially no credit for correctly predicting the winner in 1960, 1968, or 2000 and very little for predicting the winner in 1964 or 1964, but there’s information in vote differential, all the same.
As with basketball, baseball, and epidemiology, and political science, so with econometrics: Even in recent years, with all the sophistication in economic statistics, you’ll still see people fitting logistic models for binary outcomes even when the continuous variable is readily available. (See, for example, the second-to-last paragraph here, which is actually an economist doing political science, but I’m pretty sure there are lots of examples of this sort of thing in econ too.)
OK, ok, if this is all so obvious, why do people do the other thing? Why do people keep modeling the discrete variable? Some of the answer is statistical naivety, a simple “like goes with like” attitude that it makes sense to predict W-L from W-L rather than ERA.
More generally there’s the attitude that we should be modeling what we ultimately care about. If the objective is to learn about wins, we should study wins directly. To which I reply, sure, study wins, but it will be more statistically efficient to do this in a two-stage process: first study vote differential given X, then study wins given vote differential and X. The key is that vote differential is available, and a simply performing a logit model for wins alone is implicitly taking this differential as latent or missing data, thus throwing away information.
Finally, from the econometrics direction, I see a bias or robustness argument. The idea is that it’s safer, in some way, to model the outcome of interest, as this model will not be sensitive to assumptions about the distribution of the intermediate variable. For example, a linear model for score differentials could be inappropriate for games where one team runs up the score (or, conversely, for those games where the team that’s winning sends in the subs so that the score is less lopsided than it would be if both teams were playing their hardest). In response to this, I would make my usual argument that your models already have bias and robustness issues in that, to do your regression at all, you’re already pooling data from many years, many places, many different situations, etc. If the use of continuous data can increase your statistical efficiency—and it will—this in turn will allow you to do less pooling of data to construct estimates that are reliable enough for you to work with.