[See update at end of this entry.]
Jeff Lax pointed me to the book, “Discrete choice methods with simulation” by Kenneth Train as a useful reference for logit and probit models as they are used in economics. The book looks interesting, but I have one question. On page 28 of his book (go here and click through to page 28), Train writes, “the coefficients in the logit model will be √1.6 times larger than those for the probit model . . . For example, in a mode choice model, suppose the estimated cost coefficient is −0.55 from a logit model . . . The logit coefficients can be divided by √1.6, so that the error variance is 1, just as in the probit model. With this adjustment, the comparable coefficients are −0.43 . . .”
This confused me, because I’ve always understood the conversion factor to be 1.6 (i.e., the variance scales by 1.6^2, so the coefficients themselves scale by 1.6). I checked via a little simulation in R:
> n <- 100
> x <- rnorm (n)
> a <- 1.3
> b <- -0.55
> y <- rbinom (n, 1, invlogit (a + b*x))
> M1 <- glm (y ~ x, family=binomial(link="logit"))
> display (M1)
glm(formula = y ~ x, family = binomial(link = “logit”))
(Intercept) 0.88 0.22
x -0.44 0.24
n = 100, k = 2
residual deviance = 118.6, null deviance = 122.2 (difference = 3.6)
> M2 <- glm (y ~ x, family=binomial(link="probit"))
> display (M2)
glm(formula = y ~ x, family = binomial(link = “probit”))
(Intercept) 0.54 0.13
x -0.26 0.14
n = 100, k = 2
residual deviance = 118.6, null deviance = 122.2 (difference = 3.5)
I did it a few more times and got different results, but always between 1.6 and 1.8 (which is consistent with the literature, e.g., Amemiya, 1981).
Train also refers to a factor of pi^2/6, which is the variance of a single utility in the logit model (so that the difference has a variance of pi^2/3; see p.39 of his book here). This pi^2/3 is a variance, so its square root needs to be taken, hence pi/√3=1.8, which is indeed the sd of the unit logistic distribution. However, as Amemiya (1981) and others have noted, the logistic distribution function actually fits better to the normal, over most of the range of the curve, if we scale by 1.6 rather than 1.8. But, in any case, it’s 1.6, not √1.6. Anyway, I think that’s right.
I talked with Dr. Train and we realized that we’re talking about two different (although related) models. I’m working with logit/probit for binary outcomes, or ordered logit/probit for multilnomial outomes, in which there’s a single latent variable (with logistic(0,1) or normal(0,1) error term). Train is working with a utility model in which each alternative has its own independent error term (extreme-value or normal(0,1)), so that the difference in two utilities is either logistic(0,1) or normal(0,2). Hence the sqrt(2) difference in our sd’s. The parameterization/model I use is more common in statistics and, I believe, in econometric analysis of discrete data (e.g., Maddala’s book), but I can see that Train’s parameterization/model would makes sense in settings with different random utility for each person and each outcome.
These are not two different parameterizations of the same model, with one parameterization being more common than the other. They are two different models, each with its own parameterization that is common for that model.