Just to refresh your memory, here’s a simple logistic regression with only a constant term and no separation, nothing pathological at all:
> y <- rep (c(1,0),c(10,5))
> display (glm (y ~ 1, family=binomial(link="logit")))
glm(formula = y ~ 1, family = binomial(link = "logit"))
(Intercept) 0.69 0.55
n = 15, k = 1
residual deviance = 19.1, null deviance = 19.1 (difference = 0.0)
And here’s what happens when we give it the not-outrageous starting value of -2:
> display (glm (y ~ 1, family=binomial(link="logit"), start=-2))
glm(formula = y ~ 1, family = binomial(link = "logit"), start = -2)
(Intercept) 71.97 17327434.18
n = 15, k = 1
residual deviance = 360.4, null deviance = 19.1 (difference = -341.3)
glm.fit: fitted probabilities numerically 0 or 1 occurred
As I discussed in the comments to my original post, there were reasons we were running glm with prespecified starting points—this blow-up occurred originally in a real problem and then I stripped it down to get to this simple and clear example.
Mount explains what’s going on:
From a theoretical point of view the logistic generalized linear model is an easy problem to solve. The quantity being optimized (deviance or perplexity) is log-concave. This in turn implies there is a unique global maximum and no local maxima to get trapped in. Gradients always suggest improving directions. However, the standard methods of solving the logistic generalized linear model are the Newton-Raphson method or the closely related iteratively reweighted least squares method. And these methods, while typically very fast, do not guarantee convergence in all conditions. . . .
The problem is fixable, because optimizing logistic divergence or perplexity is a very nice optimization problem (log-concave). But most common statistical packages do not invest effort in this situation.
Mount points out that, in addition to patches which will redirect exploding Newton steps, “many other optimization techniques can be used”:
- stochastic gradient descent
- conjugate gradient
- EM (see “Direct calculation of the information matrix via the EM.” D Oakes, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1999 vol. 61 (2) pp. 479-482).
Or you can try to solve a different, but related, problem: “Exact logistic regression: theory and examples”, C R CR Mehta and N R NR Patel, Statist Med, 1995 vol. 14 (19) pp. 2143-2160.
There’s also the new algorithm of Polson and Scott, which looks pretty amazing, I should really try it out on some examples and report back to all of you on this.
Maybe we should also change what we write in Bayesian Data Analysis about how to fit a logistic regression.