Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data.

This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins.

In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you end up with something too noisy. Here’s an example of what you might want:

The trouble is, you don’t actually know the true value so you can’t directly use this sort of plot to make a stopping decision.

Everybody knows that early stopping of an iterative maximum likelihood estimate is approximately equivalent to maximum penalized likelihood estimation (that is, the posterior mode) under some prior distribution centered at the starting point. By stopping early, you’re compromising between the prior and the data estimates.

Early stopping is a simple implementation but I prefer thinking about the posterior mode because I can better understand an algorithm if I can interpret it as optimizing some objective function.

A key question, though, is where to stop? Different rules correspond to different solutions. For example, one appealing rule for maximum likelihood is to stop when the chi-squared discrepancy between data and fitted model is below some preset level such as its unconditional expected value. This would stop some of the more extreme varieties of overfitting. Such a procedure, however, is *not* the same as penalized maximum likelihood with a fixed prior, as it represents a different way of setting the tuning parameter.

I discussed this in my very first statistics paper, “Constrained maximum entropy methods in an image reconstruction problem” (follow the link above). The topic of early stopping came up in conversation not long ago and so I think this might be worth posting. The particular models and methods discussed in this article are not really of interest any more, but I think the general principles are still relevant.

The key ideas of the article appear in section 5 on page 433.

**Non-statistical content**

The background to the above paper is as follows. In 1986 or 1987 I was browsing the statistics section of the once-great Stanford bookstore and noticed (and bought) an interesting-looking book on maximum entropy and Bayesian methods. The book was indeed thought-provoking, especially its chapters by Jaynes, Gull, and Skilling. Some months later I saw an announcement of a maximum entropy conference to be held in Cambridge, England, in the summer of 1988. That sounded appealing so I submitted an abstract, which was accepted. This was my first academic talk, and in preparing it I followed Hal Stern’s advice: Don’t try to blow them away, just do something solid. The conference was ok. I went to every talk (there were no parallel sessions) and wrote about 100 postcards (I had to keep myself busy somehow while this was all happening.

The participants at the conference were very pleasant. I was the only statistician there, and they treated me with respect. They were pretty much in the margins of their academic fields and didn’t seem to care much about status.

My own presentation came right before lunch on the last day, or maybe it was the second-to-last day. I do remember that the talks before mine went too long, to the extent that by the time my 15-minute talk came up, it was already 20 minutes late! It went ok, though, and I’ve tried to follow Hal’s advice ever since. (Also Joe Schafer’s advice, to reduce the number of words on the screen.)

The other thing I remember from the conference was that many of the participants were politically conservative. One speaker, who was from Texas, told me that he registered as a Democrat so that in the primaries he could vote for the most left-wing candidate who would then be the easiest target for the Republicans. Somebody else told me that the organizers of the conference were extreme Thatcherites. And I have no idea of the politics of Ed Jaynes, but based on photographs, he certainly *looks* like a Republican. (I couldn’t tell much from this mini-bio, but I did learn that Jaynes spent some time at the Naval Research Lab, a place where I worked several decades later.)

I don’t see any significance to the politics of the maximum-entropy researchers; it just seemed worth noting because it’s unusual in the civilian scientific research fields I’ve encountered.

It seems like there might be some other things going on here that you do not mention (sorry if I overlooked them). It would be interesting to get your thoughts on these.

For one thing, it seems worth making a distinction between models with hidden variables (where we might use EM) and those without (where we don't need to). In the former case, the parameter estimation problem (maximizing the observed data likelihood) is nonconvex, while in the latter case it is convex. So in the presence of hidden variables, one must settle with, at best, a locally optimal solution to the optimization problem, while in the latter case one is guaranteed to get the globally optimal solution.

In cases where we get the globally optimal solution, if the solution doesn't satisfy your criteria, the right approach seems to be to modify the problem so it matches your intentions (either via constraints or regularization or modifying the objective entirely somehow), since the fact that this is an issue means that the estimation problem as stated just isn't what was intended. In the nonconvex case, however, it may just be the case that that we get stuck in some local solution that is far inferior to the true solution. In this case, though it would still be preferable to modify the optimization problem (and do a version of EM that carries out MAP estimation) rather than use a heuristic, the fact that you're only getting locally optimal solutions anyway makes it less of an issue to use some extra heuristics.

Also, you mention maximum entropy and maximum likelihood. But maximum likelihood estimation and maximum entropy are equivalent in exponential families, because the two optimization problems are (convex) duals of each other. So it seems like if there is a particular formulation of a maximum entropy problem that works well, then its dual will be a maximum likelihood-like problem that should do equally well, and the precise form of this dual problem may give some additional insight as to why the particular entropy-based formulation is doing better than vanilla maximum likelihood.

(Slightly different point: the term "early stopping" has some bad associations in some areas because some people do or used to do this in extremely unprincipled ways, e.g., stop at some arbitrary cutoff after a certain small number of iterations, or set a termination condition with such high tolerances that you basically know nothing about the suboptimal solution you're getting. And this seems to be done by people who don't know about regularization or priors or such things at all. Obviously, your comment is different, since you're pointing out a case where you want a statistically motivated stopping condition that doesn't seem like it can be encoded as a penalty. But it might be worth highlighting the distinction in general.)

Anon:

I think of early stopping as an algorithmic way to effectively alter the optimization criterion. I prefer a more formal approach (penalized optimization or full Bayes) but I wanted to understand early stopping because it's something that people do.