Flexibility is good

If I made a separate post for each interesting blog discussion, we’d get overwhelmed. That’s why I often leave detailed responses in the comments section, even though I’m pretty sure that most readers don’t look in the comments at all.

Sometimes, though, I think it’s good to bring such discussions to light. Here’s a recent example.

Michael wrote:

Poor predictive performance usually indicates that the model isn’t sufficiently flexible to explain the data, and my understanding of the proper Bayesian strategy is to feed that back into your original model and try again until you achieve better performance.

Corey replied:

It was my impression that — in ML at least — poor predictive performance is more often due to the model being too flexible and fitting noise.

And Rahul agreed:

Good point. A very flexible model will describe your training data perfectly and then go bonkers when unleashed on wild data.

But I wrote:

Overfitting comes from a model being flexible and unregularized. Making a model inflexible is a very crude form of regularization. Often we can do better.

This is consistent with Michael’s original comment and also with my favorite Radford Neal quote:

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

I’ll give Radford the last word for now (until anyone responds in the comments).

26 thoughts on “Flexibility is good

  1. I think it might be helpful to define flexibility and penalized. Is flexibility a reference to the number of parameters, or to the range of values those parameters can take, or both? Idem for regularization.

    • I think of “flexibility” as referring to the dimensionality of the space of possible solutions, and “regularized” a statement about soft restrictions within this space.

      • I suppose by dimensionality you are thinking then about the number of parameters, not their value ranges. If so, by “soft restrictions” I imagine a process by which the computer optimizes from the set of all possible parameterizations; as opposed to humans choosing among a subset of all possibilities (e.g. a subset presently in their minds). In this way regularization appears to be a finer process of simplification, kind of continuous as opposed to discretized.

        At the same time humans discretize the world all the time. Partly on the basis of prior experience, and as a result of their own internal regularization. Put differently, there is something lost when we let the computer do all the regularization.

        One way to combine both is priors on parameters, including possibly the regularization parameter. The latter says that in addition to our current guesses about some explicit parameters, we think the underlying model is probably simple. But I have no experience whatsoever in these kinds of models.

      • Can someone elaborate on what exactly regularization is? I’m still confused.

        So if I try fitting quadratics, cubics etc. increasingly higher ordered functions to my data I’m exploiting flexibility? Because my number of descriptive parameters increases?

        If I just restrict to say a cubic and vary its four parameters till I get better and better fits, that’s regularization? Or am I confusing things.

        • Regularization means more or less restricting the models flexibility. If you have 35 data points you can fit them exactly with a 35 degree polynomial, but this thing could be wildly oscillating in such a way as to predict intermediate values that are extremely unlikely. One way to regularize is to set all the higher order terms to zero (thereby making it a lower degree polynomial). Lower degree polynomials can’t oscillate as rapidly so they can’t predict as radically weird of a function. But another way to regularize would be to for example place strong priors on the higher order terms to constrain them to have coefficients that are closer and closer to zero as the degree increases. This has the advantage that if there is really strong evidence in the data for some particular high order term, it can overwhelm the prior and you can fit a function like 1+x^2 + x^11 + noise or something which you couldn’t do if you truncated your polynomial to be of order 3 or less for example.

          In essence regularization is keeping things from taking advantage of the full flexibility of the unregularized model without fully eliminating all of that flexibility.

        • Daniel:

          That way you would have to put priors on all parameters from 0 to nth. Maybe better to say that the sum of parameters has to be < x, as in ridge regression roughly speaking, and then put a prior on x no? (I am new to this.)

        • I think regularization is usually implemented by altering the objective function optimized when picking a model. For example, a standard fitting procedure of some sort might determine a model’s parameter values p (a vector of fixed length; n, say) given a data set x by minimizing a function F(p|x), where F represents the lack of fit F of your model to the data set x given the parameter vector p. Concretely F might be something like the sum of squared differences between predicted and realized values.

          Regularization I think most often means modifying the optimization problem by defining a new function P(p), which is a “penalty” function (it increases as the complexity of the model increases), and adding this new fxn to the original F(p|x). A simple case might have the penalty function P(p)=n, where n is the length of the vector p. With regularization the new optimization problem to determine your model would then be minimize (F(p|x) + P(p)). We’re now minimizing over both the values of the components of the vector p and its length, though, where previously we had fixed the length of p and were just minimizing over the values of its components.

          That’s what I’ve seen people do before, anyway.

        • p.s. I guess you could get a similar effect as this by choosing an appropriate prior on the joint distribution of the parameter values and the number of parameters, then doing the usual expected value? Probably they’re equivalent under a broad range of situations, for appropriate choices of P(p) and priors on (p,length(p)).

        • No “probably” about it. Consider your sum of squared error figure-of-merit; add the regularizer, multiply by -1/2, take the exponential. Result: normal likelihood multiplied by a prior density.

        • Thanks, Corey! I figured there must be a proof of the equivalence, since the well-versed bayesians here on Prof. Gelman’s blog seem to refer to the two approaches as equivalent.

          Do you happen to know if there’s a specific theorem or set of theorems that captures the equivalence (between regularization as optimization and regularization via priors) in a very general setting? That sounds like it might be interesting to read.

        • I’m with phil… I frequently come across folk knowledge of these regularization equivalences and it certainly makes sense, but I’d like to know where to read about these things in more depth.

        • Yes in principle they ought to be equivalent but in practice human psychology probably means that our priors on individual parameters, and our priors on the number of parameters are not consistent.

        • I don’t know of any paper that spells it out in general. Usually you can see a correspondence if you take exponentials of (the negative of) the objective function.

        • Anonipsmous:

          I followed the link and saw this:

          There are two major routes to address linear inverse problems. Whereas regularization-based approaches build estimators as solutions of penalized regression optimization problems, Bayesian estimators rely on the posterior distribution of the unknown, given some assumed family of priors. While these may seem radically different approaches, recent results have shown that, in the context of additive white Gaussian denoising, the Bayesian conditional mean estimator is always the solution of a penalized regression problem.

          “While these may seem radically different approaches”? Huh? It’s commonplace that regularization can typically be seen as something close to maximum penalized likelihood, which in turn can be interpreted as being Bayesian. I’m not slamming the content of the paper (which I haven’t read), I just think it’s a bit odd to see a claim that these approaches, which are generally seen as different formulations of the same basic idea, “may seem radically different.”

  2. Another great Radford quote related to this topic:
    “Why select a subset of the observed features?

    3) You don’t want to think too hard about your prior.
    The bad effects of bad priors may be greater in high dimensions.
    …”

    Sometimes crude forms of regularization (like cross-tuning) can save you time, but of course it would be more elegant to come up with a good prior (but how much we can tune the prior of the complex model based on the information that some simpler model performed well?)

  3. Well, you don’t have to sell me on the benefits of regularization aka soft constraints aka log-priors.

    I think it make sense to see priors with hyperparameters that are also tuned (via cross-tuning or a full multilevel treatment with a hyperprior) as enabling the model to interrogate the data about the appropriate level of flexibility. AG, of your anti-parsimony stance have written:

    Gelman wants to throw everything he can into his models — and then use multilevel (a.k.a. hierarchical) models to share information between exchangeable (or conditionally exchangeable) batches of parameters. The key concept: multilevel model structure makes the “effective number of parameters” become a quantity that is itself inferred from the data. So he can afford to take his “against parsimony” stance (which is really a stance against leaving potentially useful predictors out of his models) because his default model choice will induce parsimony just when the data warrant it.

    Do you think that’s fair?

    • But this cannot be “against parsimony”. Hierarchial models are more parsimonious than fully saturated models. AG is not, I think, spousing that kind of extreme anti-parsimony but a middle way.

  4. If we are developing a classification model for example and the data are linearly separable, what complexities would lead us to use a more complex model than, say, drawing a line?

    I use this example as we have seen it in practice when classifying EEG. The problem sounds complex, classifying what someone is thinking, and the data itself looks sufficiently complex to warrant complex models. But I still have a hard time justifying a complex model when we achieve excellent results using a simple (linear) model.

    • Jake, if your model works as well as you want then there’s no need to fit a more complicated one. I’d say that’s a general rule. I know that in my case I sometimes get all I need by taking a couple of means and standard deviations, even when the data are clearly not iid normal.

      The reason to use a more complicated model is because your simpler model is inadequate for some reason. Usually this means it either yields poor predictions, or it yields OK predictions but doesn’t give good estimates for other things you’re interested in (such as specific parameter values). To use an example Andrew has cited before, if you have a model that predicts the rate at which a drug is cleared from people’s bodies, and it fits the time decay data well but only if you assume everyone has a 30-pound liver, that might be fine if you’re trying to predict drug clearance rates but not if you’re trying to predict the probability of liver damage.

    • Jake,

      I think your question assumes two things: 1) the modeler has a full toolbox of modeling tools and knows when to use each, and 2) the data is not very noisy, and is in abundant supply, and 3) the data collection process is well-conceived (i.e. you reliably get the appropriate data). If all you are familiar with are extremely flexible techniques, and if your data is noisy enough, you’ll fit the noise of your sample (overfit).

      I also think go a machine learning class I took. A fellow in the front row wanted to predict electricity usage and he was persuaded that the teacher was holding out “the more sophisticated methods” on us. The methods we were using couldn’t predict more accurately than anyone else, and he just knew that it was because we simply weren’t using sophisticated-enough methods. The allure of a more sophisticated (i.e. more flexible) method, and model worship have created houses of cards in several disciplines.

Comments are closed.