What is “overfitting,” exactly?

This came from Bob Carpenter on the Stan mailing list:

It’s not overfitting so much as model misspecification.

I really like this line. If your model is correct, “overfitting” is impossible. In its usual form, “overfitting” comes from using too weak of a prior distribution.

One might say that “weakness” of a prior distribution is not precisely defined. Then again, neither is “overfitting.” They’re the same thing.

P.S. In response to some discussion in comments: One way to define overfitting is when you have a complicated statistical procedure that gives worse predictions, on average, than a simpler procedure.

Or, since we’re all Bayesians here, we can rephrase: Overfitting is when you have a complicated model that gives worse predictions, on average, than a simpler model.

I’m assuming full Bayes here, not posterior modes or whatever.

Anyway, yes, overfitting can happen. And it happens when the larger model has too weak a prior. After all, the smaller model can be viewed as a version of the larger model, just with a very strong prior that restricts some parameters to be exactly zero.

120 thoughts on “What is “overfitting,” exactly?

  1. Overfitting also comes from using an inference procedure that gives a point estimate at a maximum fit point.

    Even many Bayesian estimates with good priors overfit at the MAP as I’ve discovered when I try to shortcut long Stan runs using optimization.

    • Interesting thought. If you do an optimization, along the way towards the peak, you find out something about the distribution. The general idea that the typical set is the set with log probability density nearly some constant, can we estimate that constant from the path history on the way to the peak? Then perhaps draw some small number of samples in the vicinity of whatever step was closest to the typical set on your way to the peak?

      The reason I ask is that if this could be made into a say 2 minute procedure, then for those of us who have models that take something like 25 hours to run to full high quality sample, we could debug models much more quickly.

      I’m already informally doing this by watching stan spit out the samples and waiting until it calms down in terms of changing __lp and then killing it and looking at those few final samples to see if things make sense. This can sometimes take something like the few minutes, whereas a high quality sample might take an hour or 3. If things are way-off by the time I get 3 or 5 samples with “constant” __lp I can then go ahead and do things like rework the model to make things make better sense.

      • “The general idea that the typical set is the set with log probability density nearly some constant, can we estimate that constant from the path history on the way to the peak? Then perhaps draw some small number of samples in the vicinity of whatever step was closest to the typical set on your way to the peak?”

        You can do this, but not the way you describe. This is essentially how nested sampling works, but then you only get a few points from the typical set which makes function expectation estimates highly imprecise (even assuming that the rest of nested sampling is working ideally, which can be a stretch in practice).

        • Yeah, I think this is ok for debugging purposes. I mean for example suppose you want some big vector parameter to encode some function, and you have some idea of what that function should look like, you provide a prior, attempting to encode your knowledge, and you run Stan in this way to get some small number of samples. If the function doesn’t look anything like what you thought it was going to, you know you made a mistake encoding your information… figuring this out after 3 minutes instead of 3 hours is great!

  2. What does overfitting look like in its usual form in a Bayesian setting? The only way to overfit that I can think of (but I’ve not thought much about it) is by changing the prior to “fit” the data.

    • When there’s very little theory to go on, I think it’s also very possible to rework the likelihood too much. For example a simple regression with a universal approximator (say a basis expansion, or splines or whatever) or throwing in more and more covariates trying to chase your posterior predictive standard deviation down to zeroish.

      • Changing the prior or the likelihood is more or less the same: tweaking the model. But the problem is not that the prior was weak to start with, the problem is modifying the model to make it fit better.

        • I think this is exactly right. Because Bayesian procedures correctly account for uncertainty, a single application of a Bayesian procedure will not overfit. But if we keep changing our model, then we overfit via the garden of forking paths, which is not properly accounted for in any individual execution of a Bayesian procedure. (There is work in adaptive data analysis that seeks to account for this. Has anyone been following that?)

          Overfitting in the classical setting arises from choosing a model class that is too flexible and then selecting only one fitted model from that class. Such estimators have high variance, and the resulting error is what we call “overfitting” (because it usually results from fitting the noise in the data as well as the signal). Of course, cross-validation, when applicable, can provide a way of adjusting the model class complexity to match the data complexity, but it is only approximate.

          We can use holdout data to detect overfitting. But one can of course fork paths by consulting the holdout data many times. An intriguing idea from Cynthia Dwork is to use differential privacy to create a reusable holdout data set. My understanding is that that reduces but does not eliminate the overfitting risk.

          Ultimately, there is no purely statistical/computational magic. We must rely on doing additional experiments to gather more data if we hope to approach the truth.

        • I don’t agree that “Bayesian procedures correctly account for uncertainty”. It’s just a method for updating probability distributions. It’s not a guarantee the resulting distribution is sensible.

          A friend of mine refuses to read or watch anything with wizards in it. He calls them “literary mcguffins”.

        • Yes. But, to paraphrase Dorian Corey, that’s not a guarantee, that’s a fact.

          My pet hate is people thinking that computing some approximation to a posterior means they’ve accounted for uncertainty.

        • I don’t think any statistical methodology can totally account for uncertainty; statistical methods are means for *trying* to account for uncertainty. Phrases such as “we have accounted for uncertainty” or “we have controlled for …” (in contrast to phrases like “we have attempted to account for uncertainty by …” or “we have attempted to account for …”) just lead people to believe that statistical techniques can do more than is realistically possible.

        • +1 Martha

          I try to get this idea across to my students (undergrads, 1st or 2nd semester stats) by pointing out how absurd it would be to take the contents of a model literally. For instance, in simple regression we are supposing that our data came into being through a procedure in which numbers are randomly plucked from a normal distribution and then added to a perfectly straight line. This is preposterous! I’m not aware of all the forces that combined to create the measurements I see in some spreadsheet, but I know it wasn’t “straight line plus normal error”. And we aren’t supposed to believe that it was, we just think that this extreme simplification of reality will help us answer questions we are interested in.

          And as you note, the big problems come when people sincerely believe that, for instance, a regression coefficient tells them “the effect of X on Y while controlling for Z”, and that all of the uncertainty in this statement is quantified by the standard error. I try to teach my students to see standard errors and the like as lower bounds on uncertainty. They give us an estimate of uncertainty under the foolish assumption that our models are real.

        • I guess I need to be more careful about my wording. The Bayesian procedure correctly transforms the uncertainty of the prior into the uncertainty of the posterior under the assumption that the likelihood model is correct. It does not quantify or account for our uncertainty about our modeling choices (beyond those that can be captured by the prior). Indeed, I believe such an accounting is impossible in a finite model.

    • Overfitting with non-informative priors using MCMC looks exactly the same overfitting using MLE methods: for example, if a model has a Gaussian error term with unknown variance, which can be perfectly (over) fit in sample, then the posterior distribution of the error term is degenerate at sigma = 0.

      • I guess it depends on what we mean by overfitting. If I fit a model y=a+b*x+error with two data points y(x=1)=1 and y(x=2)=2 using least squares I clearly overfit. The predictions I get from the model are precisely the same as the data (or the interpolation/extrapolation for other values of x). But maybe the true model was a=b=0 (and the data was just noise). Using Bayesian inference I get a posterior predictive distribution that is quite wide (and the less informative the prior, the more diffuse the model prediction).

        • Is that really the case?

          You have an unbounded likelihood as a, b approach the line given by the data and sigma approaches 0. So as long as you have non-negative density at 0 for sigma in your prior, the posterior density is unbounded at that single point, and differentiable in neighborhood around that point.

          Of course, that doesn’t necessarily mean you have a degenerate posterior distribution; something like a Weibull with shape less than 1 has unbounded density at 0 but is not degenerate. I don’t quite recall what is mathematically required to ensure that it will not be degenerate (outside that a neighborhood around a point must have a finite integral). But at the very least, I think Stan should fail; one of the basic assumptions of most MCMC algorithms is a bounded posterior density (and I think Stan’s algorithm is no different).

        • an optimization procedure that encounters an unbounded density is in trouble, but I don’t think that’s true for MCMC. For example, just try drawing via MCMC from beta(1/2,1/2).

          The main issue with unbounded density is that it needs to be normalizable integral(f(x),dx, all X) = 1

        • Stan uses transformations to take parameters supported on intervals or the positive reals to the entire set of reals. Even if the density grows without limit as it approaches a boundary of the support set, as long as the density in the original parameterization is normalizable the transformation will inevitably flatten out. If the divergence isn’t on the boundary I imagine Stan will choke on the discontinuity.

    • Poor generalization is one result of using a high-variance point estimate. So while it is perhaps not the definition of overfitting, it is certainly a consequence.

  3. I don’t see why a weak prior leads to overfitting. It will just lead to a very broad posterior distribution. A weak prior combined with MAP estimation will definitely result in overfitting. Overfitting occurs when we use a high-variance point estimate to make decisions. I suppose it could also result if we did a poor job of sampling the posterior so that we failed to properly assess our posterior uncertainty. That is not a point estimate, but it is on the spectrum between a proper full posterior and a point estimate.

  4. > Overfitting is when you have a complicated model that gives worse predictions, on average, than a simpler model.

    Doesn’t the concept of overfitting include somehow the idea of fitting too well the observed data, while failing to predict with the same precision data not included in the analysis?

    • How about something like overfitting: matching your current dataset more accurately than your current dataset matches future datasets or datasets not included in the current set

  5. My working definition of an “overfitting” procedure (that we used in the PC priors paper) is that it’s one where, in the absence of data, the procedure prefers a more complex part of model space to a simpler part. (It’s our “Informal Definition 1 here https://arxiv.org/pdf/1403.4630.pdf. Reviewers hated it, but it’s the correct concept)

    • The more I think about it, the more I don’t think you need a concept of “data” to define overfitting.

      In moderate-to-high-dimensions, a N(0,sigma^2 I) prior on regression coefficients will overfit because with the sum of squares of hte regression coefficients will be strongly concentrated away from zero a priori, which precludes the simplest model of them all being zero. Whereas a horseshoe prior has a non-negligible mass around all of them being zero (at least the Finnish variety).

      So I don’t think Andrew’s idea of “on average fitting new data worse than current data” is the simplest form of the definition. It’s certainly true, but you need a data-generating mechanism to make it work…

        • I suppose the usual reply is that this ‘is a modelling assumption’, but it seems a bit forced to me to describe/motivate it in these terms

        • I don’t think it’s one or the other. Check your model before you see data AND validate your model after. Otherwise you’re asking for trouble.

          Sometimes one of these is easier than the other, but they should both be done as well as you can.

        • I guess I don’t really think you can check a model without data, unless against eg physical constraints like ‘does it violate conservation of [whatever] or invariance under [some transformation]’.

          Preference for simplicity over complexity should, it seems to me, be motivated by something like this or by predictive/robustness concerns about what other data might look like.

        • Good talked about the “Device of imaginary results”, which is just that.

          To some extent, pre-data model validation is a step in the modelling process, but it’s useful to make it explicit.

          I don’t really agree with the idea (is it Jaynesisn? I don’t care that much about dead statisticians to remember which one is which) physical constraints or invariances are enough to set priors in realistic models. If there are obvious ones they should be respected, but for even moderately complex models, deriving them all and then working out to incorporate them seems a Herculean task.

          So I guess the question I have is for what types of experiment is “prefer simplicity” not, pre data, a sensible prior state?

        • > the question I have is for what types of experiment is “prefer simplicity” not, pre data, a sensible prior state?

          I would say the question is relative to what sort of data are available. Most statistical procedures (including Bayes, according to me contra Lindley) only really work given identifiability constraints.

          There is only so much you can estimate. A complex model (family) just means there will be lots of equivalent models that you can’t tell apart. So you may as well restrict to models that you can tell apart.

        • But how do you do that?

          And then given that you’ve restricted to identifiable submanifolds, how do you set priors?

        • > the question I have is for what types of experiment is “prefer simplicity” not, pre data, a sensible prior state?

          My first inclination would be to think of systems where simplicity emerges from averaging over disorder. It is not a typical system for Bayesian inference but spin glasses (https://en.wikipedia.org/wiki/Spin_glass) fit the bill. They have a simple phase transition that emerges from a randomly-coupled model. In this case, the phenomenon is quite simple and physically realistic but emerges precisely because of the high entropy/complexity of the parameters. I admit this is an unusual system, although it is pretty good way to think about things like emergent behavior of social graphs, where you would similarly expect apparently random coupling between individuals.

        • > But how do you do that?

          You mean determine the effective number of parameters that can be estimated given the data? The usual ways for estimating complexity (or whatever) parameters would be a start, right? Training/test, CV, empirical Bayes, hierarchical Bayes, profiling etc etc.

          > And then given that you’ve restricted to identifiable submanifolds, how do you set priors?

          Probably doesn’t matter too much at this point, but I suppose however you normally set priors?

        • ( a key point required for my answer to not contradict my others is that imo you need a concept of identifiability, or overfitting…, in order to estimate a complexity parameter. So you can use standard tools but the goal is not quite the same as standard estimation assuming identifiability)

        • (Obviously there’s no such thing as an objective, assumption free analysis. So you need to know what the properties of your inference procedure are as much as possible *before* you see data)

      • I don’t buy your definition of overfitting, because it seems like you are conflating overfitting, a concept that always needs data, with the probability of overfitting, which you approximate fairly reasonably. It seems like the natural endpoint of your argument is Solomonoff induction, but there are many low Kolmogorov complexity models that don’t have many zero parameters.

        • Yeah we depart on the first sentence there. I just don’t agree when you’re building a Bayesian model the correct thing to look at isn’t whether or not the model can overfit.

          And I don’t see a general way (imagine dependent data to break Andrew’s definition) of checking if overfitting has occurred.

          I also don’t quite know what low Kolmogorov complexity models that aren’t zero have to do with it.

          And I think that if you’ve followed an argument all the way to Solomonoff induction you’ve gone too far. It’s like following your satnav into a lake.

        • I laughed out loud on the satnav joke. My point was just that counting zeros is a rather arbitrary way to define model complexity, and you seem intent (quite reasonably) on spreading your prior mass evenly in model complexity.

          “Can overfit” still implies that overfitting is a property of data+model not just the parametric model (otherwise you should say that a model does overfit for N points regardless of the data). We are getting a bit semantic at this point, though.

          Dependent data doesn’t necessarily break Andrew’s definition, so long as you hold out a portion of it for validation (very rarely is this impossible given the structure of the model). A Bayesian model that doesn’t allow held-out data and makes no predictions about future data seems rarely useful.

        • I’m not trying to spread mass evenly anywhere! A priori, I want it to be exponentially unlikely to have a model that’s much more complex than the base model. It’s nit counting zeros: it’s just that for some model components, zero is a sensible base state. Sometimes it’s something else.

          Regarding data+model, you need to allow for the case where the data *needs* one of the higher complexity models to describe it. What I want the prior to do is ensure it’s the data calling the shots, not the prior. if there’s low information in the data informing the model component in question, the prior won’t wash out. So you need some sort of argument for the prior.

          Hold out validation is really hard in a lot of problems. (Point processes, for example. Holding out leans very strongly on your stationarity assumption) Hold out methods work well when you can break your data into approximately exchangeable components. Otherwise you’ve typically your to think of something else!

        • “Regarding data+model, you need to allow for the case where the data *needs* one of the higher complexity models to describe it. ”

          This statement sounds like you are talking about “the data that have been collected,” which doesn’t make sense to me in context. Is what you are really talking about “the type of situation from which you are collecting what type of data”? That would make sense to me.

        • The one and future data. This is stats, so I’m assuming that at some point in this process there is data and that’s the data I’m talking about.

  6. Perhaps a naive question, but is there a nice mathematical characterization of model mis-specification when using full Bayesian inference? In this case, I am thinking of how the maximum likelihood estimator minimizes the KL divergence of the empirical distribution against the model, which characterizes what happens when the true data generating process is not representable in the model.

    In the infinite data limit on a finite model, Bayesian inference must recapitulate maximum likelihood and therefore also minimize the KL divergence. Are there any useful mental models of the optimal posterior in the finite data case where the prior is still important but the true data generating process is not representable?

    • > In the infinite data limit on a finite model, Bayesian inference must recapitulate maximum likelihood and therefore also minimize the KL divergence.

      FWIW This is not actually true for infinite-dimensional parameter spaces. In these cases your prior typically matters forever ever ever….(I don’t think Bayesians care too much for whatever reasons)

      • Bayesians care.

        But with infinite dimensional parameters modelling choices/ estimator choices tend to persist whatever your framework.

        This is sort of obvious: infinite parameters, finite data. The best you can hope for is that the stuff you care about is resolved in asymptotia. (Because even with infinite data you only see the process “once”, so you need some asymptotic assumptions to bring in some sense of replications)

      • Infinite dimensional models definitely have a new set of issues. The downside is that the prior lives on forever, but the upside is that in some cases the data generating process is always representable. I am thinking of Dirichlet process mixtures of gaussians as always converging in the infinite data limit for smooth densitities (I assume), although so does kernel density estimation.

        • The point I was trying to make is that both bayesians and non-bayesians have to make really strong, uncheckable assumptions to estimate infinite dimensional parameters. So persistence of these assumptions isn’t just a Bayesian issue

      • “Infinite dimensional” models obviously don’t really exist. They’re best thought of as a family of models in which some sufficiently large N is sufficient to describe the real scientific thing to within some measurement ability. Typically “infinite dimensional” models are actually WAY WAY lower dimensional than reality. Think for example Navier Stokes via finite elements or something. The true model is 3 positions and 3 velocities for 10^28 molecules, but the “infinite dimensional” model is approximated as say 800 finite elements with several dimensions per element. a factor of 10^26 dimensional reduction

        • I don’t understand your point. If you approximate a GP by 50 Fourier features, or a DP mixture by a mixture of 3 components the prior washes out asymptotically. So what?

        • For example, you’re doing an annual time-series analysis. You could:

          1) Use dummies for each day of the year (365 parameters to estimate)
          2) Use a GP with a squared exponential covariance function, with a single time-scale parameter and a single covariance degree, and a noise term, (3 parameters to put priors on)

          The second one is “infinite dimensional” but has actually substantially fewer parameters to estimate.

        • Not any more so than y[i] = exp(-t[i]/s) + epsilon[i]

          Most people look at this and say “it has one parameter, the scale s” but of course, there’s also all the individual values of any number of epsilons. If you put an independent normal on epsilon, you are in fact specifying a nonstandard gaussian process for the errors (a white noise).

          The purpose of a GP is to specify a lot of information with just a small number of unknowns, the 3 covariance function quantities.

          The purpose of Navier Stokes is to get a solution with just a few hundred finite elements instead of modeling the molecular interactions of 10^27 molecules.

          The purpose of continuous functions is to not have to worry about all the more realistic details you’d get in terms of giga-samples-per-second discrete time sampling process.

          Infinite dimensionality is best thought of as a computational rule for generating a finite dimensional object of whatever dimension is suitable for your purposes. every continuous function is a point in an infinite dimensional hilbert space, in this sense classical least squares on 3rd degree polynomials is an “infinite dimensional” model too.

        • – We need to teach these people to count! :p I see 365+365+1

          – I’m old fashioned in requiring a GP has a continuum parameter space, in which case iid standard normals is a bad definition of white noise.

          – there are still four unknowns. One of them is the sample path of the GP. How big it is is up for grabs (O(log n) with fixed parameters, O(n^a) for the right priors etc), but that’s the point. Your prior in the parameters controlling the GP tells you how big that infinite dimensional parameter “really” is. You need to decide if that should be near the number of observations (n) or if it should be significantly smaller and this information should be encoded into the prior. In some situations, having this effective dimension close to n is overfitting, but in others (e.g. Almost perfectly measured data) it may not be.

          – I haven’t used my fluid dynamics in a while, but that’s quite a striking notion of the purpose of Navier-Stokes

          – I don’t know if I agree with this value judgement about continuous functions.

          – this is incorrect. It’s a four dimensional set of models that form a linear sub space of that particular Hilbert space. Assuming, of course, that you drag the inner product along with you. Otherwise it’s a topological sub space but not a Hilbert subspace.

        • I don’t think we disagree on the way the models are used, I think we simply disagree on the implications for the existence of the infinite object. If you will kindly calculate the 10^10^10^10^10^10^10^10^10^10 th digit in the decimal expansion of pi I will admit its existence in the world.

        • Here’s where I see all this going. Every scientific model provides some rule for generating predictions. Imagine it as a lambda calculus expression. The rule can be applied to any number of scenarios. When the rule is probabilistic, you can imagine it augmented with an arbitrarily long sequence of random binary digits from which it generates random numbers. In this sense, every model is capable of doing an arbitrary number of calculations, if you apply it to an arbitrarily large number of questions. But in actual fact, it will only ever be used to do some finite number of calculations. In most cases when people choose to use “infinite dimensional” models, its to *reduce* the kolmogorov complexity of the model. The structure imposed by the infinite dimensional model makes fewer outcomes possible and the lambda expression shorter.

          The finiteness of the number of questions asked of the model is very relevant to the goodness of the model. A model capable of extrapolating to before the big bang which is never asked to do so is not worse because it would give nonsense in that scenario. If your Gaussian process was chosen for the purpose of modeling a particular 365 day dataset, and is never asked to generate any more than 365 dim vectors, its hard to see in what way its anything other than a 365 dimensional model specified in a kolmogorov complexity lower than the dummy variable approach.

        • I see elsewhere you mention you have limited background on Kolmogorov Complexity. So, in this sense, just think of the complexity as the length of the smallest lambda expression that computes the same model. The point of something like a GP is that by specifying the covariance function, using just a few symbols, something like c(a,b) = s*exp(-((a-b)/l)^2)+n you have encoded all of the information needed to do the computation. Whereas for something like the dummy variables approach with say independent priors on each coefficient, you need to specify a prior over each dummy coefficient, in terms of 365 n digit numbers for locations and 365 n digit numbers for scales, and maybe some number of parameters for shape parameters at each point.

          The infinite dimensional thing is just a mental device for getting you a compact lambda expression for your finite dimensional model. It provides a structure to generate compact models with N, the number of data points, a free parameter, which nevertheless in every actual usage gets bound to some actual finite value.

        • Sure. But that’s a different model. Sometimes wildly different. People use infinite dimesionla models usually because they’re easier than other things. You can build principled approximations to them, but it’s hard to work out how that approximation propagates and only really exists for fairly easy models.

          But I’m still not sure why you’re making this point. If the dimension of your model is as large or larger than your data, which is a common situation, it might as well be infinite. And to pull it back to the topic, in these cases you need to be careful with your modelling or else you will overfit. (Whatever that means)

        • > But I’m still not sure why you’re making this point. If the dimension of your model is as large or larger than your data, which is a common situation, it might as well be infinite.

          Agreed. Infinite dimensional I’ll-posedneas is a good indication of finite but large dimensional I’ll-conditioning.

        • This is shorthand for “if the state of the universe is even representable as some mathematical object, this object has some enormous but finite number of dimensions”. I think this is axiomatic for me as there’s no real way to verify this.

          However, there’s a stronger sense of this, since models are made by humans. No individual human will ever carry out a calculation on an infinite number of data values or with an infinite number of parameters. And if you disagree with that, then I think we really do have fundamentally different views of the world.

        • What if I calculate the real-valued result of a problem involving continuous functions by computing with rational coefficients of a real-valued polynomial?

          > I think we really do have fundamentally different views of the world.

          I think I’d probably agree ;-) In my more philosophical moments I prefer continuity to discreteness. Perhaps just for aesthetics.

        • We may not be as far apart as you think. I just found this paper:

          https://arxiv.org/pdf/1703.00425.pdf

          See page 40 (as well as the rest of the paper, seems really interesting and good overview of some ideas) on the link between intuitionism and NSA in which they describe the relationship between a nonstandard number in IST and the idea of a non-constructable number.

          For example suppose you can prove in non-constructive way that Goldbach’s conjecture is false, but for *every computable number* it’s true. then in some sense “in the real world” all the numbers satisfy Goldbach’s conjecture, and only “theoretical” numbers inaccessible to any construction technique may violate it.

          This divides the world into naive integers that satisfy goldbach and are constructive, and non-naive integers that don’t. This is essentially the same partition as “standard” and “nonstandard”

          So, the idea if IST in which some of the real numbers/integers etc are standard and some are not, but they’re all just real numbers… this has a flavor of intuitionism / constructability. I’m very much in favor of treating mathematics algorithmically.

          I don’t dislike continuous math, I just like it as an interpretation over a set of algorithms indexed by N, the fineness of a grid. The “continuous” concept is just a function of N, where N is free, which when applied to N returns a particular discretization. The nonstandard view is then that for all N sufficiently large (nonstandard), the results are the same to within the power of algorithmic construction to distinguish.

        • Thanks.

          Funnily enough, as discussed previously I think, when it comes to resurrecting infinitesimals I prefer so-called smooth infinitesimal analysis to NSA ;-)

          PS at what point do you think Andrew bans us for treating his blog as a miscellaneous math/stat/whatever forum. He’s remarkably tolerant…

        • You are both not too wrong: continuity is required for possibilities but does not apply to actualities.

          Daniel’s Goldbach’s conjecture is a nice example – there are “can be” numbers that violate it but you won’t encounter any that “are” violations. As Pierce put it existence breaks the continuum of possibilities.

      • Also, is KL divergence really a desirable way to measure distances, especially in the context of misspecification and finite information? I’m pretty dubious (for whatever that’s worth!)

        • Why do you think that?

          It’s definitely not a distance, but it is a measure of how much complexity would be not accounted for if a model was replaced by the base model, which makes sense for me as a measure of how far apart two models indexed by a single parameter are.

          But I’d love to hear your thoughts!

        • Correct me if I’m wrong, but two models can be arbitrarily far apart in terms of KL but generate datasets that are arbitrarily close in terms of Kolmogorov distance.

          If you e.g. generated data, rounded it to some sig fig they would be the same. So you have to really believe your model to distinguish two models that generate the same dataset.

          Of course you could add additional regularity conditions to make this less of an issue but you could also just use a different ‘distance’.

        • Thanks for the link :-)

          I definitely do think the issue is practical. Bayesians tend not to agree but people like eg Huber, Tukey etc from robust stats and data analysis also think/thought it was practical. It’s not surprising that the paper you link comes from the machine learning world – at least some of them seem to have a fair amount of influence from and/or overlap with those areas of stats. Again, someone like Vapnik also comes to mind too.

        • Hmm… this is an interesting paper. I suspect that the non-existence of a density is however an indicator of model misspecification (that is, if your problem occurs on a low dimensional manifold embedded in some higher dimensional space, then it’s your excess of dimensions thats the issue, not the lack of density). As for the use of various divergences/distances I see these as measures of how well a theoretical frequency distribution fits an observed frequency distribution. That some measures of this goodness of fit are better than others for getting models that make sense is no different than for example using a gamma distribution for the errors in your regression when they have some skewness, or a t distribution when they have outliers, instead of say a normal.

          The role that the metric / pseudo-metric plays is a comparison between model and data. If I compare two people by the color of their eyes, it’s no surprise that I can’t distinguish when one of them is rude and the other is kind and caring. In general, measures of comparison are important because they describe what it means to be a good fit vs a bad fit, in other words, they describe what it is you’re modeling.

        • Yes, the model is definitely misspecified in the technical sense. You need a pretty complex model since the data generating process is basically “pixels from a picture of someone’s room”. You can imagine that there is a very large excess of dimensions but identifying the hidden manifold is the whole point of the exercise.

        • Right, but by respecifying the model as “Here is a computer program that takes N numbers as inputs, and it outputs a K x K grid of pixels” and then specifying a prior over the numbers a[i] ~ normal(0,1) or whatever you like, and specifying a comparator C(Image, GeneratedImage) in terms of an appropriate information theoretic metric, and giving a probability distribution over the output of the comparator function (a “likelihood”), we can infer a density over the N numbers which, in the high probability manifold of the a values produces K x K grids of pixels that are not bad approximations to the image we’re “compressing”

          However, the sense in which it’s “not bad” is the sense in which C(Image, GeneratedImage) is in the high probability manifold of the “likelihood”. If that happens to correspond to the sense in which “this looks like the picture to me” then your choice of C and your choice of “likelihood” also corresponded to the theory “when C is in the high probability region of my likelihood, the picture will “look like the original”.

          As I said, if you compare people by their eye color you will find that the “similar” people may in fact have wildly different personalities. On the other hand, if you compare them by closeness in Meyers-Briggs scores, they might be pretty similar personalities, but the eye colors could vary widely.

        • I know literally nothing about kolmogov complexity. Do you mean that for any M and epsilon>0, I can find 2 models f and g with M< KL(f,g) < infinity such that if x~f and y~g are two samples , then d(x,y) < epsilon?

          This seems weird to me… do you have a ref for me to read?

        • KL means Kullback-Leibler divergence. But yes, there are very weird things mathematically that occur when you admit infinite dimensions. For example the indicator function on the irrational numbers differs from the constant function y=1 at a countably infinite set of points, but the Lebesgue integral between any two real numbers is the same.

        • I think my problem is more that I’m not sure what the kolmogov distance between two data sets is.

          (My measure theory is up to snuff, but google is failing me. Is it the L-infinity norm between empirical CDFs?)

        • Incidentally, of we’re trading “measure theory facts that blew my mind back in the day”, my favourite will always be that if mu is a Gaussian measure on an infinite dimensional compact Hilbert space (for simplicity), the H(mu), it’s reproducing kernel Hilbert space (or Cameron-Martin space) is the space where, given some noisy Gaussian observations of the process, the posterior mean will lie. The prior probability (and hence posterior probability) that any realisation of the process is in H(mu) is zero.

          So the MAP predictor (or posterior mean) is always smoother than any realisation from the process!

        • > Kolmogorov complexity

          Not that…

          > Is it the L-infinity norm between empirical CDFs?

          Yup that. The basic idea is that the KL is a strong ‘metric’ (or whatever it’s really called) while the Kolmogorov is a weak one. You get to densities from distributions via unbounded operator: differentiation.

        • Even if the resulting datasets (actual numbers) can come out the same to whatever sig fig under models arbitrarily far apart in the strong topology?

          I mean, tbh I’m happy for people to be happy with the strong topology, but personally I’m not (anymore).

        • All you’re saying is that a finite data set can’t always tell between two models. That’s vacuously true, so it doesn’t bother me. In the context, these aren’t arbitrary models, it’s a (typically) one dimensional sub-manifold of a well behaved model space. And the choice that has been made is that if there are two indistinguishable models, we’ll go with the simplest one.

        • ‘All you’re saying…vacuously true’

          Sometimes these sort of points are important points.

          As far as I can tell you have no theoretical justification for preferring simplicity. The ideas can be analysed in terms of empirical prediction, regularisation theory etc which attempt to understand when and why we might prefer simplicity to complexity and when not. See eg Vapnik for one attempt.

          Fine if you want to start from particular axioms and treat all exceptions as vacuous truths, but that’s not really appealing to me personally.

        • And of course you only find out two models are indistinguishable given the data when you go back to the weak topology. In the strong topology you think they’re different.

        • Sorry. That was obviously vague. What I was trying to say is that there’s nothing deep about a finite set of data not being able to separate models. It’s just true unless you put some serious mathematical effort into making it not true. So a topology based on empirical cdfs (by definition from finite data sets) is too weak to worry about. It’s also a data dependent topology, which is hard to think about when you’ve got a sequential experiment.

          It’s not vacuously true that entropy-based topologies aren’t too strong. Although Higdon’s comment here talks about the games you can play with intermediate topologies: https://xianblog.wordpress.com/2013/09/11/bayesian-brittleness-again/#comments

          And yes you’ve got to do predictive checks, but you’ve also got to build a good model first. Overfitting is a property of model+data. If the model doesn’t allow for overfitting it can’t happen. If the data is strong enough to prevent overfitting it can’t happen (although this is less likely in high dimensions).

          There’s a mirror to this entire conversation about underfitting.

        • I don’t think the Owhadi et al paper is necessarily the best expression of the basic issues (which, as Christian Hennig points out there, go back at least to the robust stats folk, as well as e.g. Vapnik and all those folk) but yes, it is based on same issues.

          It really is quite an important sticking point for many – some look at such things and say ‘eh, Bayes in the strong topology is still fine by me’ and others go ‘eek, maybe it’s not for me.’

          There are plently of respectable folk on either side of the issue. I’m more of an unrespectable type who’s been converted to the ‘eek’ side.

        • I can’t see the eek. It’s like a magician cutting a woman in half. The first step is to make sure she’s not in the critical part of the box. Once you’ve got that much freedom, you can do almost anything you like. (For instance, that paper shows that you can’t trust posterior predictive checks because, as they are posterior functional and you’re allowed to move your assistant around the box, they are also brittle)

          But this has drifted quite far from overfitting.

        • Someone should design a survey to determine what best predicts the ‘eek’ and ‘eh’ classes of responses to this.

          Hopefully the results don’t depend on whether the analysis is eek-based or eh-based…

        • Ojm: Robust statistics is totally compatible with Bayes. The real issue is that many people identify Bayes with generative modeling of individual data points. The ABC approach is seen as an approximation. But it’s a first class citizen as soon as you recognize a deeper truth. Bayes is still constrained by frequency interpretations for many people. They think they need to give probability as frequency distributions over data points, not over say nonlinear transforms of whole data sets. But that isn’t a requirement in any way.

        • > Robust statistics is totally compatible with Bayes.

          Well…this is actually quite a subtle issue. For precisely the reasons discussed here.

          As an example, the last chapter of the second edition of Huber’s book ‘Robust statistics’ is called ‘Bayesian Robustness’ but is fairly negative on the feasibility/compatibility.

          First he discusses e.g. prior sensitivity studies which he sees as pretty orthogonal to the main issues. Then he considers more relevant solutions but most of these seem to amount to pretty much abandoning the main tenets of the Bayesian approach.

          To misquote Andrew – ‘Once you have abandoned literal belief in the B[ayes], the question soon arises: why follow it at all?’

          [PS Andrew has said something to the effect that Huber wasn’t appear of posterior predictive checks etc but I’ve come to the position that this doesn’t really address the key issues raised]

        • ojm: I’m guessing you’re not talking about things like modeling noise with t-distribution rather than normal when you say “robust statistics”, because obviously that is totally compatible with Bayesian approach. But I’m curious what kind of “robust statistics” you have in mind. Are we talking about approaches that still work by specifying likelihoods? If so, how are they incompatible with Bayes?

        • Chris: the biggest thing that comes to mind for me regarding robust statistics is things like m-estimators with insensitive “cost” functions, windsorizing, and soforth. The goal is to be insensitive to the fact that some fraction of the data is corrupted, things like digit transpositions and instruments that get unplugged, and intermittent electrical noise, and whatever, where you don’t have a model for what happened, you just know that sometimes it’s really very different from what usually happens.

          I can think of two aspects in Bayes, the first is something like what you said, choosing distributions that allow for outliers, choosing finite mixture models where some fraction of the observations come from much more extreme distributions, and soforth. The second is something more like ABC, where using some computational method you generate “forward” data predictions, and then write your “likelihood” in terms of statistics of how well the whole set of data predictions match the whole set of data. You could for example calculate Tukey’s biweight error function, and then specify a probability distribution over the result.

          If you see a likelihood as a *frequency* distribution of errors, then this makes no sense. But if you see a likelihood as a measure of something else, this makes perfect sense. I’m working on a paper where I describe “what else” I think the Bayesian posterior measures. I think ojm has convinced me that “truth of a proposition” is not in general the best way to think about what Bayes does. Ironically, one area where “truth of a proposition” makes good sense is when you’re using an RNG to sample from a finite population. Then, there really *is* a “true” say mean, or sd or range or median. Sometimes that is a fine model of what you’re doing, but other times you know that there’s no “Truth” involved, just some kind of process you’re describing and a model that has more “goodness” vs more “badness”.

        • There is this idea in Bayes of conditioning not on the data itself but robust summary statistics.

          The challenge that arises is often there is not a closed form likelihood function for that (something giving the probability of the observing those robust summary statistics for all points in the parameter space.)

          That thinking lead to this Robust Bayesian inference via coarsening Jeffrey W. Miller, David B. Dunson which seems to work better https://arxiv.org/abs/1506.06101

        • Keith, I don’t know to what extent maybe coarsening works better, that may be true, in fact that may be an essential aspect of what we’re really doing in Bayes (no measurements are infinitely fine, all measurements are coarse at some level, but usually we ignore this for convenience).

          Still, I’m not quite sure what it means “often there is not a closed form likelihood function for that (something giving the probability of the observing those robust summary statistics for all points in the parameter space.)”

          I think this comes out of a desire to have the likelihood represent frequency of something. Restricting yourself to that case is a mistake I think. In ABC for example you generate some forward estimate of the quantities of interest, and then you calculate a summary statistic or 2 or 3, and then you have several options:

          1) accept if the summary statistics are within epsilon of a critical value (ie. differences are within epsilon of 0). This was the

          2) accept with probability proportional to some continuous non-negative function with a peak at a critical value that decreases away from this peak, say normal(0,1) or the like.

          3) accept with probability proportional to a joint function over the 2 or 3 summary statistics….

          and at each stage we get closer to the idea that the likelihood is just any non-negative function. For inference we don’t even need it to be normalizable since we’ll always have a finite number of data point, but for prior or posterior predictive purposes we do need it normalizable.

          Is there an interpretation of what this means that makes sense philosophically and mathematically? I think the answer is yes, I’ll be very interested in what you have to say about it. I will put up some kind of blog posts and a paper for comments in the next … hopefully week or so.

        • Daniel: I was all explained in my DPhil thesis ;-)

          I’ll put something together as it would be much simpler to put today than in 2007.

  7. This came from Bob Carpenter on the Stan mailing list: “It’s not overfitting so much as model misspecification.”
    I really like this line. If your model is correct, “overfitting” is impossible.

    I agree, I never thought about it this way before. It is one of those genius insights that seem so obvious in retrospect.

    I was thinking about something similar earlier actually. Often we really want is something like F = G*m1*m2/R^2 while assuming all objects are point masses and ignoring everything below a certain mass. A more complicated model that could account for stuff like perturbations for all asteroids and comets, 3d shape of the objects, etc is also useful, but plays a different role. It seems to me most of stats/ml is concerned about “exactness” and thus designed to solve problems like the latter, which is prone to overfitting.

  8. > Even if the resulting datasets (actual numbers) can come out the same to whatever sig fig under models arbitrarily far apart in the strong topology?

    “if reality does not fit the concept, too bad for reality”

    It seems to me that “over-fitting” is being defined here simply as “complexity” and not as “poor generalization” (which may of course be the consequence of “over-complexity”).

  9. What is the etymology of overfitting, if it is known? I always thought it was a metaphor from tailoring, if your bespoke suit fits you too closely (sample), it will rupture when you move (out-of-sample).
    In that sense I considered it a good metaphor, because nontechnical people easily get it this way.
    But English is my second language and some googling seems to show tailors wouldn’t actually use this word.
    Leaves me to wonder: is this not a problem tailors have?

  10. It was a refreshing post. I always a hard time to tell people that cross-validation does not prevent ‘overfitting’. In industry, it is a common misconception I think. If one has only a single model.

  11. “People are worried about overfitting.”

    Reader’s of Matt Levine know that people are constantly worried about bond market liquidity, unicorns, and many other issues. Overfitting is not a recurring worry yet, but it could become one!

    “Overfitting is partly a statistical problem, about how we can extrapolate rules from data, but it is also a deep worry about whether the world is understandable, whether it is subject to rules, and whether those rules are comprehensible to humans.”

    https://www.bloomberg.com/view/articles/2017-07-18/liquidity-bankruptcy-and-paperwork

  12. @Andrew Thank you for this conceptually fundamental post and your comment.

    (a) I think there is a link to linear algebra too. I was referring to ‘regularisation’ from inverse problems point of view, such as LASSO, where applying regularisation reduces the condition number of a ‘design matrix’. Some authors call this ‘inverse crimes’ jokingly.

    (b) Douglas Hawkins, in his paper ‘The Problem of Overfitting’ (2004), states that:

    “Overfitting of models is widely recognized as a concern. It is less recognized however that overfitting is not an absolute but involves a comparison. A model overfits if it is more complex than another model that fits equally well.”

    Comparing with your definition,

    “Overfitting is when you have a complicated model that gives worse predictions, on average, than a simpler model.”

    Hawkins’s definition is stronger. Let’s say we have a well-generalised model (M1) if there is a ‘simpler model’ that reaches similar ‘predictive power’ (M2). So, M2 is still overfitting.

    From both definitions, can we infer the following? ‘Overfitting’ is not about finding generalisation error alone, since it is a comparison problem, hence we can not really resolve ‘overfitting’ just looking at say, cross-validation error?

Leave a Reply to Carlos Ungil Cancel reply

Your email address will not be published. Required fields are marked *