(a) I think there is a link to linear algebra too. I was referring to ‘regularisation’ from inverse problems point of view, such as LASSO, where applying regularisation reduces the condition number of a ‘design matrix’. Some authors call this ‘inverse crimes’ jokingly.

(b) Douglas Hawkins, in his paper ‘The Problem of Overfitting’ (2004), states that:

“Overfitting of models is widely recognized as a concern. It is less recognized however that overfitting is not an absolute but involves a comparison. A model overfits if it is more complex than another model that fits equally well.”

Comparing with your definition,

“Overfitting is when you have a complicated model that gives worse predictions, on average, than a simpler model.”

Hawkins’s definition is stronger. Let’s say we have a well-generalised model (M1) if there is a ‘simpler model’ that reaches similar ‘predictive power’ (M2). So, M2 is still overfitting.

From both definitions, can we infer the following? ‘Overfitting’ is not about finding generalisation error alone, since it is a comparison problem, hence we can not really resolve ‘overfitting’ just looking at say, cross-validation error?

]]>Overfitting is a general concept—optimizing fit to training data does not optimize fit to test data—which does not need to have any connection to matrices at all.

]]>I’ll put something together as it would be much simpler to put today than in 2007.

]]>Reader’s of Matt Levine know that people are constantly worried about bond market liquidity, unicorns, and many other issues. Overfitting is not a recurring worry yet, but it could become one!

“Overfitting is partly a statistical problem, about how we can extrapolate rules from data, but it is also a deep worry about whether the world is understandable, whether it is subject to rules, and whether those rules are comprehensible to humans.”

https://www.bloomberg.com/view/articles/2017-07-18/liquidity-bankruptcy-and-paperwork

]]>Still, I’m not quite sure what it means “often there is not a closed form likelihood function for that (something giving the probability of the observing those robust summary statistics for all points in the parameter space.)”

I think this comes out of a desire to have the likelihood represent frequency of something. Restricting yourself to that case is a mistake I think. In ABC for example you generate some forward estimate of the quantities of interest, and then you calculate a summary statistic or 2 or 3, and then you have several options:

1) accept if the summary statistics are within epsilon of a critical value (ie. differences are within epsilon of 0). This was the

2) accept with probability proportional to some continuous non-negative function with a peak at a critical value that decreases away from this peak, say normal(0,1) or the like.

3) accept with probability proportional to a joint function over the 2 or 3 summary statistics….

and at each stage we get closer to the idea that the likelihood is just any non-negative function. For inference we don’t even need it to be normalizable since we’ll always have a finite number of data point, but for prior or posterior predictive purposes we do need it normalizable.

Is there an interpretation of what this means that makes sense philosophically and mathematically? I think the answer is yes, I’ll be very interested in what you have to say about it. I will put up some kind of blog posts and a paper for comments in the next … hopefully week or so.

]]>The challenge that arises is often there is not a closed form likelihood function for that (something giving the probability of the observing those robust summary statistics for all points in the parameter space.)

That thinking lead to this Robust Bayesian inference via coarsening Jeffrey W. Miller, David B. Dunson which seems to work better https://arxiv.org/abs/1506.06101

]]>I can think of two aspects in Bayes, the first is something like what you said, choosing distributions that allow for outliers, choosing finite mixture models where some fraction of the observations come from much more extreme distributions, and soforth. The second is something more like ABC, where using some computational method you generate “forward” data predictions, and then write your “likelihood” in terms of statistics of how well the whole set of data predictions match the whole set of data. You could for example calculate Tukey’s biweight error function, and then specify a probability distribution over the result.

If you see a likelihood as a *frequency* distribution of errors, then this makes no sense. But if you see a likelihood as a measure of something else, this makes perfect sense. I’m working on a paper where I describe “what else” I think the Bayesian posterior measures. I think ojm has convinced me that “truth of a proposition” is not in general the best way to think about what Bayes does. Ironically, one area where “truth of a proposition” makes good sense is when you’re using an RNG to sample from a finite population. Then, there really *is* a “true” say mean, or sd or range or median. Sometimes that is a fine model of what you’re doing, but other times you know that there’s no “Truth” involved, just some kind of process you’re describing and a model that has more “goodness” vs more “badness”.

]]>Well…this is actually quite a subtle issue. For precisely the reasons discussed here.

As an example, the last chapter of the second edition of Huber’s book ‘Robust statistics’ is called ‘Bayesian Robustness’ but is fairly negative on the feasibility/compatibility.

First he discusses e.g. prior sensitivity studies which he sees as pretty orthogonal to the main issues. Then he considers more relevant solutions but most of these seem to amount to pretty much abandoning the main tenets of the Bayesian approach.

To misquote Andrew – ‘Once you have abandoned literal belief in the B[ayes], the question soon arises: why follow it at all?’

[PS Andrew has said something to the effect that Huber wasn’t appear of posterior predictive checks etc but I’ve come to the position that this doesn’t really address the key issues raised]

]]>However, the sense in which it’s “not bad” is the sense in which C(Image, GeneratedImage) is in the high probability manifold of the “likelihood”. If that happens to correspond to the sense in which “this looks like the picture to me” then your choice of C and your choice of “likelihood” also corresponded to the theory “when C is in the high probability region of my likelihood, the picture will “look like the original”.

As I said, if you compare people by their eye color you will find that the “similar” people may in fact have wildly different personalities. On the other hand, if you compare them by closeness in Meyers-Briggs scores, they might be pretty similar personalities, but the eye colors could vary widely.

]]>The role that the metric / pseudo-metric plays is a comparison between model and data. If I compare two people by the color of their eyes, it’s no surprise that I can’t distinguish when one of them is rude and the other is kind and caring. In general, measures of comparison are important because they describe what it means to be a good fit vs a bad fit, in other words, they describe what it is you’re modeling.

]]>I definitely do think the issue is practical. Bayesians tend not to agree but people like eg Huber, Tukey etc from robust stats and data analysis also think/thought it was practical. It’s not surprising that the paper you link comes from the machine learning world – at least some of them seem to have a fair amount of influence from and/or overlap with those areas of stats. Again, someone like Vapnik also comes to mind too.

]]>Daniel’s Goldbach’s conjecture is a nice example – there are “can be” numbers that violate it but you won’t encounter any that “are” violations. As Pierce put it existence breaks the continuum of possibilities.

]]>I don’t think the etymology is from tailoring. My understanding is that first the usage “fit a curve to the data” originated; then I would guess that the word “overfitting” was adopted by analogy with words like “overdoing it”. Perhaps you haven’t encountered the quote from John Von Neumann that says “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”

]]>But two cool things:

https://en.wikipedia.org/wiki/Ganzfeld_effect

https://en.wikipedia.org/wiki/Apophenia

In that sense I considered it a good metaphor, because nontechnical people easily get it this way.

But English is my second language and some googling seems to show tailors wouldn’t actually use this word.

Leaves me to wonder: is this not a problem tailors have? ]]>

Funnily enough, as discussed previously I think, when it comes to resurrecting infinitesimals I prefer so-called smooth infinitesimal analysis to NSA ;-)

PS at what point do you think Andrew bans us for treating his blog as a miscellaneous math/stat/whatever forum. He’s remarkably tolerant…

]]>https://arxiv.org/pdf/1703.00425.pdf

See page 40 (as well as the rest of the paper, seems really interesting and good overview of some ideas) on the link between intuitionism and NSA in which they describe the relationship between a nonstandard number in IST and the idea of a non-constructable number.

For example suppose you can prove in non-constructive way that Goldbach’s conjecture is false, but for *every computable number* it’s true. then in some sense “in the real world” all the numbers satisfy Goldbach’s conjecture, and only “theoretical” numbers inaccessible to any construction technique may violate it.

This divides the world into naive integers that satisfy goldbach and are constructive, and non-naive integers that don’t. This is essentially the same partition as “standard” and “nonstandard”

So, the idea if IST in which some of the real numbers/integers etc are standard and some are not, but they’re all just real numbers… this has a flavor of intuitionism / constructability. I’m very much in favor of treating mathematics algorithmically.

I don’t dislike continuous math, I just like it as an interpretation over a set of algorithms indexed by N, the fineness of a grid. The “continuous” concept is just a function of N, where N is free, which when applied to N returns a particular discretization. The nonstandard view is then that for all N sufficiently large (nonstandard), the results are the same to within the power of algorithmic construction to distinguish.

]]>Hopefully the results don’t depend on whether the analysis is eek-based or eh-based…

]]>But this has drifted quite far from overfitting.

]]>It really is quite an important sticking point for many – some look at such things and say ‘eh, Bayes in the strong topology is still fine by me’ and others go ‘eek, maybe it’s not for me.’

There are plently of respectable folk on either side of the issue. I’m more of an unrespectable type who’s been converted to the ‘eek’ side.

]]>It’s not vacuously true that entropy-based topologies aren’t too strong. Although Higdon’s comment here talks about the games you can play with intermediate topologies: https://xianblog.wordpress.com/2013/09/11/bayesian-brittleness-again/#comments

And yes you’ve got to do predictive checks, but you’ve also got to build a good model first. Overfitting is a property of model+data. If the model doesn’t allow for overfitting it can’t happen. If the data is strong enough to prevent overfitting it can’t happen (although this is less likely in high dimensions).

There’s a mirror to this entire conversation about underfitting.

]]>Sometimes these sort of points are important points.

As far as I can tell you have no theoretical justification for preferring simplicity. The ideas can be analysed in terms of empirical prediction, regularisation theory etc which attempt to understand when and why we might prefer simplicity to complexity and when not. See eg Vapnik for one attempt.

Fine if you want to start from particular axioms and treat all exceptions as vacuous truths, but that’s not really appealing to me personally.

]]>You can do this, but not the way you describe. This is essentially how nested sampling works, but then you only get a few points from the typical set which makes function expectation estimates highly imprecise (even assuming that the rest of nested sampling is working ideally, which can be a stretch in practice).

]]>You mean determine the effective number of parameters that can be estimated given the data? The usual ways for estimating complexity (or whatever) parameters would be a start, right? Training/test, CV, empirical Bayes, hierarchical Bayes, profiling etc etc.

> And then given that you’ve restricted to identifiable submanifolds, how do you set priors?

Probably doesn’t matter too much at this point, but I suppose however you normally set priors?

]]>This statement sounds like you are talking about “the data that have been collected,” which doesn’t make sense to me in context. Is what you are really talking about “the type of situation from which you are collecting what type of data”? That would make sense to me.

]]>The infinite dimensional thing is just a mental device for getting you a compact lambda expression for your finite dimensional model. It provides a structure to generate compact models with N, the number of data points, a free parameter, which nevertheless in every actual usage gets bound to some actual finite value.

]]>My first inclination would be to think of systems where simplicity emerges from averaging over disorder. It is not a typical system for Bayesian inference but spin glasses (https://en.wikipedia.org/wiki/Spin_glass) fit the bill. They have a simple phase transition that emerges from a randomly-coupled model. In this case, the phenomenon is quite simple and physically realistic but emerges precisely because of the high entropy/complexity of the parameters. I admit this is an unusual system, although it is pretty good way to think about things like emergent behavior of social graphs, where you would similarly expect apparently random coupling between individuals.

]]>“if reality does not fit the concept, too bad for reality”

It seems to me that “over-fitting” is being defined here simply as “complexity” and not as “poor generalization” (which may of course be the consequence of “over-complexity”).

]]>The finiteness of the number of questions asked of the model is very relevant to the goodness of the model. A model capable of extrapolating to before the big bang which is never asked to do so is not worse because it would give nonsense in that scenario. If your Gaussian process was chosen for the purpose of modeling a particular 365 day dataset, and is never asked to generate any more than 365 dim vectors, its hard to see in what way its anything other than a 365 dimensional model specified in a kolmogorov complexity lower than the dummy variable approach.

]]>I mean, tbh I’m happy for people to be happy with the strong topology, but personally I’m not (anymore).

]]>Not that…

> Is it the L-infinity norm between empirical CDFs?

Yup that. The basic idea is that the KL is a strong ‘metric’ (or whatever it’s really called) while the Kolmogorov is a weak one. You get to densities from distributions via unbounded operator: differentiation.

]]>