Comments on: What is “overfitting,” exactly?

By: msuzen

msuzen — Sat, 19 Aug 2017 02:50:51 +0000

In reply to msuzen. PS: I meant M1 is still overfitting.

By: msuzen

msuzen — Sat, 19 Aug 2017 02:49:27 +0000

@Andrew Thank you for this conceptually fundamental post and your comment.

(a) I think there is a link to linear algebra too. I was referring to ‘regularisation’ from inverse problems point of view, such as LASSO, where applying regularisation reduces the condition number of a ‘design matrix’. Some authors call this ‘inverse crimes’ jokingly.

(b) Douglas Hawkins, in his paper ‘The Problem of Overfitting’ (2004), states that:

“Overfitting of models is widely recognized as a concern. It is less recognized however that overfitting is not an absolute but involves a comparison. A model overfits if it is more complex than another model that fits equally well.”

Comparing with your definition,

“Overfitting is when you have a complicated model that gives worse predictions, on average, than a simpler model.”

Hawkins’s definition is stronger. Let’s say we have a well-generalised model (M1) if there is a ‘simpler model’ that reaches similar ‘predictive power’ (M2). So, M2 is still overfitting.

From both definitions, can we infer the following? ‘Overfitting’ is not about finding generalisation error alone, since it is a comparison problem, hence we can not really resolve ‘overfitting’ just looking at say, cross-validation error?

By: Andrew

Andrew — Mon, 07 Aug 2017 03:44:38 +0000

Msuzen:

Overfitting is a general concept—optimizing fit to training data does not optimize fit to test data—which does not need to have any connection to matrices at all.

By: msuzen

msuzen — Mon, 07 Aug 2017 03:26:37 +0000

I had an impression that ‘overfitting’ was about condition number of a design matrix. Not sure how ‘condition-number’ is interpreted in a Bayesian way though.

By: Keith O'Rourke

Keith O'Rourke — Tue, 18 Jul 2017 18:03:00 +0000

In reply to Daniel Lakeland. Daniel: I was all explained in my DPhil thesis ;-) I'll put something together as it would be much simpler to put today than in 2007.

By: Carlos Ungil

Carlos Ungil — Tue, 18 Jul 2017 14:56:31 +0000

“People are worried about overfitting.”

Reader’s of Matt Levine know that people are constantly worried about bond market liquidity, unicorns, and many other issues. Overfitting is not a recurring worry yet, but it could become one!

“Overfitting is partly a statistical problem, about how we can extrapolate rules from data, but it is also a deep worry about whether the world is understandable, whether it is subject to rules, and whether those rules are comprehensible to humans.”

https://www.bloomberg.com/view/articles/2017-07-18/liquidity-bankruptcy-and-paperwork

By: Daniel Lakeland

Daniel Lakeland — Tue, 18 Jul 2017 13:39:34 +0000

In reply to Daniel Lakeland.

Keith, I don’t know to what extent maybe coarsening works better, that may be true, in fact that may be an essential aspect of what we’re really doing in Bayes (no measurements are infinitely fine, all measurements are coarse at some level, but usually we ignore this for convenience).

Still, I’m not quite sure what it means “often there is not a closed form likelihood function for that (something giving the probability of the observing those robust summary statistics for all points in the parameter space.)”

I think this comes out of a desire to have the likelihood represent frequency of something. Restricting yourself to that case is a mistake I think. In ABC for example you generate some forward estimate of the quantities of interest, and then you calculate a summary statistic or 2 or 3, and then you have several options:

1) accept if the summary statistics are within epsilon of a critical value (ie. differences are within epsilon of 0). This was the

2) accept with probability proportional to some continuous non-negative function with a peak at a critical value that decreases away from this peak, say normal(0,1) or the like.

3) accept with probability proportional to a joint function over the 2 or 3 summary statistics….

and at each stage we get closer to the idea that the likelihood is just any non-negative function. For inference we don’t even need it to be normalizable since we’ll always have a finite number of data point, but for prior or posterior predictive purposes we do need it normalizable.

Is there an interpretation of what this means that makes sense philosophically and mathematically? I think the answer is yes, I’ll be very interested in what you have to say about it. I will put up some kind of blog posts and a paper for comments in the next … hopefully week or so.

By: Keith O'Rourke

Keith O'Rourke — Tue, 18 Jul 2017 13:22:13 +0000

In reply to Daniel Lakeland.

There is this idea in Bayes of conditioning not on the data itself but robust summary statistics.

The challenge that arises is often there is not a closed form likelihood function for that (something giving the probability of the observing those robust summary statistics for all points in the parameter space.)

That thinking lead to this Robust Bayesian inference via coarsening Jeffrey W. Miller, David B. Dunson which seems to work better https://arxiv.org/abs/1506.06101

By: msuzen

msuzen — Tue, 18 Jul 2017 13:18:19 +0000

It was a refreshing post. I always a hard time to tell people that cross-validation does not prevent ‘overfitting’. In industry, it is a common misconception I think. If one has only a single model.

By: Daniel Lakeland

Daniel Lakeland — Tue, 18 Jul 2017 12:56:07 +0000

In reply to Daniel Lakeland.

Chris: the biggest thing that comes to mind for me regarding robust statistics is things like m-estimators with insensitive “cost” functions, windsorizing, and soforth. The goal is to be insensitive to the fact that some fraction of the data is corrupted, things like digit transpositions and instruments that get unplugged, and intermittent electrical noise, and whatever, where you don’t have a model for what happened, you just know that sometimes it’s really very different from what usually happens.

I can think of two aspects in Bayes, the first is something like what you said, choosing distributions that allow for outliers, choosing finite mixture models where some fraction of the observations come from much more extreme distributions, and soforth. The second is something more like ABC, where using some computational method you generate “forward” data predictions, and then write your “likelihood” in terms of statistics of how well the whole set of data predictions match the whole set of data. You could for example calculate Tukey’s biweight error function, and then specify a probability distribution over the result.

If you see a likelihood as a *frequency* distribution of errors, then this makes no sense. But if you see a likelihood as a measure of something else, this makes perfect sense. I’m working on a paper where I describe “what else” I think the Bayesian posterior measures. I think ojm has convinced me that “truth of a proposition” is not in general the best way to think about what Bayes does. Ironically, one area where “truth of a proposition” makes good sense is when you’re using an RNG to sample from a finite population. Then, there really *is* a “true” say mean, or sd or range or median. Sometimes that is a fine model of what you’re doing, but other times you know that there’s no “Truth” involved, just some kind of process you’re describing and a model that has more “goodness” vs more “badness”.

By: Chris Wilson

Chris Wilson — Tue, 18 Jul 2017 12:00:59 +0000

In reply to Daniel Lakeland. ojm: I'm guessing you're not talking about things like modeling noise with t-distribution rather than normal when you say "robust statistics", because obviously that is totally compatible with Bayesian approach. But I'm curious what kind of "robust statistics" you have in mind. Are we talking about approaches that still work by specifying likelihoods? If so, how are they incompatible with Bayes?

By: ojm

ojm — Tue, 18 Jul 2017 05:50:15 +0000

In reply to Daniel Lakeland.

> Robust statistics is totally compatible with Bayes.

Well…this is actually quite a subtle issue. For precisely the reasons discussed here.

As an example, the last chapter of the second edition of Huber’s book ‘Robust statistics’ is called ‘Bayesian Robustness’ but is fairly negative on the feasibility/compatibility.

First he discusses e.g. prior sensitivity studies which he sees as pretty orthogonal to the main issues. Then he considers more relevant solutions but most of these seem to amount to pretty much abandoning the main tenets of the Bayesian approach.

To misquote Andrew – ‘Once you have abandoned literal belief in the B[ayes], the question soon arises: why follow it at all?’

[PS Andrew has said something to the effect that Huber wasn’t appear of posterior predictive checks etc but I’ve come to the position that this doesn’t really address the key issues raised]

By: Daniel Lakeland

Daniel Lakeland — Tue, 18 Jul 2017 05:21:00 +0000

In reply to John Jumper.

Right, but by respecifying the model as “Here is a computer program that takes N numbers as inputs, and it outputs a K x K grid of pixels” and then specifying a prior over the numbers a[i] ~ normal(0,1) or whatever you like, and specifying a comparator C(Image, GeneratedImage) in terms of an appropriate information theoretic metric, and giving a probability distribution over the output of the comparator function (a “likelihood”), we can infer a density over the N numbers which, in the high probability manifold of the a values produces K x K grids of pixels that are not bad approximations to the image we’re “compressing”

However, the sense in which it’s “not bad” is the sense in which C(Image, GeneratedImage) is in the high probability manifold of the “likelihood”. If that happens to correspond to the sense in which “this looks like the picture to me” then your choice of C and your choice of “likelihood” also corresponded to the theory “when C is in the high probability region of my likelihood, the picture will “look like the original”.

As I said, if you compare people by their eye color you will find that the “similar” people may in fact have wildly different personalities. On the other hand, if you compare them by closeness in Meyers-Briggs scores, they might be pretty similar personalities, but the eye colors could vary widely.

By: John Jumper

John Jumper — Tue, 18 Jul 2017 04:00:08 +0000

In reply to John Jumper. Yes, the model is definitely misspecified in the technical sense. You need a pretty complex model since the data generating process is basically "pixels from a picture of someone's room". You can imagine that there is a very large excess of dimensions but identifying the hidden manifold is the whole point of the exercise.

By: Daniel Lakeland

Daniel Lakeland — Tue, 18 Jul 2017 03:42:32 +0000

In reply to Daniel Lakeland. Ojm: Robust statistics is totally compatible with Bayes. The real issue is that many people identify Bayes with generative modeling of individual data points. The ABC approach is seen as an approximation. But it's a first class citizen as soon as you recognize a deeper truth. Bayes is still constrained by frequency interpretations for many people. They think they need to give probability as frequency distributions over data points, not over say nonlinear transforms of whole data sets. But that isn't a requirement in any way.

By: Daniel Lakeland

Daniel Lakeland — Tue, 18 Jul 2017 03:20:43 +0000

In reply to John Jumper.

Hmm… this is an interesting paper. I suspect that the non-existence of a density is however an indicator of model misspecification (that is, if your problem occurs on a low dimensional manifold embedded in some higher dimensional space, then it’s your excess of dimensions thats the issue, not the lack of density). As for the use of various divergences/distances I see these as measures of how well a theoretical frequency distribution fits an observed frequency distribution. That some measures of this goodness of fit are better than others for getting models that make sense is no different than for example using a gamma distribution for the errors in your regression when they have some skewness, or a t distribution when they have outliers, instead of say a normal.

The role that the metric / pseudo-metric plays is a comparison between model and data. If I compare two people by the color of their eyes, it’s no surprise that I can’t distinguish when one of them is rude and the other is kind and caring. In general, measures of comparison are important because they describe what it means to be a good fit vs a bad fit, in other words, they describe what it is you’re modeling.

By: ojm

ojm — Tue, 18 Jul 2017 00:43:22 +0000

In reply to John Jumper.

Thanks for the link :-)

I definitely do think the issue is practical. Bayesians tend not to agree but people like eg Huber, Tukey etc from robust stats and data analysis also think/thought it was practical. It’s not surprising that the paper you link comes from the machine learning world – at least some of them seem to have a fair amount of influence from and/or overlap with those areas of stats. Again, someone like Vapnik also comes to mind too.

By: Tom Dietterich

Tom Dietterich — Mon, 17 Jul 2017 22:51:52 +0000

In reply to Daniel Simpson. I guess I need to be more careful about my wording. The Bayesian procedure correctly transforms the uncertainty of the prior into the uncertainty of the posterior under the assumption that the likelihood model is correct. It does not quantify or account for our uncertainty about our modeling choices (beyond those that can be captured by the prior). Indeed, I believe such an accounting is impossible in a finite model.

By: Corey

Corey — Mon, 17 Jul 2017 19:49:47 +0000

In reply to Cliff AB. Stan uses transformations to take parameters supported on intervals or the positive reals to the entire set of reals. Even if the density grows without limit as it approaches a boundary of the support set, as long as the density in the original parameterization is normalizable the transformation will inevitably flatten out. If the divergence isn't on the boundary I imagine Stan will choke on the discontinuity.

By: John Jumper

John Jumper — Mon, 17 Jul 2017 13:49:22 +0000

In reply to ojm.

Sometimes the difference between information theory metrics and other metrics can even be practically important: https://arxiv.org/pdf/1701.07875.pdf.

By: Keith O'Rourke

Keith O'Rourke — Mon, 17 Jul 2017 13:04:43 +0000

In reply to Daniel Simpson.

Cool and the second gets even cooler when considered as a spectrum https://en.wikipedia.org/wiki/Apophenia#.22Randomania.22 that in turn provides supports for ESP ;-)

By: Keith O'Rourke

Keith O'Rourke — Mon, 17 Jul 2017 12:50:02 +0000

In reply to ojm.

You are both not too wrong: continuity is required for possibilities but does not apply to actualities.

Daniel’s Goldbach’s conjecture is a nice example – there are “can be” numbers that violate it but you won’t encounter any that “are” violations. As Pierce put it existence breaks the continuum of possibilities.

By: Martha (Smith)

Martha (Smith) — Mon, 17 Jul 2017 04:11:55 +0000

In reply to Ruben.

Ruben,

I don’t think the etymology is from tailoring. My understanding is that first the usage “fit a curve to the data” originated; then I would guess that the word “overfitting” was adopted by analogy with words like “overdoing it”. Perhaps you haven’t encountered the quote from John Von Neumann that says “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”

By: Daniel Simpson

Daniel Simpson — Mon, 17 Jul 2017 03:34:17 +0000

In reply to Ruben.

I suspect the etymology is just over+fit.

But two cool things:
https://en.wikipedia.org/wiki/Ganzfeld_effect
https://en.wikipedia.org/wiki/Apophenia

By: Ruben

Ruben — Mon, 17 Jul 2017 01:49:29 +0000

What is the etymology of overfitting, if it is known? I always thought it was a metaphor from tailoring, if your bespoke suit fits you too closely (sample), it will rupture when you move (out-of-sample).
In that sense I considered it a good metaphor, because nontechnical people easily get it this way.
But English is my second language and some googling seems to show tailors wouldn’t actually use this word.
Leaves me to wonder: is this not a problem tailors have?

By: ojm

ojm — Mon, 17 Jul 2017 00:59:15 +0000

In reply to ojm.

Thanks.

Funnily enough, as discussed previously I think, when it comes to resurrecting infinitesimals I prefer so-called smooth infinitesimal analysis to NSA ;-)

PS at what point do you think Andrew bans us for treating his blog as a miscellaneous math/stat/whatever forum. He’s remarkably tolerant…

By: Daniel Lakeland

Daniel Lakeland — Mon, 17 Jul 2017 00:38:22 +0000

In reply to ojm.

We may not be as far apart as you think. I just found this paper:

https://arxiv.org/pdf/1703.00425.pdf

See page 40 (as well as the rest of the paper, seems really interesting and good overview of some ideas) on the link between intuitionism and NSA in which they describe the relationship between a nonstandard number in IST and the idea of a non-constructable number.

For example suppose you can prove in non-constructive way that Goldbach’s conjecture is false, but for *every computable number* it’s true. then in some sense “in the real world” all the numbers satisfy Goldbach’s conjecture, and only “theoretical” numbers inaccessible to any construction technique may violate it.

This divides the world into naive integers that satisfy goldbach and are constructive, and non-naive integers that don’t. This is essentially the same partition as “standard” and “nonstandard”

So, the idea if IST in which some of the real numbers/integers etc are standard and some are not, but they’re all just real numbers… this has a flavor of intuitionism / constructability. I’m very much in favor of treating mathematics algorithmically.

I don’t dislike continuous math, I just like it as an interpretation over a set of algorithms indexed by N, the fineness of a grid. The “continuous” concept is just a function of N, where N is free, which when applied to N returns a particular discretization. The nonstandard view is then that for all N sufficiently large (nonstandard), the results are the same to within the power of algorithmic construction to distinguish.

By: ojm

ojm — Mon, 17 Jul 2017 00:32:54 +0000

In reply to Daniel Lakeland.

Someone should design a survey to determine what best predicts the ‘eek’ and ‘eh’ classes of responses to this.

Hopefully the results don’t depend on whether the analysis is eek-based or eh-based…

By: Daniel Simpson

Daniel Simpson — Mon, 17 Jul 2017 00:18:50 +0000

In reply to Daniel Lakeland.

I can’t see the eek. It’s like a magician cutting a woman in half. The first step is to make sure she’s not in the critical part of the box. Once you’ve got that much freedom, you can do almost anything you like. (For instance, that paper shows that you can’t trust posterior predictive checks because, as they are posterior functional and you’re allowed to move your assistant around the box, they are also brittle)

But this has drifted quite far from overfitting.

By: ojm

ojm — Sun, 16 Jul 2017 23:47:58 +0000

In reply to Daniel Lakeland.

I don’t think the Owhadi et al paper is necessarily the best expression of the basic issues (which, as Christian Hennig points out there, go back at least to the robust stats folk, as well as e.g. Vapnik and all those folk) but yes, it is based on same issues.

It really is quite an important sticking point for many – some look at such things and say ‘eh, Bayes in the strong topology is still fine by me’ and others go ‘eek, maybe it’s not for me.’

There are plently of respectable folk on either side of the issue. I’m more of an unrespectable type who’s been converted to the ‘eek’ side.

By: Daniel Simpson

Daniel Simpson — Sun, 16 Jul 2017 23:27:15 +0000

In reply to Daniel Lakeland.

Sorry. That was obviously vague. What I was trying to say is that there’s nothing deep about a finite set of data not being able to separate models. It’s just true unless you put some serious mathematical effort into making it not true. So a topology based on empirical cdfs (by definition from finite data sets) is too weak to worry about. It’s also a data dependent topology, which is hard to think about when you’ve got a sequential experiment.

It’s not vacuously true that entropy-based topologies aren’t too strong. Although Higdon’s comment here talks about the games you can play with intermediate topologies: https://xianblog.wordpress.com/2013/09/11/bayesian-brittleness-again/#comments

And yes you’ve got to do predictive checks, but you’ve also got to build a good model first. Overfitting is a property of model+data. If the model doesn’t allow for overfitting it can’t happen. If the data is strong enough to prevent overfitting it can’t happen (although this is less likely in high dimensions).

There’s a mirror to this entire conversation about underfitting.

By: ojm

ojm — Sun, 16 Jul 2017 22:30:06 +0000

In reply to Daniel Lakeland. (which is at least one reason why you have to do predictive checks)

By: ojm

ojm — Sun, 16 Jul 2017 22:17:49 +0000

In reply to Daniel Lakeland. And of course you only find out two models are indistinguishable given the data when you go back to the weak topology. In the strong topology you think they're different.

By: ojm

ojm — Sun, 16 Jul 2017 22:04:39 +0000

In reply to Daniel Lakeland.

‘All you’re saying…vacuously true’

Sometimes these sort of points are important points.

As far as I can tell you have no theoretical justification for preferring simplicity. The ideas can be analysed in terms of empirical prediction, regularisation theory etc which attempt to understand when and why we might prefer simplicity to complexity and when not. See eg Vapnik for one attempt.

Fine if you want to start from particular axioms and treat all exceptions as vacuous truths, but that’s not really appealing to me personally.

By: Daniel Simpson

Daniel Simpson — Sun, 16 Jul 2017 21:54:15 +0000

In reply to Daniel Simpson. The one and future data. This is stats, so I'm assuming that at some point in this process there is data and that's the data I'm talking about.

By: Daniel Simpson

Daniel Simpson — Sun, 16 Jul 2017 21:51:35 +0000

In reply to Daniel Lakeland. All you're saying is that a finite data set can't always tell between two models. That's vacuously true, so it doesn't bother me. In the context, these aren't arbitrary models, it's a (typically) one dimensional sub-manifold of a well behaved model space. And the choice that has been made is that if there are two indistinguishable models, we'll go with the simplest one.

By: ojm

ojm — Sun, 16 Jul 2017 21:43:57 +0000

In reply to Daniel Simpson. ( a key point required for my answer to not contradict my others is that imo you need a concept of identifiability, or overfitting..., in order to estimate a complexity parameter. So you can use standard tools but the goal is not quite the same as standard estimation assuming identifiability)

By: Daniel Lakeland

Daniel Lakeland — Sun, 16 Jul 2017 20:30:35 +0000

In reply to Michael Betancourt. Yeah, I think this is ok for debugging purposes. I mean for example suppose you want some big vector parameter to encode some function, and you have some idea of what that function should look like, you provide a prior, attempting to encode your knowledge, and you run Stan in this way to get some small number of samples. If the function doesn't look anything like what you thought it was going to, you know you made a mistake encoding your information... figuring this out after 3 minutes instead of 3 hours is great!

By: Michael Betancourt

Michael Betancourt — Sun, 16 Jul 2017 20:17:47 +0000

In reply to Daniel Lakeland. "The general idea that the typical set is the set with log probability density nearly some constant, can we estimate that constant from the path history on the way to the peak? Then perhaps draw some small number of samples in the vicinity of whatever step was closest to the typical set on your way to the peak?" You can do this, but not the way you describe. This is essentially how nested sampling works, but then you only get a few points from the typical set which makes function expectation estimates highly imprecise (even assuming that the rest of nested sampling is working ideally, which can be a stretch in practice).

By: ojm

ojm — Sun, 16 Jul 2017 20:03:45 +0000

In reply to Daniel Simpson.

> But how do you do that?

You mean determine the effective number of parameters that can be estimated given the data? The usual ways for estimating complexity (or whatever) parameters would be a start, right? Training/test, CV, empirical Bayes, hierarchical Bayes, profiling etc etc.

> And then given that you’ve restricted to identifiable submanifolds, how do you set priors?

Probably doesn’t matter too much at this point, but I suppose however you normally set priors?

By: Martha (Smith)

Martha (Smith) — Sun, 16 Jul 2017 19:43:54 +0000

In reply to Daniel Simpson.

“Regarding data+model, you need to allow for the case where the data *needs* one of the higher complexity models to describe it. ”

This statement sounds like you are talking about “the data that have been collected,” which doesn’t make sense to me in context. Is what you are really talking about “the type of situation from which you are collecting what type of data”? That would make sense to me.

By: Daniel Lakeland

Daniel Lakeland — Sun, 16 Jul 2017 15:23:22 +0000

In reply to Daniel Simpson. I see elsewhere you mention you have limited background on Kolmogorov Complexity. So, in this sense, just think of the complexity as the length of the smallest lambda expression that computes the same model. The point of something like a GP is that by specifying the covariance function, using just a few symbols, something like c(a,b) = s*exp(-((a-b)/l)^2)+n you have encoded all of the information needed to do the computation. Whereas for something like the dummy variables approach with say independent priors on each coefficient, you need to specify a prior over each dummy coefficient, in terms of 365 n digit numbers for locations and 365 n digit numbers for scales, and maybe some number of parameters for shape parameters at each point. The infinite dimensional thing is just a mental device for getting you a compact lambda expression for your finite dimensional model. It provides a structure to generate compact models with N, the number of data points, a free parameter, which nevertheless in every actual usage gets bound to some actual finite value.

By: Moritz

Moritz — Sun, 16 Jul 2017 15:01:33 +0000

Overfitting is when your model learns too much from the data.

By: John Jumper

John Jumper — Sun, 16 Jul 2017 13:26:40 +0000

In reply to Daniel Simpson.

> the question I have is for what types of experiment is “prefer simplicity” not, pre data, a sensible prior state?

My first inclination would be to think of systems where simplicity emerges from averaging over disorder. It is not a typical system for Bayesian inference but spin glasses (https://en.wikipedia.org/wiki/Spin_glass) fit the bill. They have a simple phase transition that emerges from a randomly-coupled model. In this case, the phenomenon is quite simple and physically realistic but emerges precisely because of the high entropy/complexity of the parameters. I admit this is an unusual system, although it is pretty good way to think about things like emergent behavior of social graphs, where you would similarly expect apparently random coupling between individuals.

By: Carlos Ungil

Carlos Ungil — Sun, 16 Jul 2017 08:41:16 +0000

> Even if the resulting datasets (actual numbers) can come out the same to whatever sig fig under models arbitrarily far apart in the strong topology?

“if reality does not fit the concept, too bad for reality”

It seems to me that “over-fitting” is being defined here simply as “complexity” and not as “poor generalization” (which may of course be the consequence of “over-complexity”).

By: Daniel Lakeland

Daniel Lakeland — Sun, 16 Jul 2017 07:11:19 +0000

In reply to Daniel Simpson. Here's where I see all this going. Every scientific model provides some rule for generating predictions. Imagine it as a lambda calculus expression. The rule can be applied to any number of scenarios. When the rule is probabilistic, you can imagine it augmented with an arbitrarily long sequence of random binary digits from which it generates random numbers. In this sense, every model is capable of doing an arbitrary number of calculations, if you apply it to an arbitrarily large number of questions. But in actual fact, it will only ever be used to do some finite number of calculations. In most cases when people choose to use "infinite dimensional" models, its to *reduce* the kolmogorov complexity of the model. The structure imposed by the infinite dimensional model makes fewer outcomes possible and the lambda expression shorter. The finiteness of the number of questions asked of the model is very relevant to the goodness of the model. A model capable of extrapolating to before the big bang which is never asked to do so is not worse because it would give nonsense in that scenario. If your Gaussian process was chosen for the purpose of modeling a particular 365 day dataset, and is never asked to generate any more than 365 dim vectors, its hard to see in what way its anything other than a 365 dimensional model specified in a kolmogorov complexity lower than the dummy variable approach.

By: ojm

ojm — Sun, 16 Jul 2017 06:06:28 +0000

In reply to Daniel Lakeland. Even if the resulting datasets (actual numbers) can come out the same to whatever sig fig under models arbitrarily far apart in the strong topology? I mean, tbh I'm happy for people to be happy with the strong topology, but personally I'm not (anymore).

By: Daniel Simpson

Daniel Simpson — Sun, 16 Jul 2017 05:59:38 +0000

In reply to Daniel Simpson. Boring.

By: Daniel Simpson

Daniel Simpson — Sun, 16 Jul 2017 05:58:41 +0000

In reply to Daniel Lakeland. Ah ok. That doesn't worry me so much. I'm happy with the strong topology

By: ojm

ojm — Sun, 16 Jul 2017 05:47:42 +0000

In reply to Daniel Lakeland.

> Kolmogorov complexity

Not that…

> Is it the L-infinity norm between empirical CDFs?

Yup that. The basic idea is that the KL is a strong ‘metric’ (or whatever it’s really called) while the Kolmogorov is a weak one. You get to densities from distributions via unbounded operator: differentiation.