Louis, one subtlety in Consequence 2 is that it concerns , not based on actual data, so the “posterior” in Consequence 2 is what you would obtain after updating with a single simulated datapoint $y^{(0)}$, thus the actual data do not matter as that is not what is being described. It’s quite a particular posterior being described in Consequence 2, not the posterior you normally get after updating with data.

I agree with Louis in that I don’t see how is a draw from because was drawn from the prior , so the “Bayes step” of multiplying prior by likelihood was not performed.

]]>When summerizing / approximating generated data, the cry of umbrella overfitting can be attributed to any of these sources

]]>If its thought of as ABC, one of the earlier references is Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician 1984 https://projecteuclid.org/euclid.aos/1176346785

But also Francis Galton noticed and made use of it Stigler, Stephen M. 2010. Darwin, Galton and the statistical enlightenment. Journal of the Royal Statistical Society: Series A 173(3):469-482 http://onlinelibrary.wiley.com/doi/10.1111/j.1467-985X.2010.00643.x/abstract

(So Andrew went to the RSS talk in London?)

]]>.

By the way, you can include LaTeX in WordPress blogs. The problem is there’s no way to edit comments on WordPress unless you’re a blog administrator.

]]>One problem in the literature is that “overfitting” is never precisely defined anywhere. The Wikipedia is no help here—by the time they’ve moved into machine learning, they’re talking about overfitting arising from not stopping an iterative algorithm soon enough (i.e., not enough ad hoc regularization).

]]>I think there is an important point underlying this – the argument is obvious but its not obvious to many (most?) as well as likely taboo topic to many (i.e. mixing Bayes and Frequentist perspectives.)

Its unlikely discussed BDA or in most other texts (some realize it clearly in the usual mathematical expositions of Bayes but I think its rare.)

When I used these ideas of two stage sampling to explain Bayes to Epidemiology students in 2005 they googled it, found nothing as so announced in the next class that I must be wrong. The only published reference I could find on it at the time was a preprint of Cook et al. So in over 10 years nothing has changed http://andrewgelman.com/2006/12/18/example_of_a_co/

I do think there should be some pedagogical advantage of using this two-stage sampling approach with simulation in teaching Bayes even to aspiring professional statisticians.

Simulate a joint distribution and form 95% intervals:

1. Note, conditional on y’ these contain the theta that generated the y’ 95% of the time for all y – probability interval.

2. Note, conditional on theta’ these contain the theta’ that generated the data sometimes more/less than 95% of the time – not probability interval or confidence intervals.

3. Note, you can search for a prior by re-weighting the prior (to change it and the intervals) that might provide (approximate) 95% (or more coverage) conditional on every possible theta – a (approximate) confidence interval and a _reference_ prior.

At some point you will want to take them to Michael’s level of discourse if you can http://andrewgelman.com/2017/04/12/bayesian-posteriors-calibrated/#comment-464863 but only after they get what its all about.

3′ Note if you are successful in 3 and get a reference prior and confidence interval you can’t use that posterior as a prior for the next sample you get and still have confidence intervals (though they will still be probability intervals for that posterior taken as a prior).

]]>(a) sample first theta ~ p(theta) and then y ~ p(y|theta)

or the other

(b) sample first y ~ p(y) and then theta ~ p(theta|y)

If you repeat procedure (a) many times and look at the distribution of theta conditional on y, it will be indeed distributed as p(theta|y).

But I agree that the way it is presented in the post is confusing: how can theta be a random draw from a distribution conditional on the data which is in turn conditional on that very same value of the theta?

]]>The post suggests that if I simulate \theta^{(0)} according to some prior p(\theta), then this is also a draw from the posterior p(\theta|y^{(0)}), regardless of how the data look like. (see consequence 2)

I am a bit puzzled by this.

]]>Nowadays I think of e.g. 90% freqentist calibration as “90% worst case Bayesian *predictive* calibration”. To flesh this out: if I = I(y, theta) is the event that the interval formed from data y covers the hidden theta, then

1. Bayesian prior predictive is Pr{I(replicate y, theta)} = 90%

2. Bayesian posterior is Pr{I(observed y, theta) | observed y} = 90%

3. Bayesian posterior predictive is Pr{I(replicate y, theta) | observed y} = 90%

4. Frequentist coverage is Pr{I(replicate y, theta) | theta} = 90% for the worst-case choice of theta.

Note that the probability in (4.) is equivalent to

1. The Bayesian prior predictive with a point mass prior on the worst case theta

3. The Bayesian posterior predictive with a point mass prior on the worst case theta

because given theta, the replicate and observed data are iid.

On the other hand, I don’t think there’s any way to sensibly compare the frequentist and posterior coverage in general.

In my personal philosophy of statistics this translates to: “although frequentist coverage and Bayesian *predictive coverage* are about the same thing (worst vs. average case), frequentist coverage and posterior coverage are answering in-principle different questions, and so aren’t really comparable”.

]]>Sampling from the prior and the conditional likelihood produces a frequency calibrated sample from the posterior

It’s not that the posterior is somehow “calibrated by definition” but rather that this sampling process converts probability into frequency.

]]>The ensemble is great for testing, but you need much stronger conditions to guarantee good performance on a given analyses. Then again, that’s why we have (both in sample and out of sample) posterior predictive checks.

]]>A lot of people seem to think that overfitting is a phenomenon due to using the same data to evaluate the model as you used to estimate the parameters, but it seems to have as much to do with the choice to use optimization to estimate the parameters, rather than using NUTS to draw from the posterior distribution of the parameters. That and specifying bad models.

]]>