End of thinking out loud on other people’s blogs…for now

]]>I suppose the same sort of idea applies – an ‘evidence’ statistic (or statistics) and stability of this evidence statistic to challenges- but I’m not too sure how one would provide guidelines on measuring ‘evidence’ in the first place.

Example: murder weapon found in bedroom. Clearly ‘evidence’. Stability challenge: was found by cop with history of planting evidence.

Less clear: genetic evidence matches to high but not certain probability. Clearly ‘evidence’. But only versus those who lack same genetic signature. None vs someone who also matches. Is this a question of evidence measure or of stability or both or?

]]>a) compute some summaries of interest eg a straight line summary

b) think about how these vary under sensible ‘challenges’ eg dropping or resampling observations etc

]]>One can probably argue that in this case the linearity plus normality assumption for residuals elsewhere technically will be violated, so for n to infinity residual analysis will tell you that the model is wrong (though it will not highlight the outlier(s) as source), but for moderate n one can easily jot down datasets in which looking at all information the leverage outlier is crystal clear but the residuals on their own won’t look suspicious at all.

Note that I don’t have time to write anything down in detail right now so you’ve got to live with what I get from my memory on the fly. I made a number of stupid mistakes with integrals in my lecture today so you shouldn’t necessarily trust me. ;-)

]]>a) T is minimal sufficient for the model family indexed by theta

b) y|T is used to judge the model

I’m interested in cases where b) fails while a) holds.

]]>BTW I then wonder – In what sense is the model judged ‘wrong’?

Eg is it that your statistics T clearly don’t capture everything you are ‘interested in’ (model independent judgement) but these don’t ‘show up’ in the conditional distribution? Or?

]]>True but I _think_ part of the idea is that if y|T looks ‘suspect’ then the model and hence factorisation is suspect.

But all this is with my Bayesian hat on – more generally I agree that it is probably best to separate this stuff from modelling assumptions even further.

]]>Interesting… but it’s part of the model assumptions that the factorisation is possible in this way and one may want to check this, too, for which more than the “leftover” would be needed. ]]>

Yes but this is a case where the answer seems clear. We can then look at how to get _approximations_ to this.

Some quoutes from the first Mike Evans’ paper I posted:

> It is our claim that effectively (1) [OJM: basically, factoring the distribution as above] shows us how to proceed to avoid double use of the information and, as such, avoid double use of the data. Of course, as mentioned in the paper, it may be difficult, with complicated models to determine [p(y|T) etc] in meaningful ways. According, it seems reasonable to weaken this requirement in such contexts to having this hold asymptotically in some sense…

also

> In frequentist statistical theory, inference about parameters depends on the data only through the minimal sufficient statistic and, what is left over in the data (the residual), is available for model checking. Mixing these up would seem to correspond to an inappropriate statistical analysis. We believe this is equally applicable in Bayesian formulations….Of course, this restriction could be weakened ….only satisfy (1) in some asymptotic sense. The motivation for this would seem to arise from the complexity of some situations. Still, (1) can be implemented exactly with many models of considerable importance, so it isn’t just of theoretical relevance.

He also discusses checking priors and checking hierarchical models.

]]>veiw -> view

etc -> etc ]]>

> ojm, I don’t fully follow your distinction btw ‘sufficient statistics’ and ‘parameters of a model’.

A sufficient statistic is a statistic (function of the data), while a parameter is a…parameter? E.g. for a parameter defined via Fisher-constant functional it is the value of the statistic evaluated on the ‘full’ model rather than just the observed distribution.

For a sufficient statistic T(y) and model p(y;theta) we have

P(y;theta) = p(y|T(y))p(T(y);theta)

by definition of sufficient statistic.

In particular, the term p(y|T(y)) is independent of theta by definition.

In Bayes the data only enters parameter estimation via the likelihood function which is _only defined up to a data-dependent constant_. This is the important point, and noted by both ‘Frequentists’ (e.g. David Cox) and ‘Bayesians’ (e.g. Mike Evans).

Any purely data-dependent factor such as p(y|T(y)) drops out of estimating theta. So this is ‘unused’ info in this sense.

Equivalently, you can take the set of likelihood ratios or normalised likelihood function as the minimal sufficient statistic. Again the data-dependent constant component factor drops out. E.g.

[p(y|T(y))p(T(y);theta_1)]/[p(y|T(y))p(T(y);theta_2)] = p(T(y);theta_1)/p(T(y);theta_2)

in which p(y|T(y)) drops out of the comparisons. You can also write this all out in terms of a Bayesian posterior – again, something like p(y|T(y)) drops out.

The intuition is that your choice of model determines your choice of minimal sufficient statistics or vice-versa. Knowing one gives you the other.

This tells you which parts of the data you are paying attention to.

However, you can also ask whether you are paying attention to the right things – what does the ‘residual’ y|T(y) look like? If this shows unusual patterns then you doubt you’ve used a good minimal sufficient statistic i.e. you doubt you’ve used a good model.

An issue with _nontrivial_ minimal sufficient statistics is that they don’t generally exist outside of special families.

I’d probably argue that one should proceed the other way – choose what statistics are of interest and are a compact summary (‘minimal sufficient’ from a data analysis point of veiw) and _then_, if you still want to do parametric estimation, choose a model for which these are the minimal sufficient stats.

(This is also probably not far off the same procedure as something like MaxEnt, but minus the mysticism and, possibly, issues with working with entropy).

]]>I then do a posterior predictive check on yrep|y which is basically an integral over P(yrep|mu,sigma)*P(mu,sigma|y)*dmu*dsigma

Here mu and sigma are parameters of my model. Whether or not there exist sufficient statistics for estimating them in a simple way depends on my likelihood and priors, no? For simple cases that amount to max likelihood, such statistics exist. But PPC pretty clearly needs to extend to complex cases. So I don’t understand how sufficient statistics really apply (big caveat that I am not a statistician) In particular, you say

“The data enters this estimation procedure only via the sufficient statistics (i.e. ‘the likelihood principle as applied within the model’)”

But I don’t see how this is generally true in Bayesian models, and thus for how PPC happens in practice, except in special cases where T(y) are available. What am I missing?

I’m glad that I found Mike Evans’ comments on the relevance of sufficient stats to the double use of data topic, so I don’t think I’m completely wrong. But would be nice to hear any other thoughts on the topic. Even Andrew seemed reluctant to take up the specific point and hence our big detour into ‘foundations’ territory.

Does anyone out there have thoughts on posterior predictive checks, sufficient stats, and double use of data?

]]>The ML + curvature based estimate for a confidence interval can be seen as a first-order correction to the zeroth order estimate that the parameter value is just the ML value (a delta function).

A *proper* confidence interval however while I agree it doesn’t integrate over the parameter space, it *is* supposed to guarantee coverage regardless of what the true value is. The typical confidence interval construction procedure in ML fitting relies on asymptotic normality of the estimator, and only gives proper coverage as N goes to infinity. So for example if you do your simulation study for N = 100 instead of N = 4 and for mildly-informative order-of-magnitude priors, what do you observe?

So, for small data sets, the ML + Curvature estimate is probably systematically too small to be a proper CI because to be a proper CI requires that you do some kind of mini-max (ie. the minimum length interval having 95% coverage for the parameter value that is the worst-case)

It’s not that frequentist theory here says we *should* use the ML + Curvature, it’s that it has no computationally tractable way to get a true coverage guarantee and so it relies on an approximation.

At least, I think that’s what’s going on.

]]>Practically speaking, software may spit out a shorter interval, but we need to consider whether the shorter interval arises from using an asymptotic approximation that is inappropriate rather than actually having a correct CI.

]]>No, I don’t think this is correct.

Mathematically, the 95% CI that occurs when a Frequentist procedure based on likelihood theory + a flat prior is used to construct a confidence interval, *is* the 95% posterior interval that occurs when the Bayesian uses the Bayesian procedure and an improper flat prior. In other words, whatever you get from running Stan with a flat prior, is *what you should have gotten with the CI procedure*.

There’s a mathematical isomorphism, the math that you get is the same in the two cases, the intervals you get are the same in the two cases.

Now, *as an approximation* to the *correct CI* often a MAP + curvature calculation is done by software like “lm” etc. The calculation is based on an asymptotic theory, and perhaps this may systematically under-estimate the correct size of the interval when the dataset is small. The fact is, this is a *calculation error* and the resulting interval *is not a correct 95% CI*

Taking the results of this MAP/ML + curvature *calculation error* as evidence that the CI can be smaller and hence “better” in some sense, is just saying “often the simple calculation that is done leads you to a systematic error in which your CI is wrong because it’s too small, and this is ‘good'”

Yes, I agree with you that often the point of a Frequentist interval is really just to pick out a number (the MAP/ML estimate). If Frequentists just don’t care about the size of their CI then fine… But when the model is mathematically equivalent to a Bayes CI, the intervals *need to be the same* that’s what “equivalent” means in this context.

]]>The “flat” prior corresponds to the limit as sd goes to infinity. Let [a,b] be a “flat prior based confidence interval”. If you make sd something less than infinity you increase the prior probability that the parameter is in [a,b] because as sd goes to infinity, the prior probability to be in any fixed interval goes to zero.

If we increase the prior probability, and the likelihood would pick out this region under the flat prior, then we should increase the posterior probability as well. This means we can shorten the interval and keep the same posterior probability.

Now, your point about MAP + curvature may mostly be an observation that for small data, the *approximation procedure of MAP + curvature* systematically underestimates the appropriate size of the confidence interval. That’s a different story ;-)

]]>Of course, priors should contain information that is not in the data, I took this as a given. In my (limited) experience, this is not very hard, because often weakly informative priors suffice.

I also agree that when the data is to weak to estimate the model, e.g. when parameter estimates depend more on priors than on data, one should rather admit that and collect better data than employ priors.

Lastly, when simple models are estimated based on weak data, priors can make sense because they can protect from falling for unreasonably large effect sizes (In the field I am working in).

In the end, also given my failed attempt ;-), I am wondering if general statements about the usefulness of priors that reflect more than common sense are possible.

]]>e.g.

stan_model2 <- '

data{

int N;

vector[N] dat;

}

parameters{

real mu;

real sigma;

}

model{

mu ~ normal(0,0.05);

sigma ~ normal(0,0.05);

dat ~ normal(mu,sigma);

}

‘

stan_simpleCI <- '

data{

int N;

vector[N] dat;

}

parameters{

real mu;

real sigma;

}

model{

dat ~ normal(mu,sigma);

}

‘

dat <- c(0.02, 0.025, 0.04, 0.031)

stan_dat <- list(N = 4, dat = dat)

demo <- stan(file = "stan_simpleCI.stan", data = stan_dat,

chains = 3, iter = 500)

#divergents are suppressible with adapt_delta and not important

# to point I'm making

# compare to

demo_mod <- stan_model("stan_simpleCI.stan")

opt_demo <- optimizing(demo_mod, data = stan_dat, hessian = T)

print(opt_demo)

sqrt(diag(solve(-opt_demo$hessian)))

# or

summary(lm(dat ~ 1))

It all boils down to the observation that a prior is helpful if the information encoded in it is good and reliable, otherwise rather not. ]]>

Fair enough, I agree that this depends on the problem.

“However, when the data is weak (small samples and/or unreliable measurements) or when trying to estimate a complex model, I would be unhappy not to use priors.”

…and this looks far too general a statement to me. A prior can help if it represents genuine reliable information that is not in the data. If it doesn’t, I don’t see what you get from it. Surely you only want the prior to do some work here if the work done by it is good and helpful.

Sometimes I think it’s better to be honest and say that the available information is not strong enough to estimate your complex model at the required precision, and use graphs and non-probabilistic reasoning instead (or to use probability in a purely exploratory fashion) – keeping in mind also that if the data are too weak to estimate your model, chances are that they will also be to weak to check your model well.

Centered more on the data, perhaps, depends on the situation. Whether that’s a good thing also depends on the situation, for example the noisiness of the data collection process and the amount of background info that informs the prior.

]]>I think it is hard to make or understand statements about usefulness of priors (or trade-offs when defining them) without specifying the statistical problem one is dealing with.

If one has lots of data and/or very good measurements and one is estimating relatively simple models, I could also be happy not to use priors.

However, when the data is weak (small samples and/or unreliable measurements) or when trying to estimate a complex model, I would be unhappy not to use priors.

So maybe disagreements about the trade-offs involved in formulating priors stem in part from different implicit assumptions about the statistical problems to be solved?

]]>OK. I still see these terms being used commonly and I used them because of such common usage. I will avoid doing so in future! I thank you for pointing out the paper by you and Christian; I accept your points and fully appreciate the views in the paper and the discussion.

I am still unsure about what term I should apply to a probability calculated using:

(1) carefully observed, documented data alone with a mathematical model (which of course makes some unverifiable assumptions) and incorporating Bayes rule

(2) hypothetical data (“made up” as Christian says) as well as carefully observed, documented data with the same model (based on the same unverifiable assumptions) and incorporating Bayes rule

In his the first reply to your blog, Christian uses the terms ‘frequentist’ (“when using Bayesian analyses with one specific prior”) and “Bayesian” (the latter when “making up” a prior). So in accordance with Christian’s usage, do you think my comment above should have used the term ‘Bayesian’ instead of ‘subjective’ for (2) and ‘frequentist’ instead of ‘objective’ for (1)? I must say that I also feel uncomfortable about using the terms ‘frequentist’ and ‘Bayesian’ too! Perhaps my last sentence should have read “Frequentists should make use of this ‘frequentist Bayesian’ approach too.

]]>“But any attempt to use a Bayesian Decision Rule to make a decision can’t focus on just the local density at the peak” – if I remember correctly, you always claim that specifying a prior is not complicated and whatever has high density at the true value is fine… what you say here may depend a lot on what your prior does elsewhere, at least as long as you don’t have large amounts of data.

“You could reject Bayesian Decision Rules” – I don’t (I’m quite undogmatic and have been caught defending Bayes against frequentists). I’m fine with them as long as you need a decision and your prior and loss function are well justified. Except that this in my practice is a minority situation.

]]>I recommend abandoning the terms “subjective” and “objective” for reasons discussed in detail here by Christian Hennig, myself, and 53 others.

]]>Adding hypothetical likelihood distributions is a very powerful way of doing sensitivity analyses to anticipate how an existing study might give different results or the result of an attempted replication in a different setting. However, it is wrong to regard the uniform base-rate prior probability distribution as simply one more hypothetical prior. On the contrary, the uniform base rate prior probability is a fundamental part of random sampling mathematical models. Frequentists should make use of this ‘objective Bayesian’ approach too.

]]>https://projecteuclid.org/download/pdf_1/euclid.ba/1340370946 ]]>

http://andrewgelman.com/2017/04/26/using-prior-knowledge-frequentist-tests/

Its part of a quest to use Bayesian posteriors with informative priors as a method to what was quoted above as ‘pure’ Frequentist (emphasise optimality, decisions, coverage). Key is the realization that informative priors in Bayesian statistics carry the informtion that in ‘pure’ ferquentist statistics is carried by informative loss function.

]]>But as you mentioned sufficient stats rarely exist for more complex problems. The guiding intuition is perhaps the same though.

> while I kind of like the idea that there’s information left on the table, I don’t think it’s information within the statistical model, it’s kind of information about whether the statistical model itself is good for your purposes

That’s precisely the point!

]]>But any attempt to use a Bayesian Decision Rule to make a decision can’t focus on just the local density at the peak. You could reject Bayesian Decision Rules, but then you have to explain how Wald’s theorem plays into your justification for rejection…

In the end, I think my biggest concern is that the frequentist confidence interval and/or point estimation procedure feel like a misfired attempt to make good decisions based on data. It seemed intuitive, but it wasn’t fully thought through.

]]>From wiki: “In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if ‘no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter’.”

Sometimes in a Bayesian model things collapse to a small set of sufficient statistics, like the mean and sd of a normal. But other times that’s not true. I think the cauchy case there is no sufficient statistic according to the standard definition of sufficient statistics (I’ve never looked too carefully into that but I see it repeated in discussions, so I assume it’s a standard result.). So, while I kind of like the idea that there’s information left on the table, I don’t think it’s information within the statistical model, it’s kind of information about whether the statistical model itself is good for your purposes, and that comes down to whether it predicts “well” after fitting. The “well” has something to do with a utility rather than an inference. The fact is as fallible humans we can’t expect that we’ve specified the model exactly in the way we will want it specified until we’ve had a lot of experience with using it…

]]>That is, Bayes estimation only uses the sufficient stat so PPC can be perfectly well justified from a traditional point of view when based on y|T(y), just as in eg David Cox’s the ‘Fisherian reduction’.

I got the feeling that Andrew didn’t care too much about emphasising this correspondence, despite mentioning on occasion that PPC should be based on aspects of the model that are not ‘fit automatically’. Perhaps because he had (rightly) moved on to more interesting things. But still.

]]>