Nic Lewis writes:

I have made some progress with my work on combining independent evidence using a Bayesian approach but eschewing standard Bayesian updating. I found a neat analytical way of doing this, to a very good approximation, in cases where each estimate of a parameter corresponds to the ratio of two variables each determined with normal error, the fractional uncertainty in the numerator and denominator variables differing between the types of evidence. This seems a not uncommon situation in science, and it is a good approximation to that which exists when estimating climate sensitivity. I have had a manuscript in which I develop and test this method accepted by the Journal of Statistical Planning and Inference (for a special issue on Confidence Distributions edited by Tore Schweder and Nils Hjort). Frequentist coverage is almost exact using my analytical solution, based on combining Jeffreys’ priors in quadrature, whereas Bayesian updating produces far poorer probability matching. I also show that a simple likelihood ratio method gives almost identical inference to my Bayesian combination method. A copy of the manuscript is available here: https://niclewis.files.wordpress.com/2016/12/lewis_combining-independent-bayesian-posteriors-for-climate-sensitivity_jspiaccepted2016_cclicense.pdf .

I’ve since teamed up with Peter Grunwald, a statistics professor in Amsterdam whom you may know – you cite two of his works in your 2013 paper ‘Philosophy and the practice of Bayesian statistics’. It turns out that my proposed method for combining evidence agrees with that implied by the Minimum Description Length principle, which he has been closely involved in developing. We have a joint paper under review by a leading climate science journal, which applies the method developed in my JSPI paper.

I think the reason Bayesian updating can give poor probability matching, even when the original posterior used as the prior gives exact probability matching in relation to the original data, is that conditionality is not always applicable to posterior distributions. Conditional probability was developed rigorously by Kolmogorov in the conext of random variables. Don Fraser has stated that the conditional probability lemma requires two probabilistic inputs and is not satisfied where there is no prior knowledge of a parameter’s value. I extend this argument and suggest that, unless the parameter is generated by a random process rather than being fixed, conditional probability does not apply to updating a posterior corresponding to existing knowledge, used as a prior, since such a prior distribution does not provide the required type of probability. As Tore Schweder has written (in his and Nils Hjort’s 2016 book Confidence, Likelihood, Probability) it is necessary to keep epistemic and aleatory probability apart. Bayes himself, of course, developed his theory in the context of a random parameter generated with a known probability distribution.

I don’t really have time to look at this but I thought it might interest you, so feel free to share your impressions. I assume Nic Lewis will be reading the comments.

This also seemed worth posting, given that Yuling, Aki, Dan, and I will soon be releasing our own paper on combining posterior inferences from different models fit to a single dataset. Not quite the same problem but it’s in the same general class of questions.

The problem is that *Bayesian Probabilities are NOT SUPPOSED TO BE the frequency of anything*.

As soon as you decide what problem you want to solve, you will make more progress solving it.

1) I have the output of a random number generator and I want to determine what its probability distribution is. (Frequentist inference)

2) I know some things about the world and I want to determine what things I can approximately infer from this information. (Bayesian Inference)

Even more confusing for people can be where you have both… you have some facts about the world that tell you about the output of a random number generator and you want to determine what things you can infer about the frequency distribution.

You will wind up with a probability distribution over frequency distributions.

Simple example: a polling company randomly selected people in the US and measured their weight (due to random selection, it’s a kind of RNG), they then give you the average weight in each quintile. Now you’d like to infer the full shape of the weight distribution. You could do a gaussian mixture model with say 3 mixtures, and then you’ll have 9 unknown parameters (location, scale, and weight). you’ll get a posterior distribution over these 9 parameters that will induce a posterior distribution over frequency distributions of weight in the population. (that is, for every sample of the 9 parameters from the posterior over parameters, you get ONE guess at the frequency distribution)

Which is the probability distribution and which is the frequency distribution? If you can answer this question clearly in your mind, you will see that the Bayesian probability distribution over the parameters is a pure probability distribution that has nothing to do with the frequency of anything. But, there exists a frequency distribution of weights in the population of people, and if you’ve done your job right, the maximum a-posteriori parameter vector will induce a GMM that is a close approximation to the actual frequency distribution.

It’s no surprise that the frequency distribution you get as a mixture model of GMMs from the posterior parameter set is not necessarily a good approximation. This tells you only that your uncertainty over the frequency distribution is still large and you have to consider a wide range of possibilities.

“Bayesian Probabilities are NOT SUPPOSED TO BE the frequency of anything”

Yes, but in science one generally wants results objectively to reflect the observations and uncertainties involved – one isn’t after a subjective personal degree of belief – so the appropriateness of the Subjective Bayesian paradigm is questionable if the data are not strong enough to dominate the prior. As Andrew has written, ‘correspondence to observed reality [is] a core concern regarding objectivity’ and ‘the ultimate source of scientific consensus.’

In my experience, most scientists want their stated uncertainty ranges for estimates of fixed but uncertain parameters to be valid as confidence intervals. Posteriors from an objective Bayesian approach using probability matching priors can typically generate credible intervals that are, at least approximately, valid confidence intervals. My point is that, when combining evidence from different sources about such a parameter using an objective Bayesian approach, using standard Bayesian updating is not appropriate, at least if one wants to achieve probability matching. That applies equally where there is observationally-based prior knowledge regarding the parameter value (but not if the parameter value was determined by a random process and the prior knowledge is about the characteristics of that process).

“I have the output of a random number generator and I want to determine what its probability distribution is.”

That is not usually the aim when estimating the value of a fixed but unknown parameter from observational evidence. Rather, the aim is to determine a central estimate for the parameter given the values of the observed variables – which are typically viewed as being affected by random errors whose probability distributions are either already known approximately or are estimated from multiple observations – and uncertainty ranges around the central estimates (or a complete uncertainty distribution). This can be achieved either by frequentist or Bayesian methods, but credible intervals from subjective Bayesian approaches may reflect the error characteristics of the observational evidence less well, possibly greatly so, than those from objective Bayesian approaches or frequentist confidence intervals.

So trivially this will be diachronic-Dutch-bookable, but I don’t think anyone except Jay Kadane cares about that anymore. Almost as trivially, this will violate the product rule functional equation desideratum of Cox’s theorem. Jaynesians, can we push this to some limit and show that it violates propositional logic?

I haven’t got to the part where you purport to show this, but I gotta tell you that

actualstandard Bayesian methodology is trivially order-consistent:p(theta| d_1, d_2) is equal to p(theta) p(d1, d2 | theta) / p(d1, d2)

and p(d1, d2 | theta) satisfies the usual relationship between join and conditional distributions:

p(d1, d2 | theta) = p(d1 | theta) p(d2 | d1, theta) = p(d2 | theta) p(d1 | d1, theta)

Oh I get it: you’re just saying that the so-called “objective Bayes” approach of picking a reference prior relative to the first experiment and then using the posterior from the first experiment as the prior for the second experiment isn’t order-consistent. I’d view that as an argument against reference priors, not Bayesian updating, but then I’m not interested in uniform frequentist coverage over parameter space.

Daniel – where do you get your definition of frequentist inference from? It seems pretty wrong to me.

I wouldn’t take that to be a *definition* of Frequentist inference, more like one case of frequentist inference. The key thing about Frequentist Inference is that you want to precisely match long run frequencies by definition.

So, if his goal was to match the long run frequency with which a certain experimental set up would produce a certain observed result, assuming the experimental setup is precisely enough controlled that it can be approximated as producing data indistinguishable from a particular RNG… then he’s got a Frequentist goal.

But I see ultimately that this was a red herring in this case. AFAICT what he’s doing is using an RNG to produce pseudo-data that matches an assumed experimental set-up and then demanding that the frequency with which his inference has some property is something or other. In this case, you really are working with an RNG… and so it’s not the failure of the world to behave like an RNG that is the issue (which is what I initially thought, that his comparison to actual experimental data showed mismatched frequencies).

“This also seemed worth posting, given that Yuling, Aki, Dan, and I will soon be releasing our own paper on combining posterior inferences from different models fit to a single dataset. Not quite the same problem but it’s in the same general class of questions.”

Is this paper have anything to do with a recent (feb2016) discussion in the Stan users mailing list “Bayesian model averaging without reversible-jump MCMC on highly-dimensional models”

Seems similar to idea of a Mixture-model-Bayesian-stacked-like-regression-thing that was being proposed.

Having skimmed through the paper, I can’t understand what the issue is. You seem to be interested in the problem where you have data about A and B with normally distributed measurement errors, and you’re interested in the quantity A/B.

In Stan

parameter{

real A;

real B;

}

model{

A ~ somepriorA();

B ~ somepriorB();

DataA ~ normal(A,sigmaA);

DataB ~ normal(B,sigmaB);

}

generated quantities{

real RAT;

RAT = A/B;

}

Now, you have some interesting analytical approximations, and choices of prior which allow you to get a distribution for RAT directly, analytically. But, in your Figure 2 you show that your method gives exact probability match for your simulated data, whereas “bayesian updating” doesn’t.

However, I can’t figure out in your paper what it is that you actually mean by “bayesian updating”. It seems to me that the above *is* bayesian updating and it should match your procedure exactly if you put in the appropriate priors. (note, if you have real prior information here, then putting in the real prior info will get you to the right place faster, but asymptotic for large data, you wind up at the same place)

in your graph you say (Exp. A then B) and (Exp. B then A) which makes me think that you’re doing something like trying to do Bayesian inference in a strange way in which you *throw away* the information on the underlying A, B and try to purely do successive inference on RAT. If you do inference simultaneously on A, and B, or you do inference on A and then on B, or on B and then on A, you will wind up with the same distribution for RAT. The distribution on RAT is induced by push-forward from the A,B distributions. If you throw away the info on A,B it doesn’t seem that surprising that you wind up with less than efficient/correct inference.

The problem is to infer A and B… once that is done, samples of RAT are trivial.

Am I missing something?

“Am I missing something?”

Yes. You deal with the case where there is a single source of evidence, consisting of data for A and B. In that case, I agree that the approach you outline is fine (if I understand the Stan statements correctly). I have myself used essentially that approach in suitable cases.

The prime focus of my paper, however, is on combining evidence from different sets of observations. So as well as observations of A and B, one also has observations of C and D, which have different uncertainties from A and B. (There could be more different sets of paired observations, of course.)

A standard Bayesian approach would be to infer the parameter, theta, as the normalised product of some prior p(theta) and p(A B|theta), the joint likelihood for A and B, giving a posterior p(theta|A B), and then to update that posterior, treated as a prior, by the joint likelihood for C and D: p(theta|A B C D)= p(theta) p(A B|theta) p(C D|theta). My point is that the prior that produces good inference when observations only of A and B are available will not also, in general, produce good inference when inferring from observations C and D as well as A and B.

In the case I analyse in the paper, the analytical approximation for theta in the single evidence (data on A and B) case makes it simple to compute the prior that will produce good inference in the combined evidence (data on A B C and D) case, or indeed for any number of different sets of evidence.

I think I understand your graph Figure 2 to show that you do a Bayesian update as above with two experiments. To make it less confusing, let’s call your parameter X,Y and RAT = X/Y. then *experiment A* provides some data on X,Y and you get a posterior on X,Y call it p(X,Y | DataA)

From this you can easily calculate p(RAT | DataA) as a push-forward measure (ie. generate X,Y and calculate X/Y and you get a sample)

Now, it sounds to me as if you *throw away* the posterior on X,Y and start with P(RAT | DataA) as a prior on RAT, and try to do

p(RAT | DataA,DataB) = P(DataB | RAT, DataA)P(RAT|DataA) / P(DataB|DataA)

now you can see that you’ve learned a bunch of stuff about X,Y in the first step, but you’ve lost that information when you go to do the second step and collapse the 2D posterior over X,Y into a 1D posterior over RAT.

Instead, simply retain your knowledge about X,Y and do:

p(X,Y | DataA,DataB) \propto P(DataB | X,Y) p(X,Y|DataA)

and from P(X,Y|DataA,DataB) calculate the push-forward measure on RAT (ie. generate samples of X,Y and calculate X/Y)

If there is an underlying 2D space and you collapse the probability measure down to a 1D space, that you lose information is exactly what you’d expect, there are after all an infinity of possible pairs (X,Y) for which X/Y = 1 for example.

That would be relevant if Data A and Data B were both providing information about the same X and Y. But they aren’t. There is no connection between the numerator and denominator variables in the two experiments, either as to values or observational uncertainty.

Aha, this is I think what I was missing in your paper, and maybe it’s in there explained somewhere but I didn’t find the details.

So, now I’m imagining that in experiment A you learn about X, Y and in experiment B you learn about P,Q and then some ratio R is both equal to X/Y and also in separate type of experiment, equal to P/Q and you want to combine information about both experiments. You have:

(Andrew, I love Stan pseudocode too!)

parameters{

real X;

real Y;

real P;

real Q;

real R;

}

model{

X ~ priors go here;

Y ~

P ~

Q ~

R ~

DataXA ~ normal(X,xsigma);

DataYA ~ normal(Y,ysigma);

DataPB ~ normal(P,psigma);

DataQB ~ normal(Q,qsigma);

R ~ something(X/Y,…);

R ~ somethingelse(P/Q,…);

}

Now, the issue is this. P(R | X,Y) = deltafunction(X/Y) and P(R | P,Q) = deltafunction(P/Q), that is, if you told me the true values of X,Y or P,Q I would know immediately the true value of R. Unfortunately, there’s no way to express this in Stan, but mathematically you could express it to within a given epsilon by :

R ~ normal(X/Y,eps);

R ~ normal(P/Q,eps);

This may well have a hard time sampling because it constrains R,X,Y,P,Q to fly around in a very narrow tunnel in their 5 dimensional space.

I think you can think of this as if you were doing

generated quantities{

real RXY;

real RPQ;

RXY = X/Y;

RPQ = P/Q;

}

and then taking the samples of RXY and RPQ, doing a kernel density estimate using an eps width normal kernel, and then multiplying the two kernel densities to get p(R|X,Y)p(R|P,Q)

If you can make that sample, I think it should give you the right answer (in the limit of very small epsilon) and be invariant to any “order” effects.

What about simply constraining your parameter space:

parameters{

real X;

real Y;

real Q;

}

transformed parameters{

real R;

real P;

R = X/Y;

P = Q*R;

}

model{

X ~ priors go here;

Y ~

P ~

Q ~

R ~

DataXA ~ normal(X,xsigma);

DataYA ~ normal(Y,ysigma);

DataPB ~ normal(P,psigma);

DataQB ~ normal(Q,qsigma);

}

Yep, this was the solution Corey suggested over at my blog. I posited the folk theorem of “whenever you want a delta function, what you really need is clever algebra”.

Daniel:

Just separate from everything else . . . I looove the idea of Stan pseudocode. This is a great thing.

Stan pseudocode is the best!

I’m not sure whether it makes sense to call something “Bayesian” that removes what Bayes contributed to Bayesian statistics.

Maybe, but in that case nor should objective Bayesian methods in general be called Bayesian, as they are logically inconsistent with standard Bayesian updating (conditionality), as noted in Kass and Wasserman’s 1996 review paper (and by Teddy Seidenfeld much earlier).

Further, on the same argument it also seems questionable calling subjective Bayesian methods Bayesian when they are applied to inference about a fixed but unknown parameter. Bayes’ analysed a fundamentally different case, where the parameter was random, with a known distribution.

Moving away from Subjective vs Objective, and towards Cox/Jaynes… we can say that if you want to use a real number to calculate some kind of credence or plausibility, and you want it to agree with logic in the limit, then you should follow the algebra of probability. In this case:

p(R|DataA,DataB) = p(R | X,Y,P,Q) p(X,Y | DataA) p(P,Q|DataB)

and your main problem is really that p(R|X,Y,P,Q) has a nonstandard density delta(R-X/Y)delta(R-P/Q).

You’ve got the core of a discussion for the incoming discussion paper of Andrew and Christian on Wednesday!

Nic: At least Bayes came up with Bayesian updating; otherwise his paper is somewhat ambiguous and different people have read different things into it when it comes to the more general philosophy. So I think going away from Bayesian updating removes more Bayes from Bayesian statistics than the other issues you mention.

> …by using the uniform Jeffreys prior to infer phi1 and phi2, which, as they are location parameters, is completely noninformative here.

Is a uniform prior adequate? There is no translation invariance in the ECS example (the values are defined as changes from a baseline) and I guess the same will happen in other cases where the ratio phi1/phi2 may be of interest.

I coded this up in Stan, and ran it, took about 20 seconds to compile and less than a second to run on 10 observations of 2 variables in each of 2 experiments. Link to blog:

http://models.street-artists.org/2017/04/09/inference-on-a-ratio-from-two-separate-experiments-a-case-study-in-stan-awesomeness/

I agree that the problem can alternatively be solved numerically by sampling. The case you coded involves only two experiments, and relatively small fractional errors in the observations. Sampling will however become progressively more computationally demanding as the number of experiments is increased. It also provides no insight into the relative contributions of the various experiments to constraining the parameter value in different areas of parameter space.

Nic: I am working on a problem in survey analysis involving about 2000 localities and over 200,000 data points. Sure it takes about 20 minutes to get me about 200 effective samples, but the model is way more complicated (fitting 2000 separate functions of household size).

By the time you’re talking about 2000 experiments and 200,000 data points each on the *same* thing, well I don’t think that’s the real issue that your paper solves, because you’re likely to find out the ratio to within epsilon in that case regardless of the prior. Also, with sampling you can easily find out a posterior over the ratio within each separate experiment.

Here’s the thing about this paper, it has a nice analytical analysis of some mathematics, and it gives you a useful result that gives insight into the problem, especially when there are a small number of experiments (like say less than 10) and you want to understand clearly how each potentially controversial noisy experiment affects the overall picture of what’s going on. But, it’s written as if there is some inherently wrong thing about Bayesian updating and you’ve discovered the cure. And this is highly controversial, wrong on its face, and your actual point, which is correct, is easily misunderstood. If you’re looking to stir up a storm… then saying stuff that reads like “Bayesian updating is not order independent” and soforth is gonna do it.

What’s actually “wrong” here is that some old-school principles for choosing priors don’t in general provide a level of “uninformity” what they claim to provide, a fact that has actually been known for decades. These days, people advocating Bayesian statistics (such as Andrew, and myself, and Corey and others here at the blog) don’t actually advocate for “uninformative priors”. In fact, what we want is something like “moderately informative priors”. So, for example, in your ratio case, a typical thing that I’d do is use

Rat ~ exponential(1.0/Guess)

Where Guess is some guess at how big the ratio is. This is the maximum entropy prior for a positive number whose expected value is Guess. If I know that Rat can’t really be near zero, I’d modify this as

Rat ~ gamma(n, n/Guess)

with n a number larger than 1. I usually pick n by plotting the density. In this case there’s a boundary-layer near 0 that pushes the quantity away from zero, and the larger n is the more concentrated the prior is around Guess. There’s some more math you can do here, as gamma is also a maximum entropy prior for a slightly different case (where you know not only the mean value, but also the mean of the logarithm, or in other words, the order of magnitude)

These kinds of moderately informative priors have good properties. They prevent things from going off the rails, especially when the number of data points is small.

So, I think it might make sense to re-work the paper so that the subject is closer to “A choice of prior and an analytical approximation for Bayesian combination of N separate experiments each of which separately informs the numerator and denominator of a ratio”. But then, that doesn’t have the controversial punch of something that sounds like “Bayesian updating doesn’t work, and Minimum Description Length solves the problem”

“These days, people advocating Bayesian statistics (such as Andrew, and myself, and Corey and others here at the blog) don’t actually advocate for “uninformative priors”

You are mistaken. Some people do advocate for use of “noninformative” priors.

“In fact, what we want is something like “moderately informative priors”.”

Not all people advocating a Bayesian approach to inference would agree with you.

“But, it’s written as if there is some inherently wrong thing about Bayesian updating”

I believe that there is, in cases such as I am considering.

Question for you, Nic, but first let me make sure I understand your position correctly. Suppose that your data sets are expensive and time-consuming to collect to the point where it makes sense to do a fresh analysis after each one becomes available. If I understand you correctly, you would advocate that each analysis should result in a distribution estimate (in the sense of point estimate, interval estimate, distribution estimate) with the probability-matching property — a confidence distribution, in other words. The confidence distribution will be computed by multiplying the (approximate) likelihood functions of each data set with a, let’s call it a weighting function on parameter space, and normalizing. The weighting function will not be the same from analysis to analysis because if it were then we’d lose the probability-matching property.

Corey, what does the probability matching property even mean here? I am confused on that. Let’s say I run N separate RNG simulations, each of which generates data for a numerator and a denominator n_ij and d_ij where i is the experiment index, and j is the data point index.

Then, by probability matching we mean what? the frequency with which n_ij/d_ij = R for all values of R? But the experiments indexed by i are of very different nature apparently, and so although they all give info about the same ratio, they have different measurement errors etc. So are we fixing the makeup of the experimental set-ups? Or do we have a distribution over experiments as well? Because I can only imagine that the way we’d evaluate the “frequency properties” in practice is to conduct more experiments, but each one has varying character and so the ensemble of experiments is changing through time. Unless you specify a distribution over the character of the experiments you can’t even talk about the frequency properties under a sequence of experiments, only the frequency properties under repetition of the same exact ensemble of experiments.

So this brings me back to the idea that we imagine there are N labs each having some experimental set ups, each one collects some data points, and then we want to estimate the distribution of n_ij/d_ij under a scenario where we call all the labs and say “fire up your measurement apparatus and send us a new data set” (say runs indexed by k) and we just keep running the same ensemble of experiments over and over for k=1..K.

Why we’d care about this is confusing to me, as it seems like it would never happen, and besides if it did happen we wouldn’t want to throw away all the previous data sets, we’d want to just jam all the data from all the experimental runs into Stan and run one big model.

Your prior is “probability-matching” if your posterior credible intervals are also valid confidence intervals.

Right, but a valid confidence interval is relative to a particular generating process, and I can think of two generating processes:

1) A fixed set of labs all repeat a set of experiments over and over

2) Every day a new lab does a new experiment of the same general type but with different apparatus and different choice of numerator and denominator, but each case having the same underlying ratio.

In practice, I guess the probability matching must apply to case (1) but also in practice the way in which science is done is (2)

Process #2 is what we’re talking about; the precisions are arbitrary but known.

Daniel/Corey:

Both 1 and 2 could be possibly relevant reference sets or targets of inference.

More important than the difference between them is getting get past the usually silly assumption of a having the exact same underlying ratio.

An old comment by David Cox seems most relevant “The way I look at it is that it is implausible that different studies give exactly the same true effect. That is, treatment by study interaction is to be expected. It should be explained rationally if possible (noting that the omnibus tests commonly used for heterogeneity are extremely insensitive). If that is not possible there is no real alternative to supposing that the interaction (instability in treatment effect) is generated by a stochastic mechanism and that an average over an ensemble of such repetitions is of interest. That is, it is the interaction that is being replicated not the centres themselves.”

Keith, you make a very good point. Even doing the same exact experiment across different labs, if there is any bias in any of the measurement instruments, then the bias is consistent for that lab. Suppose we do an experiment and we have

n_i ~ normal(n* + bn_i,sigman_i);

d_i ~ normal(d* + bd_i,sigmad_i);

for a vector of data from the ith lab and a bias bn and bd for the ith lab we should probably include this lab bias for each measurement in realistic models

we could have a small number of individual labs so that the average biases don’t cancel, but a large number of data points from each lab, so that the measurement errors don’t matter, and still wind up with an irreducible uncertainty.

I’m reminded of the polling for Hillary Clinton

Corey: that’s broadly right. In climate science you can’t just go out and generate new independent data sets, because there common internal climate system variability affects all or most datasets relating to a given period, and short period data is of little relevance. There are only a small number of meta-analysis studies; most published observationally-based estimates of climate sensitivity involve analysing a single dataset, some and the weighting function, and producing a distribution estimate based thereon. It is not that I advocate this, but the reality of what occurs. I would say it is normal practice in science to produce estimates that reflect information as to the value of the parameter(s) being estimated that is obtained from the data analysed in the study concerned rather than combining it with data analysed in other studies or results derived therefrom (although such results might affect the experimental design).

“The weighting function will not be the same from analysis to analysis because if it were then we’d lose the probability-matching property.”

Yes, in general, where one is estimating a fixed but unknown parameter And nor will the weighting function be the same for analysis of combined datasets as for analysing any of them separately. I agree that it is best to regard the “prior” as a weighting function.

Still trying to be super careful to understand what is going on.

So we have: choose an N,D (a kind of experiment to do) such that N/D = R all of which are unknown. we have no hypothesized frequency distribution for N,D, and theoretically since R is fixed, we could say really just choose a D and then N=RD by definition of R. But as I say, there’s no frequency distribution for D it’s a choice as to what each lab will decide to measure.

collect some data n_i and d_i

Now, p(n_i | N,sigman) and p(d_i | D,sigmad) are known normal frequency distributions, but N,D are unknown.

estimate a bayesian probability distribution for N,D, we get to choose a prior for D and R here, N is then induced directly

from this calculate using priors on D,R

p(R | Data) = p(Data | R,D) p(D) P(R) / p(Data)

Now, demand this is the frequency distribution of something, even though we’ve already stated that the frequency distribution of R is a delta function around a constant unknown value?

Clearly I am misunderstanding something.

As best I can come up with we want an algorithm for choosing a point estimate R*, and we want a frequency distribution under repeated Data collection for the error R*-R

but then we need to specify the method by which we choose the R* point estimate in order to evaluate the frequency properties. Also, Wald’s theorem, Bayesian Decision Theory, and Why are we doing this?

Again, I could be confused. I mean, Frequentism basically always confuses me.

Ok thinking more about it… suppose we get

p(R | Data) using a Bayesian procedure and a choice of prior p(R)

now we form INT(p(R | Data), alpha) as the 1-alpha high probability density interval from this.

Now we demand that under repeated Data collection 1-alpha of these intervals contain R for all alpha?

I think that must be what I’m missing. It’s not the point estimate that is of interest, but rather interval construction procedures.

Now, suppose I have an informative p(R) such that when I set alpha = 0.05 I always get that 99.99999% of my intervals contain R. This is considered obviously undesirable?

Suppose I have p(R) such that my 95% intervals tend to be +- 0.001 for a parameter whose value is R = 1 but only 94.75% of the intervals contain the true value. That’d be … bad? Suppose there’s a proof that the smallest an interval can be and reliably contain 95% of the samples in repeated sampling is width = 1

so by giving up calibration under repeated sampling, I can reduce the size of my interval by a factor of a thousand, but I’m a tiny bit miscalibrated, and that would of course be … bad?

Perhaps we’re just interested in “There exists a fixed p(R) such that all p(R | Data) intervals are approximately calibrated to within 1%” as a mathematical fact… ?

My principled approach is: I want a procedure that tells me how much credence I should assign to the idea that R is a certain value after seeing some specific actual data that some people actually collected.

And, from that perspective, the existence of a choice procedure Choose(p(R)) such that p(R) coming out of Choose(p(R)) causes INT(p(R|Data),alpha) to be a confidence interval under repeated future data collections… doesn’t seem that relevant.

Daniel, I went through the paper again and got a better sense of what’s going on.

Start with single data set case. Use flat priors for N and D and get a posterior for, not

R, but a pivotal quantity calledzin the paper. Approximate this posterior with a nicer distribution (the standard normal distribution is pretty nice!). Do a change of variables fromztoR. Call the Jacobian of the transformation the “prior” (even though it’s data-dependent) and call the part you get by direct substitution (still looks standard-normal-ish, butRdoes does double duty: it’s the center and the variance is a quadratic function of it) the “likelihood” even though it does not contain all of the data-dependent bits. Usefully, this “likelihood” is has no dependence on N and D other than through R. (Calling this bit the “likelihood” isn’t quite as arbitrary as it sounds — other research that is just as confused from a Jaynesian point of view has also derived this expression as some kind of likelihood sorta thing.) Credible intervals from this posterior have good frequentist coverage — yay! (This is an actually posterior albeit approximate, so no scare quotes.) It’s important to note that the “prior” depends not only on the data but also on the variances of the measurements of N and D.Now consider a second data set. Multiply the posterior from the first data set with the “likelihood” from the second data set. It doesn’t have good frequentist coverage — boo! If you flip this around and use the posterior from the second data set and the “likelihood” from the first data set, it also doesn’t have good frequentist coverage — boo again! But if you take the product of the two “likelihoods” and do the Jeffreys prior thing (square root of the Fisher information, and s) and treat that as the “posterior” (scare quotes!)

thathas good frequentist coverage — yay!Oh what I wouldn’t give for an “edit comment” button.

Corey, hey the whole thing at least let me show what a badass piece of software Stan is. I’m honestly shocked that it did as well as it did with the “pseudo-delta-function” problem I handed it. Obviously your reparameterization (also suggested by Elrod above) is better, but it’s kind of shocking that Stan can fly down a special N dimensional tunnel 0.001 units wide without any input from me.

Seems familiar, N and D are un-shared nuisance parameters and R is the only shared parameters.

So Frequentistically, you will want to evade any between study dependence on N and D and get a pivot just involving R.

(In a Bayesian approach you will want to avoid any sharing between studies for N and D – no shrinkage – and get a marginal fully shared posterior for R)

Also reminds me of this skirmish between Fisher and Lindley https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15279/1/283.pdf

Personally I don’t find this line of research profitable.

> you want to understand clearly how each potentially controversial noisy experiment affects the overall picture of what’s going on.

There is this way of doing that (the last journal I submitted this to said the method was nice but there was not enough technical development to warrant publication in their journal) http://andrewgelman.com/2011/05/14/missed_friday_t/#comments

Nic:

I do believe its better to strip off whatever priors were used or implied in the studies and focus on the likelihoods (or even better still the raw data if its available).

Combining likelihoods (whether just the reported likelihood functions from the studies or obtained by re-analyzing the raw data) is trivial and has not changed much from when Laplace did it – they multiply if independent (or after conditioning if not independent). I first learned that from Don Fraser in the first stats course I ever took.

Believe the real challenge is to get a combined prior for a combined analysis that combines with the combined likelihoods.

(One of my colleagues is working on that with a graduate student right now).

Additionally the consensus at https://www.samsi.info/wp-content/uploads/2010/09/final-report-meta.pdf was that the role of a prior would be minimal other than informative priors on between study parameters. Within study common parameters almost always will have very concentrated likelihoods.

Finally, I have no problem using a wooden or silicon model for a mouse that is not either if – it suits the purpose. The same with a random model for an unknown fixed parameter.

Nic: You might find some of this useful – https://phaneron0.wordpress.com/2015/08/03/dphil-thesis-on-meta-analysis/

Kevin: thanks, I’ve downloaded your thesis and will take a look.

I agree that the likelihoods should be focussed on; as you say likelihoods are easy to combine when the data is conditionally independent. In the case of estimating climate sensitivity (ECS), the likelihood functions are typically pretty wide, with the likelihood often remaining a material fraction of its peak value even at extremely high ECS values, so the choice of prior has a major impact on the resulting ECS estimates.

“Believe the real challenge is to get a combined prior for a combined analysis that combines with the combined likelihoods.”

That is what my approach is designed to achieve, looking at a particular case in detail, from an objective Bayesian viewpoint.

However, one can of course infer a central estimate and uncertainty ranges – indeed, a full approximate confidence distribution – directly from the combined likelihood using a frequentist likelihood ratio method, either in a simple form or modified to provide superior probability matching. In my paper, I show that the combined-experiment inference results from doing so, using the simple signed-root likelihood ratio method, are almost identical to those using my objective Bayesian method. I offer some suggestions in the paper as to why this is the case.

Thanks.

> However, one can of course infer a central estimate and uncertainty ranges – indeed, a full approximate confidence distribution – directly from the combined likelihood

Doing anything other than that (getting the combined likelihood) unless that can’t be done – doesn’t make sense to me.

And having done that, its no longer about combination per se but rather getting _good_ uncertainty intervals.

So this statement in your paper seems _wrong_ “the standard Bayesian method for using likelihood

information to combine evidence from independent sources (Bayesian updating)” as

p(u|s1,s2) = p(u) * p(s1) * p(s2) or p(u) * p(s2) * p(s1) or p(u)^(.5) * p(s1) * p(u)^(.5) p(s2) for s1 and s2 independent.

With nuisance parameters its more complicated but “Fortunately, these lower dimensional foci (marginal posteriors) can be split into marginal prior and marginal likelihood (Berger, et al 1999). The marginal posterior obtained from integrating the full dimensional posterior is proportional to the marginal (integrated) prior times the marginal (integrated) likelihood.” http://andrewgelman.com/wp-content/uploads/2011/05/plot13.pdf

Sorry – missed part of your statement “the standard Bayesian method for using likelihood information to combine evidence from independent sources (Bayesian updating) differs from frequentist approaches”

I wonder if you have seen this paper? https://osf.io/2ukxj/ When are sample means meaningful? The role of modern estimation in psychological science.

It seems to me that, while they know quite a lot of statistics, the authors are not writing from a statistical perspective. They do not acknowledge the central role the mean plays in statistical theory. Maybe I am missing the point, but in the intro it seems like they are trying to reinvent the wheel wrt ANOVA. I get that sometimes shrinkage estimators are more desirable, but sometimes we need an unbiased estimator! It seems silly that they propose doing away with putting much emphasis on the arithmetic mean in primary school. I mean, they actually state “In our view, sample means, sample medians, and modes should not be taught as descriptives.” haha

Would be interested to hear criticism from a statistician, but I understand if you’re busy. Thanks.

That osf site is far too fancy for its own good. They are excluding a lot of people by having so much unnecessary junk going on.

Nic, while I have your attention: in my opinion, Williamson 2009 is riddled with errors and no one should cite it ever. In particular, the argument against diachronic Dutch book as a justification for conditionalization begs the question. He writes:

But we already know that Dutch books can only be avoided via standard Bayesian updating and we also know that under such updating the prior expectation of the posterior is equal to to the prior — that is, Bayesian updating induces a martingale. So “if it is generally known… that your degree of belief in

awill not decrease” then it’s immediately obvious from the martingale property that it can’tincreaseeither — if one is to avoid being Dutch booked. So in effect Williamson demonstrates that evidence that is known to be (prior-)probabilistically independent ofacan have no effect on one’s degree of belief ina, and from that somehow draws the conclusion “don’t trust dynamic Dutch books: it can be more important to change degrees of belief to reflect new evidence than to avoid certain loss.” (Needless to say, Williamson does not accept that this is question-begging [private communication].)Corey: I note your opinion of Williamson 2009. Whether it is flawed or not, it seems clear to me that Bayesian updating is not in general satisfactory for combining inference from different experiments / sources of evidence about a fixed but unknown, continuously-valued parameter, if one is concerned about the results accurately reflecting the observational uncertainties.

One last thing: the quantity you’ve called the likelihood is not, contra Efron, actually a likelihood that would be used in standard Bayesian methodology. (Notwithstanding that Efron’s a giant of the field, pretty much everything he’s done that’s Bayesian-looking should be filed under “Efron-style empirical Bayes” and not “standard Bayesian methodology”). Likewise for the prior; in fact, the flat priors for (in your notation) phi_1 and phi_2 already induce the

actual(improper) joint prior for theta and phi_2. If I’ve done my Jacobian correctly it’s proportional to the absolute value of phi_2; when considering more than one data set, it would be the product of the absolute value of all of the individual phi_2 parameters. Standard Bayesian methodology would be to integrate all the phi_2 parameters out of the complete likelihood with respect to this prior. The integrals will factor, and my scribblings suggest that they’ll by analytically tractable, so you’ll get one integrated likelihood* term per data set, satisfying your desire to know the contribution of each data set to the final posterior. And since the joint prior is flat w.r.t. theta, the posterior is just the (normalized) product of all of those integrated likelihoods. This is the distribution from which Daniel’s Stan code will generate samples (modulo the small amount of slack in the constraint that forces the ratios in the two data sets to be equal).Perhaps tonight I’ll use Wolfram Alpha to give me the expression for the integrated likelihood term and then I’ll subject it to the same kind of investigation of its frequentist coverage properties that you carried out for your version of Bayesian updating.

* some people call this the marginal likelihood, but other people use the phrase “marginal likelihood” for a different quantity

Corey:

That’s what this was about – “Fortunately, these lower dimensional foci (marginal posteriors) can be split into marginal prior and marginal likelihood (Berger, et al 1999). The marginal posterior obtained from integrating the full dimensional posterior is proportional to the marginal (integrated) prior times the marginal (integrated) likelihood.” http://andrewgelman.com/wp-content/uploads/2011/05/plot13.pdf