Thinking out loud about the N=3, n=1 case…

From a pure likelihood perspective, and considering the ‘full parameter’ problem, I wanna say that if we observe the first element of c and we know that c_i in {0,1} then

(1,0,0), (1,0,1), (1,1,0), (1,1,1)

have the same normalised likelihood value of 1, as above, and all other combos e.g. (0,…) have likelihood 0.

Suppose the goal is inference for the sum of c, theta(c), from a pure likelihood POV. Then we can write the full parameter as c = (theta(c), c’) where c’ = c-theta(c) i.e. we substract the (scalar) sum theta from the (vector) parameter c to get the (vector) differences c’.

Then we can write L(c) = L(theta,c’).

To do inference about theta we consider the profile likelihood

Lp(theta) = sup_{c’} L(theta,c’).

This is simply the flat likelihood again for each compatible theta. I.e. the profile likelihood says that theta = 1, 2, 3 are ‘equi-possible’ (profile likelihood can be interpreted as a ‘possibilistic’ inference a la https://en.wikipedia.org/wiki/Possibility_theory).

Alternatively, if I was to try to do things ‘predictively’ I might decide to randomly (uniformly) draw from the set of remaining consistent vectors above to compute the completion.

In this case theta = 2 would have probably 1/2 and theta = 1 and theta = 3 would have probably 1/4, since (1,0,1) and (1,1,0) both give theta = 2.

But in neither case do I get your answer of theta = 3 being the most supported. Perhaps this is related to whether the reduction is carried out before or after the inference? If so, reminds me a little of the marginalisation paradoxes.

]]>To make it clear: let’s say there N=3 points, we observe one single point and its value is 1.

P(one observation, the value is 1| theta) = P(the value is 1| one observation, theta) * P(one observation)

The second term is trivial and if the values are in {0,1} we can calculate the first term. For example, for theta = 1 the equiprobable cases are

(1,0,0), (0,1,0) and (0,0,1) so the probability of the value being 1 (getting 1 in the first position) is 1/3. For theta = 2 the equiprobable cases are (1,1,0), (1,0,1) and (0,1,1) and the probability of the value being one is 2/3. For theta = 3, the probability of the value being 1 is 1.

The likelihood function is (ignoring the constant term):

P(theta=0| one observation, the value is 1) = 0

P(theta=1|…) = 1/3

P(theta=2|…) = 2/3

P(theta=3|…) = 1

Fixing theta doesn’t determine the values of the c_i’s in the general case. For example, if the values are in {0, 1/2, 1} we cannot get the probability of observing the data for theta = 2. The possible combinations are (1, 1, 0), (1, 0, 1), (0, 1, 1), (1, 1/2, 1/2), (1/2, 1, 1/2), (1/2, 1/2, 1). If we have two 1’s and one 0, two out of three are compatible with the data. If we have one 1 and two 1/2’s, one out of three is compatible with the data.

]]>Hmm well as you say the problem is incomplete if stated only in terms of the mean. I would think the full parameter is required for it to be a generative model. But I’ll have another think about your answer.

]]>I think we agree that the likelihood function for what you call the full parameter (the vector of c_i’s) given the observed data (s_i indentifies the points observed, and d_i the corresponding values) is

L(c_1,…,C_N;d_1,…,d_N,s_1,…,s_N) = L(c_1;d_1,s_1) * … * L(c_N;d_N,s_N)

where each factor L(c_i;d_i,s_i) is equal to p^s_i*(1-p)^(1-s_i) = (1-p) when s_i = 0

and is equal to p^s_i*(1-p)^(1-s_i)*delta(s_i,d_i) = p*delta(s_i,d_i) when s_i = 1.

Knowing some of the c_i’s doesn’t give any information about the unobserved c_i’s unless the model includes some codependence between them.

In a Bayesian analysis, you can always make inference about the uncertain values because you have a distribution for them (equal to the prior if the c_i’s are unrelated, or informed by the observed c_i’s if the distribution depends on a common hyperparameter, for example).

> You do inference for the full parameter and then consider the implications for functions of the full parameter like mean(c).

Why? You can do much better by writing the model in terms of the parameter of interest theta = mean(c).

> I think the key point here is that for this problem the ordering actually does matter

It does, but it doesn’t. Randomizing which fixed values c_i you observe is like sampling the observed values from the distribution of c_i’s (without replacement).

The estimator for theta proposed by Wasserman is based on that kind of sampling without making it explicit.

I showed how the correct likelihood function for the parameter of interest L(theta) can be easily written down and used to calculate a better (maximum likelihood) estimator.

]]>> And to be for it to be truly capable of generating data it needs a prior to first generate the parameters that need to be set.

This makes no sense to me. Eg

p(y | theta)p(theta | lambda)

just requires one to set a lambda etc. Infinite regress and all that.

I’d rather just say p(y;theta) specifies a distribution over y for any choice of theta. It defines a family of generative models.

Almost every simulation code works like this. Eg a PDE code that takes a diffusion coefficient and some ICs/BCs and gives an answer.

A sensible inverse problem would be to ask eg what ICs give results consistent with observations and you can just plug-in a series of values to determine the inverse set image:

{ICs | F(ICs) approx = observations}

The model is perfectly generative for each choice of ICs and you don’t need to choose the ICs to plug in in any ‘random’ way – you could eg use a for loop.

]]>But the likelihood should be written as a function of the full parameter before any restriction is considered. You do inference for the full parameter and then consider the implications for functions of the full parameter like mean(c).

> If we consider the sufficient statistics S (the number of observed values) and X (the number of ones observed), the correct likelihood function is…

I think the key point here is that for this problem the ordering actually does matter – we assume that we actually do observe e.g. the first component of c, the second component of c etc, not just the c value of an unknown unit. The only thing that is random is the choice of _which_ unit we observe.

And this, I think, is why the likelihood is flat/deterministic (see response above to Keith).

]]>I missed this.

I think Wasserman defines it in a roundabout way, but the idea is roughly correct I think, subject to the zero/one requirement. But this zero/one requirement arises because of the point you make – we observe the c values.

In (sloppy) detail:

You have a set of units indexed by i available to take measurements on, and each measurement has a value c_i.

You observe only n of the units, i.e.

in = (i_1,i_2,…,i_n)

Corresponding to this is

cn = (c_i_1,c_i_2,…c_i_n)

The given data are the vectors

(in,cn)

That is, you know which units where observed and what their c values were.

Assuming the c are deterministic values of the i – i.e. they are measured without noise – then the probability of observing those c that you did, i.e. cn, is simply the probability of observing the units you did, i.e. the probability of in. Which is the ‘randomisation’ probability or whatever you call it.

But there is a key point here: the deterministic function needs to be included in the likelihood. Write cm = c(im) for the function giving the cm vector corresponding to any im vector. Intuitively: the c values corresponding to the actually observed units.

Then

p(observed data| full parameter)

= p(Cn=cn,In=in|C=c)

= p(Cn=cn|In=in,C=c)p(In=in|C=c)

Now

p(Cn=cn|In=in,C=c) = 1 if cn = c(in)

and

p(Cn=cn|In=in,C=c) = 0 if cn != c(in)

Hence the likelihood is

p(In=in|C=c) if cn = c(in)

and

0 if cn != c(in)

The component p(In=in|C=c) is constant for the randomised case and can be normalised to 1.

Intuitively: the likelihood is zero for those ‘full’ vectors c that don’t match the ‘partial’ actually observed vectors cn.

For example, if I observed

cn = (4,5,6) and in = (1,2,3)

then

c = (4,5,6,1,2) and i = (1,2,3,4,5)

has likelihood 1 but

c = (4,1,6,5,2) and i = (1,2,3,4,5)

has likelihood 0.

Intuitively: by observing the first three components of c we know that the full c can only be one with those three components. We have learned something.

For inference about the mean of c then we need to consider all possible completions of c that are consistent with the observed data.

One could report all possible completions, report the max and min based on bounds on the c values or report a distribution over the mean of completions induced by a distribution over the remaining c values. Etc.

]]>I think one could argue what Larry gives is not a likelihood or at least it is a degenerate likelihood.

“The likelihood is the probability of the observations for various values of the parameters … A more formal definition of likelihood is that it is simply a mathematical function L(theta; y) = c(y) Pr(y; theta). The formal definition as a mathematical function though, may blur that the likelihood is the probability of re-observing what was actually observed.” http://andrewgelman.com/wp-content/uploads/2011/05/plot13.pdf

Larry is taking the ci as being available for taking their average – so those are the “actually observed” – not the Si. The Si just indicate the selection of which ci to “actually observe”.

I have used data generating model as a synonym for likelihood in the past and I am starting to think I should stop doing that – at least if likelihood is being understood as something other that a probability model of how the “actually observed” came about. And to be for it to be truly capable of generating data it needs a prior to first generate the parameters that need to be set. Together the prior and data generating probability model should be able to emulate the reality that we have no direct access to, but wish to act in without being frustrated by it.

]]>It’s true that if the model assumes that the constants are independent we cannot infer anything about the non-observed parameters (and I think the independence assumption makes more sense than the “super population” model in this case). But the parameters of interest are not the c_i’s !

In fact the likelihood function that we have to write is L(theta). The function Wasserman gives is incomplete, making inference impossible, but we know how to write the probability of the data given the parameter.

If we consider the sufficient statistics S (the number of observed values) and X (the number of ones observed), the correct likelihood function is

L(theta) = p^S (1-p)^(N-S) binomial(S,X) binomial(N-S, N*theta-X) / binomial(N, N*theta)

where p is the probability of observing each value and the binomial factors are the number of arrangements of X observed ones out of S observed values, N*theta-X non-observed ones out of N-S non-observed values and N*theta ones out of N values.

Keeping only the terms which depend on theta, we find that the likelihood is proportional to

L (theta) ~ (N*theta)! / (N*theta-X)! * (N-N*theta)! / ((N-N*theta)-(S-X))!

= N*theta * (N*theta-1) * (N*theta-2) * … [ X terms ] … * (N-N*theta) * (N-N*theta-1) * (N-N*theta-2) * … [ S-X terms ] ….

One can easily estimate theta using maximum likelihood. There is no need to create ad-hoc estimators.

We can also use this likelihood to do a Bayesian analysis. Of course, a prior distribution for the parameter theta will be required.

]]>This tells us something: some c that were possible before the data are no longer possible after the data. Basu also discusses this in detail in the link above.

The Bayesian posterior is the restriction of an additional assumed ‘super population’ model p(c) to this set. If the c are independent then nothing is learned except what the likelihood tells you as an indicator function for possible c values. When they are assumed to arise from a common source then observing some of c can tell you about the unobserved values, but not otherwise.

In general, I basically see the likelihood function as the indicator function for the inverse set image of the data ie the indicator function for {theta | f(theta) = y0}. Except for probably models we need a real valued membership value p(y0;theta) for theta instead of a 0/1 indicator. It’s been noted in the literature that pure likelihood theory is basically the same as fuzzy set/possibility theory, presumably for this reason.

In the present case there is no probability model for the c and no specified dependence so all we can do is report the inverse image of the data, ie those parameters that are consistent with the observations. This is a 0/1 likelihood function playing the role of indicator function.

Though if you have a bound on the other c values you can use this to deterministically bound what the mean of any completion of c would be. Or you could construct a probability model for the completion of c, which is pretty much Bayes.

]]>When S_i = 1 the likelihood includes a term which is 1 for c_i = d_i and 0 otherwise (when S_i = 0 the likelihood term does not depend on c_i). We can use maximum likelihood to estimate the values of c_i for the terms with S_i = 1 and the answer is unsurprisingly c_i = d_i. But we cannot estimate the values of c_i for the terms with S_i = 0.

A) We are asked to estimate theta the mean of all the c_i values. The estimator proposed is unbiased but has a lot of variance and other undesirable properties. It can give estimates higher than 1, even though we know that the mean of values from {0,1} has to be in the range [0,1].

B) A better estimate would be to normalize by the actual number of observed points sum(S_i) and not the expected number of observed points (the parameter PI which is assumed known). Essentially, we take the observed average as representative for the whole population: theta = sum(S_i*d_i)/sum(S_i).

The only issue is what to do when all the points are hidden, as we would be dividing by 0 and the sample average is not defined. If this case is excluded, the estimator is unbiased and has lower MSE. If we set some value (like 1/2) in the cases where the estimator is not well defined there may be bias but vanishes as N grows.

C) What about the Bayesian treatment? If we model the constants as being independent, the prior for each c_i will be 0/1 with 50%/50% probability. For the observed points, the posterior is d_i and for the others the posterior is still 0/1 with 50%/50% probability.

We can calculate the posterior distribution for the sum of the constants (N*theta) which will be the binomial distribution for the uncertain constants Binom(size=N-sum(S_i),prob=0.5) plus the sum of the constants which are precisely known sum(S_i*d_i).

The expected value of this distribution is simply the sum of the observed values plus 0.5 for each unobserved value. The posterior expected value for theta is the weighted average of the sample mean for the observed values and 1/2 for the unobserved values.

D) An alternative Bayesian calculation would be to assume that the prior for each constant is a binomial with probability p unknown. In that case the observed constants are precisely defined and for those that remain uncertain the posterior can be updated (as long as any values have been observed, of course).

The posterior distribution for theta will be more complex in this case but if I am not mistaken when the prior for p is flat we recover the same estimator given in B) above. The expected value for each unknown constant is p, which we estimate as the average of the observed values. The expected value for theta is therefore equal to the observed average: theta = sum(S_i*d_i)/sum(S_i).

]]>See Section 7.6 of BDA3 for more than you’ll ever want to know about this topic!

]]>In Wasserman's example he uses cj in {0,1} which he states 'is not important', which seems misleading to me: the range of the values, e.g. the a and b in [a,b] appear explicitly in Hoeffding’s inequality. Using 0 and 1 means the dependence on these constants is hidden.

On the other hand, even assuming e.g. cj in [a,b] is a far weaker assumption than assuming a specific probability model.

]]>Yup, that’s math. No conclusions without assumptions.

]]>If I’m understanding correctly, the key steps are:

– Introduce (in the notation above) an additional model p(c|theta), effectively treating c as data rather than parameter (though of course this distinction is blurred in Bayes)

– Do inference for the ‘superpopulation’ parameter theta (e.g. add a prior p(theta) and get the posterior)

– For the inference about c this is a ‘data’ or ‘predictive’ inference so you compute the posterior predictive for c under the theta posterior.

Really the key step is introducing p(c|theta). This connects to what I said above:

> Without further assumptions/constraints on how the c_i relate to each other though it is difficult if not impossible to generalise from observed sample to population

From a non-Bayesian perspective without a probability model p(c|theta), you still would seem to need some sort of constraint on the parameter space, e.g. F(c,theta) = 0 I guess, or at least regularity conditions, in order for the observed c to reliably inform you about the unobserved c.

]]>https://link.springer.com/chapter/10.1007/978-1-4612-3894-2_11

]]>In chapter 8 we give a Bayesian derivation of the classical estimate. The Bayesian estimate is as simple as the classical (“frequentist”) solution. For more difficult problems such as I deal with in my own research in political science, the classical solution won’t work.

]]>Which doesn’t mean wrong of course. And this all links back to design vs model based approaches and eg the role of randomisation (which you discuss but still restrict to particular uses in a Bayes framework). So I find examples like this interesting.

]]>From a Bayesian standpoint, the notation at that link is basically irrelevant. In the Bayesian world, all unknowns (whether they are parameters, predictive quantities, or whatever) have a joint probability distribution. Terms such as “likelihood” or “prior” are just for convenient. Mathematically, you have a model, you have p(everything), and your posterior distribution is p(everything unobserved | everything observed). I’m not saying it’s all trivial—there are challenges for example when considering the properties of Bayesian methods when the model is wrong—but the notation at that linked article doesn’t really do anything for Bayesian inference because it does not assign a probability distribution to the unknowns. The linked article represents a different way of performing inference for survey sampling, which is fine, but it doesn’t pose any problem for a Bayesian approach.

]]>I’m not quite sure what is your question. You refer to “these sorts of difficulties” but I don’t see any difficulties at all to Bayesian inference from simple random sampling. It’s a straightforward problem whose straightforward solution we explain in the book!

]]>Without further assumptions/constraints on how the c_i relate to each other though it is difficult if not impossible to generalise from observed sample to population. Eg if c=(1,2,99999999) and you happen to observe the 1 and 2 I doubt you’re gonna guess 99999999 without further info.

The frequentist solution, perhaps, relies on the idea of picking a representative sample with high prob? But could still be badly influenced by outliers or irregularity?

]]>Yes, we cover Bayesian inference for survey sampling in chapter 8 of BDA3 (chapter 7 of the first two editions).

]]>Yes the upshot of the example is, I think, that in this case the likelihood is irrelevant to the problem. But we can estimate the quantity of interest quite straightforwardly using non-likelihood-based methods.

So say the probs are all 1/2. What does the Bayesian do? Do you cover a similar example in eg BDA3?

]]>I guess the point where we depart is whether or not things like n and p are data or if they’re design information. I think of them as the latter, so priors that depend on n and p are not “design dependent”. X is a different matter, but I think there are a lot of strong reasons to scale your model so parameters (and hence priors) are O(1), even if you feel this introduces design dependency.

As for those quotes – well, I’m not sure they’re different things. The quote from the paper is definitely applicable here: you are seeing your prior on the sub-data scales and so your inference will be sensitive to that aspect of the prior. In the post, I’m (clumsily) trying to argue (back with simulations from Betancourt) that if your interest is prediction, then you will do much better cutting off those parts of the posterior. The key thing you see in the simulations is that the behaviour at those ridiculous scales does not just wash away at aggregate, but does effect point summaries as well as simulations.

It really comes down to the idea that design information isn’t data, it’s meta-data. I don’t see the philosophical objection to conditioning on meta-data, and you get some serious practical benefits. (Like you can understand exactly why the Bayesian Lasso doesn’t work)

> Bayesian models don’t overfit … essentially because they don’t fit .. they condition..

They really can over-fit! For conditioning to prevent over-fitting, you need the joint to not have much mass on predictions that do not generalize, which really needs you to specify your prior correctly.

>are both of the use of the word “small” here correct?

Nope. The second one should be large. That damn bottle of wine.

> Broadly isn’t Kevin van Horn right?

Well… I don’t think either of us are completely right. This one is a hard topic and we’re viewing it through two very different lenses. My point is not that these models are somehow nested, but rather information about how big the model is (meta-data) is necessary to set priors.

]]>\prod P(C_n|S_n,\theta) P(S_n|\pi)

Where C takes 3 values 0, 1 and unobserved (and is unobserved iff S=0).

]]>I followed the link. Regarding the example, it would be even cleaner if the probabilities were all 1/2 because in that case the likelihood function is just a constant. Or, even simpler, if the probabilities were all 1: In that case you have an entire census, and again the likelihood function is a constant and is irrelevant to the problem entirely. No need to bring up Hoeffding’s inequality or whatever!

]]>The rest of the post (and the last one) are fascinating.. but so much of it seems counter to my understanding its a bit difficult to respond to. Or rather its a bit hard to respond to only one point..

Surely the procedure you describe in the previous post would create a probabilistic specification P(Y_1,…,Y_M|X_1,..X_M) which is not a marginal of P(Y_1,…,Y_M+N|X_1,..X_M+N).. Broadly isn’t Kevin van Horn right?

“In this case, priors that do not recognise the limitation of the design of the experiment will lead to poorly behaving posteriors. In this case, it manifests as the Gaussian processes severely over-fitting the data.”

This quote jars a bit (and I understand its a humorously written post done in your spare time).. but to quote your paper which I find really clear and insightful:

“In such cases, the posterior eventually concentrates not on a point but rather around some extended submanifold of parameter

space—and the projection of the prior along this submanifold continues to impact the posterior even as

more and more data are collected.”

Isn’t that what is going on here.. the likelihood is flat in a significant region of the parameter space and in that region the posterior will equal the prior.. We may be talking past each other due to terminology.. but I don’t see why this should require an adjustment of the prior – in fact isn’t an argument for leaving it alone?.. The posterior will contain all the complex functions that lie appropriately close to the data which in aggregate is a highly regularised function. I am sure I could dig up an old Radford Neal or Savage quote .. Bayesian models don’t overfit … essentially because they don’t fit .. they condition..

The big point that confuses me in the two posts is if you are defending using data dependant priors from a foundational point of view.. or if you are employing them as a heuristic device..

“Of these two scenarios, it turns out that the inference is much less sensitive to the prior on small values of \kappa (ie ranges longer than the data) than it is on small values of \kappa (ie ranges shorter than the data).”

are both of the use of the word “small” here correct? I don’t mean to be pedantic.. I just don’t follow..

Thanks for such an interesting post! Apologies that I may have missed some things.. I will continue to reread and think about it..

David

http://www.stat.cmu.edu/~larry/=stat705/Lecture6.pdf

from a Bayesian perspective.

]]>Or I could just be having fun.

]]>Don’t take the comments too seriously. Just as we contribute to the general public discussion by blogging for free, and in exchange we expect people to take our ideas seriously but to cut us some slack on our idiosyncratic ways of expressing ourselves, so should we recognize that our commenters are contributing to the general public discussion by commenting for free, and in exchange we should cut them some slack and allow them to give us a hard time on occasion, as long as they also other times contribute to the intellectual discussion,as most of our commenters do.

This is not to say you shouldn’t talk back to commenters—heck, I even respond to trolls sometimes!—just that it’s a privilege of commenters to have some fun from time to time, as that’s one of the things people can do to make commenting a more appealing pastime. And, as I expect you’re aware, our comment section is wonderful, arguably the best on the internet.

]]>Of course suggesting this in the final paragraph of such a wordy post is very me.

But mere wordiness was never the problem I had with your style. Perhaps — rather contrary to my expectation — prose purplosity varies *inversely* with inebriation, in which case all I can say is…

(On this, BDA is right, but at tension with a lot of what is taught)

]]>Regarding your remark, “Design matters even if you’re Bayesian.” Why “even”? Design is central to statistics; it means as much to Bayesians as to anyone else. See chapter 8 of BDA (chapters 7 of the first two editions), for example.

]]>