A mapping from ‘background’ knowledge to probability models

Knowledge -> P(Data,Parameters)

can be considered as a family of probability models indexed by ‘knowledge’. If you are confident that only one is correct and you just don’t know which then a probability measure over these would likely make sense. Then you have

P(Data,Parameters | Knowledge)

and

P(Knowledge)

and knowledge is ‘within’ the overall model too. If not you still have an indexed family available (eg a likelihood function). You can also use Bayes to update within the model

P(parameters | data, knowledge)

etc where background knowledge always stays on the right hand side.

But the same issue arises with non-identifiable parameters – they are problematic to update _within_ the probability model because you no longer satisfy normalisation/additivity etc – multiple inconsistent parameter sets can all have possibility one but not all have probability one. When you start dealing with large parameter spaces enforcing the ‘true but unknown’ assumption required for probability becomes more questionable. You can still resort back to possible though- possibility being essentially probability without normalisation and being eg maxitive rather than additive.

]]>Bayes works by p(Data,Parameters | Knowledge) that is, the joint probability over observed and unobserved quantities that we assign given a large database of stylized facts.

Now, I’m perfectly happy with the idea that the mapping from “Knowledge” to p(Data,Parameters | Knowledge), that is the assignment of formulas for probability distributions, is itself an imprecise problem which is NOT amenable to modeling through Bayes. this is kind of the Godel’s incompleteness of Bayes, you have an uncertainty about what meanings to assign to the facts in your knowledge set which is not in general amenable to Bayesian calculation.

Is this the essence of what you have been bringing up? or at least close?

]]>My view is that identifiability, catchall etc are all entangled via implicit uniqueness assumptions. Having a prior that integrates to one amounts to assuming there is a true solution we just don’t (yet) know which. This is particularly problematic in non-identifiable problems for which there is inherently not a unique solution.

I don’t want priors to ‘fix’ identifiability issues – I want to explore identifiability issues. Bayes works well when you are confident there are a set of mutual exclusive, identifiable possibilities of which only one is true but you are a little quantitatively uncertain which it is.

RE: Cox. Yes, he looks at a slightly more complicated system involving multiple propositions and immediately has to drop one of the stronger conditions from the simpler system and come up with an alternative (here entropy).

]]>This measure he’s got is basically “how much less information does your model have about what would actually come about than if someone told you the oracular outcome?” Or alternatively, “How sharply peaked would your model be around the real data if you had the right parameter values?” or even alternatively “how many bits per observation do you need to correct your model predictions to be exact?”

To the extent that your model, given the proper parameter values, predicts nearly perfectly, it will have not much less information than an oracle. That’s what’s meant by goodness of fit in his construction.

His construction makes more sense in a discrete case. It’s closely related to minimum message length versions of Bayes

]]>Cox’s project here is basically to derive entropy as a quantification of the informativeness of answers to questions in a way similar to the way he derives probability as a quantification of the plausibility of propositions. No one is talking about “all foundational mathematical and philosophical projects” — the question at hand is “how do we learn from data”. If people want to update possibility and necessity measures in light of data, more power to them, but I personally will not resort to that approach or other similar ones unless I can’t get Bayes to work in the same problem — that is, unless the Cox desiderata prove unsuitable to the task at hand. I keep an eye on those other approaches in case I need a fallback, but I haven’t yet. I’d be happier with possibility theory in particular if it had a well-developed decision theory whose application is comparable in difficulty to that of applying the principle “minimize posterior expected loss”.

“If you set the prior probability of kid A leaving it as 0.5, is the probability of ‘not kid A’ 0.5? Does ‘not kid A’ include ‘it fell out of my own pocket without me noticing’. Is ‘not kid A’ a well-defined proposition?”

This is the catch-all hypothesis issue, not identifiability (i.e., that “multiple mutually incompatible models can be equally consistent with your observables”). I’m happy to concede that the catch-all hypothesis issue is thorny philosophically but not too difficult to handle in practice; Christian and Daniel’s discussion below about forking paths is basically about this.

Here’s identifiability: consider a non-linear model that has a sigmoidal shape (e.g., a logistic function): as a function of the (1D real) predictor, the output is level for a stretch, then smoothly rises to a new level. There are four parameters; my preferred parameterization is: central point (that’s two parameters), slope at central point, and vertical range. (There will also be a noise model, but let’s assume it’s known.) For some data sets all parameters will be identified, but if the data don’t actually saturate on both ends and instead have a hockey stick shape or a straight line shape then the range parameter is lower-bounded but not upper-bounded and some directions of parameter space have flat likelihood functions. This isn’t even strict non-identifiability — it’s data-dependent — but all of the problems of non-identifiability can potentially arise. I’ve given a fair bit of thought to approaches for setting reasonable priors for this model.

]]>As ojm points out, Bayesian math isn’t the only math in the world (in particular it’s got nothing to say about existential quantifiers and the hyperreal number line or 2 person iterated games or computability or ….), but it IS the UNIQUE math that describes a real-valued unit-normalized weighting function over logical options.

]]>A Bayesian *can* hack by trying 4 different models and then settle on a single one and pretend that the prior they had on the other 3 were small, and the likelihood didn’t strongly favor them, so they are just approximating things by truncating out near-zero values. But they don’t have to, they have the math in place to allow a variety of options simultaneously. A person CAN drive a screw into a board with a hammer, but a person who doesn’t have a screwdriver has no other choice, and a person who admits the existence of screwdrivers does.

The math to allow a variety of options simultaneously just doesn’t exist in a frequentist testing scenario because the weighting functions over different options are not admissable probability distributions for frequentism.

]]>“I hereby declare that I have a mixture model in which I place a uniform prior over all Stan models that compile and run weighted by epsilon in addition to (1-epsilon) for whatever particular Stan model I’m running today, but am approximating this model by truncating epsilon to 0. There, now are you happy ? :~)”

Well, you’ve got to make an attempt to convince me that this approximation is any good. You’re in a bit of a catch there if you are on one hand happy to say that the prior probability for this-or-that is small enough that it makes sense to approximate it by zero to start with, but then later claim that it was still big enough to be picked by the data that didn’t behave quite as the model for which your approximated probability was 1.

I hereby declare that I have a mixture model in which I place a uniform prior over all Stan models that compile and run weighted by epsilon in addition to (1-epsilon) for whatever particular Stan model I’m running today, but am approximating this model by truncating epsilon to 0. There, now are you happy ? :~)

Seriously though, the forking paths issue in p values isn’t the same thing AT ALL. Here’s why.

Suppose I run an experiment, I do an analysis in which I place 100% prior probability on my data coming from some default frequentist distribution, say normal(m,s) and I do a test to see if m ~ 0 and… damn I can’t reject it, no biscuit for me.

So, I now go and look for some other stuff in my data, and then I place 100% prior probability on some interaction between hormones and shirt color and weather… and lo and behold I can detect nonzero mean… hooray, gimme that biscuit.

But, this isn’t the same thing at all as setting up a bayesian model in which there are broad ranges of possibilities such as maybe a positive interaction with weather, maybe nearly zero, maybe a negative interaction with zero, maybe some interaction with age, maybe some interaction with race or certain cultural backgrounds… etc etc.

When I broaden the range of possibilities, it makes it harder for the data to strongly pick out one possibility. Whereas when I take a sequence of trials in which I place delta functions on the one possibility for an interaction and simultaneous delta functions on no interactions of any other sort… I get lots of combinations to check each one inconsistent with any coherent view.

Suppose you have a sequence of numbers which you have labeled A,B,C,D,E,F,G,H,I,J and for some reason I believe that when sorted in descending order, they decay rapidly. Then I tell you “sort these in descending order and tell me the first 2 of them” so you tell me A,B and I add them up and divide each one by the total… and I say there is 90% probability for A and 10% for B, but then you come back to me and say “well C is almost as big as B” then I’m going to know that my calculation was too imprecise and I have to go back and calculate more precisely. That’s all I’m suggesting, except that instead of actually sorting in descending order, in a Bayesian analysis we sort in what we SUSPECT is the descending order, so there’s even more chance we’d be wrong in our calculation. That’s just math, it’s not a special “license” or anything.

]]>Many “forking path” examples work exactly like this. You don’t get a significant p-value in the analysis you initially planned but you discover that by combining some subgroups or applying certain transformations you can. “Broadening the set of possibilities” is exactly the issue. ]]>

Christian’s phrase “they are modes of thinking and communication” underlines why I prefer the term representation than model or information or knowledge – as there is much more in representations (i.e. abductions or good guesses).

Now, I used to make a distinction between aleatory and epistemic which (by an argument by authority from Andrew ;-) ) Andrew seemed to ignore. I am not sure anymore.

Let me put it this way, using two stage simulation as a conceptual way to represent _a_ _Bayesian_ approach.

Rep1. You have a representation of empirical phenomenon with unknown parameters, a random model for how the parameters’ values were set/determined and a random model for how observable came about and were observed/recorded.

Rep2. Given observations in hand we restrict the representation to one have exactly/approximately those observation values.

Rep1 gives a distribution of parameters’ values and Rep2 gives a arguably more appropriate distribution of pareamters’ values.

Bayes theorem is just a diagrammatical representation of the relationship of Rep1 to Rep2 which allows us to experiment and think about it. Out of this we get labels like Bayes, prior, likelihood (noticed as ~ Rep2/Rep1), posterior, etc. but there is no justification for taking prior and likelihood as separate entities. So we are we worried about re-combining them?

]]>I DON’T think it’s good practice to NARROW the model after data. This is the essence of forking paths… start with the obvious thing, if that doesn’t seem to get you the p value you want, look at something more specific and more specific until you find something (an interaction between interactions or whatever).

What I think is OK is to broaden the set of possibilities you’re entertaining to include new possibilities that you had initially thought were probably irrelevant (I assumed my evacuated tube didn’t have any leaks, but in fact there was always some possibility that the air steadily leaked back in at some slow rate… etc

The “assumed no leaks” is equivalent to a delta function on the leak rate at r = 0. The broader model is maybe exponential(1.0/rsmall)

similarly if there are discrete categories of things that were left out, you can put them back in without any problem, it always spreads the probability onto a larger set of possibilities, and it becomes the data’s job to re-concentrate things.

What wouldn’t be ok would be something like saying “hormone cycle affects what color shirt you wear” and then later when there’s no support for that say “hormone cycle affects what shade of red you wear”, that is, to add in some restriction on the set of possibilities which then necessarily sucks probability weight out of the expanded set and plops it down on your favorite sub-hypothesis.

]]>Perhaps you don’t license yourself to do anything, but you’d be willing to do certain things that are unspecified in the beginning and you only figure them out after looking at the data. That’s forking paths for me.

Anyway, I don’t even disagree about whether this could at times be a reasonable thing to do. I’m just noting that when it comes to discussing frequentism, you insist that models should be perfectly true from the start (despite frequentists being happy to reject and adjust after looking at the data, which I am aware by the way comes with its own problems), whereas for the Bayesian you consider this apparently good practice.

]]>We always simplify because its too big of a task to put every possibility in at first especially when some seem at first unlikely. But after data we can realize that even though the prior said it was unlikely the posterior would have picked out that option if we had included it.

]]>I’m not really convinced by the idea that you set up a model, claim that this only models knowledge and not the data generating process, then you see some data from that process and claim, “oh, the knowledge I had before the data was actually different from how I initially set up my model” and change your prior. You may be right that formally one can show that under so-and-so conditions this means that you can only get things so-and-so wrong, but I’d expect that in practice one wouldn’t normally worry so much about these so-and-so conditions and whether they are fulfilled, and it’s hard to do that anyway if you licence yourself to change the prior model in all kinds of ways after seeing the data. Certainly it’s not transparent unless you specify everything in advance, otherwise it’s a fine Bayesian garden of forking paths, isn’t it?

]]>I’d still claim that your Friday model is stronger than your knowledge. You don’t actually know that your observations on Friday are symmetric, you only don’t know any reason why they shouldn’t be. If you only take 10 observations on Friday, it may not matter. If you take 10,000, a good data analyst may well spot a difference that you have committed yourself to ignore by using exchangeability, and not because any proper knowledge forces you to do that. ]]>

Point being 1/2, 1/2 and 1/2 instead of 1/3, 1/3 and 1/3 also satisfies everything except normalisation. But dropping this blocks key steps of Cox’s argument.

]]>Anyway, this feels pretty pointless – some people will never be convinced to consider alternatives once they’re committed to their axioms. I’ve tried many times to raise the simple possibility that the Jaynes-Cox argument is nowhere near as convincing of ‘the one true way’ as it seems to be made out. Maybe alternatives are also interesting?

So again, I’m not saying probability is not a useful tool I’m saying there are many reasons why Jaynesian proselytizing falls on deaf ears.

I’ll leave Andrew to confirm, deny or remain silent on your and my interpretation of his philosophy.

]]>Andrew’s been pretty clear on why he doesn’t like probabilities for models — it’s because they have a sensitive dependence on details of the prior that have negligible effect on inferences *within* the model. Also because the Occam factor argument that physicists tend to use to argue in favour of that dependence place a value of simplicity which is inappropriate in the social science context of the models he works with.

Regarding predicate calculus and quantifiers and such, all I can say is that I’ve never felt the lack.

]]>“we may conclude that there is no analog, or at least complete analog, in the algebra of systems [of propositions], to contradiction in the algebra of propositions”

In his notation there is no A or ~A for systems of propositions. The lazy answer to this is ‘we only need propositional logic’ but that program seems to me have failed in all foundational mathematical and philosophical projects.

]]>If you have some doubts about whether there should be a trend or a change point or some such thing that eliminates exchangeability, then there’s nothing keeping you from going back and putting those in the model when they seem to be warranted after seeing the data. We’ve been over this before about how any given Bayesian analysis is always some kind of truncated asymptotic expansion of the full set of models you’d be willing to entertain, and when that expansion becomes singular you’re free to go back and add in the “missing” components.

]]>Quickly – the point is probability is an additive measure of uncertainty. This makes sense for obsevrables. Not so much for unobservables for which our uncertainty is often non-additive.This comes up all the time in practice when dealing with non-identifiable models.

Why did you say to use the ‘obvious’ equivalence class above? To restore additivity it seems to me.

Why doesn’t Andrew like th idea of the probability of a model, preferring predictive checks. I’d argue an at least implicit recognition of non-additive uncertainty.

]]>“Once more, if you start from an exchangeability model, no observation whatsoever can get you out.”

This is simply not true. Suppose on friday I evacuate a tube and use it to drop balls with several timers in the absence of air resistance. I decide my apparatus is well behaved and all the observations are exchangeable.

On monday I come back and drop the balls in free air… there is absolutely nothing that prevents me from using this information to change the model I use for the errors in the monday dataset. The Bayesian way is to condition on a set of knowledge, and the knowledge I have about the friday dataset is different from the monday data set and I can immediately apply it to my model. This holds for any and all future experiments that I do as well.

Exchangeability is a property of the knowledge set used for assigning the probability expressions.

]]>If you follow Jaynes formally, this means that you define model and some alternative in the framework of a supermodel, and goodness-of-fit testing shifts probabilities around between submodels of the supermodel, which in itself you can’t test. However, it may well be that Jaynes also made remarks to the effect that you can test the supermodel without setting up a Bayesian super-supermodel, which then means that you go out of the Bayesian framework. I have read the Probability Theory book but I can’t claim to be the very best expert on Jaynes and can’t claim for sure that he consistently sticks to the Bayesian setup overall.

What I mean is still that epistemic probabilities do not model the data generating process in itself, only a person’s knowledge about it, and this holds for “sampling models” as well, and Jaynes is quite clear about this in his book.

Daniel: You know the wrong frequentists. Frequentists do model checking. Empirically I’m fairly sure that they do more of it than the Bayesians (Andrew may have some good influence there).

If you get the same people you are referring to to do a Bayesian analysis instead, they will make the same hash of it.

“Suppose a Bayesian is told that a computer is outputting a sequence of random numbers from some univariate distribution with a mean and standard deviation… Are you saying this Bayesian isn’t logically warranted in collecting a large set of data, and then define say a Gaussian Mixture Model (GMM) with a large number of parameters, and then fit those parameters in such a way that they specify p(Data | Parameters) = GMM(Data | Parameters) because this GMM doesn’t represent a state of knowledge it represents a physical frequency enforced by computer program?”

I don’t know whether this was for me, but I’m not the one who is overly prescriptive here. Surely the Bayesian can do that but still it’s either fully aleatory or fully epistemic, and I’d like the Bayesian to tell me what it is.

]]>ojm, if I have multiple mutually incompatible models that can be equally consistent with my observables, I can’t see how rejecting “P xor not-P” helps me; the issues seem orthogonal to me. What connection do you see?

I don’t recall Cox investigating vectors of propositions *per se*. I do recall him working on a logic of questions which did involve systems of collections of propositions; IIRC the aim was to provide a quantitative relationship between the informativeness of answers to various question in the same was that Cox’s theorem gets at a quantitative relationship between the plausibilities of uncertain propositions. There’s a guy named Kevin Knuth who has picked up and continued that program.

I’ll have a look at Dubois’s stuff. I started the Colyvan paper and it’s making me grit my teeth — both Cox and Jaynes were very clear from the outset that they did not purport to quantify all uncertainty (like the definitional uncertainty Colyvan is going on about) but only uncertainty about propositions for which “P xor not-P” makes sense. But I’ll persevere… (A dude named Alain Drory pulled a similar trick about Jaynes’s analysis of Bertrand’s problem: he set up a straw man, misattributed it to Jaynes, and then knocked it down by arguing for the position Jaynes actually held.)

]]>Once more, if you start from an exchangeability model, no observation whatsoever can get you out. It mean that you don’t only treat the situation as symmetric before observations, it means that you commit yourself to treating it symmetrically forever regardless of what the observations are.

I don’t think that such knowledge ever exists. I rather think that the exchangeability model is an idealised *model* of the knowledge ignoring some of its complicating subtleties, as is the frequentist model an idealised model of the data generating process that commits us to ignore some subtleties for the sake of simplicity.

]]>Suppose a Bayesian is told that a computer is outputting a sequence of random numbers from some univariate distribution with a mean and standard deviation… Are you saying this Bayesian isn’t logically warranted in collecting a large set of data, and then define say a Gaussian Mixture Model (GMM) with a large number of parameters, and then fit those parameters in such a way that they specify p(Data | Parameters) = GMM(Data | Parameters) because this GMM doesn’t represent a state of knowledge it represents a physical frequency enforced by computer program?

That seems extremely odd to me.

What my complaint is is that the Frequentist is willing to go the other route, they simply before seeing any data specify that the GMM is actually definitely a one-component mixture with unknown mean and standard deviation and then proceed to “guarantee” that their interval estimates of the mean and standard deviation are guaranteed to be right 95% of the time that they collect data even though they haven’t established in any way that this initial assumption of gaussianity is relevant (and the same thing happens with lognormals, gammas, beta, whatever, they choose a particular family based on a hunch and maybe a failure to reject using a test and then predict frequencies from it!)

What’s more is that they tend to apply this in settings like 30 samples from human randomized controlled drug trials etc where serious deviations from whatever your assumed shape is are essentially guaranteed at least in the even medium-short run (like if you do this in 5 or 10 different problems one of them will have a serious violation that you can’t detect because you only have 18 data points)

]]>Is Probability the Only Coherent Approach to Uncertainty?

]]>I have no problem with dealing with stable frequencies, I just want to concentrate the inevitable Bayesian uncertainty around a particular shape of the distribution before doing the frequency calculations.

]]>The Bayesian can say “the frequency of outcomes is f(Data | Params)” and these are exchangeable among the observed data and all the future data of interest to me…. But this is a model assumption that the Bayesian doesn’t HAVE to make. The Bayesian can simply say “my probability over what would occur is p(Data | Params) for the particular set of data run in this particular subset of my current experiment” and the assumption means nothing about symmetry of the physics, it means only symmetry of the knowledge.

On the other hand, I see no way around assuming symmetry of the physics in Frequentist logic.

]]>As I mentioned, Jaynes discusses “goodness of fit” tests. If this is not testing a model against data, what do you mean precisely?

Gelman and Shalizi in their “Philosophy and the practice of Bayesian statistics” list Jaynes as one of the writers who “emphasized the value of model checking and frequency evaluation as guidelines for Bayesian inference” and later single out Jaynes in particular: “A more direct influence on our thinking about these matters is the work of Jaynes (2003), who illustrated how we may learn the most when we find that our model does not fit the data – that is, when it is falsified – because then we have found a problem with our model’s assumptions.”

Are they talking about testing a model against data?

]]>But that’s the thing: Jaynes models the state of knowledge of his robot, not the data generating process. Which means that the full model (I’m not talking about shifting probabilities within the full model around between submodels) cannot be tested against data, because it is simply not about where the data come from. Daniel, who surely has a good grasp of Jaynes-type Bayesianism, explains the same thing above: “If you collect an infinite amount of data on the Bayesian model, you can not falsify the model with observed frequencies, because the model doesn’t specify the frequencies it specifies the knowledge you had at the beginning about unobserved things.”

Also I think that you’d need to decide: Are your probabilities epistemic or aleatory? I don’t think it works to have an epistemic prior and an aleatory sampling model, because none of the approaches to the foundations of probability licenses you, in this case, to use both of these kinds of probabilities in the same calculus.

]]>Why isn’t it desirable to assume P or not P in general? Well, identifiability for one: multiple mutually incompatible models can be equally consistent with your observables. There is a uniqueness issue. I saw a generalisation of Cox’s approach to constructive logic in which case P or not P also drops out. Note also in Cox’s book that when he extends to vectors of propositions he also obtains a weaker logic than Boolean logic. I feel like Jaynes didn’t read that far or something.

]]>“It is a foregone conclusion before you collect any data that your frequency model is false.”

It’s a model. If you take it too literally, it’s false, yes. Same with Bayesian models of knowledge.

“Suppose for example that you are modeling stock market returns, and we are all actually living in a computer simulation. In the computer simulation daily returns are *actually* normal(0,1) * 0.9997 + cauchy(0,1) *0.9993 then there IS NO average and yet you find this out only after something like 10000 days and bankrupting an entire industry.”

…which has nothing to do with whether your model is Bayesian or frequentist; using a plain normal model in a setup that is prone to outliers will get the Bayesian as easily into trouble as the frequentist.

“Most of the problems with (1) come from bolting on things that fail to follow logic (such as p < 0.05 means round-off to 0.0) or failing to have a way to make assumptions other than “irreducible uncertainty with stable patterns”"

Nothing in the frequentist interpretation of probability enforces treating p<0.05 as 0.0, and frequentists can make all kinds of assumptions that the Bayesian can make, because every Bayesian model can be given a frequentist interpretation.

A situation can easily arise where you do care about predictions — your action space and loss function could be predictive after all — and then you’d best start caring about predictive calibration… There’s an example in BDA in which a log-normal model works very well for one estimand and disastrously for another due to failure to get the tail correct.

]]>Elsewhere I’ve made the point that among all of the isomorphic plausibility systems allowed by Cox’s theorem, only one — probability — appears in the Law of Large Numbers for exchangeable random variables, and it’s this strong connection between probability and expected frequency that helps resolve the underdetermination of the Cox theorem result. I can’t have my cake and eat it too — and in any event, I’ve always been more concerned than you and Joseph about the lack of predictive calibration inherent in the truth-in-the-high-density-region-is-good-enough stance.

]]>I think the likelihood represents a model of the world in the Bayesian setting as it does in the frequentist setting. I still don’t get your point. What kind of evidence regarding the model do alternative inference frameworks provide which is missing in the Bayesian framework?

]]>p(X | K) = product(p(x[i] | K), i = 1..N)

this holds because K is a constant, rather than something that changes for each data point. As soon as you have a time-series structure, or a potential change-point model or whatever, then this symmetry property doesn’t hold and you don’t have exchangeability. For example you might do a gaussian process for a signal through time, where every data point is really just part of ONE function whose values you observe with error at various times or places. The fact is, you’re putting probability on *one* object, a vector of observations.

Though in actual fact, most biochemistry examples I’ve done involve “these are all from one batch, and then this bunch are another batch, and Joe did some extra ones… and then we repeated our first batch with the reagents we re-ordered from a different supplier…”

So that in the end, exchangeability is only within small groups of data points.

Taken as an expression of symmetry of knowledge, the Bayesian IID exchangeability is un-complicated. Taken as an expression of an assertion about the world that *frequencies of outcomes* will be constant, it’s a highly objectionable and essentially always wrong on the face of it assumption.

]]>