(Apologies for the double posting. Andrew, if it’s not too much trouble you can delete the other copy of this comment – I had intended to reply to Dale but messed it up.)

]]>Is this a case of just not citing related literature, or are any of the technical results in the paper actually not novel? I suspect you mean the former but it would be good to know if it’s the latter.

FWIW, I’m also an economist, and I also get annoyed when I see members of the tribe failing to cite the statisticians who first came up with the ideas. But in the examples that I can think of, the econometricians come off well and it’s the economists who are lazy at citation. The classic example is what economists would call “White standard errors” … even though IIRC White cited Huber in his 1980 paper.

]]>I do suspect that if they’d carefully digest the work of my collaborators and myself, that this could improve their understanding. But It’s more important to me that they get things right—or, close to right—than that they cite me. I’ve been frustrated for a long time with many economists’ naive views regarding identification, rigor, hypothesis testing, unbiasedness, etc., so if they come to discover type M and type S errors through the back door, that’s great. And if what’s necessary for them to believe it is that it be written in economists’ language, so be it.

]]>That would involve discerning the (marginal) likelihood of just what was observed – the reported results of significance tests.

So the resulting re-weighting of the prior distribution is more so when it’s failure to reject than reject – interesting.

]]>Alberto Abadie (MIT) has just come out with a working paper that is (i think) similar to your Mars rover setup except for frequentist significance tests rather than frequentist CIs:

“[W]e formally adopt a limited information Bayes perspective. In this setting, agents representing journal readership or the scientific community have priors, P, over some parameters of interests, θ ∈ Θ. That is, a member p of P is a probability density function (with respect to some appropriate measure) on Θ. While agents are Bayesian, we will consider a setting where journals report frequentist results, in particular, statistical significance. Agents construct limited information Bayes posteriors based on the reported results of significance tests. We will deem a statistical result informative when it has the potential to substantially change the prior of the agents over a large range of values for θ.”

And from the conclusion:

“In this article, we have shown that rejection of a point null often carries very little information, while failure to reject is highly informative.”

Alberto Abadie, “Statistical Non-Significance in Empirical Economics”

Working Paper, March 2018

https://economics.mit.edu/files/14851

One thing though, it often doesn’t make sense to give probabilities to a high number of significant digits. When Andrew mentions this there’s sometimes bush-back from some more thoughtful Bayesians.

However, from the view of Jaynes’s Ap chapter, the width of the Ap distribution places a bound on how many significant figures it makes sense to quote probabilities to (this is for logical probabilities, not just frequencies masquerading as “probabilities”. Obviously estimating real frequencies is error prone for separate reasons)

]]>As far as the logical consistency thing, I don’t think you need any more than to avoid assigning zero probability to things which could be true. The result is going to be as “consistent” as the underlying structure. So if you’re extending Propositional logic, the result would be as consistent as Propositional logic is.

]]>Sometimes I feel like I’m having the following conversation:

Me: “abstractly, the probability calculus is THE tool for handling situations where everything isn’t known”

Someone else: “yeah, but it doesn’t always apply because in situation xyz, we don’t know enough to carry out the computations”

Me: “uh….I think I see a way out of this…”

]]>There are of course some difficulties in using propositional calculus with real-world applications. For example suppose we wish to figure out the truth table for

Q(A,B)

and A is a proposition like “Blarg(X) is a computation that halts and returns 0”

Nevertheless, I do think there’s plenty of opportunity for using Bayes in these scenarios, I just don’t know how far any kind of logical consistency guarantees really extends. The Godel completeness theorem applies to base propositional calculus, where you assume the truth or falsity of “atomic” propositions is well decidable, if you start making propositions about Blarg(X) you’re only complete in the sense that *if you tell us whether A is true or false* then Q is definitely decidable.

]]>In particular, suppose probability is the ratio of favorable to total cases, p=F/T as before, and there’s uncertainty as the correct value of F and T. Then you wind up “estimating” F/T, or taking expectation values over the Jayne’s Ap distribution. It’s like “estimating a probability”.

Many Bayesians who only partially got what Jaynes was saying, claim it doesn’t make sense to estimate a (non-frequency) probability. But if you read Jaynes carefully, it sometimes does make sense.

]]>Truth is definable enough in propositional logic, which Jayne’s probability theory generalizes.

In propositional logic given a set of atomic propositions a1, a2,…, then to determine the truth of a compound proposition Q(a1,a2,…) dependent on them, you merely construct the truth table for Q and cycle through every possible “valuation” or true/false combination for each atomic proposition. Given n atomic propositions there are 2^n possible valuations. Given enough time you can check all 2^n and determine if Q evaluates to True for all of them.

If this computation can’t be done or wont be done, then it represents a source of uncertainty. One on a deeper level than we usually deal with in statistics, but uncertainty nevertheless. No matter what you think of it, or how you think about it, the bottom line is you can consistently handle this with the same equations (sum/product rules) as any other uncertainty.

By “consistent”, I mean something like, if further computations verifying Q are made later, thereby reducing this kind of uncertainty, you can “update” in a Bayesian style to get better results which don’t inherently contradict what was claimed before.

Nor is the situation fundamentally different if we switch to Predicate logic (first order logic) since it too is semantically complete, according to theorem by Godel, which is all that’s really needed for this.

]]>I’ll concede that this area seems suspect, and probably not well resolved. In many practical cases we probably do well taking Joseph’s approach from a utilitarian perspective.

If J will send me his current email to my well known one I would appreciate it ;-)

]]>You don’t need to give up Bayes, merely recognize some hidden assumptions. Consider Laplace’s definition of a probability, namely, it’s the ratio of the favorable cases to the total cases. Write this as p=F/T. This is not a frequency of occurrence, but merely counting of possibilities.

This definition is great and can serve as the basis for a Jaynes style foundation for statistics as an extension of propositional logic. But there is a hidden assumption that F and T are actually known. If you relax that assumption to consider cases when we only have partial information about F and T, and use the sum/product rules to manipulate that added uncertainty, you get what looks like a “probability of a probability”.

This actually works though. See Jaynes’s Chapter 18 on the “Ap distribution” For example.

In the case where were our information in principle fixes F/T but we can’t “effectively” compute F/T, then we can still assign probabilities to various F and T based on what we can effectively compute. As long as you assign some probability to every value which could be true, you won’t run into any logical difficulties. For example, if our information implies a “contradiction” so that F=0, then you’ll be Ok as long as you assigned some probability to that possibility and didn’t set Pr(F=0) =0.

I suppose you could call this an extension of a very strict interpretation of Jaynes, but since you’re still using the same equations to manipulate uncertainty (just at a deeper level), it makes more sense to me to consider it still “Bayes”. I don’t think Jaynes would have been bothered by that since he wrote that Chapter 18 on the Ap distribution after all.

]]>I’m fine with saying this all seems a little scary and proceed with caution, but I’m honestly not clear on how computing blargness with a machine and computing percentage of iron atoms with a machine would be different, the second one seems to be pretty clearly the kind of thing we do with approximate data collection. The blargness thing isn’t obviously different though. Particularly for example if you can output a sequence of intermediate results from the blargness computation that you could update your probability on the basis of.

not trying to “gotcha” here or anything, honestly think this is an interesting bit of philosophy of science and / or logic. And I suspect ojm would chime in here on something related to constructivist logic and the blargness hypothesis (is that like the best band name ever?).

]]>what is your Bayesian probability that x is blarg

Not sure yet, get back to me after the Big Freeze.

I’ll use probability for logical uncertainty provisionally because to seems to work and because research into a foundation of logical uncertainty seems to show that something like probability theory works there too. Cox’s theorem doesn’t cover it though.

]]>Suppose that it is possible with a machine to deconstruct a rock atom by atom and count the iron atoms but the machine takes about 1 second per atom. Logically it is true or false that the number is or is not blarg, and it is true or false that the rock has more or less than say 1/2 iron atoms. Both require “only” the pure computation of a result by a computing machine. How do they differ?

]]>Daniel, the example of the rover was not about getting a confidence interval for an undefined parameter but for a well defined quantity (iron content) and using a well defined procedure (collect 100 samples from the rock, and construct a confidence interval for the mean iron content using standard CI procedure C, and transmit the interval to us).

> I think after seeing the interval (which is now our data) and having no other information to condition on (eh…) we should assign a probability distribution for the mean iron content that has 95% probability mass over this interval.

I can agree with that, but if you had received the complete set of data and you didn’t have any other information to condition on you would also assign a probability distribution for the mean iron content that has 95% probability mass over that interval. If you have no other information to condition on, you use a flat prior and the confidence interval is a credible interval.

Depending on the details (and it seems to be the case for this location parameter example if the likelihood is just dependent on mu and sigma), the bounds of the confidence interval can be a sufficient statistic. If you want to do a Bayesian analysis, the information sent by the rover is enough in that case.

In general, if you have a prior for the parameter you have a prior for the probability of the confidence interval returned by the rover containing the true value. Your posterior probability for the interval containing the the true vale does not have to be 95%. If your prior probability was 100%, the posterior probability will be 100%. If it was 0% it will be 0%. It can be 95%, but I guess in most cases it will be somewhere between your prior probability and 95%. Even if it cannot be calculated explicitely if you don’t have a model, the 95% CI can be interpreted as evidence supporting an increase (or maybe decrease, if it was higher) of your prior probability for the interval containing the true value.

]]>Either I’m not following your point, or you’re not following mine. Perhaps this will help illustrate: are you comfortable saying that since you don’t want to sit down and do the math/coding required, you’re fine with saying your personal probability that 12909809723450982345 is divisible by 7 is 1/7?

I’m fine with saying that. Now, even after I type in “mod(12909809723450982345, 7)”, I’m still fine with saying “Conditioning on what I just saw spit out by my computer, my personal probability that 129…45 is divisible by 7 is 0 (or 1). But before I saw that, my personal probability was 1/7”.

I don’t think this is just being annoying. I think its crucial to the interpretation of a Bayesian posterior is an update of *a* prior, and there’s lots of different priors, some better than others.

]]>On the other hand, if there are values in the credible interval that seemed very unlikely to you a priori, you might need to reconsider if these values are really so unlikely.

Realistically, you should first reconsider if you had a reasonable prior/likelihood function.

]]>“If you get a confidence interval with 95% frequentist coverage it may be justified to say that the probability of covering the true value is 95% but only as long as you don’t know what is the interval. If you do, you should condition on the data and the frequentist coverage guarantee is no longer valid.”

In Daniel’s “Mars rover” example, you can’t condition on the data because you don’t have the data – all the rover sent was the 95% interval it calculated. (I really do like this example!) And if you have the interval and nothing more, and the CI procedure assumptions are met and it’s a “standard problem” (no empty CIs possible etc.), then the claim is that you can assign a probability of 95% that the parameter is in the interval. Or am I misunderstanding your point here?

]]>Or alternative suppose someone gives you a number, they say it’s from an RNG that gives numbers that are FLORG 50% of the time. You have no idea what FLORG means, but it’s a well defined thing. You have the number. Conditional on your information, you can only say it has 50% probability of being FLORG

Your point is essentially amplifying what I already said, which is that conditioning ONLY on the knowledge that an interval came from a particular RNG / CI procedure is usually the wrong thing to do. But it’s a good amplification because it shows how background information is important at even the most basic level. We have LOTS of background information on every real world problem.

]]>For example, Carlos’s example with even/odd numbers. Conditional on only knowing that you have function that returns an even number 50% of the time, if you *just* condition on this fact, then any number you get, you can say “conditional only on the procedure, there is a 50% chance this number is even”. So if the function returns a 42, if you only condition on the procedure and not your expert information about even odd numbers, you say “conditional only on what I know about this function and nothing else, there is a 50% probability that 42, the number return by this function, is even”. Of course, you can also say “conditional on what I know about this function and what I learned in kindergarten, there’s 100% probability that 42 is even”.

To demonstrate further, suppose I use a discrete uniform rng and get output 8912437587987614581234095. If I don’t use a computer nor care to waste any time to doing long division, I’m perfectly happy with saying “Given what I know about the rng + my mathematics background, the probability that the number above is divisible by 7 is 1/7”. After I check on my computer, I’m happy to update my posterior to 0 or 1, but I recognize that this is conditional on me having checked.

]]>The way around it, I think, is to say “for standard problems only” (like with the simple bet-proof case) which means this can’t happen and the conditional probability claim will be ok. I guess you can say this is included on the right hand side of the vertical bar. But it’s different in that it’s something you know about the method rather than something you know about the parameter you’re estimating.

]]>Let’s say a rng gives you a number which is even with 50% probability. If that’s *all* you know (in particular, you do not know the number) it may be justified to assign 50% probability to the event that the number is even. On the other hand, if you know that the number generated is 42 it’s not justified to say that the probability that it is even is 50%.

If you get a confidence interval with 95% frequentist coverage it may be justified to say that the probability of covering the true value is 95% but only as long as you don’t know what is the interval. If you do, you should condition on the data and the frequentist coverage guarantee is no longer valid.

]]>If all you know is an rng gave you a random output with certain frequency behavior it’s justified to assign the probability to the event whose frequency is known… Like rolling a well made well rolled die, or calling a well tested rng function.

]]>Thanks! Very clear. And maybe the “unfortunately” isn’t warranted, at least for me. I like the Mars rover example and the conclusion that “p(Theta in CI | CI Transmitted from Mars Rover, CI procedure assumptions are met) = 0.95” because it looks like it could be useful in a teaching context.

My problem all along is that teaching students how to calculate CIs and at the same time telling them “don’t try to interpret realized CIs, wrong, can’t do that” doesn’t work too well.

The Mars rover example – the frequentist robot hands its result to a Bayesian human, who interprets it (is that fair?) – looks like something that (a) students will understand and remember, and (b) is actually correct.

Maybe it needs a footnote so that “CI procedure assumptions are met” includes some extra assumptions (possibly the same ones that the “bet-proof” interpretation needs, i.e., it’s a “standard problem”)? Otherwise you could have a CI procedure with 95% coverage that sometimes generates intervals that are empty or the entire real line. But that’s OK. “Standard problem” includes almost everything that we teach at this level.

]]>Unfortunately I think the only thing we came to is that p(Theta in CI | CI Transmitted from Mars Rover, CI procedure assumptions are met) = 0.95

this isn’t enough information to give a posterior distribution over the parameter Theta it just constrains a particular integral of that posterior. We can say that provided the CI procedure’s assumptions are met, we should assign 95% mass to the interval, but we should assign 5% mass to “outside the interval” and we don’t have a general way to make a useful probability distribution from those two pieces of info when the parameter space is unbounded.

Any information we add which would allow us to make a proper probability distribution would be added information, and the combination of this added information, and the CI procedure/interval would potentially alter the probability being assigned to the interval.

]]>Isn’t that a mixing of two interpretations of probability, though? Or are you defining “confidence interval” outside of the technical definition in the frequentist interpretation of probability?

I would say that a confidence interval only expresses uncertainty inasmuch as it agrees with a Bayesian credible interval, and then you have to say which Bayesian credible interval you mean.

]]>You end by talking about coffee drinkers in a population who drink milk. Instead of coffee drinkers, I use the example of people who are bilingual in my pre-print, and discuss the issues of uniform base-rate priors, non-uniform and non-base rate priors and posterior probabilities carefully and in some detail.

]]>The mathematics of probability is the same whether they are thought of as proportions or degrees of plausibility. This is true. However using this fact to invent some sort of proportion story around a Bayesian analysis has been the single biggest source of confusion around interpretation of Bayesian analysis, so I’m generally not in favor of that. It becomes even more confusing when we think about Bayesian analysis of proportions.

Suppose we want to estimate the Bayesian probability under some model that the proportion of coffee drinkers who add milk is less than 30%…

You could imagine the set of coffee drinkers. Then you could imagine sampling from them uniformly using an rng. We now have a frequency probability that the sample will contain less than 30% milk takers. And this is conceptually totally different from the Bayesian probability that the full set of coffee drinkers has less than 30% of its population milk takers. There is not a need for any confusing “two kinds” of priors here.

]]>I explain this in the first few pages of Chapter 13 of the 3rd edition of the Oxford Handbook of Clinical Diagnosis (see http://oxfordmedicine.com/view/10.1093/med/9780199679867.001.0001/med-9780199679867-chapter-13 ). I am aware that Bayesians claim that probabilities are degrees of belief and not observed frequencies but the point I make is that probabilities obey the same rules as proportions even if they are imaginary proportions. I am in the process of rewriting this chapter for the 4th edition and sent an extract in my entry to this blog at 3.08pm on 3 January.

I am not disputing the way Bayesians use ‘Bayesian priors’ but am simply putting these Bayesian priors in the wider context of probability theory. The way that I explain the basics of probability theory makes it consistent with the way that my medical colleagues and I use the concepts verbally during discussions with each other, patients, etc.

In my recent paper preprint (https://arxiv.org/ftp/arxiv/papers/1710/1710.07284.pdf), I show that during random sampling to estimate the value of a fixed parameter, the underlying ‘unconditional prior’ (AKA ‘base rate prior’ AKA ‘prior probability conditional on a universal set’) is uniform even though the Bayesian prior probability conditional on other unspecified evidence is not uniform. This also allows us to calculate a frequentist posterior probability distribution based on data alone that can be combined with a Bayesian probability distribution. An advantages of ‘looking behind’ Bayesian probability distributions at the underlying uniform ‘Base rate’ priors, is that it allows frequentist and Bayesian concepts to be combined.

I hope that this clarifies my reasoning.

]]>I know you seriously answer and genuinely attempt to understand other’s perspective in your responses on this blog, which is why I am confused by your response here in that it seems you are determined to misunderstand Huw’s labels which seem obvious. Huw is using an epidemiological example of the incidence of an event in a population and using the term base rate to refer to that incidence with the notion that a screening tool is only useful if it can provide information that improves the identification of someone with the condition above random selection for which the base rate would be the estimated population probability.

I don’t understand what is confusing about that. If you are saying that this term is specific to diagnosis and selection problems and not to problems of physical distance, then sure I suppose I understand your point, but it is simply another term for the probability of incidence in a population which can be used to inform a model.

Let’s say we created a binary logistic model using a beta prior and estimated theta for the LRLQ. How would you assess whether this model combined with this screening tool provides utility to the diagnostician?

]]>Next, philosophically any quantity can have a Bayesian probability distribution assigned to it. And in particular any dimensions can be associated, so for example something like length^3/time/temperature so if you have a historical record of that and use it to assign a location parameter to the distribution over that parameter, in what sense is that a “base rate”? I’m truly completely lacking an answer.

It seems Huw has some ideas in mind that don’t align.

]]>How does a base rate differ philosophically from a point estimate from previous research used to specify the mean of a prior?

]]>The usual characterization of Bayes is that it calculates a degree of plausibility of an assertion about a true or false claim. The next more subtle thing is something I’m working on where it’s more of a degree of accordance with both theory and data. This enables you to have a meaningful discussion about Bayesian models for things where the model isn’t “perfect” and so “truth” is not well defined. I’ve got a half written paper on that.

But in all of these philosophical discussions one thing has been true, there is never anything called the “base rate” which meaningfully enters into the philosophy. A “Base rate” might be one piece of information you would use to assign a degree of plausibility or accordance or whatever, but it doesn’t hold any fundamental position in the philosophy. Given this, your description reads to me like someone coming from some existing well developed background very different from “ours” here at the blog, and having a lot of specific ideas couched in that framework, but we don’t recognize that framework and so we’re all talking past each other.

Perhaps it’s just a terminology issue, but as Corey says, it mostly seems very idiosyncratic.

]]>“Bayesian statisticians emphasise the importance of specifying an ‘informal prior probability’ based on informal evidence that is then combined with substantiated probabilities (i.e. based on observations that others can share). For example, a Bayesian might suggest on the basis of such ‘informal evidence’ that the ‘prior probability’ of finding someone with appendicitis in a study is 0.6. When this ‘informal evidence’ is combined with another finding (e.g. LRLQ pain) a new posterior probability is created. This posterior probability then becomes the new prior probability if the evidence so far is combined with yet another finding (e.g. guarding).

It should be emphasised at this stage that there are two types of prior probability (1) the ‘base rate prior’ based on the universal set and (2) the non-base rate prior based on the universal set and one or more of its subset(s). The base rate prior proportion and probability for appendicitis is 100/400, if the universal set is a group of 400 patients studied to which patients with all the other findings belong (i.e. those with appendicitis, no appendicitis, LRLQ pain, no LRLQ pain, guarding, the ‘informal evidence’, etc.). The patients showing the ‘informal evidence’ used for the Bayesian prior cannot be assumed to be a ‘universal set’ of which those patients with LRLQ pain, guarding, appendicitis and NSAP were subsets. We have to assume therefore that those with the ‘informal evidence’ could be a subset of the 400 studied, giving rise to a non-base rate prior of 0.6.

The ‘non-base rate’ prior probability of 0.6 can be used to calculate a ‘posterior probability’ of appendicitis (Appx) by combining the ‘informal evidence’ (IE) and LRLQ pain:

1/{1+[Pr(No Appx|IE)/(Pr(Appx|IE))] [ (pr(LRLQ pain|No Appx))/(pr(LRLQ pain|Appx)) ] } = 1/{1+[((1-0.6))/0.6] [ (125/300)/(75/100) ] } = 0.73

The above calculation implies that there is statistical independence between the frequency of occurrence of the ‘informal evidence’ (IE) and LRLQ pain in those with appendicitis, and in those without appendicitis. For example, if the proportion of patients with the ‘informal evidence’ in those with appendicitis was 9/100 and its frequency in those without appendicitis had been 6/300, then the assumption of statistical independence means that the proportion with the informal evidence and LRLQ pain in those with appendicitis would be assumed to be 9/100 × 75/100 = 6.75/100. Similarly the proportion with the informal evidence and LRLQ pain (i.e. ‘IE & LRLQ pain’) in those without appendicitis would be assumed to be 6/300 × 125/300 = 2.5/300. We can now calculate the estimated proportion with appendicitis by using the base-rate prior proportion of 100/400 for appendicitis in the group studied (SG). Again, it is 0.73:

1/{1+[Pr(No Appx|GS)/(Pr(Appx|GS))] [ (pr(SE & LRLQ pain|No Appx))/(pr(SE & LRLQ pain|Appx)) ] } = 1/{1+[(300/400)/(100/400)] [ (2.5/300)/(6.75/100) ] } =0.73