Comments on: Stupid-ass statisticians don’t know what a goddam confidence interval is

By: Mark Schaffer

Mark Schaffer — Tue, 13 Mar 2018 21:04:25 +0000

Ah, then that’s not so bad. It’s a very short working paper and this is presumably an early version, the focus is on the technical results, there are some papers by statisticians that show up in the references but not (yet?) in the main text … maybe the lit review will get updated before it gets published. :)

(Apologies for the double posting. Andrew, if it’s not too much trouble you can delete the other copy of this comment – I had intended to reply to Dale but messed it up.)

By: Mark Schaffer

Mark Schaffer — Tue, 13 Mar 2018 21:01:41 +0000

By: Dale Lehman

Dale Lehman — Tue, 13 Mar 2018 20:49:19 +0000

In reply to Mark Schaffer. It is the former case (I haven't read it carefully enough to say whether the technical results are novel or not). I also anticipated Andrew's response. In terms of making progress, I also would not care whether people discover what others have discovered, as long as the movement is in the right direction. However, given the recent discussion about how to get published, I think this is an example of what bothers me about the advice to "not read too much" as a good formula for increasing publications.

By: Mark Schaffer

Mark Schaffer — Tue, 13 Mar 2018 20:34:00 +0000

In reply to Dale Lehman.

Dale,

Is this a case of just not citing related literature, or are any of the technical results in the paper actually not novel? I suspect you mean the former but it would be good to know if it’s the latter.

FWIW, I’m also an economist, and I also get annoyed when I see members of the tribe failing to cite the statisticians who first came up with the ideas. But in the examples that I can think of, the econometricians come off well and it’s the economists who are lazy at citation. The classic example is what economists would call “White standard errors” … even though IIRC White cited Huber in his 1980 paper.

By: Andrew

Andrew — Tue, 13 Mar 2018 19:51:47 +0000

In reply to Dale Lehman.

Dale:

I do suspect that if they’d carefully digest the work of my collaborators and myself, that this could improve their understanding. But It’s more important to me that they get things right—or, close to right—than that they cite me. I’ve been frustrated for a long time with many economists’ naive views regarding identification, rigor, hypothesis testing, unbiasedness, etc., so if they come to discover type M and type S errors through the back door, that’s great. And if what’s necessary for them to believe it is that it be written in economists’ language, so be it.

By: Dale Lehman

Dale Lehman — Tue, 13 Mar 2018 19:10:20 +0000

In reply to Mark Schaffer.

I just moments ago came across the same paper (on Marginal Revolution – and left a comment there). What is unbelievable is that there is no reference to Andrew’s work. This relates to an earlier discussion on this blog. I think it is an example of how economists get things published – don’t read too much – then pretend you discovered something new.

By: Keith O'Rourke

Keith O'Rourke — Tue, 13 Mar 2018 17:40:21 +0000

In reply to Mark Schaffer.

> limited information Bayes posteriors based on the reported results of significance tests
That would involve discerning the (marginal) likelihood of just what was observed – the reported results of significance tests.

So the resulting re-weighting of the prior distribution is more so when it’s failure to reject than reject – interesting.

By: Mark Schaffer

Mark Schaffer — Tue, 13 Mar 2018 17:00:35 +0000

Daniel et al.,

Alberto Abadie (MIT) has just come out with a working paper that is (i think) similar to your Mars rover setup except for frequentist significance tests rather than frequentist CIs:

“[W]e formally adopt a limited information Bayes perspective. In this setting, agents representing journal readership or the scientific community have priors, P, over some parameters of interests, θ ∈ Θ. That is, a member p of P is a probability density function (with respect to some appropriate measure) on Θ. While agents are Bayesian, we will consider a setting where journals report frequentist results, in particular, statistical significance. Agents construct limited information Bayes posteriors based on the reported results of significance tests. We will deem a statistical result informative when it has the potential to substantially change the prior of the agents over a large range of values for θ.”

And from the conclusion:

“In this article, we have shown that rejection of a point null often carries very little information, while failure to reject is highly informative.”

Alberto Abadie, “Statistical Non-Significance in Empirical Economics”
Working Paper, March 2018
https://economics.mit.edu/files/14851

By: Daniel Lakeland

Daniel Lakeland — Sun, 07 Jan 2018 21:25:52 +0000

In reply to Corey. In some sense my example shows how badly we need something like Bayes for logical uncertainty. The halting problem is in general undecidable, so no amount of work will in general help us eliminate the uncertainty of certain logical statements. Nevertheless, I agree with you that it seems useful to have "meta Bayes" and it will probably work right much of the time. I'd need to see some kind of more formal proof of something to really understand what the limits are. I really do need to reread that Ap distribution stuff. I'm pretty sure I didn't get it the first time I tried a few years back.

By: Anonymous

Anonymous — Sun, 07 Jan 2018 21:15:49 +0000

In reply to Corey.

Jaynes’s Ap chapter seems perfectly designed for being ignored by applied statisticians. But it’s applicable and once you see the point, it’s very natural and convenient approach to many problems. It would be worth writing up a bunch of examples, but I have a feeling it would over the heads of the denizens of the stat community.

One thing though, it often doesn’t make sense to give probabilities to a high number of significant digits. When Andrew mentions this there’s sometimes bush-back from some more thoughtful Bayesians.

However, from the view of Jaynes’s Ap chapter, the width of the Ap distribution places a bound on how many significant figures it makes sense to quote probabilities to (this is for logical probabilities, not just frequencies masquerading as “probabilities”. Obviously estimating real frequencies is error prone for separate reasons)

By: Anonymous

Anonymous — Sun, 07 Jan 2018 21:01:57 +0000

In reply to Corey.

Daniel,

As far as the logical consistency thing, I don’t think you need any more than to avoid assigning zero probability to things which could be true. The result is going to be as “consistent” as the underlying structure. So if you’re extending Propositional logic, the result would be as consistent as Propositional logic is.

By: Anonymous

Anonymous — Sun, 07 Jan 2018 20:58:24 +0000

In reply to Corey.

Daniel,

Sometimes I feel like I’m having the following conversation:

Me: “abstractly, the probability calculus is THE tool for handling situations where everything isn’t known”

Someone else: “yeah, but it doesn’t always apply because in situation xyz, we don’t know enough to carry out the computations”

Me: “uh….I think I see a way out of this…”

By: Daniel Lakeland

Daniel Lakeland — Sun, 07 Jan 2018 19:38:29 +0000

In reply to Corey.

Anon / J,

There are of course some difficulties in using propositional calculus with real-world applications. For example suppose we wish to figure out the truth table for

Q(A,B)

and A is a proposition like “Blarg(X) is a computation that halts and returns 0”

Nevertheless, I do think there’s plenty of opportunity for using Bayes in these scenarios, I just don’t know how far any kind of logical consistency guarantees really extends. The Godel completeness theorem applies to base propositional calculus, where you assume the truth or falsity of “atomic” propositions is well decidable, if you start making propositions about Blarg(X) you’re only complete in the sense that *if you tell us whether A is true or false* then Q is definitely decidable.

By: Corey Yanofsky

Corey Yanofsky — Sun, 07 Jan 2018 16:55:12 +0000

In reply to Corey. Okay, you've convinced me.

By: Anonymous

Anonymous — Sun, 07 Jan 2018 08:57:37 +0000

In reply to Corey.

I should add, this viewpoint has a lot more applications than it might look on the surface, so does Jaynes Chapter 18 on the Ap distribution for that matter.

In particular, suppose probability is the ratio of favorable to total cases, p=F/T as before, and there’s uncertainty as the correct value of F and T. Then you wind up “estimating” F/T, or taking expectation values over the Jayne’s Ap distribution. It’s like “estimating a probability”.

Many Bayesians who only partially got what Jaynes was saying, claim it doesn’t make sense to estimate a (non-frequency) probability. But if you read Jaynes carefully, it sometimes does make sense.

By: Anonymous

Anonymous — Sun, 07 Jan 2018 08:46:56 +0000

In reply to Corey.

Corey,

Truth is definable enough in propositional logic, which Jayne’s probability theory generalizes.

In propositional logic given a set of atomic propositions a1, a2,…, then to determine the truth of a compound proposition Q(a1,a2,…) dependent on them, you merely construct the truth table for Q and cycle through every possible “valuation” or true/false combination for each atomic proposition. Given n atomic propositions there are 2^n possible valuations. Given enough time you can check all 2^n and determine if Q evaluates to True for all of them.

If this computation can’t be done or wont be done, then it represents a source of uncertainty. One on a deeper level than we usually deal with in statistics, but uncertainty nevertheless. No matter what you think of it, or how you think about it, the bottom line is you can consistently handle this with the same equations (sum/product rules) as any other uncertainty.

By “consistent”, I mean something like, if further computations verifying Q are made later, thereby reducing this kind of uncertainty, you can “update” in a Bayesian style to get better results which don’t inherently contradict what was claimed before.

Nor is the situation fundamentally different if we switch to Predicate logic (first order logic) since it too is semantically complete, according to theorem by Godel, which is all that’s really needed for this.

By: Daniel Lakeland

Daniel Lakeland — Sun, 07 Jan 2018 03:32:40 +0000

In reply to Corey.

Corey, my naive understanding of intuitionist/constructivist logic is that the answer to “is x blarg” simply doesn’t exist until you’ve computed it. Or maybe until you’ve exhibited a computer program that would compute it… there’s obviously a difference.

I’ll concede that this area seems suspect, and probably not well resolved. In many practical cases we probably do well taking Joseph’s approach from a utilitarian perspective.

If J will send me his current email to my well known one I would appreciate it ;-)

By: Corey Yanofsky

Corey Yanofsky — Sun, 07 Jan 2018 02:55:41 +0000

In reply to Corey.

Hey Big J, like naïve set theory, that approach will appear to work in limited domains but will run into problems with the undefinability of truth in fairly short order. http://intelligence.org/files/DefinabilityTruthDraft.pdf

By: Corey Yanofsky

Corey Yanofsky — Sun, 07 Jan 2018 02:42:37 +0000

In reply to Corey.

The difference between the blargness of the number of iron atoms and the question of whether more than 50% of the atoms are iron is that the blargness of any particular number is a logical consequence of my prior information (which I’m assuming here has imported enough axiom schemata and whatnot from first-order logic that we can actually reason about numbers and blargness) and the proportion of iron is not. As part and parcel of the fact that we aim to extend propositional logic, Cox’s theorem (and Van Horn’s uniqueness theorem even more explicitly) takes as a premise that all logical implications of the prior information get the same plausibility value as a tautology. If your system of probability doesn’t do that, it’s not an extension of propositional logic — which is fine, because we need to go beyond propositional logic to account for bounded computational resources. I’ll be satisfied when a system of probability exists that *actually describes how* to go about updating logical probabilities on the basis of a sequence of intermediate results from some ongoing computation. The closest thing I’ve seen to such a system is behind the link I gave “a reader”.

By: Anonymous

Anonymous — Sun, 07 Jan 2018 02:21:41 +0000

In reply to Corey.

Corey,

You don’t need to give up Bayes, merely recognize some hidden assumptions. Consider Laplace’s definition of a probability, namely, it’s the ratio of the favorable cases to the total cases. Write this as p=F/T. This is not a frequency of occurrence, but merely counting of possibilities.

This definition is great and can serve as the basis for a Jaynes style foundation for statistics as an extension of propositional logic. But there is a hidden assumption that F and T are actually known. If you relax that assumption to consider cases when we only have partial information about F and T, and use the sum/product rules to manipulate that added uncertainty, you get what looks like a “probability of a probability”.

This actually works though. See Jaynes’s Chapter 18 on the “Ap distribution” For example.

In the case where were our information in principle fixes F/T but we can’t “effectively” compute F/T, then we can still assign probabilities to various F and T based on what we can effectively compute. As long as you assign some probability to every value which could be true, you won’t run into any logical difficulties. For example, if our information implies a “contradiction” so that F=0, then you’ll be Ok as long as you assigned some probability to that possibility and didn’t set Pr(F=0) =0.

I suppose you could call this an extension of a very strict interpretation of Jaynes, but since you’re still using the same equations to manipulate uncertainty (just at a deeper level), it makes more sense to me to consider it still “Bayes”. I don’t think Jaynes would have been bothered by that since he wrote that Chapter 18 on the Ap distribution after all.

By: Daniel Lakeland

Daniel Lakeland — Sun, 07 Jan 2018 01:59:51 +0000

In reply to Corey.

How is computing blargness different from counting the percentage of iron atoms though?

I’m fine with saying this all seems a little scary and proceed with caution, but I’m honestly not clear on how computing blargness with a machine and computing percentage of iron atoms with a machine would be different, the second one seems to be pretty clearly the kind of thing we do with approximate data collection. The blargness thing isn’t obviously different though. Particularly for example if you can output a sequence of intermediate results from the blargness computation that you could update your probability on the basis of.

not trying to “gotcha” here or anything, honestly think this is an interesting bit of philosophy of science and / or logic. And I suspect ojm would chime in here on something related to constructivist logic and the blargness hypothesis (is that like the best band name ever?).

By: Corey Yanofsky

Corey Yanofsky — Sun, 07 Jan 2018 01:28:32 +0000

In reply to Corey.

what is your Bayesian probability that x is blarg

Not sure yet, get back to me after the Big Freeze. I'll use probability for logical uncertainty provisionally because to seems to work and because research into a foundation of logical uncertainty seems to show that something like probability theory works there too. Cox's theorem doesn't cover it though.

By: Daniel Lakeland

Daniel Lakeland — Sun, 07 Jan 2018 00:40:06 +0000

In reply to Corey. Cory suppose that blarg is a well-defined true or false property of numbers but requires exponential computing power in the size of the number. Blarg(100) takes one year of computing, further suppose that a mathematical proof exists that a random number generator produces blarg numbers with probability 1/2. The RNG outputs x=8912437587987614581234095 what is your Bayesian probability that x is blarg. Suppose that it is possible with a machine to deconstruct a rock atom by atom and count the iron atoms but the machine takes about 1 second per atom. Logically it is true or false that the number is or is not blarg, and it is true or false that the rock has more or less than say 1/2 iron atoms. Both require "only" the pure computation of a result by a computing machine. How do they differ?

By: Carlos Ungil

Carlos Ungil — Sat, 06 Jan 2018 23:38:17 +0000

In reply to Corey.

Mark, in this case the data is what the rover provides: the pair ( lower-bound , upper-bound ) that defines the interval.

Daniel, the example of the rover was not about getting a confidence interval for an undefined parameter but for a well defined quantity (iron content) and using a well defined procedure (collect 100 samples from the rock, and construct a confidence interval for the mean iron content using standard CI procedure C, and transmit the interval to us).

> I think after seeing the interval (which is now our data) and having no other information to condition on (eh…) we should assign a probability distribution for the mean iron content that has 95% probability mass over this interval.

I can agree with that, but if you had received the complete set of data and you didn’t have any other information to condition on you would also assign a probability distribution for the mean iron content that has 95% probability mass over that interval. If you have no other information to condition on, you use a flat prior and the confidence interval is a credible interval.

Depending on the details (and it seems to be the case for this location parameter example if the likelihood is just dependent on mu and sigma), the bounds of the confidence interval can be a sufficient statistic. If you want to do a Bayesian analysis, the information sent by the rover is enough in that case.

In general, if you have a prior for the parameter you have a prior for the probability of the confidence interval returned by the rover containing the true value. Your posterior probability for the interval containing the the true vale does not have to be 95%. If your prior probability was 100%, the posterior probability will be 100%. If it was 0% it will be 0%. It can be 95%, but I guess in most cases it will be somewhere between your prior probability and 95%. Even if it cannot be calculated explicitely if you don’t have a model, the 95% CI can be interpreted as evidence supporting an increase (or maybe decrease, if it was higher) of your prior probability for the interval containing the true value.

By: Corey Yanofsky

Corey Yanofsky — Sat, 06 Jan 2018 23:35:39 +0000

In reply to Corey. No, I'm not fine with that, in the sense that the doctrinaire Bayesian in me refuses to assign probabilities that could result in conditioning on a contradiction. It is a constant irritant niggling at me that using probability for logical uncertainty in that fashion (and especially as in the case of Bayesian numerical integration) works as well as it does since I know of no foundations to justify that use case.

By: a reader

a reader — Sat, 06 Jan 2018 23:19:55 +0000

In reply to Corey.

Corey:

Either I’m not following your point, or you’re not following mine. Perhaps this will help illustrate: are you comfortable saying that since you don’t want to sit down and do the math/coding required, you’re fine with saying your personal probability that 12909809723450982345 is divisible by 7 is 1/7?

I’m fine with saying that. Now, even after I type in “mod(12909809723450982345, 7)”, I’m still fine with saying “Conditioning on what I just saw spit out by my computer, my personal probability that 129…45 is divisible by 7 is 0 (or 1). But before I saw that, my personal probability was 1/7”.

I don’t think this is just being annoying. I think its crucial to the interpretation of a Bayesian posterior is an update of *a* prior, and there’s lots of different priors, some better than others.

By: Corey Yanofsky

Corey Yanofsky — Sat, 06 Jan 2018 23:04:29 +0000

In reply to Corey.

reader, Bayesian foundations (of the Cox-Jaynes variety) postulate that if B => A then Pr(A | B) = 1; that is, Bayesian probability models a “logically omniscient” reasoner who has an oracle for the logical implications of any set of assumptions. (That’s also helpful for postulating that we never condition on a contradiction.) For logical uncertainty we need something else; what, exactly, is not yet known but progress is being made: https://intelligence.org/2016/09/12/new-paper-logical-induction/

By: a reader

a reader — Sat, 06 Jan 2018 22:11:28 +0000

In reply to Corey.

Part of what I like about this viewing of a confidence interval in this way is that it very thoroughly points out that if there are values in the confidence interval that seems very unlikely to you, you shouldn’t just accept them as now being (relatively) likely values!

On the other hand, if there are values in the credible interval that seemed very unlikely to you a priori, you might need to reconsider if these values are really so unlikely.

Realistically, you should first reconsider if you had a reasonable prior/likelihood function.

By: Mark Schaffer

Mark Schaffer — Sat, 06 Jan 2018 22:08:40 +0000

In reply to Corey.

Carlos:

“If you get a confidence interval with 95% frequentist coverage it may be justified to say that the probability of covering the true value is 95% but only as long as you don’t know what is the interval. If you do, you should condition on the data and the frequentist coverage guarantee is no longer valid.”

In Daniel’s “Mars rover” example, you can’t condition on the data because you don’t have the data – all the rover sent was the 95% interval it calculated. (I really do like this example!) And if you have the interval and nothing more, and the CI procedure assumptions are met and it’s a “standard problem” (no empty CIs possible etc.), then the claim is that you can assign a probability of 95% that the parameter is in the interval. Or am I misunderstanding your point here?

By: Daniel Lakeland

Daniel Lakeland — Sat, 06 Jan 2018 22:08:33 +0000

In reply to Corey.

Carlos: suppose you have brain damage, and you don’t know what it means to be even…(ie. “I know what evenness is” isn’t on the right side of your conditioning bar) then if someone tells you here’s the number 42 it came out of a random number generator that gives even numbers 50% of the time, what is the probability to you that the number is even?

Or alternative suppose someone gives you a number, they say it’s from an RNG that gives numbers that are FLORG 50% of the time. You have no idea what FLORG means, but it’s a well defined thing. You have the number. Conditional on your information, you can only say it has 50% probability of being FLORG

Your point is essentially amplifying what I already said, which is that conditioning ONLY on the knowledge that an interval came from a particular RNG / CI procedure is usually the wrong thing to do. But it’s a good amplification because it shows how background information is important at even the most basic level. We have LOTS of background information on every real world problem.

By: a reader

a reader — Sat, 06 Jan 2018 22:04:42 +0000

In reply to Corey.

Maybe I just see Bayesian statistics in a different light…but I think this all follows from my earlier point.

For example, Carlos’s example with even/odd numbers. Conditional on only knowing that you have function that returns an even number 50% of the time, if you *just* condition on this fact, then any number you get, you can say “conditional only on the procedure, there is a 50% chance this number is even”. So if the function returns a 42, if you only condition on the procedure and not your expert information about even odd numbers, you say “conditional only on what I know about this function and nothing else, there is a 50% probability that 42, the number return by this function, is even”. Of course, you can also say “conditional on what I know about this function and what I learned in kindergarten, there’s 100% probability that 42 is even”.

To demonstrate further, suppose I use a discrete uniform rng and get output 8912437587987614581234095. If I don’t use a computer nor care to waste any time to doing long division, I’m perfectly happy with saying “Given what I know about the rng + my mathematics background, the probability that the number above is divisible by 7 is 1/7”. After I check on my computer, I’m happy to update my posterior to 0 or 1, but I recognize that this is conditional on me having checked.

By: Mark Schaffer

Mark Schaffer — Sat, 06 Jan 2018 18:28:53 +0000

In reply to Corey.

Of course, it’s not that it can’t be empty, it’s that if it’s empty you know the conditional probability claim is wrong. (Which is what I think you meant.) And even if you only know that it’s possible that it’s empty, then you also know that if it’s not empty the conditional probability claim is also wrong. (If sometimes it’s going to be empty, the rest of the time it’ll be too wide.)

The way around it, I think, is to say “for standard problems only” (like with the simple bet-proof case) which means this can’t happen and the conditional probability claim will be ok. I guess you can say this is included on the right hand side of the vertical bar. But it’s different in that it’s something you know about the method rather than something you know about the parameter you’re estimating.

By: Carlos Ungil

Carlos Ungil — Sat, 06 Jan 2018 17:58:53 +0000

In reply to Corey. > If all you know is an rng gave you a random output with certain frequency behavior it’s justified to assign the probability to the event whose frequency is known… Let’s say a rng gives you a number which is even with 50% probability. If that’s *all* you know (in particular, you do not know the number) it may be justified to assign 50% probability to the event that the number is even. On the other hand, if you know that the number generated is 42 it’s not justified to say that the probability that it is even is 50%. If you get a confidence interval with 95% frequentist coverage it may be justified to say that the probability of covering the true value is 95% but only as long as you don’t know what is the interval. If you do, you should condition on the data and the frequentist coverage guarantee is no longer valid.

By: Daniel Lakeland

Daniel Lakeland — Sat, 06 Jan 2018 17:54:55 +0000

In reply to Corey. Mark: in reality, yes, in the formalism I'm less sure, the information you're using to infer that the CI can't be empty is what? It seems it should be something else included on the right hand side of the vertical bar. But I think the point is well made, we ALWAYS have useful information about real problems. The biggest problem with interpreting a CI as a credible interval is that it treats the problem as if you are algorithmically testing the quality of a pseudo-random number generator. That's never what you're doing.

By: Mark Schaffer

Mark Schaffer — Sat, 06 Jan 2018 17:40:01 +0000

In reply to Corey. Oops! "If the realized CI that the Mars rover SENDS is empty".

By: Mark Schaffer

Mark Schaffer — Sat, 06 Jan 2018 17:37:38 +0000

In reply to Corey.

Thanks Daniel. I think I get your point about prior science knowledge. But if you had in mind my passing remark about “standard problems” etc., I think that’s different. If the realized CI that the Mars rover is empty – possible for “nonstandard problems” – you know the conditional probability claim about this realized CI has to be wrong (and not because of prior science knowledge etc.). But maybe that’s not what you meant, in which case apologies (and thanks for continuing to engage … much appreciated).

By: Daniel Lakeland

Daniel Lakeland — Sat, 06 Jan 2018 16:34:30 +0000

In reply to Corey.

Mark i think the biggest issue is that the information the Bayesian conditions in is the absolute minimum information for the random ci generation procedure to have its frequency properties, and no more. In particular any knowledge of the underlying science or measurement tools or some logic such as you can’t have negative iron or the like changes the conclusion, and changes the modeling method. For example Corey suggests to make the data unknown parameters subject to your science knowledge, and then put distribs on them subject to them resulting in the given ci… It could Radically alter the resulting conclusion.

If all you know is an rng gave you a random output with certain frequency behavior it’s justified to assign the probability to the event whose frequency is known… Like rolling a well made well rolled die, or calling a well tested rng function.

By: Chris Wilson

Chris Wilson — Sat, 06 Jan 2018 13:42:27 +0000

In reply to Frank Harrell. Yep! Regardless of what some have suggested here and there, most researchers discuss and interpret their CIs (from least squares, max likelihood, whatever) like credible intervals, i.e. automatically and unconsciously apply a uniform prior over parameter space. I have never seen a research presentation where it was made clear that frequentist CIs are basically *procedural*, or an error statistical perspective was used.

By: Mark Schaffer

Mark Schaffer — Sat, 06 Jan 2018 12:18:37 +0000

In reply to Corey.

Daniel,

Thanks! Very clear. And maybe the “unfortunately” isn’t warranted, at least for me. I like the Mars rover example and the conclusion that “p(Theta in CI | CI Transmitted from Mars Rover, CI procedure assumptions are met) = 0.95” because it looks like it could be useful in a teaching context.

My problem all along is that teaching students how to calculate CIs and at the same time telling them “don’t try to interpret realized CIs, wrong, can’t do that” doesn’t work too well.

The Mars rover example – the frequentist robot hands its result to a Bayesian human, who interprets it (is that fair?) – looks like something that (a) students will understand and remember, and (b) is actually correct.

Maybe it needs a footnote so that “CI procedure assumptions are met” includes some extra assumptions (possibly the same ones that the “bet-proof” interpretation needs, i.e., it’s a “standard problem”)? Otherwise you could have a CI procedure with 95% coverage that sometimes generates intervals that are empty or the entire real line. But that’s OK. “Standard problem” includes almost everything that we teach at this level.

By: Daniel Lakeland

Daniel Lakeland — Sat, 06 Jan 2018 06:30:58 +0000

In reply to Corey.

Mark Shaffer:

Unfortunately I think the only thing we came to is that p(Theta in CI | CI Transmitted from Mars Rover, CI procedure assumptions are met) = 0.95

this isn’t enough information to give a posterior distribution over the parameter Theta it just constrains a particular integral of that posterior. We can say that provided the CI procedure’s assumptions are met, we should assign 95% mass to the interval, but we should assign 5% mass to “outside the interval” and we don’t have a general way to make a useful probability distribution from those two pieces of info when the parameter space is unbounded.

Any information we add which would allow us to make a proper probability distribution would be added information, and the combination of this added information, and the CI procedure/interval would potentially alter the probability being assigned to the interval.

By: Alex

Alex — Fri, 05 Jan 2018 14:59:45 +0000

In reply to Andrew.

> The key point is that confidence intervals are used to express uncertainty

Isn’t that a mixing of two interpretations of probability, though? Or are you defining “confidence interval” outside of the technical definition in the frequentist interpretation of probability?

I would say that a confidence interval only expresses uncertainty inasmuch as it agrees with a Bayesian credible interval, and then you have to say which Bayesian credible interval you mean.

By: Huw Llewelyn

Huw Llewelyn — Thu, 04 Jan 2018 23:28:34 +0000

In reply to Huw Llewelyn. Thank you everyone for your comments, which have been very helpful.

By: Huw Llewelyn

Huw Llewelyn — Thu, 04 Jan 2018 22:37:57 +0000

In reply to Huw Llewelyn.

Daniel. I am talking about something more universal than diagnosis – and also more universal than the Bayesian approach to diagnosis and statistical inference for that matter. It is the use of probability theory to explain human verbal reasoning in all walks of life but which is done intensely in medical settings. You must not forget that that a working diagnosis is simply an example of a hypothesis and that a final diagnosis is an example of a theory. The same probability theory applies to both.

You end by talking about coffee drinkers in a population who drink milk. Instead of coffee drinkers, I use the example of people who are bilingual in my pre-print, and discuss the issues of uniform base-rate priors, non-uniform and non-base rate priors and posterior probabilities carefully and in some detail.

By: Daniel Lakeland

Daniel Lakeland — Thu, 04 Jan 2018 17:58:53 +0000

In reply to Huw Llewelyn.

Curious, thanks for the kind words. I truly am confused though. My impression was that Huw was talking about something more universal than diagnosis. Perhaps that is what confused me.

The mathematics of probability is the same whether they are thought of as proportions or degrees of plausibility. This is true. However using this fact to invent some sort of proportion story around a Bayesian analysis has been the single biggest source of confusion around interpretation of Bayesian analysis, so I’m generally not in favor of that. It becomes even more confusing when we think about Bayesian analysis of proportions.

Suppose we want to estimate the Bayesian probability under some model that the proportion of coffee drinkers who add milk is less than 30%…

You could imagine the set of coffee drinkers. Then you could imagine sampling from them uniformly using an rng. We now have a frequency probability that the sample will contain less than 30% milk takers. And this is conceptually totally different from the Bayesian probability that the full set of coffee drinkers has less than 30% of its population milk takers. There is not a need for any confusing “two kinds” of priors here.

By: Huw Llewelyn

Huw Llewelyn — Thu, 04 Jan 2018 16:08:58 +0000

In reply to Huw Llewelyn.

You may know the ‘base rate’ prior probability (a term used in the phrase ‘the base-rate fallacy’) as the ‘unconditional prior probability’ [e.g. written as the p(A) or p(B)] of Bayes rule. Thus Bayes rule is p(A|B) = p(A) x p(B|A) / p(B). The term ‘unconditional’ is another confusing statistical misnomer of course because by p(A) and p(B) we really mean p(A|U) and p(B|U), U being some universal set such that A⊆U and A⊆U so that p(U|A) = 1 and p(U|B) = 1.

I explain this in the first few pages of Chapter 13 of the 3rd edition of the Oxford Handbook of Clinical Diagnosis (see http://oxfordmedicine.com/view/10.1093/med/9780199679867.001.0001/med-9780199679867-chapter-13 ). I am aware that Bayesians claim that probabilities are degrees of belief and not observed frequencies but the point I make is that probabilities obey the same rules as proportions even if they are imaginary proportions. I am in the process of rewriting this chapter for the 4th edition and sent an extract in my entry to this blog at 3.08pm on 3 January.

I am not disputing the way Bayesians use ‘Bayesian priors’ but am simply putting these Bayesian priors in the wider context of probability theory. The way that I explain the basics of probability theory makes it consistent with the way that my medical colleagues and I use the concepts verbally during discussions with each other, patients, etc.

In my recent paper preprint (https://arxiv.org/ftp/arxiv/papers/1710/1710.07284.pdf), I show that during random sampling to estimate the value of a fixed parameter, the underlying ‘unconditional prior’ (AKA ‘base rate prior’ AKA ‘prior probability conditional on a universal set’) is uniform even though the Bayesian prior probability conditional on other unspecified evidence is not uniform. This also allows us to calculate a frequentist posterior probability distribution based on data alone that can be combined with a Bayesian probability distribution. An advantages of ‘looking behind’ Bayesian probability distributions at the underlying uniform ‘Base rate’ priors, is that it allows frequentist and Bayesian concepts to be combined.

I hope that this clarifies my reasoning.

By: Curious

Curious — Thu, 04 Jan 2018 14:43:07 +0000

In reply to Huw Llewelyn.

Daniel,

I know you seriously answer and genuinely attempt to understand other’s perspective in your responses on this blog, which is why I am confused by your response here in that it seems you are determined to misunderstand Huw’s labels which seem obvious. Huw is using an epidemiological example of the incidence of an event in a population and using the term base rate to refer to that incidence with the notion that a screening tool is only useful if it can provide information that improves the identification of someone with the condition above random selection for which the base rate would be the estimated population probability.

I don’t understand what is confusing about that. If you are saying that this term is specific to diagnosis and selection problems and not to problems of physical distance, then sure I suppose I understand your point, but it is simply another term for the probability of incidence in a population which can be used to inform a model.

Let’s say we created a binary logistic model using a beta prior and estimated theta for the LRLQ. How would you assess whether this model combined with this screening tool provides utility to the diagnostician?

By: Daniel Lakeland

Daniel Lakeland — Thu, 04 Jan 2018 05:54:35 +0000

In reply to Huw Llewelyn.

Curious, well first off i hear base rate and I think frequency. So if we are trying to estimate a frequency then fine but if we are demanding that a probability associated with a parameter be a frequency, then this is not what a Bayesian probability is… So that seems confusing.

Next, philosophically any quantity can have a Bayesian probability distribution assigned to it. And in particular any dimensions can be associated, so for example something like length^3/time/temperature so if you have a historical record of that and use it to assign a location parameter to the distribution over that parameter, in what sense is that a “base rate”? I’m truly completely lacking an answer.

It seems Huw has some ideas in mind that don’t align.

By: Curious

Curious — Thu, 04 Jan 2018 02:27:08 +0000

In reply to Huw Llewelyn. Daniel, How does a base rate differ philosophically from a point estimate from previous research used to specify the mean of a prior?

By: Daniel Lakeland

Daniel Lakeland — Thu, 04 Jan 2018 00:50:55 +0000

In reply to Huw Llewelyn.

Huw, I would say we’ve had many philosophical discussions on Bayesianism on this blog and your take on this seems to be in a different direction than the way that is commonly discussed here. So much so that I don’t recognize what you’re really talking about by the names you use etc.

The usual characterization of Bayes is that it calculates a degree of plausibility of an assertion about a true or false claim. The next more subtle thing is something I’m working on where it’s more of a degree of accordance with both theory and data. This enables you to have a meaningful discussion about Bayesian models for things where the model isn’t “perfect” and so “truth” is not well defined. I’ve got a half written paper on that.

But in all of these philosophical discussions one thing has been true, there is never anything called the “base rate” which meaningfully enters into the philosophy. A “Base rate” might be one piece of information you would use to assign a degree of plausibility or accordance or whatever, but it doesn’t hold any fundamental position in the philosophy. Given this, your description reads to me like someone coming from some existing well developed background very different from “ours” here at the blog, and having a lot of specific ideas couched in that framework, but we don’t recognize that framework and so we’re all talking past each other.

Perhaps it’s just a terminology issue, but as Corey says, it mostly seems very idiosyncratic.

By: Huw Llewelyn

Huw Llewelyn — Wed, 03 Jan 2018 20:08:35 +0000

In reply to Huw Llewelyn. The terminology of statistics is already a nightmare and I apologise because this has been compounded by some typos of mine (e.g. omitting the term ‘distribution’). One source of confusion is a failure to distinguish clearly between ‘base-rate’ and ‘non-base rate’ prior probabilities (I dislike these terms!). A Bayesian prior probability distribution for an unknown parameter is of the ‘non base-rate’ variety. It is combined with a likelihood probability distribution based on data to create an updated posterior probability distribution. (I contend that when random sampling is used to estimate the value of some single parameter, although the non-base rate prior may not be uniform, the base-rate prior is uniform.) The following extract from the next edition of the Oxford Handbook of Clinical Diagnosis tries to explain the difference between a ‘base-rate’ and ‘non base-rate’ prior. (I would also like to point out that I refer to the ‘product rule’ as an assumption of ‘statistical independence’). "Bayesian statisticians emphasise the importance of specifying an ‘informal prior probability’ based on informal evidence that is then combined with substantiated probabilities (i.e. based on observations that others can share). For example, a Bayesian might suggest on the basis of such ‘informal evidence’ that the ‘prior probability’ of finding someone with appendicitis in a study is 0.6. When this ‘informal evidence’ is combined with another finding (e.g. LRLQ pain) a new posterior probability is created. This posterior probability then becomes the new prior probability if the evidence so far is combined with yet another finding (e.g. guarding). It should be emphasised at this stage that there are two types of prior probability (1) the ‘base rate prior’ based on the universal set and (2) the non-base rate prior based on the universal set and one or more of its subset(s). The base rate prior proportion and probability for appendicitis is 100/400, if the universal set is a group of 400 patients studied to which patients with all the other findings belong (i.e. those with appendicitis, no appendicitis, LRLQ pain, no LRLQ pain, guarding, the ‘informal evidence’, etc.). The patients showing the ‘informal evidence’ used for the Bayesian prior cannot be assumed to be a ‘universal set’ of which those patients with LRLQ pain, guarding, appendicitis and NSAP were subsets. We have to assume therefore that those with the ‘informal evidence’ could be a subset of the 400 studied, giving rise to a non-base rate prior of 0.6. The ‘non-base rate’ prior probability of 0.6 can be used to calculate a ‘posterior probability’ of appendicitis (Appx) by combining the ‘informal evidence’ (IE) and LRLQ pain: 1/{1+[Pr(No Appx|IE)/(Pr⁡(Appx|IE))] [ (pr(LRLQ pain|No Appx))/(pr(LRLQ pain|Appx)) ] } = 1/{1+[((1-0.6))/0.6] [ (125/300)/(75/100) ] } = 0.73 The above calculation implies that there is statistical independence between the frequency of occurrence of the ‘informal evidence’ (IE) and LRLQ pain in those with appendicitis, and in those without appendicitis. For example, if the proportion of patients with the ‘informal evidence’ in those with appendicitis was 9/100 and its frequency in those without appendicitis had been 6/300, then the assumption of statistical independence means that the proportion with the informal evidence and LRLQ pain in those with appendicitis would be assumed to be 9/100 × 75/100 = 6.75/100. Similarly the proportion with the informal evidence and LRLQ pain (i.e. ‘IE & LRLQ pain’) in those without appendicitis would be assumed to be 6/300 × 125/300 = 2.5/300. We can now calculate the estimated proportion with appendicitis by using the base-rate prior proportion of 100/400 for appendicitis in the group studied (SG). Again, it is 0.73: 1/{1+[Pr(No Appx|GS)/(Pr⁡(Appx|GS))] [ (pr(SE & LRLQ pain|No Appx))/(pr(SE & LRLQ pain|Appx)) ] } = 1/{1+[(300/400)/(100/400)] [ (2.5/300)/(6.75/100) ] } =0.73