I prefer 50% to 95% intervals for 3 reasons:

1. Computational stability,

2. More intuitive evaluation (half the 50% intervals should contain the true value),

3. A sense that in aplications it’s best to get a sense of where the parameters and predicted values will be, not to attempt an unrealistic near-certainty.

This came up on the Stan list the other day, and Bob Carpenter added:

I used to try to validate with 95% intervals, but it was too hard because there weren’t enough cases that got excluded and you never knew what to do if 4 cases out of 30 were outside the 95% intervals.

(3) is a two-edged sword because I think people will be inclined to “read” the 50% intervals as 95% intervals out of habit, expecting higher coverage than they have. But I like the point about not trying to convey an unrealistic near-certainty (which is exactly how I think people look at 95% intervals because the p value convention at .05).

And remember to call them uncertainty intervals.

It’s the point at which a bet becomes worthwhile

… depending on the bet payoff

I like 50% because it is easy to describe in words. Some alternatives could be:

68% as it is similar to a standard deviation.

66% as it is also easy to describe. Using the UN IPCC guidance note on uncertainty you could call this the “likely range”. https://www.ipcc.ch/pdf/supporting-material/uncertainty-guidance-note.pdf

This is a very interesting document! Thanks for providing it.

(Not directly related to today’s post, just of general interest. If there’s a different place you’d prefer comments like this, let me know.)

You wrote recently about researchers explaining away failed replications by postulating all sorts of new modulators, and said that “This was what the power pose authors said about the unsuccessful replication performed by Ranehill et al.” That’s not quite true; only Amy Cuddy said that.

Dana Carney, the lead author of the original power pose paper, now has a statement on her website (http://faculty.haas.berkeley.edu/dana_carney/vita.html, “My position on ‘Power Poses'”) that begins:

“Reasonable people, whom I respect, may disagree. However since early 2015 the evidence has been mounting suggesting there is unlikely any embodied effect of nonverbal expansiveness (vs. contractiveness)—i.e.., “power poses” — on internal or psychological outcomes.

As evidence has come in over these past 2+ years, my views have updated to reflect the evidence. As such, I do not believe that “power pose” effects are real…”

The citation to the paper on her website is now immediately followed by a note:

” ***This result failed to replicate in an adequately powered sample. See: Ranehill, Dreber, Johannesson, Leiberg, Sul, & Weber (2015). [.pdf] and a p-curve analysis also suggested the effect is not likely real [.pdf]”

I feel this blog could use more positive examples of researchers responding well to failed replications, and this is an excellent example of that.

Af:

When I wrote about power pose before the recent Carney statement, what I wrote was that Cuddy was defending the work and Carney and Yap were keeping quiet. I wondered if Carney and Yap were not speaking up because they were hoping the whole thing would go away. After Carney’s statement came out, I wrote that Carney had changed her position but Cuddy was still defending the work. Yap remains quiet, but last time I checked, the power pose paper was still featured on his website.

Where truth indeed is stranger than fiction, Prof. Yap lists one of his research interests (in his CV) as “Dishonesty, Deception and Unethical Behavior”.

Llewelyn:

There’s something that fascinates me about the name Andy Yap, because my name is Andy and I yap all the time!

So you’re Andy Yapper

Sure. You’ve been reasonable about it. I just hadn’t seen you comment on the statement and wanted to be sure you’d seen it.

From deep down in the comment section, continuing beyond that, and then mentioned as a P.P.S. to a post. Hard to find, I know, but many of us applauded her behavior:

http://andrewgelman.com/2016/09/24/a-break-in-the-thin-blue-line/#comment-314390

Thanks for the link. I missed that completely when looking for responses here.

I like Richard McElreath’s approach in his book Statistical Rethinking a lot. He just picks prime numbers to illustrate that these numbers are entirely arbitrary (and probably to annoy people who think there’s something magical about 0.05 or 95%). So why not a 53% interval instead of 50%? I usually go for 89% intervals and smile every time I do.

Or you could use the Bill and Ted interval:

https://www.youtube.com/watch?v=XsC8zEgZEfo

+100

People already go into shock if I don’t report a p-value and instead just show a 95% credible interval. If I now switch to something other than a 95% interval, I think that would be the last straw for them.

I like my intervals wide (if possible, sometimes for graphing it is too ugly), they should really be like lower/upper bounds, sometimes it is best to even double the span of a 99.99 interval. That way you can look at the most extreme possibilities that maybe are still consistent with observation. What would that mean for your theory? What can the theory tolerate? Using a narrower one will cause more confusion I think.

Lower and upper bounds are often infinite—posterior support is usually equal to prior support.

Priors are going to have a big effect on the endpoints of 99.99% intervals. And you’re going to need a huge sample to estimate the necessary .00005 and .99995 quantiles—only the order of millions of draws, all of which will need to be saved or you’ll need a special online algorithm (which can handle extreme empirical quantiles with very limited memory).

As an asymptotic approximation I usually just double the interval (-inf,inf)

;-)

Look, I deserve some credit here for being the first to propose 100% confidence intervals.

That would be a good thing surely? – people’s preconceptions beed to be challenged. Then again, maybe you want to get your papers published!

In my experience (clinical research) almost everybody has little idea what a confidence interval means, and they just produce 95% CIs because (a) that gives them an easy way of seeing if the result is “significant” and (b) statisticians told them they should do this about 20 years ago (and journals took it on board). Confidence intervals are almost invariably given a Bayesian interpretation (most likely values etc) – not surprisingly, because that is what people want to know. Given that, it makes sense to use a Bayesian analysis – but virtually nobody does. Why is this difficult or controversial?

> it makes sense to use a Bayesian analysis

Yes, if one has credible/sensible priors and data models (that have been checked) as well as good insight for what to make of the resulting posterior probabilities.

> but virtually nobody does

And most that do just use default priors and data models that most likely were not checked – just pulling the Bayesian crank and claiming “all done”.

(Andrew provided a example of how bad this can be for repeated use of a default prior implied credible intervals.)

> Why is this difficult or controversial?

I think both, but the insurmountable opportunity (challenge) is to make it less difficult and thoughtfully less controversial.

Keith – I would say that those comments apply even more to traditional analyses. There’s a better chance of people intrpreting posterior probabilities correctly than p-values and confidence intervals, you need to consider priors when interpreting frequentist results anyway (everyone does this but subconsciously), and you need to make choices aout models in the same way. I guess it’s easy to do any analysis wrong (“statistics is hard” as someone may have said).

What really puzzles me is why there is so much opposition to Bayesian methods (from personal experience) – I think the answer is largely to do with people in the field feeling heavily invested in the existing paradigm, not understandding the problems and not wanting change.

>”why there is so much opposition to Bayesian methods”

I wouldn’t take the opposition to Bayesian methods from clinical/biomed researchers seriously. The vast majority are just following rote instructions/flow charts when it comes to data analysis. They never had the time/inclination/need to figure out what a distribution is, etc. That opposition fades away as soon as you give them the reason and tools to do mcmc parameter estimation, etc.

Now, when it comes to testing a null hypothesis vs their hypothesis, they should know better. In that case it is just an attempt to save face because what has been passing for scientific reasoning is so ridiculously and obviously fatally flawed.

Anoneuoid said, “They never had the time/inclination/need to figure out what a distribution is, etc. …

Now, when it comes to testing a null hypothesis vs their hypothesis, they should know better.”

People who never had the inclination to figure out what a distribution is can’t be expected to understand what is the problem with testing a null hypothesis vs their hypothesis. There is, regrettably, a surprising amount of ignorance out there. Once there was a proposal among some in my university for math to teach a combined calculus/statistics course for biological science students. One proponent said “Don’t waste time talking about distributions. Just teach them how to read an ANOVA table.” (Please don’t take this as a diatribe about all biological scientists. My experience is that they range widely in their understanding of statistics and their openness to being told when/why they are doing something unwarranted.)

Agreed – most of the people I work with (clinicians of various sorts) are not particularly knowledgeable or interested in statistics. They generally know what the traditional approach to studies is (significance tests, primary outcomes, sample size calculations and so on), and that’s what they expect. So there is some hope for re-education there. More concerning to me is the amount of opposition from statisticians, who really should know better. Or maybe I’ve just been unlucky.

Further to Simon’s comment – and stealing his prose ;-)

Most of the statistician I have worked with who apply statistics (Hospital research institutions, Cochrane Collaboration, Oxford Centre for Statistics in Medicine, regulatory agencies, etc.) are not particularly knowledgeable or interested in non-standard or advanced statistics (i.e. anything they did not study when they were in graduate school.)

They generally know what the traditional approaches to statistics are (Anova, regression, GLMs, Repeated measures Anova, and naive random effects or generalized estimating equations and possibly default prior Bayes), and that’s what they expect to be using in their work.

There maybe some hope for re-education here especially given good blogs, on line video webinars and running prototype analysis scripts (e.g. Github repositories.)

Or maybe I’ve just been unlucky.

Perhaps the bigger challenge will be teaching that statistical tool box referred to here http://www.dcscience.net/Gigerenzer-Journal-of-Management-2015.pdf whatever that needs to be???

Simon:

Agree that statistics is seldom not hard and I was not suggesting that traditional analyses should be a standard or even a usual default.

> There’s a better chance of people interpreting posterior probabilities correctly

If those are taken as omnipotent (or near omnipotent as with well understood disease screening applications) then probabilities likely will be better interpreted (at least with some training.)

Now here, http://andrewgelman.com/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/ is the problem all in the Bayesian setting with a flat prior (or near flat Normal prior with huge variance) that is often used as a default versus Bayes with _sensible prior_ or Bayes again with a sensible prior versus Frequentist with no prior at all?

Logically first versus seem the same to me as the second versus.

I do worry about thoughtless Bayes becoming the new default analysis ritual http://www.dcscience.net/Gigerenzer-Journal-of-Management-2015.pdf

Also, I do think coverage in repeated use of Bayes does need to be considered http://andrewgelman.com/2016/11/05/why-i-prefer-50-to-95-intervals/#comment-341799 except in exceptional applications.

You can probably argue that “thoughtless Bayes” would be better than “thoughtless NHST” (or maybe that is just standard NHST, as I haven’t yet found a thoughtful version), but that isn’t a great choice to have. Thoughtful analysis trumps both.

Andrew,

Have you ever seen this app? https://www.getguesstimate.com/ I think it is pretty neat to play around with. Also it gets to a question of uncertainty for what purpose?

Elin:

Looks interesting. I looked at their documentation but couldn’t figure out what they’re doing. I guess that’s standard with non-academic software, they want to keep their algorithm a secret? Maybe we should get someone to build something similar using Stan.

Sounds good to me!

Is there a good reason not to report both intervals? I think that each interval gives a different kind of information about what I know (similar to the different moments of a distribution).

I’ve become very fond of caterpillar plots that show both a narrow (you like 50%, I like 66%) and wide (95%) intervals.

+1

Isn’t it even better to present the posterior distribution? I guess you don’t then get the precise ends of intervals but you can quote specific intervals if there is a reason to.

I think 95% intervals make more sense for making decisions, and 50% intervals may make more sense for descriptives. 50% is intuitively more meaningful to me for prediction, but saying “95% of the posterior density falls in ___ region” is much more useful for making a decision one way or another. Although many may think this is due to the \alpha = .05 mentality, in my own case, I think it’s more about quite literally what 95% posterior probability implies. I don’t think we should adopt a 95% credible interval range simply because a field also tests at \alpha = .05.

Suppose that there’s a 95% posterior probability that your expensive effort to launch a rocket and deflect an asteroid heading to the earth will fail? So what?

Making decisions based on 95% intervals pretty much works when those intervals are small and the consequences of the decision lie in a narrow region of outcome space… in other words when there really isn’t much uncertainty.

Suppose a 95% interval for the 1 yr return on a 1000 dollar investment is 1500 to 1555, now suppose it’s -1300 to 7000

I don’t really grasp the problem here. In the first example, you’d obviously want to compute the expected value here; if there’s a 5% probability of success (and there are no alternatives to this problem), I’m guessing expected value = .05*value of success would be quite large; that is, even with a 5% chance of success, the expected value, so to speak, of sending the rocket is extremely high (since otherwise, the value is 0 — we’re all dead)?

If you have a 95% posterior probability that the return is 1500-1555, then make the investment. If it’s [-1300, 7000], it’s not a clear cut decision then, is it? Likewise, if the 95% interval is [1, 10] (only 1-10 dollar return), it would still be a safe bet, and I’d take that bet.

these arguments go in circles because probabilities (frequencies or otherwise) shouldn’t be the basis for decisions in the absence of a utility function.

That’s one of the mistakes of the whole p-value/significance enterprise. IMO the handwaving around the importance of multiple comparison corrections is just a backdoor means of introducing implicit utilities and implicit base rates.

Yes this. Even with the definitely positive return example you only make the investment if you don’t have an even better option.

Expected return is the general purpose method.

I reckon there should be two classes of terms:

* “Confidence interval”, “uncertainty interval”, “credible interval” — which connote data-conditional (epistemic) properties of the interval itself, and

* “Calibrated interval”, “coverage interval” — which connote truth-conditional (sampling) properties of the interval-generating method.

Of course more often than not they line up (under *some* interpretation), but generally speaking the former implies the latter.

Agree, two terms are needed as there sometimes seems to be confusion over what is being claimed.

> generally speaking the former implies the latter.

Yes, if the prior and data model is true which of course they never actually are – so two terms for “Calibrated interval”, “coverage interval” are also needed – Sander Greenland coined omnipotent for coverage assuming prior and data model is true.

Maybe this is captured implicitly in your comment about stability, but I also think that as you go to larger intervals, you become more sensitive in general to details or assumptions of your model. For example, you would never believe that you had correctly identified the 99.9995% interval. Or at least that’s my intuition. If it’s right, it argues that 50-percent should be better estimated than 95-percent. Or maybe “more robustly” estimated, since it is less sensitive to assumptions about the noise, perhaps?

There is a lively discussion of your comment (and of Andrew’s post) going on at http://stats.stackexchange.com/questions/248113.

Amoeba:

All this discussion is good because people are moving beyond unthinking use of 95% intervals.