Skip to content

Why do we communicate probability calculations so poorly, even when we know how to do it better?

Haynes Goddard writes:

I thought to do some reading in psychology on why Bayesian probability seems so counterintuitive, and making it difficult for many to learn and apply. Indeed, that is the finding of considerable research in psychology. It turns out that it is counterintuitive because of the way it is presented, following no doubt the way the textbooks are written. The theorem is usually expressed first with probabilities instead of frequencies, or “natural numbers” – counts in the binomial case.

The literature is considerable, starting at least with a seminal piece by David Eddy (1982). “Probabilistic reasoning in clinical medicine: problems and opportunities,” in Judgment under Uncertainty: Heuristics and Biases, eds D. Kahneman, P. Slovic and A. Tversky. Also much cited are Gigerenzer and Hoffrage (1995) “How to improve Bayesian reasoning without instruction: frequency formats” Psychol. Rev, and also Cosmides and Tooby, “Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty”, Cognition, 1996.

This literature has amply demonstrated that people actually can readily and accurately reason in Bayesian terms if the data are presented in frequency form, but have difficulty if the data are given as percentages or probabilities. Cosmides and Tooby argue that this is so for evolutionary reasons, and their argument seems compelling.

So taking a look at my several texts (not a random sample of course), including Andrew’s well written text, I wanted to know how many authors introduce the widely used Bayesian example of determining the posterior probability of breast cancer after a positive mammography in numerical frequency terms or counts first, then shifting to probabilities. None do, although some do provide an example in frequency terms later.

Assuming that my little convenience sample is somewhat representative, it raises the question of why are not the recommendations of the psychologists adopted.

This is a missed opportunity, as the psychological findings indicate that the frequency approach makes Bayesian logic instantly clear, making it easier to comprehend the theorem in probability terms.

Since those little medical inference problems are very compelling, it would make the lives of a lot of students a lot easier and increase acceptance of the approach. One can only imagine how much sooner the sometimes acrimonious debates between frequentists and Bayesians would have diminished if not ended. So there is a clear lesson here for instructors and textbook writers.

Here is an uncommonly clear presentation of the breast cancer example: And there are numerous comments from beginning statistics students noting this clarity.

My response:

I agree, and in a recent introductory course I prepared, I did what you recommend and started right away with frequencies, Gigerenzer-style.

Why has it taken us so long to do this? I dunno, force of habit, I guess? I am actually pretty proud of chapter 1 of BDA (especially in the 3rd edition with its new spell-checking example, but even all the way back to the 1st edition in 1995) in that we treat probability as a quantity that can be measured empirically, and we avoid what I see as the flaw of seeking a single foundational justification for probability. Probability is a mathematical model with many different applications, including frequencies, prediction, betting, etc. There’s no reason to think of any one of these applications as uniquely fundamental.

But, yeah, I agree it would be better to start with the frequency calculations: instead of “1% probability,” talk about 10 cases out of 1000, etc.

P.S. It’s funny that Goddard cited a paper by Cosmides and Tooby, as they’re coauthors on that notorious fat-arms-and-political-attitudes paper, a recent gem in the garden-of-forking-paths, power=.06 genre. Nobody’s perfect, I guess. In particular, it’s certainly possible for people to do good research on the teaching and understanding of statistics, even while being confused about some key statistical principles themselves. And even the legendary Kahneman has been known, on occasion, to overstate the strength of statistical evidence.

“Another bad chart for you to criticize”

Perhaps in response to my lament, “People used to send me ugly graphs, now I get these things,” Stuart Buck sends me an email with the above comment and a link to this “Graphic of the day” produced by some uncredited designer at Thomson Reuters:

Screen Shot 2014-10-09 at 8.48.21 AM

From a statistical perspective, this graph is a disaster in that the circular presentation destroys the two-way structure (countries x topics) which has to be central to any understanding of these data. In addition, to the extent that you’d want to get something out of the graph, you’ll end up having to perform mental divisions of line widths.

At this point I’d usually say something like: On the plus side, this is a thought-provoking display (given its tentacle-like appearance, one might even call it “grabby”) that draws viewers’ attention to the subject matter. But I can’t really even say that, because the subject of the graph—nationalities of Nobel Prize winners—is one of the more overexposed topics out there, and really the last thing we need is one more display of these numbers. Probably the only thing we need less of is further analysis of the Titanic survivors data. (Sorry, Bruno: 5 papers on that is enough!)

Another stylized fact bites the dust

According to economist Henry Farber (link from Dan Goldstein):

In a seminal paper, Camerer, Babcock, Loewenstein, and Thaler (1997) find that the wage elasticity of daily hours of work New York City (NYC) taxi drivers is negative and conclude that their labor supply behavior is consistent with target earning (having reference dependent preferences). I replicate and extend the CBLT analysis using data from all trips taken in all taxi cabs in NYC for the five years from 2009-2013. The overall pattern in my data is clear: drivers tend to respond positively to unanticipated as well as anticipated increases in earnings opportunities. This is consistent with the neoclassical optimizing model of labor supply and does not support the reference dependent preferences model.

I explore heterogeneity across drivers in their labor supply elasticities and consider whether new drivers differ from more experienced drivers in their behavior. I find substantial heterogeneity across drivers in their elasticities, but the estimated elasticities are generally positive and only rarely substantially negative. I also find that new drivers with smaller elasticities are more likely to exit the industry while drivers who remain learn quickly to be better optimizers (have positive labor supply elasticities that grow with experience).

It’s good to get that one out of the way.

A silly little error, of the sort that I make every day


Ummmm, running Stan, testing out a new method we have that applies EP-like ideas to perform inference with aggregate data—it’s really cool, I’ll post more on it once we’ve tried everything out and have a paper that’s in better shape—anyway, I’m starting with a normal example, a varying-intercept, varying-slope model where the intercepts have population mean 50 and sd 10, and the slopes have population mean -2 and sd 0.5 (for simplicity I’ve set up the model with intercepts and slopes independent), and the data variance is 5. Fit the model in Stan (along with other stuff, the real action here’s in the generated quantities block but that’s a story for another day), here’s what we get:

            mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu_a[1]    49.19    0.01 0.52 48.14 48.85 49.20 49.53 50.20  2000    1
mu_a[2]    -2.03    0.00 0.11 -2.23 -2.10 -2.03 -1.96 -1.82  1060    1
sigma_a[1]  2.64    0.02 0.50  1.70  2.31  2.62  2.96  3.73   927    1
sigma_a[2]  0.67    0.00 0.08  0.52  0.61  0.66  0.72  0.85   890    1
sigma_y     4.97    0.00 0.15  4.69  4.86  4.96  5.06  5.27  2000    1

We’re gonna clean up this output—all these quantities are ridiculous, also I’m starting to think we shouldn’t be foregrounding the mean and sd as these can be unstable; median and IQR would be better, maybe—but that’s another story too.

Here’s the point. I looked at the above output and noticed that the sigma_a parameters are off: the sd of the intercept is too low (it’s around 2 and it should be 10) and the sd of the slopes is too high (it’s around 0.6 and it should be 0.5). The correct values aren’t even in the 95% intervals.

OK, it could just be this one bad simulation, so I re-ran the code a few times. Same results. Not exactly, but the parameter for the intercepts was consistently underestimated and the parameter for the slopes was consistently overestimated.

What up? OK, I do have a flat prior on all these hypers, so this must be what’s going on: there’s something about the data where intercepts and slopes trade off, and somehow the flat prior allows inferences to go deep into some zone of parameter space where this is possible.

Interesting, maybe ultimately not too surprising. We do know that flat priors cause problems, and here we are again.

What to do? I’d like something weakly informative, this prior shouldn’t boss the inferences around but it should keep them away from bad places.

Hmmm . . . I like that analogy: the weakly informative prior (or, more generally, model) as a permissive but safe parent who lets the kids run around in the neighborhood but sets up a large potential-energy barrier to keep them away from the freeway.

Anyway, to return to our story . . . I needed to figure out what was going on. So I decided to start with a strong prior focused on the true parameter values. I just hard-coded it into the Stan program, setting normal priors for mu_a[1] and mu_a[2]. But then I realized, no, that’s not right, the problem is with sigma_a[1] and sigma_a[2]. Maybe put in lognormals?

And then it hit me: in my R simulation, I’d used sd rather than variance. Here’s the offending code:

a <- mvrnorm(J, mu_a, diag(sigma_a))

That should've been diag(sigma_a^2). Damn! Going from univariate to multivariate normal, the notation changed.

On the plus side, there was nothing wrong with my Stan code. Here's what happens after I fixed the testing code in R:

            mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu_a[1]    48.17    0.11 1.62 45.08 47.07 48.12 49.23 51.38   211 1.02
mu_a[2]    -2.03    0.00 0.10 -2.22 -2.09 -2.02 -1.97 -1.82  1017 1.00
sigma_a[1] 10.98    0.05 1.18  8.95 10.17 10.87 11.68 13.55   496 1.01
sigma_a[2]  0.57    0.00 0.09  0.42  0.51  0.56  0.63  0.75   826 1.00
sigma_y     5.06    0.00 0.15  4.78  4.95  5.05  5.16  5.35  2000 1.00

Fake-data checking. That's what it's all about.


And that's why I get so angry at bottom-feeders like Richard Tol, David Brooks, Mark Hauser, Karl Weick, and the like. Every damn day I'm out here working, making mistakes, and tracking them down. I'm not complaining; I like my job. I like it a lot. But it really is work, it's hard work some time. So to encounter people who just don't seem to care, who just don't give a poop whether the things they say are right or wrong, ooohhhhh, that just burns me up.

There's nothing I hate more than those head-in-the-clouds bastards who feel deep in their bones that they're right. Whether it's an economist fudging his numbers, or a newspaper columnist lying about the price of a meal at Red Lobster, or a primatologist who won't share his videotapes, or a b-school professor who twists his stories to suit his audience---I just can't stand it, and what I really can't stand is that it doesn't even seem to matter to them when people point out their errors. Especially horrible when they're scientists or journalists, people who are paid to home in on the truth and have the public trust to do that.

A standard slam against profs like me is that we live in an ivory tower, and indeed my day-to-day life is far removed from the sort of Mametian reality, that give-and-take of fleshy wants and needs, that we associate with "real life." But, y'know, a true scholar cares about the details. Take care of the pennies and all that.


Mistaken identity

Someone I know sent me the following email:

The person XX [pseudonym redacted] who posts on your blog is almost certainly YY [name redacted]. So he is referencing his own work and trying to make it sound like it is a third party endorsing it. Not sure why but it bugs me. He is an ass as well so pretty much every thing he does bugs me. . . .

OK, fair enough. I was curious so I searched in the blog archives for commenter XX. It turns out he was not a very frequent commenter but he had a few, and he did refer to the work of YY. But I’m almost certain that XX is not YY. I’m no Sherlock Holmes when it comes to the internet but I checked the url’s, and XX appears to be coming from a different country than the location of YY. And, looking at the comments themselves, I can’t believe this is some elaborate attempt at deception.

No big deal. But it’s an interesting example of how it’s possible to be so sure of oneself and happen to be wrong.

New research in tuberculosis mapping and control

Mapping and control. Or, as we would say, descriptive and causal inference.

Jon Zelner informs os about two ongoing research projects:

1. TB Hotspot Mapping: Over the summer, I [Zelner] put together a really simple R package to do non-parametric disease mapping using the distance-based mapping approach developed by Caroline Jeffery and Al Ozonoff at Harvard. The package is available here. And I also mention it on my website.

We’ve been using this package to map hotspots of multidrug-resistant (MDR) TB and spatial clustering of specific TB strains in Lima. Here’s a paper from a few years ago by Justin Manjourides (he’s one of the coauthors on our new paper that’s currently under review) that used this approach to do something similar using administrative data from Lima and has some figures that do a good job of demonstrating the application of this approach to the MDR problem in Lima. I think it’s a pretty cool way to do this kind of mapping, and it has the virtue of being less sensitive to irregular point patterns than KDE and better able to deal with really large datasets than a GP smoother.

The package is pretty simple and easy to use and there’s a tutorial on the for the github repo. The only caveat is that the current version of the master branch only uses a fixed-bandwidth smoother, but I’m hoping to push a version with the variable bandwidth smoothing at some point (the version I have of that for my own work relies on a hacky combination of R and Julia and isn’t quite fit for public consumption yet; hoping to re-implement the slower bits using RCPP for an R-only version soon).

Ummm . . . I’d say do all the damn smoothing in Stan and then you don’t have to be hacky at all.

And here’s Zelner’s other tuberculosis project:

2. Developing targeted interventions for TB control: In this group of analyses (the first is in this paper from the American Journal of Epidemiology), we used data from a large household-based cohort study in Lima, Peru, that my coauthors collected to estimate age-specific rates of TB from exposure in the community and to household cases. In this paper, we tried to tackle the question of whether screening for latent TB infection and providing preventive therapy to the household contacts of TB cases that presented at community health centers could be effective for individuals older than 5 (which is the current WHO cutoff for this type of screening and preventive therapy).

What we found is that it looks like there are a good number of new infections from this type of exposure up to about 15 years. And then a second paper of ours, which came out in the American Journal of Respiratory and Critical Care Medicine last year, suggested that preventive therapy provided to these younger individuals was very effective at preventing them from developing TB disease during the year following enrollment into the study, which is encouraging. What makes this analysis cool, I think, is how we used differences in the infectivity of different types of household index cases (the ones that showed up in the community health centers) to estimate the age-specific rates of community and household transmission. This kind of thing is relatively new in TB epidemiology, where we typically rely on more broad-brush kinds of policies.

I think there are also some selection issues that we tried to deal with around what kind of households get what kind of index cases. But I’ve been thinking there’s probably a post-stratification style solution to this issue that would be more elegant than what we did in the paper (basically a lot of sensitivity analysis). I’m actually working on a Stan-based extension to the AJE paper right now to see how robust are conclusions are to spatial variation in community infection rates and was hoping to ping you about a better way of tackling the potential selection problem at some point.

Poststratification good. Stan good.

This is not an application area that I know anything about but I wanted to share this interesting stuff with you.

How can teachers of (large) online classes use text data from online learners?


Dustin Tingley sends along a recent paper (coauthored with Justin Reich, Jetson Leder-Luis, Margaret Roberts, and Brandon Stewart), which begins:

Dealing with the vast quantities of text that students generate in a Massive Open Online Course (MOOC) is a daunting challenge. Computational tools are needed to help instructional teams uncover themes and patterns as MOOC students write in forums, assignments, and surveys. This paper introduces to the learning analytics community the Structural Topic Model, an approach to language processing that can (1) find syntactic patterns with semantic meaning in unstructured text, (2) identify variation in those patterns across covariates, and (3) uncover archetypal texts that exemplify the documents within a topical pattern. We show examples of computationally-aided discovery and reading in three MOOC settings: mapping students’ self-reported motivations, identifying themes in discussion forums, and uncovering patterns of feedback in course evaluations.

This sounds like it could be useful, especially if the data collection and analysis is all automatic. I’m sure the model will have a lot of problems—all models do—but that’s ok. The instructor could run this program, look at the results, see what makes sense, and see what doesn’t make sense. Ideally the program would come with some feedback options so that Reich et al., as developers of the software, can improve the model and make it more useful. Thus, a system with its own built-in mechanism for improvement. Perhaps my posting here can start that process going.

Comparison of Bayesian predictive methods for model selection

This post is by Aki

We mention the problem of bias induced by model selection in A survey of Bayesian predictive methods for model assessment, selection and comparison, in Understanding predictive information criteria for Bayesian models, and in BDA3 Chapter 7, but we haven’t had a good answer how to avoid that problem (except by not selecting any single model, but integrating over all them).

We (Juho Piironen and me) recently arxived a paper Comparison of Bayesian predictive methods for model selection, which I can finally recommend as giving a useful practical answer how to make model selection with greatly reduced bias and overfitting. We write

The results show that the optimization of a utility estimate such as the cross-validation score is liable to finding overfitted models due to relatively high variance in the utility estimates when the data is scarce. Better and much less varying results are obtained by incorporating all the uncertainties into a full encompassing model and projecting this information onto the submodels. The reference model projection appears to outperform also the maximum a posteriori model and the selection of the most probable variables. The study also demonstrates that the model selection can greatly benefit from using cross-validation outside the searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model.

Our experiments were made with Matlab, but we are working on Stan+R code, which should be available in a few weeks.

Outside pissing in


Coral Davenport writes in the New York Times:

Mr. Tribe, 73, has been retained to represent Peabody Energy, the nation’s largest coal company, in its legal quest to block an Environmental Protection Agency regulation that would cut carbon dioxide emissions from the nation’s coal-fired power plants . . . Mr. Tribe likened the climate change policies of Mr. Obama to “burning the Constitution.”

But can we really trust a reporter with this name on the topic of global warming? Coral is, after all, on the front line of climate change risk.

Coral outcrop on Flynn Reef

So what’s happened to Laurence “ten-strike” Tribe?

Last we heard from him, he was asking Obama for a “newly created DOJ position dealing with the rule of law.” Maybe if Tribe had gotten the damn job, he’d keep it on the down low about the whole “burning the Constitution” thing.

The noted left-leaning Harvard Law professor has gone rogue! And, like our favorite rogue economist, seems to have become disillusioned with our Hawaiian-born leader.

The news article also contains this juicy bit:

In addition to the brief, Mr. Tribe wrote a lengthy public comment on the climate rules that Peabody submitted to the E.P.A. Mr. Tribe’s critics note that his comment, which he echoed in an op-ed article in The Wall Street Journal in December, includes several references to the virtues of coal, calling it “a bedrock component of our economy.”

The comment also has phrases frequently used by the coal industry. . . .

Laurence Tribe using phrases written by others! That could never happen, right?

P.S. I was curious so I googled *Laurence Tribe bedrock* which took me to a legal document with this wonderful phrase:

This bedrock principle, one familiar to anyone who has taken an elementary civics class in any halfway adequate high school . . .

Damn! I knew that was my problem. My high school didn’t offer a civics class.

If this case ever gets to the Supreme Court, I expect Tribe will have some difficulty explaining these concepts to Sotomayor. As he put it so eloquently in his job-seeking letter to the C-in-C, she’s not nearly as smart as she seems to think she is. Elementary civics might be a bit beyond her.

And . . . our featured 2015 seminar speaker is . . . Thomas HOBBES!!!!!


Just in case you’ve forgotten where this all came from:

This came in the departmental email awhile ago:

The Brown Institute for Media Innovation, Alliance (Columbia University, École Polytechnique, Sciences Po, and Panthéon-Sorbonne University), The Center for Science and Society, and The Faculty of Arts and Sciences are proud to present
You are invited to apply for a seminar led by Professor Bruno Latour on Tuesday, September 23, 12-3pm. Twenty-five graduate students from throughout the university will be selected to participate in this single seminar given by Prof. Latour. Students will organize themselves into a reading group to meet once or twice in early September for discussion of Prof. Latour’s work. They will then meet to continue this discussion with a small group of faculty on September 15, 12-2pm. Students and a few faculty will meet with Prof. Latour on September 23. A reading list will be distributed in advance.
If you are interested in this 3-4 session seminar (attendance at all 3-4 sessions is mandatory), please send
Your School:
Your Department:
Year you began your terminal degree at Columbia:
Thesis or Dissertation title or topic:
Name of main advisor:
In one short, concise paragraph tell us what major themes/keywords from Latour’s work are most relevant to your own work, and why you would benefit from this seminar. Please submit this information via the site

The due date for applications is August 11 and successful applicants will be notified in mid-August.

This is the first time I’ve heard of a speaker who’s so important that you have to apply to attend his seminar! And, don’t forget, “attendance at all 3-4 sessions is mandatory.”

At this point you’re probably wondering what exactly is it that Bruno Latour does. Don’t worry—I googled him for you. Here’s the description of his most recent book, “An Inquiry Into Modes of Existence”:

The result of a twenty five years inquiry, it offers a positive version to the question raised, only negatively, with the publication, in 1991, of ”We have never been modern”: if ”we” have never been modern, then what have ”we” been? From what sort of values should ”we” inherit? In order to answer this question, a research protocol has been developed that is very different from the actor-network theory. The question is no longer only to define ”associations” and to follow networks in order to redefine the notion of ”society” and ”social” (as in ”Reassembling the Social”) but to follow the different types of connectors that provide those networks with their specific tonalities. Those modes of extension, or modes of existence, account for the many differences between law, science, politics, and so on. This systematic effort for building a new philosophical anthropology offers a completely different view of what the ”Moderns” have been and thus a very different basis for opening a comparative anthropology with the other collectives – at the time when they all have to cope with ecological crisis. Thanks to a European research council grant (2011-2014) the printed book will be associated with a very original purpose built digital platform allowing for the inquiry summed up in the book to be pursued and modified by interested readers who will act as co-inquirers and co-authors of the final results. With this major book, readers will finally understand what has led to so many apparently disconnected topics and see how the symmetric anthropology begun forty years ago can come to fruition.

Huh? I wonder if this is what they mean by “one short, concise paragraph” . . .

Update: We just got an announcement in the mail. The due date has been extended a second time, this time to Aug 18. This seems like a good sign, if fewer Columbia grad students than expected wanted to jump through the hoops to participate in this seminar.

The ultimate bracket

So . . . I had the idea that we could do better, and I gathered 64 potential speakers, eight current or historical figures from each of the following eight categories:
– Philosophers
– Religious Leaders
– Authors
– Artists
– Founders of Religions
– Cult Figures
– Comedians
– Modern French Intellectuals.

And Paul Davidson put them in a bracket, which, as of a few days ago, looked like this:

Bracket v1

And yesterday we had the final round, which was won by Hobbes based on this positive argument from X:

from “Hobbes’s State of Nature : A Modern Bayesian Game-Theoretic Analysis” by Hun Chung:

I personally think that applying game theory to political theory is misguided only when one tries to apply the wrong model; and, not all game-theoretic models are wrong. This is why I believe conserving the details of Hobbes’s logic is important. I believe that the model provided in this paper is the correct game-theoretic model that represents Hobbes’s state of nature in a way that Hobbes had originally intended it to be.

We need to know what Hobbes thinks of Chung’s Bayesian analysis!

And this negative argument from an anonymous commenter:

I think Dick is bowing out of the competition with this quote:

Probability, Joe said to himself. A science in itself. Bernoulli’s theorem, the Bayes-Laplace theorem, the Poisson Distribution, Negative Binomial Distribution…coins and cards and birthdays, and at last random variables. And, hanging over it all, the brooding specter of Rudolf Carnap and Hans Reichenbach, the Vienna Circle of philosophy and the rise of symbolic logic. A muddy world, in which he did not quite care to involve himself.

If Dick does not care to involve himself with probability, I don’t care to involve myself with him!

Best of all, was this comment from Jonathan:

[Hobbes] got off this scatalogical sally directed at the Wallis and arguing the superiority of graphics to equations. (Note: Pappus was a 4th century geometer who proved things with pictures)

“When did you see any man but yourselves publish his Demonstrations by signs not generally received, except it were not with intent to demonstrate, but to teach the use of Signes? Had Pappus no Analytiques? Or wanted he the wit to shorten his reckoning by Signes? Or has he not proceeded Analytically in a hundred Problems (particularly in his seventh Book), and never used Symboles? Symboles are poor unhandsome (though necessary) scaffolds of Demonstration; and ought no more appear in publique, than the most deformed necessary business which you do in your Chambers.

Poop jokes and an argument that graphs are better than tables. Plus he’s a political scientist. Thomas Hobbes is my man.

What a great way to end our tournament, demonstrating that the earlier rounds were all worth it to lead up to this point.

Thank you all for participating!