Skip to content

Three ways to present a probability forecast, and I only like one of them

To the nearest 10%:


To the nearest 1%:


To the nearest 0.1%:

I think the National Weather Service knows what they’re doing on this one.

On deck this week

Mon: Three ways to present a probability forecast, and I only like one of them

Tues: Try a spaghetti plot

Wed: I ain’t got no watch and you keep asking me what time it is

Thurs: Some questions from our Ph.D. statistics qualifying exam

Fri: Solution to the helicopter design problem

Sat: Solution to the problem on the distribution of p-values

Sun: Solution to the sample-allocation problem

“Your Paper Makes SSRN Top Ten List”

I received the following email from the Social Science Research Network, which is a (legitimate) preprint server for research papers:

Dear Andrew Gelman:

Your paper, “WHY HIGH-ORDER POLYNOMIALS SHOULD NOT BE USED IN REGRESSION DISCONTINUITY DESIGNS”, was recently listed on SSRN’s Top Ten download list for: PSN: Econometrics, Polimetrics, & Statistics (Topic) and Political Methods: Quantitative Methods eJournal.

As of 02 September 2014, your paper has been downloaded 17 times. You may view the abstract and download statistics at:

Top Ten Lists are updated on a daily basis. . . .

The paper (with Guido Imbens) is here.

What amused me, though, was how low the number was. 17 downloads isn’t so many. I guess it doesn’t take much to be in the top 10!

Hoe noem je?

Haynes Goddard writes:

Reviewing my notes and books on categorical data analysis, the term “nominal” is widely employed to refer to variables without any natural ordering. I was a language major in UG school and knew that the etymology of nominal is the Latin word nomen (from the Online Etymological Dictionary: early 15c., “pertaining to nouns,” from Latin nominalis “pertaining to a name or names,” from nomen (genitive nominis) “name,” cognate with Old English nama (see name (n.)). Meaning “of the nature of names” (in distinction to things) is from 1610s. Meaning “being so in name only” first recorded 1620s.)

So variables without a natural order such as gender (male-female), transport mode (walk, bicycle, bus, train, car) and so on are just coded 0, 1 and so on. Yet the textbook writers do not explain that nominal just means name which it seems to me would help the students better understand the application.

Do you know when this usage was first introduced into statistics?

I have no idea but maybe you, the readers, can offer some insight?

How do companies use Bayesian methods?

Jason May writes:

I’m in Northwestern’s Predictive Analytics grad program. I’m working on a project providing Case Studies of how companies use certain analytic processes and want to use Bayesian Analysis as my focus.

The problem: I can find tons of work on how one might apply Bayesian Statistics to different industries but very little on how companies actually do so except as blurbs in larger pieces.

I was wondering if you might have ideas of where to look for cases of real life companies using Bayesian principles as an overall strategy.

Some examples that come to mind are pharmaceutical companies that use hierarchical pharmacokinetic/pharmacodynamic modeling, as well as people on the Stan users list who are using Bayes in various business settings. And I know that some companies do formal decision analysis which I think is typically done in a Bayesian framework. And I’ve given some short courses at companies, which implies that they’re interested in Bayesian methods, though I don’t really know if they ended up following my particular recommendations.

Perhaps readers can to supply other examples?

Prediction Market Project for the Reproducibility of Psychological Science

Anna Dreber Almenberg writes:

The second prediction market project for the reproducibility project will soon be up and running – please participate!

There will be around 25 prediction markets, each representing a particular study that is currently being replicated. Each study (and thus market) can be summarized by a key hypothesis that is being tested, which you will get to bet on.

In each market that you participate, you will bet on a binary outcome: whether the effect in the replication study is in the same direction as the original study, and is statistically significant with a p-value smaller than 0.05.

Everybody is eligible to participate in the prediction markets: it is open to all members of the Open Science Collaboration discussion group – you do not need to be part of a replication for the Reproducibility Project. However, you cannot bet on your own replications.

Each study/market will have a prospectus with all available information so that you can make informed decisions.

The prediction markets are subsidized. All participants will get about $50 on their prediction account to trade with. How much money you make depends on how you bet on different hypotheses (on average participants will earn about $50 on a Mastercard (or the equivalent) gift card that can be used anywhere Mastercard is used).

The prediction markets will open on October 21, 2014 and close on November 4.

If you are willing to participate in the prediction markets, please send an email to Siri Isaksson by October 19 and we will set up an account for you. Before we open up the prediction markets, we will send you a short survey.

The prediction markets are run in collaboration with Consensus Point.

If you have any questions, please do not hesitate to email Siri Isaksson.

Statistical Communication and Graphics Manifesto

Statistical communication includes graphing data and fitted models, programming, writing for specialized and general audiences, lecturing, working with students, and combining words and pictures in different ways.

The common theme of all these interactions is that we need to consider our statistical tools in the context of our goals.

Communication is not just about conveying prepared ideas to others: often our most important audience is ourselves, and the same principles that suggest good ways of communication with others also apply to the methods we use to learn from data.

See also the description of my statistical communication and graphics course, where we try to implement the above principles.

[I'll be regularly updating this post, which I sketched out (with the help of the students in my statistical communication and graphics course this semester) and put here so we can link to it from the official course description.]

My course on Statistical Communication and Graphics

Screen Shot 2013-08-03 at 4.23.29 PM

We will study and practice many different aspects of statistical communication, including graphing data and fitted models, programming in Rrrrrrrr, writing for specialized and general audiences, lecturing, working with students and colleagues, and combining words and pictures in different ways.

You learn by doing: each week we have two classes that are full of student participation, and before each class you have a pile of readings, a homework assignment, and jitts.

You learn by teaching: you spend a lot of time in class explaining things to your neighbor.

You learn by collaborating: you’ll do a team project which you’ll present at the end of the semester.

The course will take a lot of effort on your part, effort which should be aligned with your own research and professional goals. And you will get the opportunity to ask questions of guest stars who will illustrate diverse perspectives in statistical communication and graphics.

See also the statistical communication and graphics manifesto.

[I'll be regularly updating this post, which I sketched out (with the help of the students in my statistical communication and graphics course this semester) and put here so we can link to it from the official course description.]

The Fault in Our Stars: It’s even worse than they say


In our recent discussion of publication bias, a commenter link to a recent paper, “Star Wars: The Empirics Strike Back,” by Abel Brodeur, Mathias Le, Marc Sangnier, Yanos Zylberberg, who point to the notorious overrepresentation in scientific publications of p-values that are just below 0.05 (that is, just barely statistically significant at the conventional level) and the corresponding underrepresentation of p-values that are just above the 0.05 cutoff.

Brodeur et al. correctly (in my view) attribute this pattern not just to selection (the much-talked-about “file drawer”) but also to data-contingent analyses (what Simmons, Nelson, and Simonsohn call “p-hacking” and what Loken and I call “the garden of forking paths”). They write:

We have identified a misallocation in the distribution of the test statistics in some of the most respected academic journals in economics. Our analysis suggests that the pattern of this misallocation is consistent with what we dubbed an inflation bias: researchers might be tempted to inflate the value of those almost-rejected tests by choosing a “significant” specification. We have also quantified this inflation bias: among the tests that are marginally significant, 10% to 20% are misreported.

They continue with “These figures are likely to be lower bounds of the true misallocation as we use very conservative collecting and estimating processes”—but I would go much further. One way to put it is that there are (at least) three selection processes going on here:

1. (“the file drawer”) Significant results (traditionally presented in a table with asterisks or “stars,” hence the photo above) more less likely to get published.

2. (“inflation”) Near-significant results get jiggled a bit until they fall into the box

3. (“the garden of forking paths”) The direction of an analysis is continually adjusted in light of the data.

Brodeur et al. point out that item 1 doesn’t tell the whole story, and they come up with an analysis (featuring a “lemma” and a “corollary”!) explaining things based on item 2. But I think item 3 is important too.

The point is that the analysis is a moving target. Or, to put it another way, there’s a one-to-many mapping from scientific theories to statistical analyses.

So I’m wary of any general model explaining scientific publication based on a fixed set of findings that are then selected or altered. In many research projects, there is either no baseline analysis or else the final analysis is so far away from the starting point that the concept of a baseline is not so relevant.

Although maybe things are different in certain branches of economics, in that people are arguing over an agreed-upon set of research questions.

P.S. I only wish I’d known about these people when I was still in Paris; we could’ve met and talked.

I didn’t say that! Part 2

Uh oh, this is getting kinda embarrassing.

The Garden of Forking Paths paper, by Eric Loken and myself, just appeared in American Scientist. Here’s our manuscript version (“The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ‘fishing expedition’ or ‘p-hacking’ and the research hypothesis was posited ahead of time”), and here’s the final, trimmed and edited version (“The Statistical Crisis in Science”) that came out in the magazine.

Russ Lyons read the published version and noticed the following sentence, actually the second sentence of the article:

Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation.

How horrible! Russ correctly noted that the above statement is completely wrong, on two counts:

1. To the extent the p-value measures “confidence” at all, it would be confidence in the null hypothesis, not confidence in the data.

2. In any case, the p-value is not not not not not “the probability that a perceived result is actually the result of random variation.” The p-value is the probability of seeing something at least as extreme as the data, if the model (in statistics jargon, the “null hypothesis”) were true.

How did this happen?

The editors at American Scientist liked our manuscript but it was too long, also parts of it needed explaining for a nontechnical audience. So they cleaned up our article and added bits here and there. This is standard practice at magazines. It’s not just Raymond Carver and Gordon Lish.

Then they sent us the revised version and asked us to take a look. They didn’t give us much time. That too is standard with magazines. They have production schedules.

We went through the revised manuscript but not carefully enough. Really not carefully enough, given that we missed a glaring mistake—two glaring mistakes—in the very first paragraph of the article.

This is ultimately not the fault of the editors. The paper is our responsibility and it’s our fault for not checking the paper line by line. If it was worth writing and worth publishing, it was worth checking.

P.S. Russ also points out that the examples in our paper all are pretty silly and not of great practical importance, and he wouldn’t want readers of our article to get the impression that “the garden of forking paths” is only an issue in silly studies.

That’s a good point. The problems of nonreplication etc affect all sorts of science involving human variation. For example there is a lot of controversy about something called “stereotype threat,” a phenomenon that is important if real. For another example, these problems have arisen in studies of early childhood intervention and the effects of air pollution. I’ve mentioned all these examples in talks I’ve given on this general subject, they just didn’t happen to make it into this particular paper. I agree that our paper would’ve been stronger had we mentioned some of these unquestionably important examples.