Skip to content

Just in case

Hi, R. Could you please prepare 50 handouts of the attached draft course plan (2-sided printing is fine) to hand out to students? I prefer to do this online but it sounds like there’s some difficulty with that, so we can do handouts on this first day of class.


My Amtrak is rescheduled and it is scheduled to arrive in Boston at 4:35. This should give me plenty of time to get to class on time, but Amtrak is sometimes delayed. So if class begins and I am not there yet, please start without me!

If I’m not there, please do the following:

- Get to the room 10 minutes early. Before class begins, chat with the students as they are coming in. You can talk about any topic, as long as it’s statistical: tell them about your qualifying exam, or discuss how to express uncertainty in weather forecasts, or talk about the Celtics (ha ha). No need to be lecturing here, just get them on track, thinking and talking about statistics. Also during that time, please get the projector set up so that, when I do arrive, I can plug in my laptop and be all ready to go.

- Once class begins (I don’t remember the convention at Harvard; will it start exactly at the scheduled time, or 5 minutes later?), start right away with a statistics story. I have stories of my own prepared, but if I’m not there, you can do one yourself. Prepare something; feel free to use the blackboard. It doesn’t have to be a long story; 5 or 10 minutes will be fine.

- Then write the following on the blackboard: “(a) Say something about yourself or your work in relation to statistics, (b) Why are you in this class?”

- Have the students divide into pairs. In pairs, they meet each other:
(3 min) A talks to B
(2 min) B asks a question to A, and A responds
(3 min) B talks to A
(2 min) A asks a question to B, and B responds
They are supposed to be talking to each other about their work in relation to statistics.

- If not all the students fit in the room, that’s not really a problem; you can have the overflow people in the lounge area, doing the same thing.

Once the students have done the intros in pairs, take a few volunteers (or, if there are no volunteers, pick some students and ask them to pick other students) to stand up and answer questions (a) and (b) above. Use these to lead the class into discussions that loop around to consider the relevance and different varieties of statistical communication.

Really, this can take all the class period. But I assume that at some point I’ll arrive—how delayed could Amtrak be, after all?? I just wanted to give you some contingency plan so that nobody has to worry if it’s 6:25 and I’m still not there.


See you

About a zillion people pointed me to yesterday’s xkcd cartoon

I have the same problem with Bayes factors, for example this:

Screen Shot 2015-01-27 at 4.42.52 PM

and this:

Screen Shot 2015-01-27 at 4.45.03 PM

(which I copied from Wikipedia, except that, unlike you-know-who, I didn’t change the n’s to d’s and remove the superscripting).

Either way, I don’t buy the numbers, and I certainly don’t buy the words that go with them.

I do admit, though, to using the phrase “statistically significant.” It doesn’t mean so much, but, within statistics, everyone knows what it means, so it’s convenient jargon.

Crowdsourcing data analysis: Do soccer referees give more red cards to dark skin toned players?

Raphael Silberzahn Eric Luis Uhlmann Dan Martin Pasquale Anselmi Frederik Aust Eli Christopher Awtrey Štěpán Bahník Feng Bai Colin Bannard Evelina Bonnier Rickard Carlsson Felix Cheung Garret Christensen Russ Clay Maureen A. Craig Anna Dalla Rosa Lammertjan Dam Mathew H. Evans Ismael Flores Cervantes Nathan Fong Monica Gamez-Djokic Andreas Glenz Shauna Gordon-McKeon Tim Heaton Karin Hederos Eriksson Moritz Heene Alicia Hofelich Mohr Fabia Högden Kent Hui Magnus Johannesson Jonathan Kalodimos Erikson Kaszubowski Deanna Kennedy Ryan Lei Thomas Andrew Lindsay Silvia Liverani Christopher Madan Daniel Molden Eric Molleman Richard D. Morey Laetitia Mulder Bernard A. Nijstad Bryson Pope Nolan Pope Jason M. Prenoveau Floor Rink Egidio Robusto Hadiya Roderique Anna Sandberg Elmar Schlueter Felix S Martin Sherman S. Amy Sommer Kristin Lee Sotak Seth Spain Christoph Spörlein Tom Stafford Luca Stefanutti Susanne Täuber Johannes Ullrich Michelangelo Vianello Eric-Jan Wagenmakers Maciej Witkowiak SangSuk Yoon and Brian A. Nosek write:

Twenty-­nine teams involving 61 analysts used the same data set to address the same research questions: whether soccer referees are more likely to give red cards to dark skin toned players than light skin toned players and whether this relation is moderated by measures of explicit and implicit bias in the referees’ country of origin. Analytic approaches varied widely across teams. For the main research question, estimated effect sizes ranged from 0.89 to 2.93 in odds ratio units, with a median of 1.31. Twenty teams (69%) found a significant positive effect and nine teams (31%) observed a non­significant relationship. The causal relationship however remains unclear. No team found a significant moderation between measures of bias of referees’ country of origin and red card sanctionings of dark skin toned players. Crowdsourcing data analysis highlights the contingency of results on choices of analytic strategy, and increases identification of bias and error in data and analysis. Crowdsourcing analytics represents a new way of doing science; a data set is made publicly available and scientists at first analyze separately and then work together to reach a conclusion while making subjectivity and ambiguity transparent.

“It is perhaps merely an accident of history that skeptics and subjectivists alike strain on the gnat of the prior distribution while swallowing the camel that is the likelihood”

I recently bumped into this 2013 paper by Christian Robert and myself, “‘Not Only Defended But Also Applied': The Perceived Absurdity of Bayesian Inference,” which begins:

Younger readers of this journal may not be fully aware of the passionate battles over Bayesian inference among statisticians in the last half of the twentieth century. During this period, the missionary zeal of many Bayesians was matched, in the other direction, by a view among some theoreticians that Bayesian methods are absurd—not merely misguided but obviously wrong in principle. Such anti-Bayesianism could hardly be maintained in the present era, given the many recent practical successes of Bayesian methods. But by examining the historical background of these beliefs, we may gain some insight into the statistical debates of today. . . .

The whole article is just great. I love reading my old stuff!

Also we were lucky to get several thoughtful discussions:

“Bayesian Inference: The Rodney Dangerfield of Statistics?” — Steve Stigler

“Bayesian Ideas Reemerged in the 1950s” — Steve Fienberg

“Bayesian Statistics in the Twenty First Century” — Wes Johnson

“Bayesian Methods: Applied? Yes. Philosophical Defense? In Flux” — Deborah Mayo

And our rejoinder, “The Anti-Bayesian Moment and Its Passing.”

Good stuff.

“The Statistical Crisis in Science”: My talk this Thurs at the Harvard psychology department

Noon Thursday, January 29, 2015, in William James Hall 765 room 1:

The Statistical Crisis in Science

Andrew Gelman, Dept of Statistics and Dept of Political Science, Columbia University

Top journals in psychology routinely publish ridiculous, scientifically implausible claims, justified based on “p < 0.05.” And this in turn calls into question all sorts of more plausible, but not necessarily true, claims, that are supported by this same sort of evidence. To put it another way: we can all laugh at studies of ESP, or ovulation and voting, but what about MRI studies of political attitudes, or embodied cognition, or stereotype threat, or, for that matter, the latest potential cancer cure? If we can’t trust p-values, does experimental science involving human variation just have to start over? And what to we do in fields such as political science and economics, where preregistered replication can be difficult or impossible? Can Bayesian inference supply a solution? Maybe. These are not easy problems, but they’re important problems.

Here are the slides from the last time I gave this talk, and here are some relevant articles:

[2014] Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. {\em Perspectives on Psychological Science} {\bf 9}, 641–651. (Andrew Gelman and John Carlin)

[2014] The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. {\em Journal of Management}. (Andrew Gelman)

[2013] It’s too hard to publish criticisms and obtain data for replication. {\em Chance} {\bf 26} (3), 49–52. (Andrew Gelman)

[2012] P-values and statistical practice. {\em Epidemiology}. (Andrew Gelman)

The (hypothetical) phase diagram of a statistical or computational method

phase_diagram_sketch copy

So here’s the deal. You have a new idea, call it method C, and you try it out on problems X, Y, and Z and it works well—it destroys the existing methods A and B. And then you publish a paper with the pithy title, Method C Wins. And, hey, since we’re fantasizing here anyway, let’s say you want to publish the paper in PPNAS.

But reviewers will—and should—have some suspicions. How great can your new method really be? Can it really be that methods A and B, which are so popular, have nothing to offer anymore?

Instead give a sense of the bounds of your method. Under what conditions does it win, and under what conditions does it not work so well?

In the graph above, “Dimension 1″ and “Dimension 2″ can be anything, they could be sample size and number or parameters, or computing time and storage cost, or bias and sampling error, whatever. The point is that a method can be applied under varying conditions. And, if a method is great, what that really means is that it works well under a wide range of conditions.

So, make that phase diagram. Even if you don’t actually draw the graph or even explicitly construct a definition of “best,” you can keep in mind the idea of exploring the limitations of your method, coming up with places where it doesn’t perform so well.

On deck this week

Mon: The (hypothetical) phase diagram of a statistical or computational method

Tues: “It is perhaps merely an accident of history that skeptics and subjectivists alike strain on the gnat of the prior distribution while swallowing the camel that is the likelihood”

Wed: Six quick tips to improve your regression modeling

Thurs: “Another bad chart for you to criticize”

Fri: Cognitive vs. behavioral in psychology, economics, and political science

Sat: Economics/sociology phrase book

Sun: Oh, it’s so frustrating when you’re trying to help someone out, and then you realize you’re dealing with a snake

Tell me what you don’t know

We’ll ask an expert, or even a student, to “tell me what you know” about some topic. But now I’m thinking it makes more sense to ask people to tell us what they don’t know.

Why? Consider your understanding of a particular topic to be divided into three parts:
1. What you know.
2. What you don’t know.
3. What you don’t know you don’t know.

If you ask someone about 1, you get some sense of the boundary between 1 and 2.

But if you ask someone about 2, you implicitly get a lot of 1, you get a sense of the boundary between 1 and 2, and you get a sense of the boundary between 2 and 3.

As my very rational friend Ginger says: More information is good.

Postdoc opportunity here, with us (Jennifer Hill, Marc Scott, and me)! On quantitative education research!!

Hop the Q-TRAIN: that is, the Quantitative Training Program, a postdoctoral research program supervised by Jennifer Hill, Marc Scott, and myself, and funded by the Institute for Education Sciences.

As many of you are aware, education research is both important and challenging. And, on the technical level, we’re working on problems in Bayesian inference, multilevel modeling, survey research, and causal inference.

There are various ways that you can contribute as a postdoc: You can have a PhD in psychometrics or education research, and this is your chance to go in depth with statistical inference and computation, or maybe you can do all sorts of Bayesian computation and you’d like to move into education research. We’re looking for top people to join our team.

If you’re interested, send me an email with a letter describing your qualifications and reason for applying, a C.V., and at least one article you’ve written, and have three letters of recommendation sent to me. All three of us (Jennifer, Marc, and I) will evaluate the applications.

We have openings for two 2-year postdocs. As per federal government regulations, candidates must be United States citizens or permanent residents.

“What then should we teach about hypothesis testing?”

Someone who wishes to remain anonymous writes in:

Last week, I was looking forward to a blog post titled “Why continue to teach and use hypothesis testing?” I presume that this scheduled post merely became preempted by more timely posts. But I am still interested in reading the exchange that will follow.

My feeling is that we might have strong reservations about the utility of NHST [null hypothesis significance testing], but realize that they aren’t going away anytime soon. So it is important for students to understand what information other folks are trying to convey when they report their p-values, even if we would like to encourage them to use other frameworks (e.g. a fully Bayesian decision theoretic approach) in their own decision making.

So I guess the next question is, what then should we teach about hypothesis testing? What proportion of the time in a one semester upper level course in Mathematical Statistics should be spent on the theory and how much should be spent on the nuance and warnings about misapplication of the theory? These are questions I’d be interested to hear opinions about from you and your thoughtful readership.

A related question I have is on the “garden of forking paths” or “researcher degrees of freedom”. In applied research, do you think that “tainted” p-values are the norm, and that editors, referees, and readers basically assume some level of impurity of reported p-values?

I wonder, because it seems, if applied statistics textbooks are any guide, that the first recommendation in a data analysis seems to often be: plot your data. And I suspect that many folks might do this *before* settling in on the model they are going to fit. e.g. If they see nonlinearity, they will then consider a transformation that they wouldn’t have considered before. So whether they make the transformation or not, they might have, thus affecting the interpretability of p-values and whatnot. Perhaps I am being an extremist. Pre-registration, replication studies, or simply splitting a data set into training and testing sets may solve this problem, of course.

So to tie these two questions together, shouldn’t our textbooks do a better job in this regard, perhaps in making clear a distinction between two types of statistical analysis: a data analysis, which is intended to elicit the questions and perhaps build a model, and a confirmatory analysis which is the “pure” estimation and prediction from a pre-registered model, from which a p-value might retain some of its true meaning?

My reply: I’ve been thinking about this a lot recently because Eric Loken, Ben Goodrich, and I have been designing an introductory statistics course, and we have to address these issues. One way I’ve been thinking about it is that statistical significance is more of a negative than a positive property:

Traditionally we say: If we find statistical significance, we’ve learned something, but if a comparison is not statistically significant, we can’t say much. (We can “reject” but not “accept” a hypothesis.)

But I’d like to flip it around and say: If we see something statistically significant (in a non-preregistered study), we can’t say much, because garden of forking paths. But if a comparison is not statistically significant, we’ve learned that the noise is too large to distinguish any signal, and that can be important.