Skip to content

Stop screaming already: Exaggeration of effects of fan distraction in NCAA basketball


John Ezekowitz writes:

I have been reading your work on published effect sizes, and I thought you might be interested in this example, which is of small consequence but grates me as a basketball and data fan. Kevin Quealy and Justin Wolfers published an analysis in The NYT on fans’ effectiveness in causing road teams to shoot worse from the free throw line in college basketball.

In the piece, Wolfers notes that players shoot better at home than on the road, but then compares “fan effectiveness” by looking at how much worse opponents shoot at a given arena vs. their home arena. I think it is pretty clear that the correct comparison is opponents’ road FT shooting, not their home shooting.

When I asked him about this, he admitted that the road vs. road effect was smaller. It looks like he just picked home vs. road because he could show a bigger “effect size.” This feels symptomatic of the larger problems you have continued to highlight on your blog.

By the way, not sure if you watch any Columbia basketball, but Maodo Lo can really, really play.

Good point (not about Maodo Lo, that I have no idea about, but regarding the NYT article). The goal of newsworthiness can get in the way of clear communication.

Specifically, Quealy and Wolfers wrote:

On average, college basketball players are about one percentage point less likely to make a free throw when in front of a hostile crowd than when at home. . . . On average, the sixth man’s ability to distract opposing free throwers is worth about 0.2 points per game.

Hmmm, 0.2 points a game is pretty irrelevant anyway. But they get estimates of over 1 point per game for a few teams, most strikingly Arizona State and Northwestern.

Or is it just (or mostly) noise? Quealy and Wolfers write:

Some of the tremendous variation among teams may reflect statistical noise, given that we’re evaluating only five seasons’ worth of data. But that’s still enough to suggest that the overall patterns are real.

They provide no quantitative evidence that for this claim. All they give is this graph:

Screen Shot 2015-07-20 at 10.39.14 AM

This graph looks consistent with a small difference attributable to home-court advantage (recall Ezekowitz’s point), but I see no evidence, from this graph alone, that the differences between stadiums are real. I just don’t know.

Quealy and Wolfers write:

There are also a handful of arenas where visiting teams have actually hit a greater share of free throws than they typically do in front of their home fans. Boston College and Notre Dame are two prominent examples. It’s unfair to suggest that these fans actually hurt their team; rather, it’s more likely that they were of little or no help, and random luck means that visitors hit a few extra free throws.

Whoa baby. Hold up right there. First, according to the graph, it’s not “a handful” of teams, it’s about 110 of them. Second, that’s fine to credit these patterns to random luck. But then shouldn’t you also be considering random luck as an explanation for the success of certain teams?

And what’s with this sort of data dredging:

Duke’s Cameron Crazies are among the most famous fan groups in any sport in the country. And to some extent, they live up to their hype. Our data ranks them as one of the more distracting teams in the nation, although they’re outside our top 10. It could be that they’re actually better than that, and that their numbers will improve with more seasons of data. Or perhaps their creativity does not match their intensity.

Here’s the bad news for Duke fans: Their main rivals, the fans in Chapel Hill, have them slightly beaten here. North Carolina’s fans help the Tar Heels to the tune of about two-thirds of a point per game, relative to a typical home crowd.

This is getting ridiculous. These guys could give a story to coin flips.

What’s really needed here is a hierarchical model. Or, simpler than that, let’s just try computing these summaries for each arena in each season, and see if the arenas with these free-throw patterns in season 1, also show the patterns in season 2. At its simplest, if the differences between arenas are all noise, the year-to-year correlation between these results will be essentially zero. Next step is to fit a hierarchical model with arena effects and arena*year interactions.

On deck this week

Mon: Stop screaming already: Exaggeration of effects of fan distraction in NCAA basketball

Tues: My job here is done

Wed: The tabloids strike again

Thurs: Econometrics: Instrument locally, extrapolate globally

Fri: I wish Napoleon Bonaparte had never been born

Sat: This is a workshop you can’t miss: DataMeetsViz

Sun: You won’t believe these stunning transformations: How to parameterize hyperpriors in hierarchical models?

2 new thoughts on Cauchy priors for logistic regression coefficients

Aki noticed this paper, On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression, by Joyee Ghosh, Yingbo Li, and Robin Mitra, which begins:

In logistic regression, separation occurs when a linear combination of the predictors can perfectly classify part or all of the observations in the sample, and as a result, finite maximum likelihood estimates of the regression coefficients do not exist. Gelman et al. (2008) recommended independent Cauchy distributions as default priors for the regression coefficients in logistic regression, even in the case of separation, and reported posterior modes in their analyses. As the mean does not exist for the Cauchy prior, a natural question is whether the posterior means of the regression coefficients exist under separation. We prove two theorems that provide necessary and sufficient conditions for the existence of posterior means under independent Cauchy priors for the logit link and a general family of link functions, including the probit link. For full Bayesian inference, we develop a Gibbs sampler based on Polya-Gamma data augmentation . . .

It’s good to see research on this. Statistics is the science of defaults, and an important part of statistical theory at its best is the study of how defaults work on a range of problems. It’s a good idea to study the frequency properties of statistical methods—any methods, including Bayesian methods.

I have not read through the paper, but based the above abstract I have two quick comments:

1. We no longer recommend Cauchy as our first-choice default. Cauchy can be fine as a weakly informative prior, but in the recent applications I’ve seen, I’m not really expecting to get huge coefficients, and so a stronger prior such as normal(0,1) can often make sense. See, for example, section 3 of this recent paper. I guess I’m saying that, even for default priors, I recommend a bit of thought into the expected scale of the parameters.

2. I assume that any of the computations can be done in Stan, no need for all these Gibbs samplers. I’m actually surprised that anyone is writing Gibbs samplers anymore in 2015!

The Rachel Tanur Memorial Prize for Visual Sociology

tanur_ad_2016_low copy

Judy Tanur writes:

The Rachel Tanur Memorial Prize for Visual Sociology recognizes students in the social sciences who incorporate visual analysis in their work. The contest is open worldwide to undergraduate and graduate students (majoring in any social science). It is named for Rachel Dorothy Tanur (1958–2002), an urban planner and lawyer who cared deeply about people and their lives and was an acute observer of living conditions and human relationships.

The 2016 competition for the Rachel Tanur Memorial Prize for Visual Sociology is now accepting applications, with a deadline of January 25, 2016. Entries will be judged by members of the Visual Sociology Group (WG03) of the International Sociological Association (ISA). Up to three prizes will be awarded at the Third ISA Forum of Sociology, The Futures We Want: Global Sociology and the Struggles for a Better World, to be held in Vienna, Austria on July 10-14, 2016. Attendance at the forum is not a requirement but is encouraged. First prize is $2,500; second prize is $1,500; and third prize is $500.

For more information and to apply, go here.

Are you ready for some smashmouth FOOTBALL?



This story started for me three years ago with a pre-election article by Tyler Cowen and Kevin Grier entitled, “Will Ohio State’s football team decide who wins the White House?.” Cowen and Grier wrote:

Economists Andrew Healy, Neil Malhotra, and Cecilia Mo . . . examined whether the outcomes of college football games on the eve of elections for presidents, senators, and governors affected the choices voters made. They found that a win by the local team, in the week before an election, raises the vote going to the incumbent by around 1.5 percentage points. When it comes to the 20 highest attendance teams—big athletic programs like the University of Michigan, Oklahoma, and Southern Cal—a victory on the eve of an election pushes the vote for the incumbent up by 3 percentage points.

Hey, that’s a big deal (and here’s the research paper with the evidence). As Cowen and Grier put it:

That’s a lot of votes, certainly more than the margin of victory in a tight race.

Upon careful examination, though, I concluded:

There are multiple games in multiple weeks in several states, each of which, according to the analysis, operates on the county level and would have at most a 0.2% effect in any state. So there’s no reason to believe that any single game would have a big effect, and any effects there are would be averaged over many games.

So I wasn’t so disturbed about the legitimacy of the democratic process. That said, it still seemed a little bit bothersome that football games were affecting election outcomes at all.

Hard tackle

The next chapter in the story came a couple days ago, when Anthony Fowler pointed me to a paper he wrote with B. Pablo Montagnes, arguing that this college football effect was nothing but a meaningless data pattern, the sort of thing that might end up getting published in a rag like Psychological Science:

We reassess the evidence and conclude that there is likely no such effect, despite the fact that Healy et al. followed the best practices in social science and used a credible research design. Multiple independent sources of evidence suggest that the original finding was spurious—reflecting bad luck for researchers rather than a shortcoming of American voters.

We might worry that this surprising result is a false positive, arising from some combination of multiple testing (within and across research teams), specification searching, and bad luck.

I discussed this all yesterday in the sister blog, concluding that voters aren’t as “irrational and emotional” as is sometimes claimed.

Short passes

Neil Malhotra, one of the authors of the original paper in question, pointed to two replies that he and his colleagues wrote (here and here).

And Anthony Fowler pointed to two responses (here and here) that he and his colleague wrote to the above-mentioned replies.

And in an email to Malhotra, I wrote: My quick thought is that the framework of “is it or is it not a false positive” is not so helpful, and I prefer thinking of all these effects as existing, but with lots of variation, so that one has to be careful about drawing general conclusions from a particular study.

Three yards and a crowd of dust

Where do the two parties stand now?

Malhotra writes:

I think the findings of the Fowler/Montagnes paper are incorrectly interpreted by the authors. Instead of a framing of “the original effect was a false positive,” I think a more accurate/appropriate framing is:

1. Some independent, separate tests conducted by Fowler/Montagnes are not consistent with the original study. Therefore, in a meta-analytic sense, the overall literature produces mixed results. Fowler/Montagnes’ results in no way mean that the original study was “wrong.” However, we argue that these independent tests are not appropriate to test the hypothesis that mood affects voting. For
example, it’s not surprising to us that NFL football outcomes do not influence elections since NFL teams are located in large metropolises where there are many competing sources of entertainment. On the other hand, the fate of the Oklahoma Sooners in Norman, OK, is the main event in the town. Further, single, early regular season games in the NFL are less important than later, regular season games in NCAA football. So the dosage of good/mad mood is much lower in the NFL study. Now readers may agree/disagree with me about the validity of the NFL test. But the important thing to realize is that the NFL study is a different, separate test. It doesn’t tell us anything about whether the original study is a “false positive” or is incorrect.

2. Some tests on sub-samples of the original dataset show that there is heterogeneity in the effect. For example, Fowler/Montagnes show that the effect does not seem to be there when voters appear to have more information (e.g., in open-seat races, and when there is partisan competition). This is very theoretically interesting (and cool) heterogeneity, and it definitely changes our interpretation of the original findings. However, these are not “replications” of the original result, and do not speak to whether the original results are false positives.

In sum, I am very open to criticisms of my research. I think this new paper definitely changes my opinion of the scope of the original findings. However, I do not think it is accurate to say that the original findings are incorrect or that the findings were obtained by “bad luck.” There are some auxiliary tests conducted by Fowler/Montanges that either support our results (e.g., that geographically proximate locations outside the home county also respond to the team’s wins/losses) or don’t make much sense (e.g., Texas is a minor college football team?), but we will let interested readers weigh the evidence.

And here’s Fowler:

Of course, we wouldn’t conclude that the effect of college football games on elections is exactly zero, but our independent tests suggest that most likely the effect is substantively very small and Healy et al.’s original results were significant overestimates.

If their purported effects are genuine, we would expect them to vary in particular ways, but none of these independent tests are consistent with the notion that football games and subsequent mood influence elections. So by examining treatment effect heterogeneity in theoretically motivated ways, we reassess the credibility of the original result.


We’re getting closer. I’ll invoke the Edlin effect and say that I think the originally published estimates are indeed probably too high (and, as noted near the beginning of this post, even the effects as reported would have a much much more minor effect on elections than you might think based on a naive interpretation of the numerical estimates of direct causal effects). Based on my own statistical tastes, I’d prefer not to “test the hypothesis that mood affects voting” but instead to think about variation, and I like the part of Malhotra’s note that discusses this.

When it comes to the effects of mood on voting, I think that the point of studying things like football games is not that their political effects are large (that is, let’s ignore that original Slate article) but rather that sporting events have a big random component, so these games can be treated as a sort of natural experiment. As Fowler, Malhotra, and their colleagues all recognize, such analyses can be challenging, not so much because of the traditional “identification” problem familiar to statisticians, econometricians, and political scientists (although, yes, one does have to worry about such concerns), but rather because of the garden of forking paths, all the possible ways that one could chop up these data and put them back together again.

Two-minute warning

There is no two-minute warning here. There is no game that is about to end. Research on these topics will continue. I agree with both Malhotra and Fowler that the way to think about these problems is by wrestling with the details. And I do think these discussions are good, even if they can take on a slightly adversarial flavor. I’d like the mood-and-politics literature to stay grounded, to not end up like the the Psychological-Science-style evolutionary psychology literature, where effects are all huge and where each new paper reports a new interaction. Now that we know that a subfield can be spun out of nothing, we should be careful to put research findings into context—something that both Malhotra and Fowler are trying to do, each in their own way.

To put it another way, they’re disagreeing about effect sizes and strength of evidence, but they both accept the larger principle that research studies don’t stand alone, and Healy et al. are open and accepting of criticism. Which one would think would be a given in science, but it’s not, so let’s appreciate how this is going.

And to get back to football for a moment: As a political scientist this is all important to me because, to the extent that things like sporting events sway people’s votes, that’s a problem—but, to the extent that such irrationalities are magnified by statistical studies and hyped by the news media, that’s a problem too, in that this can be used to disparage the democratic process.


Fowler saw the above and added:

Some of Neil’s comments reflect a misunderstanding of our paper. For example, we did not show that “the effect does not seem to be there when voters appear to have more information” and we never wrote anything along those lines. In the test he’s alluding to (Table 1, “By incumbent running”), we find that the purported effect of football games on incumbent party support is no greater when the incumbent actually runs for reelection. One of our predictions is that if football games meaningfully influence incumbent party support by affecting voter mood, we should expect a bigger effect when the incumbent is actually running. However, the interactive point estimate is actually negative and statistically insignificant, meaning that we don’t find variation in the direction one would expect if the effect is genuine, and if anything, the variation goes in the wrong direction.

Neil’s comments also sound to us like ex-post rationalization. He argues that we shouldn’t expect NFL games to influence local mood in the same way that college football games do, but the local television ratings suggest the opposite. In another interview (here), Neil justified the notion that football games influence mood by citing Card and Dahl who find that football games influence domestic violence. But Card and Dahl analyze NFL games, and they provide arguments as to why the NFL provides the best opportunity to estimate the effects of changes in local mood. In our view, Healy et al. contradict themselves by rationalizing the null effect in the NFL while at the same time citing an NFL study as evidence that football games influence mood.

“How does peer review shape science?”

In a paper subtitled, “A simulation study of editors, reviewers, and the scientific publication process,” political scientist Justin Esarey writes:

Under any system I study, a majority of accepted papers will be evaluated by the average reader as not meeting the standards of the journal. Moreover, all systems allow random chance to play a strong role in the acceptance decision. Heterogeneous reviewer and reader standards for scientific quality drive both results.

He concludes:

A peer review system with an active editor (who uses desk rejection before review and does not rely strictly on reviewer votes to make decisions) can mitigate some of these effects.

This seems reasonable to me. As a reviewer, I give my recommendation but I recognize that the decision is up to the editor. This takes the pressure off me: I feel that all I have to do is provide useful information, not to make the decision.

Esarey’s paper also includes some graphs which are pretty good, but I won’t include them here because I’m so bummed that he doesn’t label the lines directly—I can’t stand having to go back and forth between the lines and the legend. Also I don’t like graphs where the y-axis represents probability but the axis goes below 0 and above 1. It’s all there, though, at the link.

I like the paper. I haven’t read the details so I can’t comment on Esarey’s specific models, but the general features seem to make sense, so it seems like a good start in any case.

You won’t be able to stop staring at this original Hot Hand preprint


To continue with our basketball theme, here’s the preprint of the original hot hand paper, “Misperception of Chance Processes in Basketball,” by Amos Tversky, Robert Vallone, and Thomas Gilovich, from 1985 or so. I remember when it was floating around and everybody was talking about it. When discussing the hot hand with Josh Miller the other day, I remembered I had this preprint in my filing cabinet.

Screen Shot 2015-07-16 at 2.16.23 AM

Here it is. Cool, huh? Even if they did make a mistake in their estimation and then, thirty years later, doubled down and tried to minimize the extent of their error.

Josh Miller’s thoughts

Josh writes:

Question: who coined “hot hand fallacy”?

GVT call it a cognitive illusion in their 1985 paper, and their 2 follow up papers in Chance on 1989.

If you notice in their 1989 “Cold Facts” Chance paper, they are careful to say their result applies exclusively to basketball, because that is the domain in which they asked about beliefs. On the one hand, this makes sense, but really? Imagine finding momentum in human performance in all other sports except basketball, what do you do say then?

While basketball season is just beginning, baseball is in its final throes. Don’t forget Stephen J. Gould had something to say after reading GVT, and also wrote something up for chance in 1989.

Here is Gould’s quote: “Everybody knows about hot hands. The only problem is that no such phenomenon exists. Stanford psychologist Amos Tversky studied…”

He goes on with an attempt to grant Joe DiMaggio an exception, but gives up on it in the end.

Here is Larry Summers, in 2013, in a similar spirit, admonishing the Harvard Basketball team for believing in the hot hand:

Summers was probably just trying to connect with them, but he has cited GVT’s paper (link here if you want: )

Hopefully things will shift, and people don’t have to be embarrassed anymore when talking about the hot hand (though they *sometimes* should be!).

Perhaps people will be more interested in conducting research on momentum in human performance, and not just looking at streak patterns? What a thought! A streak is a data pattern, but the “Hot hand” is a concept, it is a temporary elevation in ability/talent/prob. of success, it should have a mechanism(s), and there must be other things to measure besides, as you call it, “weak” binary data. These two things were entirely confounded in the original GVT study, because the focus was on making light of athletes who said things like: “the basket seems to be so wide. No matter what you do, you know the ball is going to go in” (Purvis Short). Now it may be intellectually respectable to take a look.

Interesting. I like the idea of studying performance more directly rather than obsessing over the data patterns.

Hi-tech hoops: Characterizing the spatial structure of defensive skill in professional basketball

Screen Shot 2015-05-03 at 6.38.27 PM

Joshua Vogelstein points me to this article by Alexander Franks, Andrew Miller, Luke Bornn, and Kirk Goldsberry and writes:

For some reason, I feel like you’d care about this article, and the resulting discussion on your blog would be fun.

Screen Shot 2015-05-03 at 6.40.24 PM

Hey—label your lines directly!

Screen Shot 2015-05-03 at 6.41.36 PM


Screen Shot 2015-05-03 at 6.42.52 PM

Ummm . . . no.

Screen Shot 2015-05-03 at 6.43.33 PM


Screen Shot 2015-05-03 at 6.44.17 PM

Really, really, really, really no. “−25,474.93.” What were they thinking???

I have nothing to offer on the substance of the paper because, hey, I know next to nothing about basketball! One thing that interested me, though, is that the claims of the paper are entirely presented in basketball terms. I guess that’s a difference between stat and econ/management. A stats paper about sports can just be about sports. An econ or management paper about sports will make the claim of relevance based on general principles of motivation, organization, training, or whatever.

I’m not saying one way or the other is better, it’s just interesting how the two fields differ.

Super-topical NBA post!!!

Paul Alper writes:

Now that his team has won the NBA Championship, I am surprised that you have not commented on Curry and his mouthguard. The link is from May 8, 2015. Notice that mouthguard out is mouthguard chewed! From the article:

Curry says his mouthguard routines are completely random, but apparently he’s now well aware that he shoots slightly better when chewing on it like a cigar.

Not from the article:

Sample X N Sample p
1 198 214 0.925234
2 110 123 0.894309

Difference = p (1) – p (2)
Estimate for difference: 0.0309247
95% CI for difference: (-0.0338346, 0.0956840)
Test for difference = 0 (vs ≠ 0): Z = 0.94 P-Value = 0.349

Fisher’s exact test: P-Value = 0.420

Which more or less implies that the mouthguard’s position is immaterial. Yet the WSJ article on June 15, 2015 claims the the 3.1 percentage points “is a substantial increase in free-throw accuracy.” And it further states, “wouldn’t it be smart to do more of what has worked consistently in the past [mouthguard out].” Clearly, the implication of a very large p-value is not something WSJ financial writers fully comprehend.

More important is how in the world did Big Data acquire the 337 foul shots? Is the next step on the mouthguard issue then breaking it down into subgroups such as first half vs. second half, home vs. away, weekend vs. during the week, etc.? All in order to get to .05?

Ooooh, “Fisher’s exact test”—I hate that! And what’s with all those numbers after the decimal place???

But, yeah, I agree with Alper on this.

Unfortunately this came after the NBA playoffs were over. So in a desperate attempt at topicality, I’ve delayed this post until now, the start of the new season.

Workshop on replication in economics

Jan Hoffler sends along this information:

YSI workshop with Richard Ball, Johannes Pfeifer, Edward Miguel, Jan H. Höffler, and Thomas Herndon

January 6 – 7, San Francisco, Mozilla Science Lab

The workshop will take place right after the Annual Meeting of the American Social Sciences Associations, which includes the Annual Meeting of the American Economic Association (AEA) . . .

The workshop will consist of mini-courses covering research transparency in empirical research and macro models that are neglected in the conventional economics curriculum. For young scholars it can be very useful to orient themselves by looking at how established researchers do their studies. By now there is a lot of material available but then it is often frustrating when one wants to take a look at how their analyses were done just to see that it cannot so easily be redone. This workshops intends to help young scholars to find out how to replicate others’ studies and how to archive their own research for future use and for others.

The workshop also will feature student presentation sessions, which will give Ph.D. candidates the opportunity to present and discuss their research in a collaborative environment. Applicants shall enter the title and abstract in the registration form (deadline is November 1) and submit the complete version to the Institute no later than December 1, 2015. Moreover, during the joined lunch and dinner there will be ample time for social interaction with students and teachers.

P.S. The name Edward Miguel rings a bell . . . oh yeah, here he is. His website remains impressive but it no longer says that he’s from New Jersey. I wonder what happened with that.

P.P.S. A commenter reminds us that Miguel is also involved in the Worm Wars which we discussed recently in this space.

Don’t miss this one: “Modern Physics from an Elementary Point of View”

I was googling *back of the envelope* for a recent post and I came across these lecture notes by Victor Weisskopf from 1969.

I can no longer really follow this sort of thing—I really really wish this had been my textbook back when I was studying physics. If they’d taught us this stuff, I might’ve never left that field.

Anyway, here’s one of the more accessible bits, from page 8-11 of the document, where he derives that a mountain must be less than 30 km high to be supported by the rock at its base:

Screen Shot 2015-07-26 at 11.57.54 PM

Screen Shot 2015-07-26 at 11.58.28 PM

Screen Shot 2015-07-26 at 11.58.58 PM

Screen Shot 2015-07-26 at 11.59.28 PM

Screen Shot 2015-07-26 at 11.59.58 PM

Screen Shot 2015-07-27 at 12.01.24 AM

In the next section, Weisskopf derives the number of atoms in a liquid from its surface tension and latent heat of evaporation:

Screen Shot 2015-07-27 at 12.05.10 AM

Screen Shot 2015-07-27 at 12.05.35 AM

I’m really annoyed that physics wasn’t taught this way to me. That said, it’s a challenge to write a statistics book of this sort. I’m working on it; I have various examples that have this flavor.

On deck this week

Lotsa hoops this week:

Mon: Don’t miss this one: “Modern Physics from an Elementary Point of View”

Tues: Super-topical NBA post!!!

Wed: Hi-tech hoops: Characterizing the spatial structure of defensive skill in professional basketball

Thurs: You won’t be able to stop staring at this original Hot Hand preprint

Fri: Stop screaming already: Exaggeration of effects of fan distraction in NCAA basketball

Sat: You’ll never guess what I say when I have nothing to say

Sun: 2 new thoughts on Cauchy priors for logistic regression coefficients

Top 5 movies about scientists


In this NYT interview, Philip “Stanford Prison Experiment” Zimbardo gives his list:

1. “Madame Curie,” 1943

2. “The Seven-Per-Cent Solution,” 1976

3. “Awakenings,” 1990

4. “The Insider,” 1999

5. “The Imitation Game,” 2014.

Not a very impressive list. But that’s the point, I guess: there haven’t been many good movies about scientists. I was racking my brains and the only obvious omission from this list was Young Frankenstein. Also A Beautiful Mind, but maybe that wasn’t very good, I don’t know. Actually, of all the 5 movies on the above list, the only one I actually saw was The Seven-Per-Cent Solution, back when it came out. It was ok, but nothing special. Certainly not half as good as The Bad News Bears, which came out in that same year. On the other hand, maybe it’s more watchable than Young Frankenstein, which may not have aged well.

Anyway, back to movies about scientists: Are we missing any good ones? Or even any OK ones? Does Moneyball count?

Hmmm, there was The R. A. Fisher Story. Overall not a very significant film, but I remember the stirring climax, that courtroom scene where the great statistician and biologist stands up and gives a stirring speech to the jury, explaining once and for all why cigarette smoking does not cause cancer.

Or maybe that was the Donald Rubin Story? Or the Joseph Fleiss Story? Or the Ingram Olkin Story? Or the Arnold Zellner Story? Or . . .?

Anyway, that’s all we have until “Second Chance U” and “The New Dirty Dozen” come out. They’ll rocket up to #1 and 2 on anybody’s list of top science flicks.

P.S. One commenter mentions Dr. Strangelove. I’d say that’s the winner.

And then I was thinking about The Man With Two Brains, which I’m pretty sure is way better than any of the other scientist movies mentioned above.

Meet the 1 doctor in America who has no opinion on whether cigarette smoking contributes to lung cancer in human beings.

Paul Alper writes:

In your blog today you once again criticize Tol’s putative results regarding global warming: “At no point did Tol apologize or thank the people who pointed out his errors; instead he lashed out, over and over again. Irresponsible indeed.”

Well, here is something far more irresponsible and depressing. Read Susan Perry:

Why would physicians testify that smoking isn’t addictive? Money. Lots of money.

Here’s a finding, however, that may make that willingness seem less shocking: Tobacco companies pay the doctors handsomely for their testimony — up to $100,000 per physician per case.

Perry is referring to this paper by Robert Jackler, “Testimony by otolaryngologists in defense of tobacco companies 2009–2014.”

In response, I pointed Alper to my article from a few years back on statisticians and other hired guns for the cigarette industry. Among other things, it caused me to ratchet down my respect several notches for cigarette shill Darrell Huff. But most amazing was this exchange from Kenneth Ludmerer, a professor of history and medicine at Washington University in St. Louis:

Q: Doctor, is it your opinion that cigarette smoking contributes to the development of lung cancer in human beings?

A: I have no opinion on that.

The guy has an M.D. and he teaches at one of the nation’s top medical schools. All I can say is, I wouldn’t want this guy as my family doctor!

Kenneth Ludmerer

After reading my article, Alper wrote:

Did Rubin really receive over $2 million? When I was in graduate school at Madison in the late 1950s, early 1960s I attended Fisher’s lecture where he asserted smoking does not cause cancer. His then son-in-law, George Box, headed the stat department and was duly embarrassed.

P.S. Dr. Ludmerer is the “Mabel Dorn Reeder Distinguished Professor in the History of Medicine.” I googled Mabel Dorn Reeder and came across this obituary. I was curious if she’d died of a cigarette-related illness but it doesn’t say. I see here that she attended graduate school at Columbia!

And here’s the news article, “Ludmerer named distinguished professor in history of medicine.”

In that article, they forgot to mention that Ludmerer is one of the 5 doctors in America who has no opinion on whether cigarette smoking contributes to lung cancer in human beings.

It’s gotta be hard to find a doctor who holds that opinion. Such a man is very special and certainly deserves a distinguished chair. Perhaps he can share it with that doctor who believes that vaccines cause autism, and that other doctor who things that diseases are caused by demonic possession.

There are lots of witches running around in St. Louis. Something must be done!

P.P.S. I’m one to talk, given that I get research funds from Novartis and the U.S. military (for basic research, but still, the military must at some level find it useful to their purposes). Still, no matter how much they pay me, no matter how many awards they give me, I don’t think I’d ever say something as dumb as that I had no opinion on whether cigarette smoking contributes to the development of lung cancer in human beings.


3 reasons why you can’t always use predictive performance to choose among models


A couple years ago Wei and I published a paper, Difficulty of selecting among multilevel models using predictive accuracy, in which we . . . well, we discussed the difficulty of selecting among multilevel models using predictive accuracy.

The paper happened as follows. We’d been fitting hierarchical logistic regressions of poll data and I had this idea to use cross-validated predictive accuracy to see how the models were doing. The idea was that we could have a graph with predictive error on the y-axis and number of parameters on the x-axis, and we see how adding new parameters in a Bayesian model increased predictive accuracy (unlike in a classical non-regularized regression, where if you add too many parameters you get overfitting and suboptimal predictive performance).

But my idea failed. Failed failed failed failed failed. It didn’t work. We had a model where we added important predictions, definitely improved the model—but that improvement didn’t show up in the cross-validated predictive error. And this happened over and over again.

Our finding: Predictive loss was not a great guide to model choice

Here’s an example, where we fit simple multilevel logistic regressions of survey outcomes given respondents’ income and state of residence. We fit the model separately to each of 71 different responses from the Cooperative Congressional Election Survey (a convenient dataset because the data are all publicly available). And here’s what we find, for each outcome plotting average cross-validated log loss comparing no pooling, complete pooling, and partial pooling (Bayesian) regressions:

Screen Shot 2015-07-25 at 12.46.49 AM

OK, partial pooling does perform the best, as it should. But it’s surprising how small is the difference compared to the crappy complete pooling model. (No pooling is horrible but that’s cos of the noisy predictions in small states where the survey has few respondents.)

The intuition behind our finding

It took us awhile to understand how a model—partial pooling—did not perform much better given that it was evidently superior to the alternatives.

But we eventually understood, using the following simple example:

What sorts of improvements in terms of expected predictive loss can we expect to find from improved models applied to public opinion questions? We can perform a back-of-the-envelope calculation. Consider one cell with true proportion 0.4 and three fitted models, a relatively good one that gives a posterior estimate of 0.41 and two poorer mod- els that give estimates of 0.44 and 0.38. The predictive log loss is −[0.4 log(0.41) + 0.6 log(0.59)] = 0.6732 under the good model and −[0.4 log(0.44) + 0.6 log(0.56)] = 0.6739 and −[0.4 log(0.38) + 0.6 log(0.62)] = 0.6763 under the others.

In this example, the improvement in predictive loss by switching to the better model is between 0.0006 and 0.003 per observation. The lower bound is given by −[0.4 log(0.4) + 0.6 log(0.6)] = 0.6730, so the potential gain from moving to the best possible model in this case is only 0.0002.

Got that? 0.0002 per observation. That’s gonna be hard to detect.

We continue:

These differences in expected prediction error are tiny, implying that they would hardly be noticed in a cross-validation calculation unless the number of observations in the cell were huge (in which case, no doubt the analysis would be more finely grained and there would not be so many data points per cell). At the same time, a change in prediction from 0.38 to 0.41, or from 0.41 to 0.44, can be meaningful in a political context. For example, Mitt Romney in 2012 won 38% of the two-party vote in Massachusetts, 41% in New Jersey, and 44% in Oregon; these differences are not huge but they are politically relevant, and we would like a model to identify such differences if it is possible from data.

The above calculations are idealized but they gives a sense of the way in which real differences can correspond to extremely small changes in predictive loss for binary data.

3 postdoc opportunities you can’t miss—here in our group at Columbia! Apply NOW, don’t miss out!


Hey, just once, the Buzzfeed-style hype is appropriate. We have 3 amazing postdoc opportunities here, and you need to apply NOW.

Here’s the deal: we’re working on some amazing projects. You know about Stan and associated exciting projects in computational statistics. There’s the virtual database query, which is the way I like to describe our generic MRP and deep-interactions research, which we’re applying to so many different problems, ranging from voting and turnout to scores on medical board exams and surveys of at-risk births. We’re working on differential equation models in pharmacometrics and soil science. Physics and astronomy. Ornithology, for chrissake! You name it. We’re doing some of the most exciting stuff around in causal inference, with links to some amazing causal-inference researchers at Columbia and elsewhere. Treatment interactions, anyone? These models are fundamental to a modern view of applied statistics and policy analysis. Weakly informative priors. Important in themselves, also we have some ideas of how they can help resolve the replication crisis. That’s right, I’m not just whining about replication on the blog, I’m doing research on how to do better. I think I’m forgetting about 80 other projects we’re working on. Oh yeah, we’re evaluating the Millennium Village project and we’re working with some economists on general ideas for hierarchical models for multiple outcomes. That’s right, we want to tackle the Edlin factor. And penumbras. And, did I mention that we’re working on the hot hand? We’re aiming big here, we’re doing the most advanced computation, working on every possible problem you can think of, with an amazing interdisciplinary team. And we’re not too proud to get help from our friends like Jennifer Hill, Guido Imbens, Dave Blei, Havard Rue, and so on. Developing the best tools to do the best work, that’s what we’re all about. And you—that’s right, you—can be part of it. You just have to apply.

2-year postdoctoral research position on statistics and education research

This postdoc is part of an Institute for Education Sciences training grant, operated jointly by Andrew Gelman (Columbia), Jennifer Hill (NYU), and Marc Scott (NYU). We are working on a range of different research problems, including causal inference, computation for hierarchical modeling, and various applications in education research. The candidate should have strength in statistics, interest in education research, and ideally be a strong programmer as well. This work will take place in an interdisciplinary research environment, and the postdoc will have the opportunity to collaborate on multiple projects in statistics and applied research. The candidate must be a U.S. citizen or permanent resident. If you are interested, please email me your CV, a letter of application, some of your papers, and three letters of recommendation. We’ve already started collecting applications, so apply right away.

Also, our colleague Sophia Rabe-Hesketh at the UC Berkeley school of education has a 15-month postdoctoral opening to work on quantitative education research. We’re working together on Stan, so if you’re interested in contributing to Stan in a useful, fun, and exciting way, either from NYC or Berkeley, this could be a great opportunity for you.

2-year postdoctoral research position on informative priors for Bayesian inference

This postdoc, supervised by Andrew Gelman, is funded by the Office of Naval Research to perform research on the use of informative priors that add a small amount of information to stabilize a statistical analysis without overwhelming the information in data. Much of the work will be done in the context of particular applications in the social, behavioral, and natural sciences, and the ideal candidate will have a deep understanding of Bayesian modeling, an interest in applied statistics, and excellent computation skills. This work will take place in an interdisciplinary research environment, and the postdoc will have the opportunity to collaborate on multiple projects in statistics and applied research. The candidate is not required to be a U.S. citizen or permanent resident. If you are interested, please email me your CV, a letter of application, some of your papers, and three letters of recommendation. We’ve already started collecting applications, so apply right away.

2-year Earth Institute postdoctoral fellowship

The Earth Institute at Columbia brings in several postdocs each year–it’s a two-year gig—and some of them have been statisticians (recently, Kenny Shirley, Leontine Alkema, and Shira Mitchell). We’re particularly interested in statisticians who have research interests in development and public health. It’s fine—not just fine, but ideal—if you are interested in statistical methods also. The EI postdoc can be a place to do interesting work and begin a research career. Details here. If you’re a statistician who’s interested in this fellowship, feel free to contact me—you have to apply to the Earth Institute directly (see link above), but I’m happy to give you advice about whether your goals fit into our program. It’s important to me, and to others in the EI, to have statisticians involved in our research. Deadline for applications is 30 Oct, so it’s time to prepare your application NOW!

The answer to my previous question


What’s the probability that Daniel Murphy hits a home run tonight?


20% 15%, that’s my quick empirical estimate.

Where do I get this? I googled *most home runs hit in consecutive games* and found this list of players who’ve hit home runs in at least six consecutive games. There are 20 such cases; 14 of these streaks ended at six games, and 6 of these streaks continued. So I’d estimate Pr(homer in 7th consecutive game | homered in 6 consecutive games) as 6/20 = 0.30.

Ok, but Murphy’s streak so far is only 5 games, not 6. So we really want Pr(homer in 6th consecutive game | homered in 5 consecutive games). How many streaks of length 5? According to this page, there have been 39 streaks of 5 or longer since 1997. Going back to that earlier page, we see that 7 of the streaks of length 6+ happened since 1997. So this gives us an empirical probability of 7/39 = 0.18.

So my empirical estimate is 18%, which is hyper-precise. I round it to 20%.

P.S. See discussion in comments. After partial pooling to Murphy-specific data, my new estimate is 15%.

P.P.S. My commenters know much more baseball than I do and are trying to push my probability down to 10%. I’ll let all of you make the call on this one.

P.P.P.S. The results of a more precise calculation appear in the next post.

It’s all about the denominator: Rajiv Sethi and Sendhil Mullainathan in a statistical debate on racial bias in police killings

Screen Shot 2015-10-20 at 11.39.27 PM

Rajiv Sethi points me to this column by Sendhil Mullainathan, who writes:

Tamir Rice. Eric Garner. Walter Scott. Michael Brown. Each killing raises a disturbing question: Would any of these people have been killed by police officers if they had been white? . . .

There is ample statistical evidence of large and persistent racial bias in other areas — from labor markets to online retail markets. So I [Mullainathan] expected that police prejudice would be a major factor in accounting for the killings of African-Americans. But when I looked at the numbers, that’s not exactly what I found. . . . what the data does suggest is that eliminating the biases of all police officers would do little to materially reduce the total number of African-American killings.

Then come the numbers:

According to the F.B.I.’s Supplementary Homicide Report, 31.8 percent of people shot by the police were African-American, a proportion more than two and a half times the 13.2 percent of African-Americans in the general population. . . .

But this data does not prove that biased police officers are more likely to shoot blacks in any given encounter.

Instead, there is another possibility: It is simply that — for reasons that may well include police bias — African-Americans have a very large number of encounters with police officers. . . . Arrest data lets us measure this possibility.

At this point I just have to interject that every time I see “data” used as a singular noun, it’s like fingernails on a blackboard to me. Yes, yes, I know that in modern English, “data” is acceptable as a singular or plural noun. So I’m not saying that Mullainathan (or the New York Times style guide) is wrong. It just bothers me. I’m not used to it.

Anyway, to continue from Mullainathan:

For the entire country, 28.9 percent of arrestees were African-American. This number is not very different from the 31.8 percent of police-shooting victims who were African-Americans. If police discrimination were a big factor in the actual killings, we would have expected a larger gap between the arrest rate and the police-killing rate.

This in turn suggests that removing police racial bias will have little effect on the killing rate. . . .

He continues with some sentences that explain the basic idea which should be unexceptional to the readers of this blog.

My Columbia University colleague Sethi was not happy with this reasoning. Sethi writes:

Sendhil Mullainathan is one of the most thoughtful people in the economics profession, but he has a recent piece in the New York Times with which I [Sethi] really must take issue. . . .

A key assumption underlying this argument is that encounters involving genuine (as opposed to perceived) threats to officer safety arise with equal frequency across groups. . . .

Sethi argues that it does seem that police officers often behave more violently toward black suspects. But then, he asks,

How, then, can one account for the rough parity between arrest rates and the rate of shooting deaths at the hands of law enforcement? If officers frequently behave differently in encounters with black civilians, shouldn’t one see a higher rate of killing per encounter?

Sethi answers his own question:

Not necessarily. . . . If the very high incidence of encounters between police and black men is due, in part, to encounters that ought not to have occurred at all, then a disproportionate share of these will be safe, and one ought to expect fewer killings per encounter in the absence of bias. Observing parity would then be suggestive of bias, and eliminating bias would surely result in fewer killings.

The discussion continues

Sethi updates:

This post by Jacob Dink is worth reading. Jacob shows that the likelihood of being shot by police conditional on being unarmed is twice as high for blacks relative to whites. The likelihood is also higher conditional on being armed, but the difference is smaller:


[Damn that’s an ugly y-axis. And boy is it ugly to label the two lines with a legend. — ed.]

Sethi summarizes:

This, together with the fact that rates of arrest and killing are roughly equal across groups, implies that blacks are less likely to be armed than whites, conditional on an encounter. In the absence of bias, therefore, the rate of killing per encounter should be lower for blacks, not equal across groups. So we can’t conclude that “removing police racial bias will have little effect on the killing rate.” That was the point I was trying to make in this post.

Some interesting discussion in comments to Sethi’s post. I have no idea where the error bars are coming from in Dink’s post—I assume the data are some total count, not from a sample.

What’s my take on all this? First off, data are good (or, as the kids today say, data is good). So I appreciate Mullainathan’s numbers and also Dink’s (even if I can’t be quite clear on what Dink is actually plotting). I’m also sympathetic to Sethi’s general argument. Suppose there’s a sliding scale of police aggression, starting with arresting, moving to violence, and culminating in killing a suspect. I could imagine #killed/#arrested not varying much by race, while at the same time the cops are disproportionately arresting, beating, and killing African Americans.

To put it another way, this is an argument over the denominator. Mullainathan is using arrests as the denominator, but it’s not clear this is appropriate. These are tough questions. In our stop-and-frisk paper we used previous year’s validated arrests as a baseline. But that’s not perfect.

It’s interesting that Mullainathan uses the example of Tamir Rice and then recommends using arrests as a reference point. I assume that had the officer gotten close enough to Rice to see what was happening, he wouldn’t have arrested Rice in any case, right?

Also it is notable that it is two economists having this discussion, given that the topic is not actually economics! I say this not because I think economists should be discouraged from studying such topics, it just seems surprising to me. I suppose what’s going on is that there are a lot more academic economists out there, than there are sociologists or criminologists or even political scientists. Also it’s my impression that quantitative political scientists are discouraged from working on this sort of policy research, but I might be wrong about that.

Ta-Nehisi Coates, David Brooks, and the “Street Code” of Journalism

In my latest Daily Beast column, I decide to be charitable to the factually-challenged NYT columnist:

From our perspective, Brooks’s refusal to admit error makes him look like a buffoon. But maybe we’re just judging him based on the norms of another culture. . . .

From our perspective, Brooks spreading anti-Semitic false statistics in the pages of The New York Times is actually much worse than Eric Garner selling loose cigarettes on the streets of Staten Island.

But that’s just our perspective. Maybe our judgmental attitude toward sloppy journalism is as clueless as Brooks’s attitude toward street crime. We don’t know.

As the saying goes, read the whole thing. I have sympathy for the attitude of regular blog readers who skip any post whenever the words “David Brooks” appear. But I think this particular column of mine is relevant to larger issues of people being unwilling to own their errors.

Here are my previous Daily Beast columns:

What’s So Fun About Fake Data?

Don’t Mistake Genetics For Fate

The Truth About Post-Ferguson Gun Deaths

There Are Infinite Types of Drunk People

Could Google Rig the 2016 Election? Don’t Believe the Hype.

And here are the columns by my collaborator, Kaiser Fung:

Why Consumers Should Care About Apple’s War on Big Data

Banks Want Robots to Do Their Hiring

How the Media #Fails Basic Math

Debunking the Great ‘Selfies Are More Deadly Than Shark Attacks’ Myth

The two of us share the column and take joint responsibility; most of the time we send the column back and forth so that we’re both contributing at some level each time.