Skip to content

Scientific explanation of Panther defeat!

Screen Shot 2016-02-09 at 9.21.02 PM

Roy’s comment on our recent post inspires me to reveal the true explanation underlying the Carolina team’s shocking Super Bowl loss.

The Panthers were primed during the previous week with elderly-themed words such as “bingo” and “Manning.” As well-established research as demonstrated, this caused Cam and the gang to move more slowly, hence all the sacks and difficulty scoring.

Your reaction to this explanation may be disbelief. The idea you should focus on, however, is that disbelief is not an option. You have no choice but to accept that the major conclusions are true.

Stan’s Super Bowl prediction: Broncos 24, Panthers 13


We ran the data through our model, not just the data from the past season but from the past 17 seasons (that’s what we could easily access) with a Gaussian process model to allow team abilities to vary over time. Because we’re modeling individual game outcomes, our model automatically controls for imbalances such as Carolina’s notoriously easy schedule. And we don’t just model win/loss or even score differential, we model points for each team, which allows us to estimate offense and defense numbers for each team. Also we model separate scores (TD, FG, etc) so that we can get some shot at predicting the actual scores.

Our model isn’t perfect; there’s a lot more information out there we’re not using. No play-level data or even player-level data. Still, it’s what our model predicts: Broncos 24, Panthers 13.

P.S. (9 Feb) Hey, the game’s over. What actually happened? Broncos 24, Panthers 10. Pretty good! Actually better than we might expect—we got lucky. But we’ll take it.

Go Stan!

P.P.S. Damn! We forgot to preregister. But you can take our word for it that this is the only analysis we did with these data.

P.P.P.S. To all you vexatious replication bullies who keep buggin me about the data: I’ll release my Excel files when I damn well please. We spent a year working on this paper, sweating out every number, sweating out over what we were doing, and then to see people blogging about it in real time—that’s not the way science really gets done. And so it’s a little hard for us to respond to all of the blog posts that are coming out.

Phd positions in Probabilistic Machine Learning at #AaltoPML group Finland


There are PhD positions in our Probabilistic Machine Learning group at Aalto, Finland, and altogether 15 positions in Helsinki ICT network. Apply here

The most interesting topic in the call is supervised by Prof. Samuel Kaski at AaltoPML (and you may collaborate with me too :)

We are looking for PhD candidates interested in probabilistic modeling and machine learning, both theory and applications. Main keywords include Bayesian inference and multiple data sources. Strong application areas with excellent collaboration opportunities are: personalized medicine, bioinformatics, user interaction, brain signal analysis, information visualization and intelligent information access. The group has several excellent postdocs who participate in supervision. We belong to the Finnish Center of Excellence in Computational Inference Research COIN.

Although this description doesn’t mention it, the research may also be related to Stan.

And before Andrew comments, I just say that right now in the winter, south Finland is warmer than New York or Iceland!

Primed to lose


David Hogg points me to a recent paper, “A Social Priming Data Set With Troubling Oddities” by Hal Pashler, Doug Rohrer, Ian Abramson, Tanya Wolfson, and Christine Harris, which begins:

Chatterjee, Rose, and Sinha (2013) presented results from three experiments investigating social priming—specifically, priming effects induced by incidental exposure to concepts relating to cash or credit cards. They reported that exposing people to cash concepts made them less generous with their time and money, whereas exposing them to credit card concepts made them more generous.

The effects reported in the Chatterjee et al. paper were large—suspiciously large.

Last year, I wrote about a study whose results were stunningly large. It was only after I learned the data had been faked—it was the notorious Lacour and Green voter canvassing paper—that I ruefully wrote that, sometimes a claim that is too good to be true, isn’t.

Pashler at all skipped my first step and went straight to the data. After some statistical detective work, they conclude:

We are not in a position to determine exactly what series of actions and events could have resulted in this pattern of seemingly corrupted data. In our view, given the results just described, possibilities that would need to be considered would include (a) human error, (b) computer error, and (c) deliberate data fabrication.


In our opinion based solely on the analyses just described, the findings do seem potentially consistent with the disturbing third possibility: that the data records that contributed most to the priming effect were injected into the data set by means of copy-and- paste steps followed by some alteration of the pasted strings in order to mask the abnormal provenance of these data records that were driving the key effect.


No coincidence that we see fraud (or extreme sloppiness) in priming studies

How did we get to this point?

Do you think Chatterjee et al. wanted to fabricate data (if that’s what they did) or do incredibly sloppy data processing (if that’s what happened)? Do you think that, when Chatterjee, Rose, and Sinha were in grad school studying psychology or organizational behavior or whatever, they thought, When I grow up I want to be running my data through the washing machine?

No, of course not.


They were driven to cheat, or to show disrespect for their data, because there was nothing there for them to find (or, to be precise, that any effects that were there, were too small and too variable for them to have any chance of detecting; click on above kangaroo image for a fuller explanation of this point).

Nobody wants to starve. If there’s no fruit on the trees, people will forage through the weeds looking for vegetables. If there’s nothing there, they’ll start to eat dirt. The low quality of research in these subfields of social psychology is a direct consequence of there being nothing there to study. Or, to be precise, it’s a direct consequence of effects being small and highly variable across people and situations.

I’m sure these researchers would’ve loved to secure business-school teaching positions by studying large and real effects. But, to continue my analogy, they got stuck in a barren patch of the forest, eating dirt and tree bark in a desperate attempt to stay viable. It’s not a pretty sight. But I can see how it can happen. I blame them, sure (just as I blame myself for the sloppiness that led to my two erroneous published papers). But I also blame the system, the advisors and peers and journal editors and Ted talk impresarios who misled them into thinking that they were working in a productive area of science, when they weren’t. They were blindfolded and taken into some area of the outback that had nothing to eat.

Outback, huh? I just realize what I wrote. It was unintentional, and I think I was primed by the kangaroo picture.

In all seriousness, I have no doubt that priming occurs—I see it all the time in my own life. My skepticism is with the claim of huge indirect priming effects. As Wagenmakers et al. put it, quoting Hal Pashler, “disbelief does in fact remain an option.” Especially because, as discussed in the present post, if these effects were really present, they’d be interfering with each other all over the place, and these sorts of crude experiments wouldn’t work anyway.

It’s all about the incentives

So . . . you take a research area with small and highly variable effects, but where this is not well understood so you can get publications in top journals with statistically significant results . . . this creates very little incentive to do careful research. I mean, what’s the point? If there’s essentially nothing going on and you’re gonna have to p-hack your data anyway, why not just jump straight to the finish line. Chatterjee et al. could’ve spent 3 years collecting data on 1000 people, they still probably would’ve had to twist the data to get what they needed for publication.

And that’s the other side of the coin. Very little incentive to do careful research, but a very big incentive to cheat or to be so sloppy with your data that maybe you can happen upon a statistically significant finding.

Bad bad incentives + Researchers in a tough position with their careers = Bad situation.

Forking paths vs. six quick regression tips


Bill Harris writes:

I know you’re on a blog delay, but I’d like to vote to raise the odds that my question in a comment to discussed, in case it’s not in your queue.

It’s likely just my simple misunderstanding, but I’ve sensed two bits of contradictory advice in your writing: fit one complete model all at one, and fit models incrementally, starting with the overly small.

For those of us who are working in industry and trying to stay abreast of good, current practice and thinking, this is important.

I realize it may also not be a simple question.  Maybe both positions are correct, and we don’t yet have a unifying concept to bring them together.

I am open to a sound compromise.  For example, I could imagine the need to start with EDA and small models but hold out a test set for one comprehensive model.  I recall you once wrote to me that you don’t worry much about holding out data for testing, since your field produces new data with regularity.  Others of us aren’t quite so lucky, either because data is produced parsimoniously or the data we need to use is produced parsimoniously.

Still, building the one big model, even after the discussions on sparsity and on horseshoe priors, can sound a bit like, and, although I recognize that regularization can make a big difference.


My reply:

I have so many things I really really must do, but am too lazy to do.  Things to figure out, data to study, books to write.  Every once in awhile I do some work and it feels soooo good.  Like programming the first version of the GMO algorithm, or doing that simulation the other day that made it clear how the simple Markov model massively underestimates the magnitude of the hot hand (sorry, GVT!), or even buckling down and preparing R and Stan code for my classes.  But most of the time I avoid working, and during those times, blogging keeps me sane.  It’s now May in blog time, and I’m 1/4 of the way toward being Jones.

So, sure, Bill, I’ll take next Monday’s scheduled post (“Happy talk, meet the Edlin factor”) and bump it to 11 May, to make space for this one.

And now, to get to the topic at hand:  Yes, it does seem that I give two sorts of advice but I hope they are complementary, not contradictory.

On one hand, let’s aim for hierarchical models where we study many patterns at once.  My model here is Aki’s birthday model (the one with graphs on cover of BDA3) where, instead of analyzing just Valentine’s Day and Halloween, we looked at all 366 days at once, also adjusting for day of week in a way that allows that adjustment to change over time.

On the other hand, we can never quite get to where we want to be, so let’s start simple and build our models up.  This happens both within a project—start simple, build up, keep going until you don’t see any benefit from complexifying your model further—and across projects, where we (statistical researchers and practitioners) gradually get comfortable with methods and can go further.

This is related to the general idea we discussed  a few years ago (wow—it was only a year ago, blogtime flies!), that statistical analysis recapitulates the development of statistical methods.

In the old days, many decades ago, one might start by computing correlation measures and then move to regression, adding predictors one at a time.  Now we might start with a (multiple) regression, then allow intercepts to vary, then move to varying slopes.  In a few years, we may internalize multilevel models (both in our understanding and in our computation) so that they can be our starting point, and once we’ve chunked that, we can walk in what briefly will feel like seven-league boots.

Does that help?

On deck this week


Mon: Forking paths vs. six quick regression tips

Tues: Primed to lose

Wed: Point summary of posterior simulations?

Thurs: In general, hypothesis testing is overrated and hypothesis generation is underrated, so it’s fine for these data to be collected with exploration in mind.

Fri: “Priming Effects Replicate Just Fine, Thanks”

Sat: Pooling is relative to the model

Sun: Hierarchical models for phylogeny: Here’s what everyone’s talking about

The above image is so great I didn’t want you to have to wait till Tues and Fri to see it.

You’ll never guess what I say when I have nothing to say

A reporter writes:

I’m a reporter working on a story . . . and I was wondering if you could help me out by taking a quick look at the stats in the paper it’s based on.

The paper is about paedophiles being more likely to have minor facial abnormalities, suggesting that paedophilia is a neurodevelopment disorder that starts in the womb. We’re a bit concerned that the stats look weak though – small sample size, no comparison to healthy controls, large SD, etc.

If you have time, could you take a quick look and let me know if the statistics seem to be strong enough to back up their conclusions? The paper is here:

I replied: Yes, I agree, I don’t find this convincing, also it’s hard to know what to do with this. It doesn’t seem newsworthy to me. That said, I’m not an expert on this topic.

What’s the difference between randomness and uncertainty?

Julia Galef mentioned “meta-uncertainty,” and how to characterize the difference between a 50% credence about a coin flip coming up heads, vs. a 50% credence about something like advanced AI being invented this century.

I wrote: Yes, I’ve written about this probability thing. The way to distinguish these two scenarios is to embed each of them in a larger setting. The question is, how would each probability change as additional information becomes available. The coin flip is “random” to the extent that intermediate information is not available that would change the probability. Indeed, the flip becomes less “random” to the extent that it is flipped. In other settings such as the outcome of an uncertain sports competition, intermediate information could be available (for example, maybe some key participants are sick or injured) hence it makes sense to speak of “uncertainty” as well as randomness.

It’s an interesting example because people have sometimes considered this to be merely a question of “philosophy” or interpretation, but the distinction between different sources of uncertainty can in fact be encoded in the mathematics of conditional probability.

The bejeezus


Tova Perlmutter writes of a recent online exchange:
Continue reading ‘The bejeezus’ »

Stat Podcast Plan


In my course on Statistical Communication and Graphics, each class had a special guest star who would answer questions on his or her area of expertise. These were not “guest lectures”—there were specific things I wanted the students to learn in this course, it wasn’t the kind of seminar where they just kick back each week and listen—rather, they were discussions, typically around 20 minutes long, facilitated by the outside expert.

One thing that struck me about these discussions was how fun they were, and how various interesting and unexpected things came up in our conversations.

And that made me think—Hey, we should do a podcast! I can be the host and have conversations with these guests, one at a time, and then release these as (free) 15-minute podcasts. How awesome! The only challenge is to keep them lively. Without a roomful of students, a recorded conversation between two people could get stilted.

Also we need a title for the series. “Statistics Podcast” is pretty boring. “Statcast”? The topics we’ve had so far have been focused on statistical communication, but once we go through that, we could cover other statistical areas as well.

And then there’s the technical details: how to actually set up a podcast, also maybe it needs to be edited a bit?

So here’s what I’m needing from you:

– A title for the podcast series.

– Advice on production and distribution.

Our starting lineup

Here are some of the visitors we’ve had in our course so far. I’d plan to start with them, since I’ve already had good conversations with them.

I list the topic corresponding to each visitor, but the actual conversations ranged widely.

    Thomas Basbøll, Writing Consultant, Copenhagen Business School (topic: Telling a story)

    Howard Wainer, Distinguished Research Scientist, National Board of Medical Examiners (topic: Principles of statistical graphics, although the actual discussion ended up being all about educational testing, because that’s what the students’ questions were about)

    Deborah Nolan, Professor of Statistics, University of California (topic: Student activities and projects)

    Jessica Watkins, Department of Education, Tufts University (topic: Facilitating class participation)

    Justin Phillips, Professor of Political Science, Columbia University (topic: Classroom teaching)

    Beth Chance, Professor of Statistics, California Polytechnic State University (topic: Preparing and evaluating a class)

    Amanda Cox, Graphics Editor, New York Times (topic: Graphing data: what to do)

    Jessica Hullman, Assistant Professor of Information Visualization, University of Washington (topic: Graphing data: what works)

    Kaiser Fung, Senior Data Advisor, Vimeo (topic: Statistical reporting)

    Elke Weber, Professor of Psychology and Management, Columbia University (topic: Communicating variation and uncertainty)

    Eric Johnson, Professor of Psychology and Management, Columbia University (topic: Communicating variation and uncertainty)

    Cynthia Rudin, Associate Professor of Statistics, MIT (topic: Understanding fitted models)

    Kenny Shirley, Principal Inventive Scientist, Statistics Research Department, AT&T Laboratories (topic: Understanding fitted models)

    Tom Wood, Assistant Professor of Political Science, Ohio State University (topic: Displaying fitted models)

    Elizabeth Tipton, Assistant Professor of Applied Statistics, Teachers College, Columbia University (topic: Displaying fitted models)

    Brad Paley, Principal, Digital Image Design Incorporated (topic: Giving a presentation)

    Jared Lander, statistical consultant and author of R for Everyone (topic: Teaching in a non-academic environment)

    Jonah Gabry, Researcher, Department of Statistics, Department of Political Science, and Population Research Center, Columbia University (topic: Dynamic graphics)

    Martin Wattenberg, Data Visualization, Google (topic: Dynamic graphics)

    Hadley Wickham, Chief Scientist, RStudio (topic: Dynamic graphics)

    David Rindskopf, Professor of Educational Psychology, City University of New York (topic: Consulting)

    Shira Mitchell, Postdoctoral Researcher, Earth Institute, Columbia University (topic: Collaboration)

    Katherine Button, Lecturer, Department of Psychology, University of Bath (topic: Communication and its impact on science)

    Jenny Davidson, Professor of English, Columbia University (topic: Writing for a technical audience)

    Rachel Schutt, Senior Vice President of Data Science, News Corporation (topic: Communication with a non-technical audience)

    Leslie McCall, Professor of Sociology, Northwestern University (topic: Social research and policy)

    Yair Ghitza, Senior Scientist, Catalist (topic: Data processing)

    Bob Carpenter, Research Scientist, Department of Statistics, Columbia University (topic: Programming)

P.S. Lots of suggested titles in comments. My favorite title so far: Learning from Numbers.

P.P.S. I asked Sharad if he could come up with any names for the podcast and he sent me these:

White Noise
In the Noise
The Signal
Random Samples

I’ll have to nix the first suggestion as it’s a bit too accurate a description of the ethnic composition of myself and our guest stars. The third suggestion is pretty good but it’s almost a bit too slick. After all, we’re not the signal, we’re just a signal. I’m still leaning toward Learning from Numbers.

The Notorious N.H.S.T. presents: Mo P-values Mo Problems

Alain Content writes:

I am a psycholinguist who teaches statistics (and also sometimes publishes in Psych Sci).

I am writing because as I am preparing for some future lessons, I fall back on a very basic question which has been worrying me for some time, related to the reasoning underlying NHST [null hypothesis significance testing].

Put simply, what is the rational justification for considering the probability of the test statistic and any more extreme value of it?

I know of course that the point value probability cannot be used, but I can’t figure the reasoning behind the choice of any more extreme value. I mean, wouldn’t it be as valid (or invalid) to consider for instance the probability of some (conventionally) fixed interval around the observed value? (My null hypothesis is that there is no difference between Belgians and Americans in chocolate consumption. If find a mean difference of say 3 kgs. I decide to reject H0 based on the probability of [2.9-3.1].)

My reply: There are 2 things going on:

1. The logic of NHST. To get this out of the way, I don’t like it. As we’ve discussed from time to time, NHST is all about rejecting straw-man hypothesis B and then using this to claim support for the researcher’s desired hypothesis A. The trouble is that both models are false, and typically the desired hypothesis A is not even clearly specified.

In your example, the true answer is easy: different people consume different amounts of chocolate. And the averages for two countries will differ. The average also differs from year to year, so a more relevant question might be how large are the differences between countries, compared to the variation over time, the variation across states within a country, the variation across age groups, etc.

2. The use of tail-area probabilities as a measure of model fit. This has been controversial. I don’t have much to say on this. On one hand, if a p-value is extreme, it does seem like we learn something about model fit. If you’re seeing p=.00001, that does seem notable. On the other hand, maybe there are other ways to see this sort of lack of fit. In my 1996 paper with Meng and Stern on posterior predictive checks, we did some p-values, but now I’m much more likely to perform a graphical model check.

In any case, you really can’t use p-values to compare model fits or to compare datasets. This example illustrates the failure of the common approach of using p-value as a data summary.

My main message is to use model checks (tail area probabilities, graphical diagnostics, whatever) to probe flaws in the model you want to fit—not as a way to reject null hypotheses.

“Chatting with the Tea Party”

I got an email last month offering two free tickets to the preview of a new play, Chatting with the Tea Party, described as “a documentary-style play about a New York playwright’s year attending Tea Party meetings around the country and interviewing local leaders. Nothing the Tea Party people in the play say has been made up.”

I asked if they could give me 3 tickets and they did, and I went with two family members.

I won’t be spoiling much if I share the plot: self-described liberal playwright talks with liberal friends during the rise of the conservative Tea Party movements, realizes he doesn’t know any Tea Party activists himself, so during his random travels around the country (as a playwright, he’s always going to some performance or workshop or another), he arranges meetings with Tea Party activists in different places. Some of these people say reasonable things, some of them say rude things, many have interesting personal stories. No issue attitudes get changed, but issues get explored.

The play, directed by Lynnette Barkley, had four actors; one played the role of the playwright, the others did the voices of the people he met. They did the different voices pretty well: each time it seemed like a new person. If Anna Deavere Smith or Mel Blanc had been there to do all the voices, it would’ve been amazing, but these actors did the job. And the playwright, Rich Orloff, did a good job compressing so many hours of interviews to yield some intense conversations.

There were two things that struck me during the watching of the play.

First, it would’ve been also interesting to see the converse: a conservative counterpart of the reasonable, pragmatic Orloff interviewing liberal activists. I could imagine a play that cut back and forth between the two sets of scenes. The play did have some scenes with Orloff’s know-nothing liberal NYC friends, but I think it would’ve worked better for them to be confronting an actual conservative, rather than just standing there expressing their biases.

Second, I was struck by how different the concerns of 2009-2010 were, compared to the live political issues now. Back then, it was all about the national debt, there were 3 trillion dollars being released into the economy, everything was gonna crash. Now the concerns seem more to do with national security and various long-term economic issues, but nothing like this spending-is-out-of-control thing. I guess this makes sense: with a Republican-controlled congress, there’s less concern that spending will get out of control. In any case, the central issues have changed. There’s still polarization, though, and still space for literary explorations of the topic. As a person who has great difficulty remembering exact dialogue myself, I’m impressed with a play that can capture all these different voices.

Where the fat people at?


Pearly Dhingra points me to this article, “The Geographic Distribution of Obesity in the US and the Potential Regional Differences in Misreporting of Obesity,” by Anh Le, Suzanne Judd, David Allison, Reena Oza-Frank, Olivia Affuso, Monika Safford, Virginia Howard, and George Howard, who write:

Data from BRFSS [the behavioral risk factor surveillance system] suggest that the highest prevalence of obesity is in the East South Central Census division; however, direct measures suggest higher prevalence in the West North Central and East North Central Census divisions. The regions relative ranking of obesity prevalence differs substantially between self-reported and directly measured height and weight.

And they conclude:

Geographic patterns in the prevalence of obesity based on self-reported height and weight may be misleading, and have implications for current policy proposals.

Interesting. Measurement error is important.

But, hey, what’s with this graph:

Screen Shot 2015-07-31 at 10.45.29 AM

Who made this monstrosity? Ed Wegman?

I can’t imagine a clearer case for a scatterplot. Ummmm, OK, here it is:


Hmmm, I don’t see the claimed pattern between region of the country and discrepancy between the measures.

Maybe things will be clearer if we remove outlying Massachusetts:


Maryland’s a judgment call; I count my home state as northeastern but the cited report places it in the south. In any case, I think the scatterplot is about a zillion times clearer than the parallel coordinates plot (which, among other things, throws away information by reducing all the numbers to ranks).

P.S. Chris in comments suggests redoing the graphs with same scale on the two axes. Here they are:


It’s a tough call. These new graphs make the differences between the two assessments more clear, but then it’s harder to compare the regions. It’s fine to show both, I guess.

Hey—go to Iceland and work on glaciers!


Egil Ferkingstad and Birgir Hrafnkelsson write:

We have an exciting PhD position here at the University of Iceland on developing Bayesian hierarchical spatio-temporal models to the field of glaciology. Havard Rue at NTNU, Trondheim and Chris Wikle at the University of Missouri will also be part of the project.

The Department of Mathematics at the University of Iceland (UI) seeks applicants for a fully funded 3 year PhD position for the project Statistical Models for Glaciology.

The student will develop Bayesian hierarchical spatio-temporal models to the field of glaciology, working with a consortium of experts at the University of Iceland, the University of Missouri and the Norwegian University of Science and Technology. The key people in the consortium are Prof. Birgir Hrafnkelsson at UI, Prof. Chris Wikle, and Prof. Håvard Rue, experts in spatial statistics and Bayesian computation. Another key person is Prof. Gudfinna Adalgeirsdottir at UI, an expect in glaciology. The Glaciology group at UI possesses extensive data and knowledge about the Icelandic glaciers.

The application deadline is February 29, 2016.

Detailed project description:

Job ad with information on how to apply:

It’s a good day for cold research positions.

Summer internship positions for undergraduate students with Aki

There are couple cool summer internship positions for undergraduate students (BSc level) in Probabilistic Machine Learning group at Aalto (Finland) with me (Aki) and Samuel Kaski. Possible research topics are related to Bayesian inference, machine learning, Stan, disease risk prediction, personalised medicine, computational biology, contextual information retrieval, information visualization, etc. Application deadline 18 February. See more here.

Stunning breakthrough: Using Stan to map cancer screening!

Screen Shot 2015-07-25 at 12.28.14 AM

Paul Alper points me to this article, Breast Cancer Screening, Incidence, and Mortality Across US Counties, by Charles Harding, Francesco Pompei, Dmitriy Burmistrov, Gilbert Welch, Rediet Abebe, and Richard Wilson.

Their substantive conclusion is there’s too much screening going on, but here I want to focus on their statistical methods:

Spline methods were used to model smooth, curving associations between screening and cancer rates. We believed it would be inappropriate to assume that associations were linear, especially since nonlinear associations often arise in ecological data. In detail, univariate thin-plate regression splines (negative binomial model to accommodate overdispersion, log link, and person-years as offset) were specified in the framework of generalized additive models and fitted via restricted maximum likelihood, as implemented in the mgcv package in R. . . .

To summarize cross-sectional changes in incidence and mortality, we evaluated the mean rate differences and geometric mean relative rates (RRs) associated with a 10–percentage point increase in the extent of screening across the range of data (39%-78% screening). The 95% CIs were calculated by directly simulating from the posterior distribution of the model coefficients (50 000 replicates conditional on smoothing parameters).

Can someone get these data and re-fit in Stan? I have no reason to think the published analysis by Harding et al. has any problems; I just think it would make sense to do it all in Stan, as this would be a cleaner workflow and easier to apply to new problems.

P.S. See comments for some discussions by Charles Harding, author of the study in question.

When does peer review make no damn sense?

Disclaimer: This post is not peer reviewed in the traditional sense of being vetted for publication by three people with backgrounds similar to mine. Instead, thousands of commenters, many of whom are not my peers—in the useful sense that, not being my peers, your perspectives are different from mine, and you might catch big conceptual errors or omissions that I never even noticed—have the opportunity to point out errors and gaps in my reasoning, to ask questions, and to draw out various implications of what I wrote. Not “peer reviewed”; actually peer reviewed and more; better than peer reviewed.


Last week we discussed Simmons and Simonsohn’s survey of some of the literature on the so-called power pose, where they wrote:

While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve). However, at this point the evidence for the basic effect seems too fragile to search for moderators or to advocate for people to engage in power posing to better their lives.


Even if the effect existed, the replication suggests the original experiment could not have meaningfully studied it.

The first response of one of the power-pose researchers was:

I’m pleased that people are interested in discussing the research on the effects of adopting expansive postures. I hope, as always, that this discussion will help to deepen our understanding of this and related phenomena, and clarify directions for future research. . . . I respectfully disagree with the interpretations and conclusions of Simonsohn et al., but I’m considering these issues very carefully and look forward to further progress on this important topic.

This response was pleasant enough but I found it unsatisfactory because it did not even consider the possibility that her original finding was spurious.

After Kaiser Fung and I publicized Simmons and Simonsohn’s work in Slate, the power-pose author responded more forcefully:

The fact that Gelman is referring to a non-peer-reviewed blog, which uses a new statistical approach that we now know has all kinds of problems, as the basis of his article is the WORST form of scientific overreach. And I am certainly not obligated to respond to a personal blog. That does not mean I have not closely inspected their analyses. In fact, I have, and they are flat-out wrong. Their analyses are riddled with mistakes, not fully inclusive of all the relevant literature and p-values, and the “correct” analysis shows clear evidential value for the feedback effects of posture.

Amy Cuddy, the author of this response, did not at any place explain how Simmons and Simonsohn were “flat-out wrong,” nor did she list even one of the mistakes with which their analyses were “riddled.”

Peer review

The part of the above quote I want focus on, though, is the phrase “non-peer-reviewed.” Peer reviewed papers have errors, of course (does the name “Daryl Bem” ring a bell?). Two of my own published peer-reviewed articles had errors so severe as to destroy their conclusions! But that’s ok, nobody’s claiming perfection. The claim, I think, is that peer-reviewed articles are much less likely to contain errors, as compared to non-peer-reviewed articles (or non-peer-reviewed blog posts). And the claim behind that, I think, is that peer review is likely to catch errors.

And this brings up the question I want to address today: What sort of errors can we expect peer review to catch?

I’m well placed to answer this question as I’ve published hundreds of peer-reviewed papers and written thousands of referee reports for journals. And of course I’ve also done a bit of post-publication review in recent years.

To jump to the punch line: the problem with peer review is with the peers.

In short, if an entire group of peers has a misconception, peer review can simply perpetuate error. We’ve seen this a lot in recent years, for example that paper on ovulation and voting was reviewed by peers who didn’t realize the implausibility of 20-percentage-point vote swings during the campaign, peers who also didn’t know about the garden of forking paths. That paper on beauty and sex ratio was reviewed by peers who didn’t know much about the determinants of sex ratio and didn’t know much about the difficulties of estimating tiny effects from small sample sizes.

OK, let’s step back for a minute. What is peer review good for? Peer reviewers can catch typos, they can catch certain logical flaws in an argument, they can notice the absence of references to the relevant literature—that is, the literature that the peers are familiar with. That’s how the peer reviewers for that psychology paper on ovulation and voting didn’t catch the error of claiming that days 6-14 were the most fertile days of the cycle: these reviewers were peers of the people who made the mistake in the first place!

Peer review has its place. But peer reviewers have blind spots. If you want to really review a paper, you need peer reviewers who can tell you if you’re missing something within the literature—and you need outside reviewers who can rescue you from groupthink. If you’re writing a paper on himmicanes and hurricanes, you want a peer reviewer who can connect you to other literature on psychological biases, and you also want an outside reviewer—someone without a personal and intellectual stake in you being right—who can point out all the flaws in your analysis and can maybe talk you out of trying to publish it.

Peer review is subject to groupthink, and peer review is subject to incentives to publishing things that the reviewers are already working on.

This is not to say that a peer-reviewed paper is necessarily bad—I stand by over 99% of my own peer-reviewed publications!—rather, my point is that there are circumstances in which peer review doesn’t give you much.

To return to the example of power pose: There are lots of papers in this literature and there’s a group of scientists who believe that power pose is real, that it’s detectable, and indeed that it can help millions of people. There’s also a group of scientists who believe that any effects of power pose are small, highly variable, and not detectable by the methods used in the leading papers in this literature.

Fine. Scientific disagreements exist. Replication studies have been performed on various power-pose experiments (indeed, it’s the null result from one of these replications that got this discussion going), and the debate can continue.

But, my point here is that peer-review doesn’t get you much. The peers of the power-pose researchers are . . . other power-pose researchers. Or researchers on embodied cognition, or on other debatable claims in experimental psychology. Or maybe other scientists who don’t work in this area but have heard good things about it and want to be supportive of this work.

And sometimes a paper will get unsupportive reviews. The peer review process is no guarantee. But then authors can try again until they get those three magic positive reviews. And peer review—review by true peers of the authors—can be a problem, if the reviewers are trapped in the same set of misconceptions, the same wrong framework.

To put it another way, peer review is conditional. Papers in the Journal of Freudian Studies will give you a good sense of what Freudians believe, papers in the Journal of Marxian Studies will give you a good sense of what Marxians believe, and so forth. This can serve a useful role. If you’re already working in one of these frameworks, or if you’re interested in how these fields operate, it can make sense to get the inside view. I’ve published (and reviewed papers for) the journal Bayesian Analysis. If you’re anti-Bayesian (not so many of these anymore), you’ll probably think all these papers are a crock of poop and you can ignore them, and that’s fine.

(Parts of) the journals Psychological Science and PPNAS have been the house organs for a certain variety of social psychology that a lot of people (not just me!) don’t really trust. Publication in these journals is conditional on the peers who believe the following equation:

“p less than .05” + a plausible-sounding theory = science.

Lots of papers in recent years by Uri Simonsohn, Brian Nosek, John Ioannidis, Katherine Button, etc etc etc., have explored why the above equation is incorrect.

But there are some peers that haven’t got the message yet. Not that they would endorse the above statement when written as crudely as in that equation, but I think this is how they’re operating.

And, perhaps more to the point, many of the papers being discussed are several years or even decades old, dating back to a time when almost nobody (myself included) realized how wrong the above equation is.

Back to power pose

And now back to the power pose paper by Carney et al. It has many garden-of-forking-paths issues (see here for a few of them). Or, as Simonsohn would say, many researcher degrees of freedom.

But this paper was published in 2010! Who knew about the garden of forking paths in 2010? Not the peers of the authors of this paper. Maybe not me either, had it been sent to me to review.

What we really needed (and, luckily, we can get) is post-publication review: not peer reviews, but outside reviews, in this case reviews by people who are outside of the original paper both in research area and in time.

And also this, from another blog comment:

It is also striking how very close to the .05 threshhold some of the implied p-values are. For example, for the task where the participants got the opportunity to gamble the reported chi-square is 3.86 which has an associated p-value of .04945.

Of course, this reported chi-square value does not seem to match the data because it appears from what is written on page 4 of the Carney et al. paper that 22 participants were in the high power-pose condition (19 took the gamble, 3 did not) while 20 were in the low power-pose condition (12 took the gamble, 8 did not). The chi-square associated with a 2 x 2 contingency table with this data is 3.7667 and not 3.86 as reported in the paper. The associated p-value is .052 – not less than .05.

You can’t expect peer reviewers to check these sorts of calculations—it’s not like you could require authors to supply their data and an R or Stata script to replicate the analyses, ha ha ha. The real problem is that the peer reviewers were sitting there, ready to wave past the finish line a result with p less than .05, which provides an obvious incentive for the authors to get p less than .05, one way or another.

Commenters also pointed out an earlier paper by one of the same authors, this time on stereotypes of the elderly, from 2005, that had a bunch more garden-of-forking-paths issues and also misreported two t statistics: the actual values were something like 1.79 and 3.34; the reported values were 5.03 and 11.14! Again, you can’t expect peer reviewers to catch these problems (nobody was thinking about forking paths in 2005, and who’d think to recalculate a t statistic?), but outsiders can find them, and did.

At this point one might say that this doesn’t matter, that the weight of the evidence, one way or another, can’t depend on whether a particular comparison in one paper was or was not statistically significant—but if you really believe this, what does it say about the value of the peer-reviewed publication?

Again, I’m not saying that peer review is useless. In particular, peers of the authors should be able to have a good sense of how the storytelling theorizing in the article fits in with the rest of the literature. Just don’t expect peers to do any assessment of the evidence.

Linking as peer review

Now let’s consider the Simmons and Simonsohn blog post. It’s not peer reviewed—except it kinda is! Kaiser Fung and I chose to cite Simmons and Simonsohn in our article. We peer reviewed the Simmons and Simonsohn post.

This is not to say that Kaiser and I are certain that Simmons and Simonsohn made no mistakes in that post; peer review never claims to that sort of perfection.

But I’d argue that our willingness to cite Simmons and Simonsohn is a stronger peer review than whatever was done for those two articles cited above. I say this not just because those papers had demonstrable errors which affect their conclusions (and, yes, in the argot of psychology papers, if a p-value shifts from one side of .05 to the other, it does affect the conclusions).

I say this also because of the process. When Kaiser and I cite Simmons and Simonsohn in the way that we do, we’re putting a little bit of our reputation on the line. If Simmons and Simonsohn made consequential errors—and, hey, maybe they did, I didn’t check their math, any more than the peer reviewers of the power pose papers checked their math—that rebounds negatively on us, that we trusted something untrustworthy. In contrast, the peer reviewers of those two papers are anonymous. The peer review that they did was much less costly, reputationally speaking, than ours. We have skin in the game, they do not.

Beyond this, Simmons and Simonsohn say exactly what they did, so you can work it out yourself. I trust this more than the opinions of 3 peers of the authors in 2010, or 3 other peers in 2005.


Peer review can serve some useful purposes. But to the extent the reviewers are actually peers of the authors, they can easily have the same blind spots. I think outside review can serve a useful purpose as well.

If the authors of many of these PPNAS or Psychological Science-type papers really don’t know what they’re doing (as seems to be the case), then it’s no surprise that peer review will fail. They’re part of a whole peer group that doesn’t understand statistics. So, from that perspective, perhaps we should trust “peer review” less than we should trust “outside review.”

I am hoping that peer review in this area will improve, given the widespread discussion of researcher degrees of freedom and garden of forking paths. Even so, though, we’ll continue to have a “legacy” problem of previously published papers with all sorts of problems, up to and including t statistics misreported by factors of 3. Perhaps we’ll have to speak of “post-2015 peer-reviewed articles” and “pre-2015 peer-reviewed articles” as different things?

On deck this week

Mon: When does peer review make no damn sense?

Tues: Stunning breakthrough: Using Stan to map cancer screening!

Wed: Where the fat people at?

Thurs: The Notorious N.H.S.T. presents: Mo P-values Mo Problems

Fri: What’s the difference between randomness and uncertainty?

Sat: You’ll never guess what I say when I have nothing to say

Sun: I refuse to blog about this one

I don’t know about you, but I love these blog titles. Each week I put together this “on deck” post and I get interested all again in these topics. I wrote most of these so many months ago, I have no idea what’s in them. I’m looking forward to these posts almost as much as you are!

What a great way to start the work week.

Ted Cruz angling for a position in the Stanford poli sci department

In an amusing alignment of political and academic scandals, presidential candidate Ted Cruz was blasted for sending prospective voters in the Iowa Caucus this misleading mailer:


Which reminds me of the uproar two years ago when a couple of Stanford political science professors sent prospective Montana voters this misleading mailer:

Screen Shot 2014-10-29 at 6.04.24 PM

I don’t know which is worse: having a “voting violation” in Iowa or being almost as far left as Barack Obama in Montana.

There is well known research in political science suggesting that shaming people can motivate them to come out and vote, so I can understand how Cruz can describe this sort of tactic as “routine.”

It’s interesting, though: In 2014, some political scientists got into trouble by using campaign-style tactics in a nonpartisan election (and also for misleading potential voters by sending them material with the Montana state seal). In 2016, a political candidate is getting into trouble by using political-science-research tactics in a partisan election (and also for misleading potential voters with a “VOTING VIOLATION” note).

What’s the difference between Ted Cruz and a Stanford political scientist?

Some people wrote to me questioning the link I’m drawing above between Cruz and the Stanford political scientists. So let me emphasize that I know of no connections here. I don’t even know if Cruz has any political scientists on his staff, and I’m certainly not trying to suggest that the Stanford profs in question are working for Cruz or for any other presidential candidate. I have no idea. Nor would I think it a problem if they are. I was merely drawing attention to the similarities between Cruz’s item and the Montana mailer from a couple years back.

I do think what Cruz did is comparable to what the political scientists did.

There are some differences:

1. Different goals: Cruz wants to win an election, the political scientists wanted to do research.

2. Different time frames: Cruz is in a hurry and got sloppy, the political scientists had more time and could be more careful with the information on their mailers.

But I see two big similarities:

1. Research-based manipulation of voters: Cruz is working off of the effects of social pressure on turnout, the political scientists were working off the effects of perceived ideology on turnout.

2. Misleading information: Cruz is implying that people have some sort of obligation to vote, the political scientists were implying that their mailer was coming from the State of Montana.

Postdoc opportunity with Sophia Rabe-Hesketh and me in Berkeley!

Sophia writes:

Mark Wilson, Zach Pardos and I are looking for a postdoc to work with us on a range of projects related to educational assessment and statistical modeling, such as Bayesian modeling in Stan (joint with Andrew Gelman).

See here for more details.

We will accept applications until February 26.

The position is for 15 months, starting this Spring. To be eligible, applicants must be U.S. citizens or permanent residents.