Skip to content

The never-back-down syndrome and the fundamental attribution error


David Allison told me about a frustrating episode in which he published a discussion where he pointed out problems with a published paper, and the authors replied with . . . not even a grudging response, they didn’t give an inch, really ungracious behavior. No “Thank you for finding our errors”; instead they wrote:

We apologize for the delay in the reply to Dr Allison’s letter of November 2014, this was probably due to the fact that it was inadvertently discarded.

Which would be kind of a sick burn except that they’re in the wrong here.

Anyway, I wrote this to Allison:

Yeah, it would be too much for them to consider the possibility they might have made a mistake!

It’s funny, in medical research, it’s accepted that a researcher can be brilliant, creative, well-intentioned, and come up with a theory that happens to be wrong. You can even have an entire career as a well-respected medical researcher and pursue a number of dead ends, and we accept that; it’s just life, there’s uncertainty, the low-hanging fruit have all been picked, and we know that the attempted new cancer cure is just a hope.

And researchers in other fields know this too, presumably. We like to build big exciting theories, but big exciting theories can be wrong.

But . . . in any individual case, researchers never want to admit error. A paper can be criticized and criticized and criticized, and the pattern is to not even consider the possibility of a serious mistake. Even the authors of that ovluation-and-clothing paper, or the beauty-and-sex-ratio paper, or the himmicanes paper, never gave up.

It makes no sense. Do these researchers think that only “other people” make errors?

And Allison replied:

The phenomenon you note seems like a variant on what psychologists call the Fundamental Attribution Error.

Interesting point. I know about the fundamental attribution error and I think a lot about refusal to admit mistakes, but I’d never made the connection. More should be done on this. I’m envisioning a study with 24 undergrads and 100 Mechanical Turk participants that we can publish in Psych Sci or PPNAS if they don’t have any ESP or himmicane studies lined up.

No, really, I do think the connection is interesting and I would like to see it studied further. I love the idea of trying to understand the stubborn anti-data attitudes of so many scientists. Rather than merely bemoaning these attitudes (as I do) or cynically accepting them (as Steven Levitt has), we could try to systematically learn about them. I mean, sure, people have incentives to lie, exaggerate, cheat, hide negative evidence, etc.—but it’s situational. I doubt that researchers typically think they’re doing all these things.

It’s not about the snobbery, it’s all about reality: At last, I finally understand hatred of “middlebrow”

I remember reading Dwight Macdonald and others slamming “middlebrows” and thinking, what’s the point? The classic argument from the 1940s onward was to say that true art (James Joyce etc) was ok, and true mass culture (Mickey Mouse and detective stories) were cool, but anything in the middle (John Marquand, say) was middlebrow and deserved mockery and disdain. The worst of the middlebrow was the stuff that mainstream newspaper critics thought was serious and uplifting.

When I’d read this, I’d always rebel a bit. I had no particular reason to doubt most of the judgments of Macdonald etc. (although I have to admit to being a Marquand fan), but something about the whole highbrow/middlebrow/lowbrow thing bugged me: If lowbrow art could have virtues (and I have no doubt that it can), then why can’t middlebrow art also have these positive qualities?

What I also couldn’t understand was the almost visceral dislike that Macdonald and other critics felt for the middlebrow. So what if some suburbanites were patting themselves on the back for their sophistication in reading John Updike? Why deprive them of that simple pleasure, and why hold that against Updike?

But then I had the same feeling myself, the same fury against the middlebrow, and I think I understand where Macdonald etc. were coming from.

It came up after the recent “air rage” story, in which a piece of PPNAS-tagged junk science got the royal treatment at the Economist, NPR, Science magazine, etc. etc.

This is “middlebrow science.” It goes about in the trappings of real science, is treated as such by respected journalists, but it’s trash.

To continue the analogy: true science is fine, and true mass culture (for example, silly news items about Elvis sightings and the Loch Ness monster) is fine too, in that nobody is taking it for real science. But the Gladwell/Easterbrook/PPNAS/PsychScience/NPR axis . . . this is the middlebrow stuff I can’t stand. It has neither the rigor of real science, but is not treated by journalists with the disrespect it deserves.

And I think that’s how Macdonald felt about middlebrow literature: bad stuff is out there, but seeing bad stuff taken so seriously by opinion-makers, that’s just painful.

P.S. Let me clarify based on some things that came up in comments. I don’t think middlebrow is necessarily bad. I’m a big fan of Marquand and Updike, for example. Similarly, when it comes to popular science, there’s lots of stuff that I like that also gets publicity in places such as NPR. Simplification is fine too. The point, I think, is that work has to be judged on its own merits, that the trappings of seriousness should not be used as an excuse to abdicate critical responsibility.

Astroturf “patient advocacy” group pushes to keep drug prices high


Susan Perry tells the story:

Patients Rising, [reporter Trudy Lieberman] reports, was founded by Jonathan Wilcox, a corporate communications and public relations consultant and adjunct professor at USC’s Annenberg School of Communications and his wife, Terry, a producer of oncology videos. . . .

Both Wilcox and his wife had worked with Vital Options International, another patient advocacy group with a special mission of generating global cancer conversations. She is a former executive director. A search of [Vital Options International’s] website showed that drug industry heavy hitters, such as Genentech, Eli Lilly, and Bristol-Myers Squibb, had in the past sponsored some of the group’s major activities . . .

Patients Rising is pushing back particularly strongly against Dr. Peter Bach, an epidemiologist at New York City’s Memorial Sloan Kettering Cancer Center, who has been outspoken about the high cost of cancer drugs.

Pretty horrible. Political advocacy is fine, and it could well be that there are good reasons for drug prices to remain high. But faking a patient advocacy organization, that’s not cool.

I will say, though, that artificial turf is a lot more pleasant than it used to be. 20 years ago, it felt like concrete; now it feels a lot more like grass. Except on really hot days when the turf feels like hot tar.

Full disclosure: I am working with colleagues at Novartis and getting paid for it.

Handy Statistical Lexicon — in Japanese!


So, one day I get this email from Kentaro Matsuura:

Dear Professor Andrew Gelman,

I’m a Japanese Stan user and write a blog to promote Stan.
(and translator of

I believe your post on “Handy statistical lexicon (” is so great that I’d like to translate and spread the post in my blog. Could I do that?


Wow, how cool is that? Of course I said yes, please do it.

A week later Kentaro wrote to ask for a favor:

Could I change some terms slightly so that Japanese could have more familiarity? For example, there is no “self-cleaning oven” in Japan, but there is “self-cleaning air conditioner”.

I had no idea.

And here it is!

Don’t trust Rasmussen polls!


Political scientist Alan Abramowitz brings us some news about the notorious pollster:

In the past 12 months, according to Real Clear Politics, there have been 72 national polls matching Clinton with Trump—16 polls conducted by Fox News or Rasmussen and 56 polls conducted by other polling organizations. Here are the results:

Trump has led or been tied with Clinton in 44 percent (7 of 16) of Fox and Rasmussen Polls: 3 of 5 Rasmussen Polls and 4 of 11 Fox News Polls.

Trump has led or been tied with Clinton in 7 percent (4 of 56) polls conducted by other polling organizations.

To put it another way, Fox and Rasmussen together have accounted for 22 percent of all national polls in the past year but they have accounted for 64 percent of the polls in which Trump has been leading or tied with Clinton.

Using Pollster’s tool that allows you to calculate polling averages with different types of polls and polling organizations excluded:

Current Pollster average: Clinton +2.7
Removing Rasmussen and Fox News: Clinton +7.7
Live Interview polls only: Clinton +8.8
Live interview polls without Fox News: Clinton +9.2

I find it remarkable that simply removing Rasmussen and Fox changes the average by 5 points.

Hey—I remember Rasmussen! They’re a bunch of clowns.


Here are a couple of old posts about Rasmussen.

From 2010:

Rasmussen polls are consistently to the right of other polls, and this is often explained in terms of legitimate differences in methodological minutiae. But there seems to be evidence that Rasmussen’s house effect is much larger when Republicans are behind, and that it appears and disappears quickly at different points in the election cycle.

From 2008:

I was looking up the governors’ popularity numbers on the web, and came across this page from Rasmussen Reports which shows Sarah Palin as the 3rd-most-popular governor. But then I looked more carefully. Janet Napolitano of Arizona is viewed as Excellent by 28% of respondents, Good by 27%, Fair by 26%, and Poor by 27%. That adds up to 108%! What’s going on? I’d think they would have a computer program to pipe the survey results directly into the spreadsheet. But I guess not, someone must be entering these numbers by hand. Weird.

I just checked that page again and it’s still wrong:

Screen Shot 2016-05-20 at 8.10.03 PM

What ever happened to good old American quality control?

But, hey, it’s a living. Produce crap numbers that disagree with everyone else and you’re gonna get headlines.

You’d think news organizations would eventually twig to this particular scam and stop reporting Rasmussen numbers as if they’re actually data, but I guess polls are the journalistic equivalent of crack cocaine.

Given that major news organizations are reporting whatever joke study gets released in PPNAS, I guess we shouldn’t be surprised they’ll fall for Rasmussen, time and time again. It’s inducing stat rage in me nonetheless.

If only science reporters and political reporters had the standards of sports reporters. We can only dream.

Why the garden-of-forking-paths criticism of p-values is not like a famous Borscht Belt comedy bit


People point me to things on the internet that they’re sure I’ll hate. I read one of these awhile ago—unfortunately I can’t remember who wrote it or where it appeared, but it raised a criticism, not specifically of me, I believe, but more generally of skeptics such as Uri Simonsohn and myself who keep bringing up p-hacking and the garden of forking paths.

The criticism that I read is wrong, I think, but it has a superficial appeal, so I thought it would be worth addressing it here.

The criticism went like this: People slam classical null-hypothesis-significance-testing (NHST) reasoning (the stuff I hate, all those “p less than .05” papers in Psychological Science, PPNAS, etc., on ESP, himmicanes, power pose, air rage, . . .) on two grounds: first, that NHST makes no sense, and second, that published p-values are wrong because of selection, p-hacking, forking paths, etc. But, the criticism continues, this anti-NHST attitude is itself self-contradictory: if you don’t like p-values anyway, why care that they’re being done wrong?

The author of this post (that I now can’t find) characterized anti-NHST critics (like me!) as being like the diner in that famous Borscht Belt routine, who complains about the food being so bad. And such small portions!

And, indeed, if we think NHST is such a bad idea, why do we then turn around and say that p-values are being computed wrong? Are we in the position of the atheist who goes into church in order to criticize the minister on his theology?

No, and here’s why. Suppose I come across some piece of published junk science, like the ovulation-and-clothing study. I can (and do) criticize it on a number of grounds. Suppose I lay off the p-values, saying that I wouldn’t compute a p-value here in any case so who cares. Then a defender of the study could easily stand up and say, Hey, who cares about these criticisms? The study has p less than .05, this will only happen 5% of the time if the null hypothesis is true, thus this is good evidence that the null hypothesis is false and there’s something going on! Or, a paper reports 9 different studies, each of which is statistically significant at the 5% level. Under the null hypothesis, the probability of this happening is (1/20)^9, thus the null hypothesis can be clearly rejected. In these cases, I think it’s very helpful to be able to go back and say, No, because of p-hacking and forking paths, the probability you’ll find something statistically significant in each of these experiments is quite high.

The reason I’m pointing out the problems with published p-values is not because I think researchers should be doing p-values “right,” whatever that means. I’m doing it in order to reduce the cognitive dissonance. So, when a paper such as power pose fails to replicate, my reaction is not, What a surprise: this statistically significant finding did not replicate!, but rather, No surprise: this noisy result looks different in a replication.

It’s similar to my attitude toward preregistration. It’s not that I think that studies should be preregistered; it’s that when a study is not preregistered, it’s hard to take p-values, and the reasoning that usually flows from them, at face value.

NPR’s gonna NPR


I was gonna give this post the title, Stat Rage More Severe in the Presence of First-Class Journals, but then I thought I’d keep it simple.

Chapter 1. Background

OK, here’s what happened. A couple weeks ago someone pointed me to a low-quality paper that appeared in PPNAS (the prestigious Proceedings of the National Academy of Sciences), edited by the same person who approved the notorious himmicanes paper and who had earlier published a paper with one of the authors of the notoriously power pose paper. This new article was an attempt to understand the sources of “air rage” but unfortunately the actual analysis was a big uninterpretable multiple regression on some observational data.

Just to get the criticisms out of the way, here’s what I wrote, explaining why I couldn’t believe any of the claims in that article:

The interpretation of zillions of regression coefficients, each one controlling for all the others. For example, “As predicted, front boarding of planes predicted 2.18-times greater odds of an economy cabin incident than middle boarding (P = 0.005; model 2), an effect equivalent to an additional 5-h and 58-min flight delay (0.7772 front boarding/0.1305 delay hours).” What does it all mean? Who cares!

Story time: “We argue that exposure to both physical and situational inequality can result in antisocial behavior. . . . even temporary exposure to physical inequality—being literally placed in one’s “class” (economy class) for the duration of a flight—relates to antisocial behavior . . .”

A charming reference in the abstract to testing of predictions, even though no predictions were supplied before the data were analyzed.

They report a rate of incidents of 1.58 per thousand flights in economy seats on flights with first class, .14 per thousand flights in economy seats with no first class, and .31 per thousand flights in first class.

It seems like these numbers are per flight, not per passenger, but that can’t be right: lots more people are in economy class than in first class, and flights with first class seats tend to be in bigger planes than flights with no first class seats. This isn’t as bad as the himmicanes analysis but it displays a similar incoherence.

I didn’t explain all these points in detail—this was a blog post, not a textbook or even a referee report—but it was all there. To spell it out: My first quoted paragraph above addressed the problem of the regression coefficients, which is the same problem you noted in your comment. My second quoted paragraph is relevant in that the paper makes claims about human behavior which are not supported by their data. My third quoted paragraph pointed out the non-preregistered nature of the analysis. As always, preregistration is not required, but when this sort of completely open-ended study is not preregistered, this calls into question all claims of prediction accuracy and p-values. My fourth and fifth paragraphs address the point that the direct comparisons presented in the paper are uninterpretable.

Finally, the data are unavailable so it is impossible for an outsider to evaluate any of these claims. If the data were public, I’d recommend publication under a much lower standard, because once the data are out there, others could do their own analyses.

For another take on data problems with the “air rage” paper, see this post by John Walton.

Chapter 2. First mention of NPR

When posting on this study, I threw gratuitous shade at one of America’s most trusted news sources in my “tl;dr summary”:

NPR will love this paper. It directly targets their demographic of people who are rich enough to fly a lot but not rich enough to fly first class, and who think that inequality is the cause of the world’s ills.

The next day I posted a roundup of media outlets that’d fallen for this story, including CNN, the LA Times, and ABC News, along with respected tech sources Science and BoingBoing. I discussed the selection bias that occurs when the best science reporters realize this study is empty and don’t report it, while everyday journalists just follow the PPNAS label and don’t even think there could be a problem. All jokes about “stat rage” aside, this is a big problem in that consumers of the news only see the sucker takes, never the knowledge.

But NPR wasn’t included in that media roundup, hence I wrote “I was unfair to NPR,” and commenter Sepp asked,

Why the dig at NPR? And why the implication that NPR listeners cannot distinguish good scientific articles from bad ones that agree with listeners’ values? On that note, why the implicit indictment of said values (i.e. the desire to reduce inequality, etc.)? I find these statements saddening and confusing.

Chapter 3. The return of NPR

But then, after all that, NPR bit on the story—multiple times!

I’m sad to say that our public radio network lived up to its reputation. And I really am sad. I’d be much happier to report that they showed admirable skepticism and restraint. But they didn’t:

– Wait Wait Don’t Tell Me (or so I’ve been told; I haven’t seen the transcript so maybe they were actually mocking the study; I can only hope.)

Planet Money:

Screen Shot 2016-05-19 at 8.11.01 PM

– And finally, this from Alva Noë:

Screen Shot 2016-05-19 at 8.15.15 PM

The nation’s finest news source, indeed. It could’ve been worse: it could’ve been mentioned on NPR’s evening news show, but still, it’s disappointing.

Chapter 4. Summary

I return to my original statement:

NPR will love this paper. It directly targets their demographic of people who are rich enough to fly a lot but not rich enough to fly first class, and who think that inequality is the cause of the world’s ills.

NPR was not the only prestige outlet to be fooled.

Science magazine fell for this story (“Air rage? Blame the first-class cabin”). No surprise they were duped, I guess: tabloids gotta stick together.

But I was disappointed to see the usually-skeptical Economist take the bait too (“Resentment of first-class passengers can be a cause of air rage”). The Economist is no great enemy of inequality so they had no particular political reason to like this one. I guess they got conned by the PPNAS label.

Good job, PPNAS: another short-term win for the PPNAS publicity machine, another long-term loss for your reputation. Or, should I say, medium-term, as I sill have hope that you will clean up your act someday.

Let me conclude with this, from a commenter on the Economist article who understands this better than the Economist’s own reporter:

Screen Shot 2016-05-19 at 8.24.13 PM

I think NPR and these other news outlets can do better. And really, who goes into journalism to reheat press releases, anyway?

And, hey, to the authors of the paper: It’s ok. Everyone makes mistakes. Statistics is hard. I’m sure you were intending to do good science here. I’m not quite sure what to recommend for you. For this particular line of study: Sure, I’d recommend you cut your losses and release a short statement recognizing that the data don’t support your claims. After that? Maybe bring in a couple collaborators, one who knows about airlines and one who knows about causal inference. Sure, that takes work, but it’s kind of a necessity if (a) you want to learn anything useful from these data, and (b) you’re not willing or able to just make the data public. For future research on other topics: Hmmm, I guess my quick suggestion for any future paper is to present it at seminars in a few econ departments. Economists are pretty tough. They don’t catch every error but they might’ve caught these. And, for God’s sake, don’t submit to PPNAS anymore. Look what happened last time. In all sincerity, I wish you the best, and I urge you to reject the lure of the quick publication. PPNAS isn’t doing you any favors by publishing work like this (except, I guess, indirectly, in that now you’re getting some free advice from me). It’s a dysfunctional relationship we have here, between journals that seek publicity, news organizations that are all too willing to essentially run press releases, and researchers who often just don’t know better, and are led to believe that anything with “p less than .05” that’s published in a journal is good for them. Time to jump off the merry-go-round.

P.S. I wrote this post several months ago—this blog’s on a lag—and it just happened to appear today.

Don’t move Penn Station

I agree 100% with Henry Grabar on this one. Ever since I heard many years ago about the plan to blog a few billion dollars moving NYC’s Penn Station to a prettier but less convenient location, I’ve grimaced. Big shots really love to spend our money on fancy architecture, don’t they?

As I wrote a few years ago, my guess is that the new Penn Station will be a lot more like an airport. Bright and airy, some top-end stores, it would look beautiful if it weren’t filled with thousands of people trying to get on and off the train in rush hour.

Here’s hoping some of our elected representatives can derail this project, as it were, or that the Amtrak management can convince them to spend these zillions on better train signals or tunnel repairs or something that’s actually useful.

P.S. This came up before and a bunch of commenters disagreed with me. I still think I’m right.

He wants to get started on Bayes

Mathew Mercuri writes:

I am interested in learning how to work in a Bayesian world. I have training in a frequentist approach, specifically from an applied health scientist/epidemiologist approach. However, while i teach courses in applied statistics, I am not particularly savvy with heavy statistical mathematics, so I am a bit worried bout how to enter into this topic.

Can you recommend a book on Bayesian statistics for a frequentist trying to make the conversion! I am interested in your book, but I am not sure if it is the correct entry point. Note: I have read Howson and Urbach’s book on bayesian reasoning, so I am not a complete novice.

My reply: I recommend Richard McElreath’s book as a start.

“Find the best algorithm (program) for your dataset.”

Piero Foscari writes:

Maybe you know about this already, but I found it amazingly brutal; while looking for some reproducible research resources I stumbled onto the following at (which would be nice if done properly, at least as a standardization attempt):

Find the best algorithm (program) for your dataset.
Upload your dataset and run existing programs on it to see which one works best.
No mention of proper procedures in the FAQ summary, but I did not dig deep.

In the financial community data snooping has been a well known problem for at least 20 years, exhacerbated by automated model searches, so it’s amusing that something like that can still run that openly in the related and supposedly less naive ML community.

My reply: Some people do seem to think that there’s some sort of magic to a procedure that minimizes cross-validation error.

I refuse to blog about this one

Shravan points me to this article, Twitter Language Use Reflects Psychological Differences between Democrats and Republicans, which begins with the following self-parody of an abstract:

Previous research has shown that political leanings correlate with various psychological factors. While surveys and experiments provide a rich source of information for political psychology, data from social networks can offer more naturalistic and robust material for analysis. This research investigates psychological differences between individuals of different political orientations on a social networking platform, Twitter. Based on previous findings, we hypothesized that the language used by liberals emphasizes their perception of uniqueness, contains more swear words, more anxiety-related words and more feeling-related words than conservatives’ language. Conversely, we predicted that the language of conservatives emphasizes group membership and contains more references to achievement and religion than liberals’ language. We analysed Twitter timelines of 5,373 followers of three Twitter accounts of the American Democratic and 5,386 followers of three accounts of the Republican parties’ Congressional Organizations. The results support most of the predictions and previous findings, confirming that Twitter behaviour offers valid insights to offline behaviour.

and also this delightful figure:

journal.pone.0137422.g001 copy

The pie-chart machine must’ve been on the fritz that day.

I can’t actually complain about this article because it appeared in Plos-one. I have the horrible feeling that, with another gimmick or two, it could’ve become a featured article in PPNAS or Science or Nature.

Anyway, I replied to Shravan:

Stop me before I barf . . .

To which Shravan replied:

Let it all out on your blog.

But no, I don’t think this is worth blogging. There must be some football items that are more newsworthy.

A book on RStan in Japanese: Bayesian Statistical Modeling Using Stan and R (Wonderful R, Volume 2)

Bayesian Statistical Modeling Using Stan and R Book Cover
Wonderful, indeed, to have an RStan book in Japanese:

Google translate makes the following of the description posted on Amazon Japan (linked from the title above):

In recent years, understanding of the phenomenon by fitting a mathematical model using a probability distribution on data and prompts the prediction “statistical modeling” has attracted attention. Advantage when compared with the existing approach is both of the goodness of the interpretation of the ease and predictability. Since interpretation is likely to easily connect to the next action after estimating the values ​​in the model. It is rated as very effective technique for data analysis Therefore reality.

In the background, the improvement of the calculation speed of the computer, that the large scale of data becomes readily available, there are advances in stochastic programming language to very simple trial and error of modeling. From among these languages, in this document to introduce Stan is a free software. Stan is a package which is advancing rapidly the development equipped with a superior algorithm, it can easily be used from R because the package for R RStan has been published in parallel. Descriptive power of Stan is high, the hierarchical model and state space model can be written in as little as 30 lines, estimated calculation is also carried out automatically. Further tailor-made extensions according to the analyst of the problem is the easily possible.

In general, dealing with the Bayesian statistics books or not to remain in rudimentary content, what is often difficult application to esoteric formulas many real problem. However, this book is a clear distinction between these books, and finished to a very practical content put the reality of the data analysis in mind. The concept of statistical modeling was wearing through the Stan and R in this document, even if the change is grammar of Stan, even when dealing with other statistical modeling tools, I’m sure a great help.

I’d be happy to replace this with a proper translation if there’s a Japanese speaker out there with some free time (Masanao Yajima translated the citation for us).

Big in Japan?

I’d like to say Stan’s big in Japan, but that idiom implies it’s not so big elsewhere. I can say there’s a very active Twitter community tweeting about Stan in Japanese, which we follow occasionally using Google Translate.

Looking at the polls: Time to get down and dirty with the data


Poll aggregation is great, but one thing that we’ve been saying a lot recently (see also here) is that we can also learn a lot by breaking open a survey and looking at the numbers crawling around inside.

Here’s a new example. It comes from Alan Abramowitz, who writes:

Very strange results of new ABC/WP poll for nonwhite voters

There’s something very odd going on here.

See the table below provided by ABC News. [I’ll put the table at the end of the post.—ed.] They show that Clinton leads Trump by 89-2 among African-Americans and by 68-19 among Hispanics. But then they report that she only leads by 69-19 among all nonwhites. That makes no sense. Trump would have to have a huge lead among the other groups of nonwhite voters, mainly Asian-Americans, to produce that overall result among nonwhites.

Let’s assume that nonwhites are 28 percent of likely voters. And let’s assume that blacks are 12 percent, Hispanics are 11 percent and Asian/other are 5 percent.

According to my calculations, among the nonwhite 28 percent of the electorate, they have Clinton leading Trump by 19.3 to 5.3, a net advantage of 14 percentage points. Among the African-American 12 percent of the electorate, they have Clinton leading Trump by 10.7 to 0.2. And among the Hispanic 11 percent of the electorate, they have Clinton leading 7.5 to 2.1. Adding up the numbers for African-Americans and Hispanics, for that combined 23 percent of the electorate they have Clinton leading 18.2 to 2.3 for a lead of 15.9 percentage points. But remember, they only have Clinton leading by a net 14 percentage points among nonwhites. So in order to get to that result, Clinton must be down by a net 1.9 points among the remaining nonwhite voters. That means she would be LOSING to Trump among those other nonwhite voters by a landslide margin, something like 60 to 20!

Now my assumptions about the African-American, Hispanic and other nonwhite shares of the overall nonwhite electorate could be off a little, but probably not by much. And even if you modify those assumptions somewhat, you are still going to be left with the conclusion that Trump is far ahead of Clinton among nonwhites other than African-Americans and Hispanics.

If we flip the results for nonwhites other than African-Americans and Hispanics, giving Clinton a 60-20 lead rather than a 60-20 deficit, which would certainly be more realistic, this would make a noticeable difference in the overall results of the poll, moving the numbers from a 2 point Clinton lead among all likely voters to closer to a 5-6 point overall lead.

I responded:

What do you think happened? Maybe they used different adjustments for toplines and crosstabs?

Abramowitz said that, given the information that was currently available to him, “I have no idea what they did but I can’t come up with any way that these numbers add up.”

Just to be clear: I’m not saying these pollsters did anything wrong. I have no idea. I’ve not seen the raw data either, and I didn’t even go through all of Abramowitz’s comments in detail. My point here is just that, if we want to use and understand polls, sometimes we have to get down and dirty and try to figure out exactly what’s going on. Mike Spagat knows this, David Rothschild knows this, and so should we.

And here’s that table:
Continue reading ‘Looking at the polls: Time to get down and dirty with the data’ »

No statistically significant differences for get up and go

Politics and chance

After the New Hampshire primary Nadia Hassan wrote:

Some have noted how minor differences in how the candidates come out in these primaries can make a huge difference in the media coverage. For example, only a few thousand voters separate third and fifth and it really impacts how pundits talk about a candidate’s performance. Chance events can have a huge impact in politics and many areas. Candidates can win because of weather, or something they said, or a new news revelation. Nevertheless, I wonder if there’s a better way to handle this kind of thing when we are talking about close results in these primaries.

I replied:

Yes, but one reassuring perspective is that there’s arbitrariness in any case, as there are dozens of well qualified candidates for president and only one winner. Rather than arbitrariness, I’m more worried about systematic factors such as congress filling up with millionaires because these are the people who have the connections to allow them to run for office.

Cracks in the thin blue line


When people screw up or cheat in their research, what do their collaborators say?

The simplest case is when coauthors admit their error, as Cexun Jeffrey Cai and I did when it turned out that we’d miscoded a key variable in an analysis, invalidating the empirical claims of our award-winning paper.

On the other extreme, coauthors can hold a united front, as Neil Anderson and Deniz Ones did after some outside researchers found a data-coding error in their paper. Instead of admitting it and simply recognizing that some portion of their research was in error, Anderson and Ones destroyed their reputation by refusing to admit anything. This particular case continues to bother me because there’s no good reason for them not to want to get the right answer. Weggy was accused of plagiarism, which is serious academic misconduct, so it makes sense for him to stonewall and run out the clock until retirement. But Anderson and Ones simply made an error: Is admitting a mistake so painful as all that?

In other cases, researchers mount a vigorous defense in a more reasonable way. For example, after that Excel error was found, Reinhart and Rogoff admitted they made a mistake, and the remaining discussion turned on (a) the implications of the error for their substantive conclusions, and (b) the practice of data sharing. I think both sides had reasonable points in this discussion; in particular, yes the data were public and always available but not the particular data file used by Reinhart and Rogoff was not accessible for outsiders. The resulting discussion moved forward in a useful way, toward a position that researchers who publish data should make their scripts and datasets available, even when they are working with public data. Here’s an example.

But what I want to talk about today is when coauthors do not take a completely united front.

In the case of disgraced primatologist Marc Hauser, collaborator Noam Chomsky escalated with: “Marc Hauser is a fine scientist with an outstanding record of accomplishment. His resignation is a serious loss for Harvard, and given the nature of the attack on him, for science generally.” On the upside, I don’t think Chomsky actually defended Hauser’s practice of trying to tell his research assistants how not code his monkey data. I’m assuming that Chomsky kept his distance from the controversial research studies, allowing him to engage in an aggressive defense on principle alone.

Another option is to just keep quiet. The famous “power pose” work of Carney, Cuddy, and Yap has been questioned on several grounds: first that their study is too small and their data are too noisy for them to have a hope of finding the effects they were looking for, second that an attempted replication of their main finding failed, and third that at least one of the test statistics in their paper was miscalculated in a way that moved the p-value from above .05 to below .05. This last sort of error has also been found in at least one other paper of Cuddy. Upon publication of the non-replication, all three of Carney, Cuddy, and Yap responded in a defensive way that implied a lack of understanding of the basic statistical principles of statistical significance and replication. But after that, Carney and Yap appear to have kept quiet. [Not quite; see P.P.S. below.] Cuddy issued loud attacks on her critics but her coauthors perhaps have decided to stay out of the limelight. I’m glad they’re not going on the attack but I’m disappointed that they seem to want to hold on to their discredited claims. But that’s one strategy to follow when your work is found lacking: just stay silent and hope the storm blows over.

A final option, and the one I find most interesting, is when a researcher commits fraud or gross incompetence and does not admit it, but his or her coauthor will not sit still and accept this.

The most famous recent example was the gay-marriage-persuasion study of Michael Lacour and Don Green. When outsiders found out that the data were faked, Lacour denied it but Green pulled the plug. He told the scientific journal and the press that he had no trust in the data. Green did the right thing.

Another example is biologist Robert Trivers, who found out about problems in a paper he had coauthored—one coauthor had faked the data and another was defending the fraud. It took years until Trivers could get the journal to retract it.

My final example, which motivated me to write this post, came today in a blog comment from Randall Rose, a coauthor, with Promothesh Chatterjee and Jayati Sinha, of a social psychology study that was utterly destroyed by Hal Pashler, Doug Rohrer, Ian Abramson, Tanya Wolfson, and Christine Harris, to the extent that that Pashler et al. concluded that the data could not have happened as claimed in the paper and were consistent with fraud. Chatterjee and Sinha wrote horrible, Richard Tol-like defenses of their work (here’s a sample: “Although 8 coding errors were discovered in Study 3 data and this particular study has been retracted from that article, as I show in this article, the arguments being put forth by the critics are untenable”), but Rose did not join in:

I have ceased trying to defend the data in this paper, particularly Study 3, a long time ago. I am not certain what happened to generate the odd results (other than clear sloppiness in study execution, data coding, and reporting) but I am certain that the data in Study 3 should not be relied on . . .

I appreciate that. Instead of the usual the-best-defense-is-a-good-offense attitude, Rose openly admits that he did not handle the data himself and that he has no reason to vouch for the data quality or claim that the results still stand.

Wouldn’t it be great if everyone could do that?

It’s not an easy position, to be a coauthor in a study that has been found wanting, either through fraud, serious data errors, or simply a subtle statistical misunderstanding (such as that which led Satoshi Kanazawa to think that he could possibly learn anything about variation in sex ratios from a sample of size 3000). I find the behavior of Trivers, Green, and Rose in this setting to be exemplary, but I recognize the personal and professional difficulties here.

For someone like Carney or Yap, it’s a tough call. On one hand, to distance themselves from this work and abandon their claims would represent a serious hit on their careers, not to mention the pain involved in having to reassess their understanding of psychology. On the other hand, the work really is wrong, the experiment really wasn’t replicated, the data really are too noisy to learn what they were hoping to learn, and unlike Cuddy they’ve kept a lower profile so it doesn’t seem too late for them to admit error, accept the sunk cost, and move on.

P.S. See here for a discussion of a similar situation.

P.P.S. Commenter Bernoulli writes:

It is not true that Carney has remained totally silent.

Continue reading ‘Cracks in the thin blue line’ »

Trump +1 in Florida; or, a quick comment on that “5 groups analyze the same poll” exercise

Nate Cohn at the New York Times arranged a comparative study on a recent Florida pre-election poll. He sent the raw data to four groups (Charles Franklin; Patrick Ruffini; Margie Omero, Robert Green, Adam Rosenblatt; and Sam Corbett-Davies, David Rothschild, and me) and asked each of us to analyze the data how we’d like to estimate the margin of support for Hillary Clinton vs. Donald Trump in the state. And then he compared this to the New York Times pollster’s estimate.

Here’s what everyone estimated:

Franklin: Clinton +3 percentage points
Ruffini: Clinton +1
Omero, Green, Rosenblatt: Clinton +4
Us: Trump +1
NYT, Siena College: Clinton +1

We did Mister P, and the big reason our estimate was different from everyone else’s was that one of the variables we adjusted for was party registration, and this particular sample of 867 respondents had more registered Democrats than you’d expect, compared to Florida voters from 2012 with an adjustment for anticipated changes in the electorate for the upcoming election.

In previous efforts we’d adjusted on stated party identification but in this case the survey was conducted based on registered voter lists, so we knew party registration of the respondents. It was simpler to adjust for registration than stated party ID because we know the poststratification distribution for registration (based on the earlier election), whereas if we wanted to poststratify on party ID we’d need to take the extra step of estimating from that distribution.

Anyway, the exercise was fun and instructive. As Cohn put it:

We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results. . . . How so? Because pollsters make a series of decisions when designing their survey, from determining likely voters to adjusting their respondents to match the demographics of the electorate. These decisions are hard. They usually take place behind the scenes, and they can make a huge difference.

And he also provided some demographics on the different adjusted estimates:


I like this. You can evaluate our estimates not just based on our headline numbers but also based on the distributions we matched to.

But the differences weren’t as large as they look

Just one thing. At first I was actually surprised the results varied by so much. 5 percentage points seems like a lot!

But, come to think of it, the variation wasn’t so much. The estimates had a range of 5 percentage points, but that corresponds to a sd of about 2 percentage points. And that’s the sd on the gap between Clinton and Trump, hence the sd for either candidate’s total is more like 1 percentage point. Put it that way, and it’s not so much variation at all.

Andrew Gelman is not the plagiarism police because there is no such thing as the plagiarism police.


The title of this post is a line that Thomas Basbøll wrote a couple years ago.

Before I go on, let me say that the fact that I have not investigated this case in detail is not meant to imply that it’s not important or that it’s not worth investigating. It’s just not something that I had the energy to look into. Remember, people can be defined by what ticks them off.

And now here’s the story. I got the following email from someone called Summer Madison:

I think this might interest you:

I replied:
Continue reading ‘Andrew Gelman is not the plagiarism police because there is no such thing as the plagiarism police.’ »

Multicollinearity causing risk and uncertainty

Alexia Gaudeul writes:

Maybe you will find this interesting / amusing / frightening, but the Journal of Risk and Uncertainty recently published a paper with a rather obvious multicollinearity problem.

The issue does not come up that often in the published literature, so I thought you might find it interesting for your blog.

The paper is:

Rohde, I. M., & Rohde, K. I. (2015). Managing social risks–tradeoffs between risks and inequalities. Journal of Risk and Uncertainty, 51(2), 103-124.

The authors report very nicely all the elements that would normally indicate to a reviewer that there is something wrong. I got the data from the authors to run my own tests, which I [Gaudeul] report here.

I haven’t looked into this in detail but I thought I’d post it because there’s this scandal in econ that’s somehow all about process and little about substance, and tomorrow I have a post on that, so I thought it was worth preceding it with an example that’s all about substance, not process.

Why is the scientific replication crisis centered on psychology?

The replication crisis is a big deal. But it’s a problem in lots of scientific fields. Why is so much of the discussion about psychology research?

Why not economics, which is more controversial and gets more space in the news media? Or medicine, which has higher stakes and a regular flow of well-publicized scandals?

Here are some relevant factors that I see, within the field of psychology:

1. Sophistication: Psychology’s discourse on validity, reliability, and latent constructs is much more sophisticated than the usual treatment of measurement in statistics, economics, biology, etc. So you see Paul Meehl raising serious questions as early as the 1960s, at a time in which min other fields we were just getting naive happy talk about how all problems would be solved with randomized experiments.

2. Overconfidence deriving from research designs: When we talk about the replication crisis in psychology, we’re mostly talking about lab experiments and surveys. Either way, you get clean identification of comparisons, hence there’s assumption that simple textbook methods can’t go wrong. We’ve seen similar problems in economics (for example, that notorious paper on air pollution in China which was based on a naive trust in regression discontinuity analysis, not recognizing that, when you come down to it, what they had was an observational study), but lab experiments and surveys in psychology are typically so clean that researchers sometimes can’t seem to imagine that there could be any problems with their p-values.

3. Openness. This one hurts: psychology’s bad press is in part a consequence of its open culture, which manifests in various ways. To start with, psychology is _institutionally_ open. Sure, there are some bad actors who refuse to share their data or who try to suppress dissent. Overall, though, psychology offers many channels of communication, even including the involvement of outsiders such as myself. One can compare to economics, which is notoriously reistant to ideas coming from other fields.

And, compared to medicine, psychology is much less restricted by financial and legal considerations. Biology and medicine are big business, and there are huge financial incentives for suppressing negative results, silencing critics, and flat-out cheating. In psychology, it’s relatively easy to get your hands on the data or at least to find mistakes in published work.

4. Involvement of some of prominent academics. Research controversies in other fields typically seem to involve fringe elements in their professons, and when discussing science publication failures, you might just say that Andrew Wakefield had an axe to grind and the editor of the Lancet is a sucker for political controversy, or that Richard Tol has an impressive talent for getting bad work published in good journals. In the rare cases when a big shot is involved (for example, Reinhart and Rogoff) it is indeed big news. But, in psychology, the replication crisis has engulfed Susan Fiske, Roy Baumeister, John Bargh, Carol Dweck, . . . these are leaders in their field. So there’s a legitimate feeling that the replication crisis strikes at the heart of psychology, or at least social psychology; it’s hard to dismiss it as a series of isolated incidents. It was well over half a century ago that Popper took Freud to task regarding unfalsifiable theory, and that remains a concern today.

5. Finally, psychology research is often of general interest (hence all the press coverage, Ted talks, and so on) and accessible, both in its subject matter and its methods. Biomedicine is all about development and DNA and all sorts of actual science; to understand empirical economics you need to know about regression models; but the ideas and methods of psychology are right out in the open for all to see. At the same time, most of psychology is not politically controversial. If an economist makes a dramatic claim, journalists can call up experts on the left and the right and present a nuanced view. Ta least until recently, reporting about psychology followed the “scientist as bold discoverer” template, from Gladwell on down.

What do you get when you put it together?

The strengths and weaknesses of the field of research psychology seemed to have combined to (a) encourage the publication and dissemination of lots of low-quality, unreplicable research, while (b) creating the conditions for this problem to be recognized, exposed, and discussed openly.

It makes sense for psychology researchers to be embarrassed that those papers on power pose, ESP, himmicanes, etc. were published in their top journals and promoted by leaders in their field. Just to be clear: I’m not saying there’s anything embarrassing or illegitimate about studying and publishing papers on power pose, ESP, or himmicanes. Speculation and data exploration are fine with me; indeed, they’re a necessary part of science. My problem with those papers is that they presented speculation as mature theory, that they presented data exploration as confirmatory evidence, and that they were not part of research programmes that could accomodate criticism. That’s bad news for psychology or any other field.

But psychologists can express legitimate pride in the methodological sophistication that has given them avenues to understand the replication crisis, in the openness that has allowed prominent work to be criticized, and in the collaborative culture that has facilitated replication projects. Let’s not let the breakthrough-of-the-week hype and the Ted-talking hawkers and the “replication rate is statistically indistinguishable from 100%” blowhards distract us from all the good work that has showed us how to think more seriously about statistical evidence and scientific replication.