Skip to content

Transparency, replications, and publication

Bob Reed responded to my recent Retraction Watch article (where I argued that corrections and retractions are not a feasible solution to the problem of flawed science, because there are so many fatally flawed papers out there and retraction or correction is such a long, drawn-out process) with a post on openness, data transparency, and replication. He writes:

A recent survey article by Duvendack et al. report that, of 333 journals categorized as “economics journals” by Thompson Reuter’s Journal Citation Reports, 27, or a little more than 8 percent, regularly published data and code to accompany empirical research studies. As some of these journals are exclusively theory journals, the effective rate is somewhat higher.

Noteworthy is that many of these journals only recently instituted a policy of publishing data and code. So while one can argue whether the glass is, say, 20 percent full or 80 percent empty, the fact is that the glass used to contain virtually nothing. That is progress.

But making data more “open” does not, by itself, address the problem of scientific unreliability. Researchers have to be motivated to go through these data, examine them carefully, and determine if they are sufficient to support the claims of the original study. Further, they need to have an avenue to publicize their findings in a way that informs the literature. . . .

Without an outlet to publish their findings, researchers will be unmotivated to spend substantial effort re-analysing other researchers’ data. Or to put it differently, the open science/data sharing movement only addresses the supply side of the scientific market. Unless the demand side is addressed, these efforts are unlikely to be successful in providing a solution to the problem of scientific unreliability.

Reed concludes:

The irony is this: The problem has been identified. There is a solution. The pieces are all there. But in the end, the gatekeepers of scientific findings, the journals, need to open up space to allow science to be self-correcting.

Could be. It could also be that the journals become less important. In statistics and political science, I see journals as still being very important in determining academic careers, but not so important any more for the dissemination of knowledge. There’s just so much being published and so many places to look for things. Econ might be different because there are still a few journals that are recognized to be the top. From the other direction, the CS publication system is completely broken because all the conferences are flooded with papers, it’s just endless layers of hype. CS, on the other hand, is doing just fine. At this point the publication process just seems to be going in parallel with the research process, not really doing anything very useful.

Mister P can solve problems with survey weighting

It’s tough being a blogger who’s expected to respond immediately to topics in his area of expertise.

For example, here’s Scott “fraac” Adams posting on 8 Oct 2016, post titled “Why Does This Happen on My Vacation? (The Trump Tapes).” After some careful reflection, Adams wrote, “My prediction of a 98% chance of Trump winning stays the same.” And that was all before the second presidential debate, which “Trump won bigly. This one wasn’t close.” I don’t know what Trump’s chance of winning is now. Maybe 99%. Or 108%.

That’s fine. When Gregg Easterbrook made silly political prognostications, I was annoyed, because he purported to be a legitimate political writer. Adams has never claimed to be anything but an entertainer and, by golly, he continues to perform well in that regard. So, yes, I laugh at Adams, but I don’t see why he’d mind that. He is a humorist.

What interested me about Adams’s post of 8 Oct was not so much his particular opinions—Adams’s judgments on electoral politics are about as well-founded as my takes on cartooning—but rather his apparent attitude that he had a duty to his readers to share his thoughts, right away. The whole thing had a pleasantly retro feeling; it brought me back to the golden age of blogging, back around 2002 and the “warbloggers” who, whatever their qualifications, expressed such feelings of urgency about each political and military issue as it arose.

Anyway, that’s all background, and I thought of it all only because a similar thing happened to me today.

The real post starts here

Regular readers know that I’ve been taking a break from blogging—wow, it’s been over two months now—except for the occasional topical item that just can’t wait. And today something came that just couldn’t wait.

Several people pointed me to this news article by Nate Cohn with the delightful title, “How One 19-Year-Old Illinois Man Is Distorting National Polling Averages”:

There is a 19-year-old black man in Illinois who has no idea of the role he is playing in this election. . . .

He’s a panelist on the U.S.C. Dornsife/Los Angeles Times Daybreak poll, which has emerged as the biggest polling outlier of the presidential campaign. Despite falling behind by double digits in some national surveys, Mr. Trump has generally led in the U.S.C./LAT poll. . . .

Our Trump-supporting friend in Illinois is a surprisingly big part of the reason. In some polls, he’s weighted as much as 30 times more than the average respondent . . . Alone, he has been enough to put Mr. Trump in double digits of support among black voters. . . .

Cohn gives a solid exposition of how this happens: When you do a survey, the sample won’t quite match the population, and survey researchers use adjustments to adjust for known differences between sample and population. In particular, young black men tend to be underrepresented in surveys, compared to the general population, hence the few respondents in this demographic group need to be correspondingly upweighted. If there’s just one guy in the cell, he might have to get a really big weight, and Cohn identifies this as a key problem in the adjustment, that the survey is using weighting cells that are too small, hence they get very noisy adjustments. In this case, the noise manifests itself as big swings in this USC/LAT poll depending on whether or not this one man is in the sample.

There’s also an issue of adjusting for recalled vote in the previous presidential election but I’ll set that aside for now.

Here’s Cohn on the problems with the big survey weights:

In general, the choice in “trimming” weights [or using coarser weighting cells] is between bias and variance in the results of the poll. If you trim the weights [or use coarser weighting cells], your sample will be biased — it might not include enough of the voters who tend to be underrepresented. If you don’t trim the weights, a few heavily weighted respondents could have the power to sway the survey. . . .

By design, the U.S.C./LAT poll is stuck with the respondents it has. If it had a slightly too Republican sample from the start — and it seems it did, regardless of weighting — there was little it could do about it.

This is fine for what it is, conditional on the assumption that survey researchers are required to only use classical weighting methods. But there is no such requirement! We can now use Mister P.

Here’s a recent article in the International Journal of Forecasting describing how we used MRP for the Xbox poll. Here’s a longer article in the American Journal of Political Science with more technical details. Here’s MRP in the New York Times back in 2009! And here’s MRP in a Nate Cohn article last month in the Times.

Mister P is not magic; of course if your survey has too many Clinton supporters or too many Trump supporters, compared to what you’d expect based on their demographics, then you’ll get the wrong answer. No way around that. But MRP will automatically give the appropriate weight to single observations.

Two issues arise. First, there’s setting up the regression model. The usual plan would be logistic regression with predictors for sex*ethnicity and age*education. We don’t usually see sex*ethnicity*age. This one guy in the survey would influence all these coefficients—but, again, it’s just one survey respondent so the influence shouldn’t be large, especially assuming you use some sort of informative prior to avoid the blow-up you’d get if you had zero African-American Trump supporters in your sample. Second, poststratification. There you’ll need some estimate of the demographic composition of the electorate. But you’d need such an estimate to do weighting, too. I assume the survey organization’s already on top of this one.

So, yeah, we pretty much already know how to handle these problems. That said, there’s some research to be done in easing the transition from classical survey weighting to a modern MRP approach. I addressed some of these challenges in my 2007 paper, Struggles with Survey Weighting and Regression Modeling, but I think a clearer roadmap is needed. We’re working on it.

P.S. Someone forwarded me some comments on a listserv, posted by Arie Kapteyn, Director, USC Dornsife Center for Economic and Social Research:

When designing our USC/LAT poll we have strived for maximal transparency so that indeed anyone who has registered to use our data can verify every step we have taken.

The weights we use to make sure our sample is representative of the U.S. population do result in underrepresented individuals in the sample with a higher weight than those who are in overrepresented groups. In general, one has to make a decision whether to trim weights so that the factor for any individual will not exceed a certain value. However, trimming weights comes with a trade-off, in that it may not be possible to adequately balance the overall sample after trimming. In this poll, we made the decision that we would not trim the weights to ensure that our overall sample would be representative of, for example, young people and African Americans. The result is that a few individuals from groups such as those who are less represented in polling samples and thus have higher weighting factors, can shift the subgroup graphs when they participate. However, they contribute to an unbiased (but possibly noisier) estimate of the outcomes for the overall population.

Our confidence intervals (the grey zones) take into account the effect of weights. So if someone with a big weight participates the confidence interval tends to go up. One can see this very clearly in the graph for African Americans. Essentially, whenever the line for Trump improved, the grey band widened substantially. More generally, the grey band implies a confidence interval of some 30 percentage points so we really should not base any firm conclusion on the changes in the graphs. Admittedly, the weight given to this one individual is very large, nevertheless excluding this individual would move the estimate of the popular vote by less than one percent. Admittedly a lot, but not something that fundamentally changes our forecast. And indeed a movement that falls well within the estimated confidence interval.

So the bottom line is: one should not over-interpret movements if confidence bands are wide.

OK, sure, don’t overinterpret movements if confidence bands are wide, but . . . (1) One concern expressed by Cohn was not just movements but also the estimate itself being consistently too high for the Republican candidate, and (2) With MRP, you can do better! No need to take these horrible noisy estimates and just throw up your hands. Using basic principles of statistics you can get better estimates.

It’s not about trimming the weights or not trimming the weights, it’s about getting a better estimate of national public opinion. The weights—or, more generally, the statistical adjustment—is a means to an end. And you have to keep that end in mind. Don’t get fixated on weighting.

P.P.S. Also, I guess I should also clarify this one point: The classical weighting estimate is not actually unbiased. Kapteyn was incorrect in that claim of unbiasedness.

Should you abandon that low-salt diet? (uh oh, it’s the Lancet!)


Russ Lyons sends along this news article by Ian Johnston, who writes:

The prestigious medical journal The Lancet has been attacked for publishing an academic paper that claimed eating too little salt could increase the chance of dying from a heart attack or stroke.

Johnston summarizes the study:

Researchers from the Population Health Research Institute in Canada, studied more than 130,000 people from 49 different countries on six continents and concluded people should consume salt “in moderation”, rather than trying to reduce it in accordance with government guidelines across the world. . . .

The paper compared the health of people who tests showed had consumed low levels of sodium (up to three grams a day), medium amounts (four or five grams) and high levels (seven grams or more).

“Those participants with four to five grams of sodium excretion had the lowest risk” of death or suffering a “major cardiovascular disease event”, the researchers reported.

Among people who had high blood pressure, eating high and low levels of salt “were both associated with increased risk”. And for people without high blood pressure, consuming less than three grams a day was “associated with a significantly increased risk” – 11 per cent – of death or a serious cardiovascular event.

But there are critics:

Professor Francesco Cappuccio, head of the World Health Organization’s Collaborating Centre for Nutrition, attacked both the methods used in the study and the journal for agreeing to publish it.

“It is with disbelief that we should read such bad science published in The Lancet,” he said.

Professor Cappuccio said the article contained “a re-publication of data” used in another paper.

“The flaws that were extensively noted in their previous accounts are maintained and criticisms ignored,” he said.

The measurement of salt intake used by the researchers, he said, was “flawed” because it was done by testing urine samples given in the morning and then “extrapolated to 24-hour excretion” using an “inadequate” equation.

Professor Cappuccio also said the participants were “almost exclusively from clinical trials of sick people that have a very high risk of dying and are taking several medications”.

Now I don’t know what to think. I really don’t. I haven’t looked at the paper or the criticisms. If the study really is so flawed, though, I can’t say I’m surprised that it was published in the Lancet, a journal that’s famous for producing headline-grabbing papers that are later refuted, such as that Iraq survey, or that article claiming that gun laws could reduce firearm deaths by 200% or whatever, or, most notoriously, that paper by Andrew Wakefield [no link needed]. The Lancet may well publish some high-quality work but they do seem to have a weakness for bold claims and publicity.

I suppose the Lancet will publish a letter by Cappuccio or someone else with the criticisms? Perhaps a reader can keep us up to date here.

P.S. I’m sure I eat too much salt for my own good. I have a big jar of pretzels just sitting here in my office!

Gray graphs look pretty

Swupnil made this graph for a research meeting we had today:


It looks so cool. I think it’s the gray colors.

So here’s my advice to you: If you want to make your graphs look cool, use lots of gray.

My online talk this Friday noon for the Political Methods Colloquium: The Statistical Crisis in Science

Justin Esarey writes:

This Friday, October 14th at noon Eastern time, the International Methods Colloquium will inaugurate its Fall 2016 series of talks with a presentation by Andrew Gelman of Columbia University. Professor Gelman’s presentation is titled “The Statistical Crisis in Science.” The presentation will draw on these two papers:

“Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors”

“Disagreements about the Strength of Evidence”

To tune in to the presentation and participate in the discussion after the talk, visit and click “Watch Now!” on the day of the talk. To register for the talk in advance, click here:

The IMC uses Zoom, which is free to use for listeners and works on PCs, Macs, and iOS and Android tablets and phones. You can be a part of the talk from anywhere around the world with access to the Internet. The presentation and Q&A will last for a total of one hour.

A webinar isn’t as much fun as a live talk, but you can feel free to check this one out. According to Justin, it’s open to all and there will be lots of time for questions.

No, I don’t think the Super Bowl is lowering birth weights


In a news article entitled, “Inequality might start before we’re even born,” Carolyn Johnson reports:

Another study, forthcoming in the Journal of Human Resources, analyzed birth outcomes in counties where the home team goes to the Super Bowl. . . . The researchers found that women in their first trimester whose home team played in the Super Bowl had measurably different birth outcomes than pregnant women whose teams did not go to the championship. There was a small, 4 percent increase in the probability of having a baby with low birth weight when the team won.

Garden. Of. Forking. Paths.

And this:

The magnitude of the change was tiny, but what was striking to Mansour [one of the authors of the study] was that it was detectable at all, in studying Super Bowl history from 1969 to 2004.

On the contrary, I’m not surprised at all. Given that researchers have detected ESP, and the effects of himmicanes, and power pose, and beauty and sex ratio, etc etc etc., I’m not surprised they can detect the effect of the Super Bowl. That’s the point of researcher degrees of freedom: you can detect anything.

As a special bonus, we get the difference between significant and non-significant:

The chances of having a low birth weight baby were a bit higher when the team won in an upset, suggesting that surprise may have helped fuel the effect. There was little effect when the team lost.

Really no end to the paths in this garden.

To her credit, Johnson does express some skepticism:

There’s a huge caveat to interpreting these studies. . . . That means researchers have to use natural experiments and existing data sets to explore their hypothesis. That leads to imaginative studies — like the Super Bowl one — but also means that they can’t be certain that it’s the prenatal experiences and not some other factor that explains the result.

But not nearly enough skepticism, as far as I’m concerned. To say “they can’t be certain that . . .” is to way overstate the evidence. If someone shows a blurry photo that purports to show the Loch Ness Monster, the appropriate response is not “they can’t be certain that it’s Nessie and not some other factor that explains the result.”

Sure, you can come up with a story in which the Super Bowl adds stress that increases the risk of low birth weight. Or a story in which the Super Bowl adds positive feelings that decrease that risk. Or a story about the relevance of any other sporting event, or any other publicized event, maybe a major TV special or an election or the report of shark attacks or prisoners on the loose or whatever else is happening to you this trimester. Talk is cheap, and so is “p less than .05.”

P.S. One more thing. I just noticed that the news headline was “Inequality might start before we’re even born.” What do they mean, “might”? Of course inequality starts before we’re even born. You don’t have to be George W. Bush or Edward M. Kennedy to know that! It’s fine to be concerned about inequality; no need to try to use shaky science to back up a claim that’s evident from lots of direct observations and a huge literature on social mobility.


Paul Alper writes:

Heimlich who is 96, was in the news lately, saving a woman, 87 years old, using the technique he invented.

So, off to Wikipedia:

Henry Judah Heimlich (born February 3, 1920) is an American thoracic surgeon widely credited as the inventor of the Heimlich maneuver, a technique of abdominal …

where I [Alper] discovered some strange things about his wife:

Heimlich’s wife coauthored a book on homeopathy with Maesimund B. Panos called “Homeopathic Medicine at Home.”[6] She also wrote What Your Doctor Won’t Tell You, which advocates chelation therapy and other alternative therapies.

and kids:

Heimlich and his wife had four children: Phil Heimlich, a former Cincinnati elected official turned conservative Christian radio talk-show host; Peter Heimlich, whose website describes what he alleges to be his father’s “wide-ranging, unseen 50-year history of fraud”

and this:

From the early 1980s, Heimlich advocated malariotherapy, the deliberate infection of a person with benign malaria in order to treat ailments such as cancer, Lyme disease and (more recently) HIV. As of 2009 the treatments were unsuccessful, and attracted criticism as both scientifically unsound and dangerous.[24] The United States Food and Drug Administration and Centers for Disease Control and Prevention have rejected malariotherapy and, along with health professionals and advocates for human rights, consider the practice “atrocious”

and this:

Joseph P. Ornato MD, Medical College of Virginia:
“Dr. Heimlich continues to distort, misquote, fabricate, and mislead his peers and the public regarding the scientific ‘evidence’ supporting the safety and efficacy of his (drowning) theory. Dr. Heimlich’s ‘evidence’ consists of unsubstantiated, poorly documented anecdotes. He cites letters to the editor (published in the Journal of the American Medical Association) as though they represented rigorous scientific study.” (August 1992 letter to the American Red Cross as quoted in the Cincinnati Enquirer, May 10, 1993)

Searching to see if you have ever expounded on Heimlich and his research, I found this:

Maksim Gelman, butcher of Brighton Beach: ‘I killed 6 more that no ……/Maksim-Gelman-butcher-Brighton-B…
Jan 16, 2012 – Maksim Gelman, 24, who faces sentencing on Wednesday for murdering … communications and intelligence devices; Dr. Henry Heimlich, 96, …

but for the life of me, Heimlich is missing on that page. Is there any Gelman-Heimlich connection?

My brother Alan is a doctor and he saved a kid from choking once using the Heimlich maneuver. It really happened—I was there! The kid was choking on a piece of meat.

That’s all I’ve got for ya. But the material here is pretty amazing! It doesn’t seem like Heimlich ever let the truth get in the way of self-promotion!

Note to journalists: If there’s no report you can read, there’s no study

Blogger Echidne caught a study by the British organization Demos which was reported in Newsweek as “Half of Misogyny on Twitter Comes From Women.” But, as Echidne points out, there’s no study to report:

I [Echidne] then e-mailed Demos to ask for the url of the study. The rapid response I received (thanks, Demos!) told me — and here comes the fun part! — that there IS NO WRITTEN REPORT THAT PEOPLE COULD ANALYZE.

That is bullshit. Absolute bullshit. . . .

That there is no report does not imply that the results are incorrect, only that we cannot tell if they are correct or incorrect. But a written report is very important. The reason that researchers write their studies up is so that others can see what they did, how they did it, and also so that others can judge whether the study was done right or not.

I agree. I’m reminded of the gay gene tabloid hype, where results presented in a 10-minute conference talk were promoted all around the world, without there being any paper describing the data and methods.

Or the Wall Street Journal article that reported on a claimed survey of the super-rich for which no documentation was provided and which we have no reason to trust.

Hey, journalists: Don’t get fooled. Demand to see the study! I think it would work.

Next time someone sends you a press release and you’re thinking of running the story, first contact the organization and ask to see the written report. If they say they don’t have a report, it’s simple: Either don’t run the study, or run a report that is appropriately dripping with skepticism, including the phrase “for which the organization refused to supply a written report” as many times as possible.

“The Prose Factory: Literary Life in England Since 1918” and “The Windsor Faction”

It’s been D. J. Taylor week here. I expect that something like 0% of you (rounding to the nearest percentage point) have heard of D. J. Taylor, and that’s ok. He’s an English literary critic. Several years ago I picked up a copy of his book, A Vain Conceit: British Fiction in the 1980s, and I’ve been a fan ever since. Then last week I bought The Prose Factory: Literary Life in England Since 1918, and read most of it on the plane back from London. It’s been a long time since I’ve put my work aside and just relaxed and read like this, and I hadn’t remembered how long it can take to read a book. 7 hours on the plane and I still wasn’t quite done. I’d also picked up a copy of Taylor’s most recent novel, The Windsor Faction, and reading it has been a revealing experience too. I don’t always want to read a novel by a literary critic, but the novel was ok and it also gave me insight into the criticism.

All of this is a pretty obscure hobby so most of you can just stop reading at this point. I have no statistical insights coming. And, just to be clear, I don’t there’s anything particularly special or admirable about enjoying literary journalism; it just happens to be an interest of mine. If you don’t care about the topic, again, you can skip reading, just as Shravan skips the baseball posts and other people skip anything on politics.

I hate to even have to say all this but there’s so much reverse snobbery out there that I feel like I have to apologize for writing about literary journalism.

The Prose Factory

Anyway, I enjoyed Taylor’s book of literary history, but there was something a bit, umm, off about it. I wasn’t quite sure what it was about. I understood that it’s not primarily a work of literary criticism, so there’s not so much discussion of individual stories or novels or nonfiction prose. It’s more about what it was like to be a writer or critic during this period. But there’s next to nothing about large classes of professional prose writers, including writers of genre fiction (no Agatha Christie or John Le Carre, also nothing on the many less-successful writers in their fields), plain old newspaper writers, playwrights, etc. Nothing on Michael Frayn, for example, who did various of these things. Next to nothing on writers of popular or scholarly books on history. That’s fine—the topics Taylor does write on are mostly interesting—I’d just’ve appreciated some discussion of what he felt the book was really about, along with some consideration of all the prose he’d decided not to write about.

I did a quick search and found this review by Stefan Collini that covers Taylor’s book well. So if you’re interested I suggest you start with Collini’s review.

Also Terry Eagleton wrote this review which I really hated. Actually much of Eagleton’s review is excellent: he knows a lot and has all sorts of interesting thoughts and reflections. Actually, I recommend you read it. But it still irritated me because it seemed to be all about taking sides. Eagleton kept picking fights from nowhere. For example:

The Bloomsbury group, he [Taylor] admits, were a jealously exclusive elite, but so what? ‘It was their club: why should they be expected to let non-members through the door?’ he protests. Does this extend to banning Jews from golf clubs?

Where did that come from?

Or this:

Lord David Cecil’s ‘gentlemanly and rather old-fashioned scholarship’ is duly noted, but ‘this is not to disparage Lord David’s accomplishments, either as critic or biographer.’ Why not?

Why not? It’s right here, dude! Taylor’s very next phrase: “his life of Max Beerbohm (1964) is still the standard account.”

That’s good enough for me: if you write a book of biography and criticism that is still the standard account—over fifty years later!—that’s an accomplishment. At the end of this sentence, Taylor characterizes Cecil as “at best backward-looing and at worst deeply reactionary.”

As I said, Eagleton has a lot of interesting things to say. I just am so sick of his attitude: he hates this David Cecil so much that he can’t accept that maybe the guy wrote a good book once!

The Windsor Faction

The Windsor Faction is Taylor’s most recent novel. I enjoyed it. The plot was fine, the characters were . . . well, they were ok. They didn’t really come alive, I don’t think they had any agency. Where the book really stood out was in its atmosphere. It took place in London in 1939-1940 and it seemed so real, much more so than in many other historical novels I’ve read. Not just the scenery and decor, also the way the characters wrote and talked, how they used their language. Here I can really see the influence of Taylor’s immersion in the English literature and journalism of that period.

(Just to interject: I recently read Expo 58 by Jonathan Coe. Now he’s a real writer. It’s fine for craftsmen such as Taylor to write fiction—if nothing else, it’s gotta make him a better critic—but it’s good to be reminded by reading Coe what a real novelist can do. I assume (hope) Taylor would agree with me on this one.)

Anyway, to return to The Windsor Faction . . . Lots of other influences too: There was a certain clever trick that Taylor described in his nonfiction book, something that this novelist from the 1920s or 30s did—I can’t remember the name of the author or his books—this author would describe certain actions occurring in the background of the scene which would subtly and humorously advance the plot, a sort of cinematic trick. Anyway, Taylor does this in his own novel. It works fine but it was a bit disconcerting to know exactly where it came from. Also the book had lots of observation of social class that would fit in with Orwell’s writing. Taylor doesn’t quite have the bucket-of-water-falling-on-the-hapless-hero’s-head style of Orwell or Jonathan Coe, but the early scene of the guy working in the junk shop and ripping off his boss had a bit of that Aspidistra feeling. There were also some scenes that mix broad comedy with deep discomfort—in particular I’m thinking of a scene of an awkward party where one of the guests simultaneously attempts suicide and floods the bathroom—that remind me very much of Kingsley Amis, another favorite subject of Taylor. And, finally, in its general air of foreboding, the entire book could be taken as a gloss on the unforgettable final three paragraphs of Homage to Catalonia.

And I think Taylor would have no problem at all with us closing out this review with those paragraphs from Orwell’s classic:

I think we stayed three days in Banyuls. It was a strangely restless time. In this quiet fishing-town, remote from bombs, machine-guns, food-queues, propaganda, and intrigue, we ought to have felt profoundly relieved and thankful. We felt nothing of the kind. The things we had seen in Spain did not recede and fall into proportion now that we were away from them; instead they rushed back upon us and were far more vivid than before. We thought, talked, dreamed incessantly of Spain. For months past we had been telling ourselves that ‘when we get out of Spain’ we would go somewhere beside the Mediterranean and be quiet for a little while and perhaps do a little fishing, but now that we were here it was merely a bore and a disappointment. It was chilly weather, a persistent wind blew off the sea, the water was dull and choppy, round the harbour’s edge a scum of ashes, corks, and fish-guts bobbed against the stones. It sounds like lunacy, but the thing that both of us wanted was to be back in Spain. Though it could have done no good to anybody, might indeed have done serious harm, both of us wished that we had stayed to be imprisoned along with the others. I suppose I have failed to convey more than a little of what those months in Spain meant to me. I have recorded some of the outward events, but I cannot record the feeling they have left me with. It is all mixed up with sights, smells, and sounds that cannot be conveyed in writing: the smell of the trenches, the mountain dawns stretching away into inconceivable distances, the frosty crackle of bullets, the roar and glare of bombs; the clear cold light of the Barcelona mornings, and the stamp of boots in the barrack yard, back in December when people still believed in the revolution; and the food-queues and the red and black flags and the faces of Spanish militiamen; above all the faces of militiamen—men whom I knew in the line and who are now scattered Lord knows where, some killed in battle, some maimed, some in prison—most of them, I hope, still safe and sound. Good luck to them all; I hope they win their war and drive all the foreigners out of Spain, Germans, Russians, and Italians alike. This war, in which I played so ineffectual a part, has left me with memories that are mostly evil, and yet I do not wish that I had missed it. When you have had a glimpse of such a disaster as this—and however it ends the Spanish war will turn out to have been an appalling disaster, quite apart from the slaughter and physical suffering—the result is not necessarily disillusionment and cynicism. Curiously enough the whole experience has left me with not less but more belief in the decency of human beings. And I hope the account I have given is not too misleading. I believe that on such an issue as this no one is or can be completely truthful. It is difficult to be certain about anything except what you have seen with your own eyes, and consciously or unconsciously everyone writes as a partisan. In case I have not said this somewhere earlier in the book I will say it now: beware of my partisanship, my mistakes of fact, and the distortion inevitably caused by my having seen only one corner of events. And beware of exactly the same things when you read any other book on this period of the Spanish war.

Because of the feeling that we ought to be doing something, though actually there was nothing we could do, we left Banyuls earlier than we had intended. With every mile that you went northward France grew greener and softer. Away from the mountain and the vine, back to the meadow and the elm. When I had passed through Paris on my way to Spain it had seemed to me decayed and gloomy, very different from the Paris I had known eight years earlier, when living was cheap and Hitler was not heard of. Half the cafés I used to know were shut for lack of custom, and everyone was obsessed with the high cost of living and the fear of war. Now, after poor Spain, even Paris seemed gay and prosperous. And the Exhibition was in full swing, though we managed to avoid visiting it.

And then England—southern England, probably the sleekest landscape in the world. It is difficult when you pass that way, especially when you are peacefully recovering from sea-sickness with the plush cushions of a boat-train carriage under your bum, to believe that anything is really happening anywhere. Earthquakes in Japan, famines in China, revolutions in Mexico? Don’t worry, the milk will be on the doorstep tomorrow morning, the New Statesman will come out on Friday. The industrial towns were far away, a smudge of smoke and misery hidden by the curve of the earth’s surface. Down here it was still the England I had known in my childhood: the railway-cuttings smothered in wild flowers, the deep meadows where the great shining horses browse and meditate, the slow-moving streams bordered by willows, the green bosoms of the elms, the larkspurs in the cottage gardens; and then the huge peaceful wilderness of outer London, the barges on the miry river, the familiar streets, the posters telling of cricket matches and Royal weddings, the men in bowler hats, the pigeons in Trafalgar Square, the red buses, the blue policemen—all sleeping the deep, deep sleep of England, from which I sometimes fear that we shall never wake till we are jerked out of it by the roar of bombs.


It’s ok to criticize


I got a little bit of pushback on my recent post, “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant: Education edition”—some commenters felt I was being too hard on the research paper I was discussing, because the research wasn’t all that bad, and the conclusions weren’t clearly wrong, and the authors didn’t hype up their claims.

What I’d like to say is that it is OK to criticize a paper, even it isn’t horrible.

We’ve talked on this blog about some papers that are just terrible (himmicanes) or that are well-intentioned but obviously wrong (air pollution in China) or with analyses that are so bad as to be uninterpretable (air rage) or which have sample sizes too small and data too noisy to possibly support any useful conclusions (beautiful parents, ovulation and voting) or which are hyped out of proportion to whatever they might be finding (gay gene tabloid hype) or which are nothing but a garden of forking paths (Bible Code, ESP). And it’s fine to blow these papers out of the water.

But it’s also fine to present measured criticisms of research papers that have some value but also have some flaws. And that’s what I did in that earlier post. As I wrote at the time:

Just to be clear, I’m not trying to “shoot down” this research article nor am I trying to “debunk” the news report. I think it’s great for people to do this sort of study, and to report on it. It’s because I care about the topic that I’m particularly bothered when they start overinterpreting the data and drawing strong conclusions from noise.

If my goal were to make a series of airtight cases, destroying published paper after published paper, then, yes, it would make sense to concentrate my fire on the worst of the worst. But that’s not what it’s all about. “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant” is a real error, it’s an important error, and it’s a frequent error—and I think it’s valuable to point out this error in the context of a paper that’s not trash, reported on in a newspaper that’s not prone to sensationalism.

So, you commenters who told me I was being too harsh: What you’re really saying is that methods criticisms should be reserved for papers that are terrible. But I disagree. I say it can be helpful to criticize some of the reasoning of a paper on methodological grounds, even while other aspects of the paper are fine.

Criticism and review are all about advancing science, not about making airtight cases against particular papers or particular bodies of work.

The real dogs and the stat-dogs


One of the earliest examples of simulation-based model checking in statistics comes from the 1954 book by Robert Bush and Frederick Mosteller, Stochastic Models for Learning.

They fit a probability model to data on dogs being shocked in a research lab (yeah, I know, not an experiment that would be done today). Then they simulate “stat-dogs” from their fitted model and compare them to the real dog.

It’s a posterior predictive check! OK, only approximately, as they use a point estimate of the parameters, not the full posterior distribution. But their model has only 2 parameters and they’re estimated pretty well, so that’s no big deal. Also, they only simulate a single random replicated dataset, but that works ok here because the data have internal replication (an independent series on each of 30 dogs).

The responses of real dogs, that is, the observed data, y, are shown above.

Here are the “stat-dogs,” the replicated data, y_rep:


And here are some comparisons:



Jennifer and I pick up this example in chapter 24 of our book and consider how to make the graphical comparison more effective. We also talk about some other models and about how that the variation among dogs can be explained by some combination of variation between dogs and positive feedback learning (the idea that a dog learns more from an avoidance than from a shock). With such a small dataset it’s hard to untangle these two explanations.

Michael Lacour vs John Bargh and Amy Cuddy

In our discussion of the Bargh, Chen, and Burrows priming-with-elderly-related-words-makes-people-walk-slowly-paper (the study which famously failed in a preregistered replication), commenter Lois wrote:

Curious as to what people think of this comment on the Bargh et al. (1996) paper from Pubpeer: (see below).

In Experiment 3, the experimenter rated participants on irritability, hostility, anger, and uncooperativeness on 10-point scales. Independently, two blind coders rated the same participants’ facial expressions (from video-tapes) on a 5-point scale of hostility.
Figure 3 in the paper displays the mean hostility ratings from the experimenter and the blind coders for participants in the two conditions (Caucasian prime and African-American prime). The means from the experimenter and from the blind coders appear almost identical (they are indistinguishable to my eye).
How likely is it that the experimenter and the blind coders provided identical ratings, using different scales, in both conditions, for ratings of something as subjective as expressed hostility?
Are the bars shown in Figure 3 an error?

Did Bargh et al. cheat? Did their research assistant cheat? Did they make up data? Did they make a Excel error? Did they do everything correctly and we have nothing here but a suspicious PubPeer commenter?

I don’t know. But from a scientific perspective, it doesn’t really matter.

In our discussion of the Carney, Cuddy, and Yap paper on power pose (another one with a failed replication), commenters here and here noticed some irregularities in this and an earlier paper of Cuddy: test statistics were miscalculated, often in ways that moved a result from p greater than .05 to p less than .05.

Did Cuddy et al. cheat? Did they enter the wrong values in their calculator? Were the calculations performed by an unscrupulous research assistant who was paid by the star?

I don’t know. But from a scientific perspective, it doesn’t really matter.

Why doesn’t it matter? Because these studies are, from a quantitative perspective, nothing but the manipulation of noise. Any effects are too variable to be detected by these experiments. Indeed, this variability is implicitly recognized in the literature, where every new study reports the discovery of a new interaction.

So, whether these p-values less than .05 come from cheating, sloppiness, the garden of forking paths, or some combination of the above (for example, opportunistic rounding might fall somewhere in the borderline between cheating and sloppiness, and Daryl Bem’s thicket of forking paths is so tangled that in my opinion it counts as sloppiness at the very least), it just doesn’t matter. These quantitative results are useless anyway.

Enter Lacour

In contrast, consider the fraud conducted by Michael Lacour in his paper with Don Green on voter persuasion. This one, if his numbers had been true, really would’ve represented strong evidence in favor of his hypothesis. He had a clean study, a large sample, and a large and persistent effect. Indeed, when I first saw that paper, my first thought was forking paths but upon a quick read it was clear that there was something real there (conditional on the data being true, of course). No way these results could’ve been manufactured simply by excluding some awkward data points or playing around with predictors in the regression.

So in the Lacour case, the revelations of fraud were necessary. The forensics were a key part of the story. For Cuddy, etc., sure, maybe they engaged in unethical research practices, or maybe they were just sloppy, but it’s not really worth our while to try to find out. Cos even if they did everything they did with pure heart, their numbers are not offering real support for their claims.

One might say that Lacour was too clever by half. He could’ve got that paper in Science without any faking of data at all! Just do the survey, gather the data, then use the garden of forking paths to get a few statistically significant p-values. Here are some of the steps he could’ve taken:

1. If he gathers real data, lots of the data won’t fit his story: there will be people who got the treatment who remained hostile to gay rights and people who got the control who still end up supporting gay marriage. Start out by cleaning your data, removing as many of these aberrant cases as you can get away with. It’s not so hard: just go through, case by case, and see what you can find to disqualify these people. Maybe some of their responses to other survey questions were inconsistent, maybe the interview was conducted outside the official window for the study (which you can of course alter at will, and it looks so innocuous in the paper: “We conducted X interviews between May 1 and June 30, 2011” or whatever). If the aggregate results for any canvasser look particularly bad, you could figure out a way to discard that whole batch of data—for example, maybe this was the one canvasser who was over 40 years old, or the one canvasser who did not live in the L.A. area, or whatever. If there’s a particular precinct where the results did not turn out as they should, find a way to exclude it: perhaps the vote in the previous election there was more than 2 standard deviations away from the mean. Whatever.

2. Now comes the analysis stage. Subsets, comparisons, interactions. Run regressions including whatever you want. If you find no main effect, do a path analysis—that was enough to convince a renowned political scientist that a subliminally flashing smiley face could cause a large change in people’s attitudes toward immigration. The smiley-face conclusion was a stretch (to say the least); by comparison, Lacour and Green’s claims, while large, were somewhat plausible.

3. The writeup. Obviously you report any statistically significant thing you happen to find. But you can do more. You can report any non-significant comparisons as evidence against alternative explanations of your data. And if you’re really ballsy you can report p-values such as .08 as evidence for weak effects. What you’re doing is spinning a story using all these p-values as raw material. This is important: don’t think of statistical significance as the culmination of your study: feel confidence that using all your researcher degrees of freedom, you can get as many statistically significant p-values as you want, and look ahead to the next step of telling your tale. You already have statistical significance; no question about that. To get the win—the publication in Nature, Science, or PPNAS—you need that pow! that connects your empirical findings to your theory.

4. Finally, once you’ve created your statistical significance, you can use that as backing to make as many graphs as you want. So Lacour still gets to use that skill of his.

Isn’t it funny? He could’ve got all the success he wanted—NPR coverage, Ted Talk, tenure-track job, a series of papers in top journals—without faking the data at all!

For decades the statistical community completely missed the point. We obsessed about stopping rules and publication bias, pretty much ignoring the real action is in data exclusion, subsetting, and degrees of freedom in analysis.

P.S. I wrote this post months ago; it just happened to come up today, in the midst of all the discussion of the replication crisis.

Did Colombia really vote no in that peace referendum?

Mike Spagat and Neil Johnson write:

The official line is that the “no” vote won the referendum in Colombia. The internationally lauded peace treaty with the FARC guerillas was rejected . . .

But did “no” actually win?

The numbers divide four ways, rather than just two “no” and “yes” answers: 6,431,376 against the treaty, 6,377,482 in favour, 86,243 unmarked ballots, and 170,946 nullified ballots.

The referendum process itself was without doubt transparent and fair . . . But there were nonetheless several inevitable sources of statistical error in the counting process that could have swamped the razor-thin victory margin of 53,894.

Spagat and Johnson first point out errors in ballot counting:

As a large research literature has made clear, we can reasonably assume that even well-rested people would have made mistakes with between 0.5% and 1% of the ballots. On this estimate, about 65,000-130,000 votes would have been unintentionally misclassified. It means the number of innocent counting errors could easily be substantially larger than the 53,894 yes-no difference.

But this would only make a difference if the miscounts were biased, not if the mistakes were happening at random.

Spagat and Johnson then write:

Second, there were 170,946 nullified votes. As the photo below shows, the ballots were so simple that it’s hard to imagine how there could be so many invalid ones.


Of course, it is correct to toss out any ballot with both “yes” and “no” marked, but it seems surprising that so many ballots were apparently spoiled this badly. Again, the consistency of the counters’ decisions could have been decisive.


Then there are the blank or nullified ballots. It’s quite possible that many of the 86,243 unmarked ballots were simply marked very lightly, meaning their votes did not register in the tired eyes of well-meaning volunteer counters – and the nearly 270,000 voters whose ballots were rejected as blank or null must have had some voting intention that they somehow failed to express.

So these ballots represent 270,000 voting failures, more than four times the victory margin. Even if many of these ballots were reasonably well-classified, this figure is an enormous red flag.

What about biases? Spagat and Johnson write:

We know of no evidence of cheating, and Colombia is to be lauded for the seriousness of its referendum process, but the distinction between intentional and unintentional misclassification by individual counters can occasionally become blurred in practice.

As a study that asked voters to judge “ambiguous” ballots demonstrated, those doing the counting can be driven by unconscious biases. . . .

In total, therefore, the result presents as many as 400,000 opportunities for classification mistakes. That’s before counting any systematic human behaviours not listed above. This represents a numerical uncertainty that swamps the victory margin of 53,894.

They summarize:

None of the above analysis proves that most voters on Sunday supported the peace treaty, but there’s an immense difference between declaring that “no” won and declaring the result inconclusive. This referendum has momentous implications, not just for the Colombian people but also for the many national governments and the United Nations that supported the peace deal. It is very sad that a country’s future may be dictated by an inaccurate declaration (one readily amplified by international media).

Interesting. I don’t know anything about Colombia myself, so all I can say is there’s a lot to think about here.

StanCon: now accepting registrations and submissions


As we announced here a few weeks ago, the first Stan conference will be Saturday, January 21, 2017 at Columbia University in New York. We are now accepting both conference registrations and submissions. Full details are available at StanCon page on the Stan website. If you have any questions please let us know and we hope to see you in NYC this January!

Here are the links for registration and submissions:


Anyone using or interested in Stan is welcome to register for the conference. To register for StanCon please visit the StanCon registration page.


StanCon’s version of conference proceedings will be a collection of contributed talks based on interactive, self-contained notebooks (e.g., knitr, R Markdown, Jupyter, etc.). Submissions will be peer reviewed by the StanCon organizers and all accepted notebooks will be published in an official StanCon repository. If your submission is accepted we may also ask you to present during one of the StanCon sessions.

For details on submissions please visit the StanCon submissions page.

P.S. Stay tuned for an announcement about several Stan and Bayesian inference courses we will be offering in the days leading up to the conference.

The never-back-down syndrome and the fundamental attribution error


David Allison told me about a frustrating episode in which he published a discussion where he pointed out problems with a published paper, and the authors replied with . . . not even a grudging response, they didn’t give an inch, really ungracious behavior. No “Thank you for finding our errors”; instead they wrote:

We apologize for the delay in the reply to Dr Allison’s letter of November 2014, this was probably due to the fact that it was inadvertently discarded.

Which would be kind of a sick burn except that they’re in the wrong here.

Anyway, I wrote this to Allison:

Yeah, it would be too much for them to consider the possibility they might have made a mistake!

It’s funny, in medical research, it’s accepted that a researcher can be brilliant, creative, well-intentioned, and come up with a theory that happens to be wrong. You can even have an entire career as a well-respected medical researcher and pursue a number of dead ends, and we accept that; it’s just life, there’s uncertainty, the low-hanging fruit have all been picked, and we know that the attempted new cancer cure is just a hope.

And researchers in other fields know this too, presumably. We like to build big exciting theories, but big exciting theories can be wrong.

But . . . in any individual case, researchers never want to admit error. A paper can be criticized and criticized and criticized, and the pattern is to not even consider the possibility of a serious mistake. Even the authors of that ovluation-and-clothing paper, or the beauty-and-sex-ratio paper, or the himmicanes paper, never gave up.

It makes no sense. Do these researchers think that only “other people” make errors?

And Allison replied:

The phenomenon you note seems like a variant on what psychologists call the Fundamental Attribution Error.

Interesting point. I know about the fundamental attribution error and I think a lot about refusal to admit mistakes, but I’d never made the connection. More should be done on this. I’m envisioning a study with 24 undergrads and 100 Mechanical Turk participants that we can publish in Psych Sci or PPNAS if they don’t have any ESP or himmicane studies lined up.

No, really, I do think the connection is interesting and I would like to see it studied further. I love the idea of trying to understand the stubborn anti-data attitudes of so many scientists. Rather than merely bemoaning these attitudes (as I do) or cynically accepting them (as Steven Levitt has), we could try to systematically learn about them. I mean, sure, people have incentives to lie, exaggerate, cheat, hide negative evidence, etc.—but it’s situational. I doubt that researchers typically think they’re doing all these things.

It’s not about the snobbery, it’s all about reality: At last, I finally understand hatred of “middlebrow”

I remember reading Dwight Macdonald and others slamming “middlebrows” and thinking, what’s the point? The classic argument from the 1940s onward was to say that true art (James Joyce etc) was ok, and true mass culture (Mickey Mouse and detective stories) were cool, but anything in the middle (John Marquand, say) was middlebrow and deserved mockery and disdain. The worst of the middlebrow was the stuff that mainstream newspaper critics thought was serious and uplifting.

When I’d read this, I’d always rebel a bit. I had no particular reason to doubt most of the judgments of Macdonald etc. (although I have to admit to being a Marquand fan), but something about the whole highbrow/middlebrow/lowbrow thing bugged me: If lowbrow art could have virtues (and I have no doubt that it can), then why can’t middlebrow art also have these positive qualities?

What I also couldn’t understand was the almost visceral dislike that Macdonald and other critics felt for the middlebrow. So what if some suburbanites were patting themselves on the back for their sophistication in reading John Updike? Why deprive them of that simple pleasure, and why hold that against Updike?

But then I had the same feeling myself, the same fury against the middlebrow, and I think I understand where Macdonald etc. were coming from.

It came up after the recent “air rage” story, in which a piece of PPNAS-tagged junk science got the royal treatment at the Economist, NPR, Science magazine, etc. etc.

This is “middlebrow science.” It goes about in the trappings of real science, is treated as such by respected journalists, but it’s trash.

To continue the analogy: true science is fine, and true mass culture (for example, silly news items about Elvis sightings and the Loch Ness monster) is fine too, in that nobody is taking it for real science. But the Gladwell/Easterbrook/PPNAS/PsychScience/NPR axis . . . this is the middlebrow stuff I can’t stand. It has neither the rigor of real science, but is not treated by journalists with the disrespect it deserves.

And I think that’s how Macdonald felt about middlebrow literature: bad stuff is out there, but seeing bad stuff taken so seriously by opinion-makers, that’s just painful.

P.S. Let me clarify based on some things that came up in comments. I don’t think middlebrow is necessarily bad. I’m a big fan of Marquand and Updike, for example. Similarly, when it comes to popular science, there’s lots of stuff that I like that also gets publicity in places such as NPR. Simplification is fine too. The point, I think, is that work has to be judged on its own merits, that the trappings of seriousness should not be used as an excuse to abdicate critical responsibility.

Astroturf “patient advocacy” group pushes to keep drug prices high


Susan Perry tells the story:

Patients Rising, [reporter Trudy Lieberman] reports, was founded by Jonathan Wilcox, a corporate communications and public relations consultant and adjunct professor at USC’s Annenberg School of Communications and his wife, Terry, a producer of oncology videos. . . .

Both Wilcox and his wife had worked with Vital Options International, another patient advocacy group with a special mission of generating global cancer conversations. She is a former executive director. A search of [Vital Options International’s] website showed that drug industry heavy hitters, such as Genentech, Eli Lilly, and Bristol-Myers Squibb, had in the past sponsored some of the group’s major activities . . .

Patients Rising is pushing back particularly strongly against Dr. Peter Bach, an epidemiologist at New York City’s Memorial Sloan Kettering Cancer Center, who has been outspoken about the high cost of cancer drugs.

Pretty horrible. Political advocacy is fine, and it could well be that there are good reasons for drug prices to remain high. But faking a patient advocacy organization, that’s not cool.

I will say, though, that artificial turf is a lot more pleasant than it used to be. 20 years ago, it felt like concrete; now it feels a lot more like grass. Except on really hot days when the turf feels like hot tar.

Full disclosure: I am working with colleagues at Novartis and getting paid for it.

Handy Statistical Lexicon — in Japanese!


So, one day I get this email from Kentaro Matsuura:

Dear Professor Andrew Gelman,

I’m a Japanese Stan user and write a blog to promote Stan.
(and translator of

I believe your post on “Handy statistical lexicon (” is so great that I’d like to translate and spread the post in my blog. Could I do that?


Wow, how cool is that? Of course I said yes, please do it.

A week later Kentaro wrote to ask for a favor:

Could I change some terms slightly so that Japanese could have more familiarity? For example, there is no “self-cleaning oven” in Japan, but there is “self-cleaning air conditioner”.

I had no idea.

And here it is!

Don’t trust Rasmussen polls!


Political scientist Alan Abramowitz brings us some news about the notorious pollster:

In the past 12 months, according to Real Clear Politics, there have been 72 national polls matching Clinton with Trump—16 polls conducted by Fox News or Rasmussen and 56 polls conducted by other polling organizations. Here are the results:

Trump has led or been tied with Clinton in 44 percent (7 of 16) of Fox and Rasmussen Polls: 3 of 5 Rasmussen Polls and 4 of 11 Fox News Polls.

Trump has led or been tied with Clinton in 7 percent (4 of 56) polls conducted by other polling organizations.

To put it another way, Fox and Rasmussen together have accounted for 22 percent of all national polls in the past year but they have accounted for 64 percent of the polls in which Trump has been leading or tied with Clinton.

Using Pollster’s tool that allows you to calculate polling averages with different types of polls and polling organizations excluded:

Current Pollster average: Clinton +2.7
Removing Rasmussen and Fox News: Clinton +7.7
Live Interview polls only: Clinton +8.8
Live interview polls without Fox News: Clinton +9.2

I find it remarkable that simply removing Rasmussen and Fox changes the average by 5 points.

Hey—I remember Rasmussen! They’re a bunch of clowns.


Here are a couple of old posts about Rasmussen.

From 2010:

Rasmussen polls are consistently to the right of other polls, and this is often explained in terms of legitimate differences in methodological minutiae. But there seems to be evidence that Rasmussen’s house effect is much larger when Republicans are behind, and that it appears and disappears quickly at different points in the election cycle.

From 2008:

I was looking up the governors’ popularity numbers on the web, and came across this page from Rasmussen Reports which shows Sarah Palin as the 3rd-most-popular governor. But then I looked more carefully. Janet Napolitano of Arizona is viewed as Excellent by 28% of respondents, Good by 27%, Fair by 26%, and Poor by 27%. That adds up to 108%! What’s going on? I’d think they would have a computer program to pipe the survey results directly into the spreadsheet. But I guess not, someone must be entering these numbers by hand. Weird.

I just checked that page again and it’s still wrong:

Screen Shot 2016-05-20 at 8.10.03 PM

What ever happened to good old American quality control?

But, hey, it’s a living. Produce crap numbers that disagree with everyone else and you’re gonna get headlines.

You’d think news organizations would eventually twig to this particular scam and stop reporting Rasmussen numbers as if they’re actually data, but I guess polls are the journalistic equivalent of crack cocaine.

Given that major news organizations are reporting whatever joke study gets released in PPNAS, I guess we shouldn’t be surprised they’ll fall for Rasmussen, time and time again. It’s inducing stat rage in me nonetheless.

If only science reporters and political reporters had the standards of sports reporters. We can only dream.

Why the garden-of-forking-paths criticism of p-values is not like a famous Borscht Belt comedy bit


People point me to things on the internet that they’re sure I’ll hate. I read one of these awhile ago—unfortunately I can’t remember who wrote it or where it appeared, but it raised a criticism, not specifically of me, I believe, but more generally of skeptics such as Uri Simonsohn and myself who keep bringing up p-hacking and the garden of forking paths.

The criticism that I read is wrong, I think, but it has a superficial appeal, so I thought it would be worth addressing it here.

The criticism went like this: People slam classical null-hypothesis-significance-testing (NHST) reasoning (the stuff I hate, all those “p less than .05” papers in Psychological Science, PPNAS, etc., on ESP, himmicanes, power pose, air rage, . . .) on two grounds: first, that NHST makes no sense, and second, that published p-values are wrong because of selection, p-hacking, forking paths, etc. But, the criticism continues, this anti-NHST attitude is itself self-contradictory: if you don’t like p-values anyway, why care that they’re being done wrong?

The author of this post (that I now can’t find) characterized anti-NHST critics (like me!) as being like the diner in that famous Borscht Belt routine, who complains about the food being so bad. And such small portions!

And, indeed, if we think NHST is such a bad idea, why do we then turn around and say that p-values are being computed wrong? Are we in the position of the atheist who goes into church in order to criticize the minister on his theology?

No, and here’s why. Suppose I come across some piece of published junk science, like the ovulation-and-clothing study. I can (and do) criticize it on a number of grounds. Suppose I lay off the p-values, saying that I wouldn’t compute a p-value here in any case so who cares. Then a defender of the study could easily stand up and say, Hey, who cares about these criticisms? The study has p less than .05, this will only happen 5% of the time if the null hypothesis is true, thus this is good evidence that the null hypothesis is false and there’s something going on! Or, a paper reports 9 different studies, each of which is statistically significant at the 5% level. Under the null hypothesis, the probability of this happening is (1/20)^9, thus the null hypothesis can be clearly rejected. In these cases, I think it’s very helpful to be able to go back and say, No, because of p-hacking and forking paths, the probability you’ll find something statistically significant in each of these experiments is quite high.

The reason I’m pointing out the problems with published p-values is not because I think researchers should be doing p-values “right,” whatever that means. I’m doing it in order to reduce the cognitive dissonance. So, when a paper such as power pose fails to replicate, my reaction is not, What a surprise: this statistically significant finding did not replicate!, but rather, No surprise: this noisy result looks different in a replication.

It’s similar to my attitude toward preregistration. It’s not that I think that studies should be preregistered; it’s that when a study is not preregistered, it’s hard to take p-values, and the reasoning that usually flows from them, at face value.