Skip to content

You’ll never guess what I say when I have nothing to say

A reporter writes:

I’m a reporter working on a story . . . and I was wondering if you could help me out by taking a quick look at the stats in the paper it’s based on.

The paper is about paedophiles being more likely to have minor facial abnormalities, suggesting that paedophilia is a neurodevelopment disorder that starts in the womb. We’re a bit concerned that the stats look weak though – small sample size, no comparison to healthy controls, large SD, etc.

If you have time, could you take a quick look and let me know if the statistics seem to be strong enough to back up their conclusions? The paper is here:

I replied: Yes, I agree, I don’t find this convincing, also it’s hard to know what to do with this. It doesn’t seem newsworthy to me. That said, I’m not an expert on this topic.

What’s the difference between randomness and uncertainty?

Julia Galef mentioned “meta-uncertainty,” and how to characterize the difference between a 50% credence about a coin flip coming up heads, vs. a 50% credence about something like advanced AI being invented this century.

I wrote: Yes, I’ve written about this probability thing. The way to distinguish these two scenarios is to embed each of them in a larger setting. The question is, how would each probability change as additional information becomes available. The coin flip is “random” to the extent that intermediate information is not available that would change the probability. Indeed, the flip becomes less “random” to the extent that it is flipped. In other settings such as the outcome of an uncertain sports competition, intermediate information could be available (for example, maybe some key participants are sick or injured) hence it makes sense to speak of “uncertainty” as well as randomness.

It’s an interesting example because people have sometimes considered this to be merely a question of “philosophy” or interpretation, but the distinction between different sources of uncertainty can in fact be encoded in the mathematics of conditional probability.

The bejeezus


Tova Perlmutter writes of a recent online exchange:
Continue reading ‘The bejeezus’ »

Stat Podcast Plan


In my course on Statistical Communication and Graphics, each class had a special guest star who would answer questions on his or her area of expertise. These were not “guest lectures”—there were specific things I wanted the students to learn in this course, it wasn’t the kind of seminar where they just kick back each week and listen—rather, they were discussions, typically around 20 minutes long, facilitated by the outside expert.

One thing that struck me about these discussions was how fun they were, and how various interesting and unexpected things came up in our conversations.

And that made me think—Hey, we should do a podcast! I can be the host and have conversations with these guests, one at a time, and then release these as (free) 15-minute podcasts. How awesome! The only challenge is to keep them lively. Without a roomful of students, a recorded conversation between two people could get stilted.

Also we need a title for the series. “Statistics Podcast” is pretty boring. “Statcast”? The topics we’ve had so far have been focused on statistical communication, but once we go through that, we could cover other statistical areas as well.

And then there’s the technical details: how to actually set up a podcast, also maybe it needs to be edited a bit?

So here’s what I’m needing from you:

– A title for the podcast series.

– Advice on production and distribution.

Our starting lineup

Here are some of the visitors we’ve had in our course so far. I’d plan to start with them, since I’ve already had good conversations with them.

I list the topic corresponding to each visitor, but the actual conversations ranged widely.

    Thomas Basbøll, Writing Consultant, Copenhagen Business School (topic: Telling a story)

    Howard Wainer, Distinguished Research Scientist, National Board of Medical Examiners (topic: Principles of statistical graphics, although the actual discussion ended up being all about educational testing, because that’s what the students’ questions were about)

    Deborah Nolan, Professor of Statistics, University of California (topic: Student activities and projects)

    Jessica Watkins, Department of Education, Tufts University (topic: Facilitating class participation)

    Justin Phillips, Professor of Political Science, Columbia University (topic: Classroom teaching)

    Beth Chance, Professor of Statistics, California Polytechnic State University (topic: Preparing and evaluating a class)

    Amanda Cox, Graphics Editor, New York Times (topic: Graphing data: what to do)

    Jessica Hullman, Assistant Professor of Information Visualization, University of Washington (topic: Graphing data: what works)

    Kaiser Fung, Senior Data Advisor, Vimeo (topic: Statistical reporting)

    Elke Weber, Professor of Psychology and Management, Columbia University (topic: Communicating variation and uncertainty)

    Eric Johnson, Professor of Psychology and Management, Columbia University (topic: Communicating variation and uncertainty)

    Cynthia Rudin, Associate Professor of Statistics, MIT (topic: Understanding fitted models)

    Kenny Shirley, Principal Inventive Scientist, Statistics Research Department, AT&T Laboratories (topic: Understanding fitted models)

    Tom Wood, Assistant Professor of Political Science, Ohio State University (topic: Displaying fitted models)

    Elizabeth Tipton, Assistant Professor of Applied Statistics, Teachers College, Columbia University (topic: Displaying fitted models)

    Brad Paley, Principal, Digital Image Design Incorporated (topic: Giving a presentation)

    Jared Lander, statistical consultant and author of R for Everyone (topic: Teaching in a non-academic environment)

    Jonah Gabry, Researcher, Department of Statistics, Department of Political Science, and Population Research Center, Columbia University (topic: Dynamic graphics)

    Martin Wattenberg, Data Visualization, Google (topic: Dynamic graphics)

    Hadley Wickham, Chief Scientist, RStudio (topic: Dynamic graphics)

    David Rindskopf, Professor of Educational Psychology, City University of New York (topic: Consulting)

    Shira Mitchell, Postdoctoral Researcher, Earth Institute, Columbia University (topic: Collaboration)

    Katherine Button, Lecturer, Department of Psychology, University of Bath (topic: Communication and its impact on science)

    Jenny Davidson, Professor of English, Columbia University (topic: Writing for a technical audience)

    Rachel Schutt, Senior Vice President of Data Science, News Corporation (topic: Communication with a non-technical audience)

    Leslie McCall, Professor of Sociology, Northwestern University (topic: Social research and policy)

    Yair Ghitza, Senior Scientist, Catalist (topic: Data processing)

    Bob Carpenter, Research Scientist, Department of Statistics, Columbia University (topic: Programming)

P.S. Lots of suggested titles in comments. My favorite title so far: Learning from Numbers.

P.P.S. I asked Sharad if he could come up with any names for the podcast and he sent me these:

White Noise
In the Noise
The Signal
Random Samples

I’ll have to nix the first suggestion as it’s a bit too accurate a description of the ethnic composition of myself and our guest stars. The third suggestion is pretty good but it’s almost a bit too slick. After all, we’re not the signal, we’re just a signal. I’m still leaning toward Learning from Numbers.

The Notorious N.H.S.T. presents: Mo P-values Mo Problems

Alain Content writes:

I am a psycholinguist who teaches statistics (and also sometimes publishes in Psych Sci).

I am writing because as I am preparing for some future lessons, I fall back on a very basic question which has been worrying me for some time, related to the reasoning underlying NHST [null hypothesis significance testing].

Put simply, what is the rational justification for considering the probability of the test statistic and any more extreme value of it?

I know of course that the point value probability cannot be used, but I can’t figure the reasoning behind the choice of any more extreme value. I mean, wouldn’t it be as valid (or invalid) to consider for instance the probability of some (conventionally) fixed interval around the observed value? (My null hypothesis is that there is no difference between Belgians and Americans in chocolate consumption. If find a mean difference of say 3 kgs. I decide to reject H0 based on the probability of [2.9-3.1].)

My reply: There are 2 things going on:

1. The logic of NHST. To get this out of the way, I don’t like it. As we’ve discussed from time to time, NHST is all about rejecting straw-man hypothesis B and then using this to claim support for the researcher’s desired hypothesis A. The trouble is that both models are false, and typically the desired hypothesis A is not even clearly specified.

In your example, the true answer is easy: different people consume different amounts of chocolate. And the averages for two countries will differ. The average also differs from year to year, so a more relevant question might be how large are the differences between countries, compared to the variation over time, the variation across states within a country, the variation across age groups, etc.

2. The use of tail-area probabilities as a measure of model fit. This has been controversial. I don’t have much to say on this. On one hand, if a p-value is extreme, it does seem like we learn something about model fit. If you’re seeing p=.00001, that does seem notable. On the other hand, maybe there are other ways to see this sort of lack of fit. In my 1996 paper with Meng and Stern on posterior predictive checks, we did some p-values, but now I’m much more likely to perform a graphical model check.

In any case, you really can’t use p-values to compare model fits or to compare datasets. This example illustrates the failure of the common approach of using p-value as a data summary.

My main message is to use model checks (tail area probabilities, graphical diagnostics, whatever) to probe flaws in the model you want to fit—not as a way to reject null hypotheses.

“Chatting with the Tea Party”

I got an email last month offering two free tickets to the preview of a new play, Chatting with the Tea Party, described as “a documentary-style play about a New York playwright’s year attending Tea Party meetings around the country and interviewing local leaders. Nothing the Tea Party people in the play say has been made up.”

I asked if they could give me 3 tickets and they did, and I went with two family members.

I won’t be spoiling much if I share the plot: self-described liberal playwright talks with liberal friends during the rise of the conservative Tea Party movements, realizes he doesn’t know any Tea Party activists himself, so during his random travels around the country (as a playwright, he’s always going to some performance or workshop or another), he arranges meetings with Tea Party activists in different places. Some of these people say reasonable things, some of them say rude things, many have interesting personal stories. No issue attitudes get changed, but issues get explored.

The play, directed by Lynnette Barkley, had four actors; one played the role of the playwright, the others did the voices of the people he met. They did the different voices pretty well: each time it seemed like a new person. If Anna Deavere Smith or Mel Blanc had been there to do all the voices, it would’ve been amazing, but these actors did the job. And the playwright, Rich Orloff, did a good job compressing so many hours of interviews to yield some intense conversations.

There were two things that struck me during the watching of the play.

First, it would’ve been also interesting to see the converse: a conservative counterpart of the reasonable, pragmatic Orloff interviewing liberal activists. I could imagine a play that cut back and forth between the two sets of scenes. The play did have some scenes with Orloff’s know-nothing liberal NYC friends, but I think it would’ve worked better for them to be confronting an actual conservative, rather than just standing there expressing their biases.

Second, I was struck by how different the concerns of 2009-2010 were, compared to the live political issues now. Back then, it was all about the national debt, there were 3 trillion dollars being released into the economy, everything was gonna crash. Now the concerns seem more to do with national security and various long-term economic issues, but nothing like this spending-is-out-of-control thing. I guess this makes sense: with a Republican-controlled congress, there’s less concern that spending will get out of control. In any case, the central issues have changed. There’s still polarization, though, and still space for literary explorations of the topic. As a person who has great difficulty remembering exact dialogue myself, I’m impressed with a play that can capture all these different voices.

Where the fat people at?


Pearly Dhingra points me to this article, “The Geographic Distribution of Obesity in the US and the Potential Regional Differences in Misreporting of Obesity,” by Anh Le, Suzanne Judd, David Allison, Reena Oza-Frank, Olivia Affuso, Monika Safford, Virginia Howard, and George Howard, who write:

Data from BRFSS [the behavioral risk factor surveillance system] suggest that the highest prevalence of obesity is in the East South Central Census division; however, direct measures suggest higher prevalence in the West North Central and East North Central Census divisions. The regions relative ranking of obesity prevalence differs substantially between self-reported and directly measured height and weight.

And they conclude:

Geographic patterns in the prevalence of obesity based on self-reported height and weight may be misleading, and have implications for current policy proposals.

Interesting. Measurement error is important.

But, hey, what’s with this graph:

Screen Shot 2015-07-31 at 10.45.29 AM

Who made this monstrosity? Ed Wegman?

I can’t imagine a clearer case for a scatterplot. Ummmm, OK, here it is:


Hmmm, I don’t see the claimed pattern between region of the country and discrepancy between the measures.

Maybe things will be clearer if we remove outlying Massachusetts:


Maryland’s a judgment call; I count my home state as northeastern but the cited report places it in the south. In any case, I think the scatterplot is about a zillion times clearer than the parallel coordinates plot (which, among other things, throws away information by reducing all the numbers to ranks).

P.S. Chris in comments suggests redoing the graphs with same scale on the two axes. Here they are:


It’s a tough call. These new graphs make the differences between the two assessments more clear, but then it’s harder to compare the regions. It’s fine to show both, I guess.

Hey—go to Iceland and work on glaciers!


Egil Ferkingstad and Birgir Hrafnkelsson write:

We have an exciting PhD position here at the University of Iceland on developing Bayesian hierarchical spatio-temporal models to the field of glaciology. Havard Rue at NTNU, Trondheim and Chris Wikle at the University of Missouri will also be part of the project.

The Department of Mathematics at the University of Iceland (UI) seeks applicants for a fully funded 3 year PhD position for the project Statistical Models for Glaciology.

The student will develop Bayesian hierarchical spatio-temporal models to the field of glaciology, working with a consortium of experts at the University of Iceland, the University of Missouri and the Norwegian University of Science and Technology. The key people in the consortium are Prof. Birgir Hrafnkelsson at UI, Prof. Chris Wikle, and Prof. Håvard Rue, experts in spatial statistics and Bayesian computation. Another key person is Prof. Gudfinna Adalgeirsdottir at UI, an expect in glaciology. The Glaciology group at UI possesses extensive data and knowledge about the Icelandic glaciers.

The application deadline is February 29, 2016.

Detailed project description:

Job ad with information on how to apply:

It’s a good day for cold research positions.

Summer internship positions for undergraduate students with Aki

There are couple cool summer internship positions for undergraduate students (BSc level) in Probabilistic Machine Learning group at Aalto (Finland) with me (Aki) and Samuel Kaski. Possible research topics are related to Bayesian inference, machine learning, Stan, disease risk prediction, personalised medicine, computational biology, contextual information retrieval, information visualization, etc. Application deadline 18 February. See more here.

Stunning breakthrough: Using Stan to map cancer screening!

Screen Shot 2015-07-25 at 12.28.14 AM

Paul Alper points me to this article, Breast Cancer Screening, Incidence, and Mortality Across US Counties, by Charles Harding, Francesco Pompei, Dmitriy Burmistrov, Gilbert Welch, Rediet Abebe, and Richard Wilson.

Their substantive conclusion is there’s too much screening going on, but here I want to focus on their statistical methods:

Spline methods were used to model smooth, curving associations between screening and cancer rates. We believed it would be inappropriate to assume that associations were linear, especially since nonlinear associations often arise in ecological data. In detail, univariate thin-plate regression splines (negative binomial model to accommodate overdispersion, log link, and person-years as offset) were specified in the framework of generalized additive models and fitted via restricted maximum likelihood, as implemented in the mgcv package in R. . . .

To summarize cross-sectional changes in incidence and mortality, we evaluated the mean rate differences and geometric mean relative rates (RRs) associated with a 10–percentage point increase in the extent of screening across the range of data (39%-78% screening). The 95% CIs were calculated by directly simulating from the posterior distribution of the model coefficients (50 000 replicates conditional on smoothing parameters).

Can someone get these data and re-fit in Stan? I have no reason to think the published analysis by Harding et al. has any problems; I just think it would make sense to do it all in Stan, as this would be a cleaner workflow and easier to apply to new problems.

P.S. See comments for some discussions by Charles Harding, author of the study in question.

When does peer review make no damn sense?

Disclaimer: This post is not peer reviewed in the traditional sense of being vetted for publication by three people with backgrounds similar to mine. Instead, thousands of commenters, many of whom are not my peers—in the useful sense that, not being my peers, your perspectives are different from mine, and you might catch big conceptual errors or omissions that I never even noticed—have the opportunity to point out errors and gaps in my reasoning, to ask questions, and to draw out various implications of what I wrote. Not “peer reviewed”; actually peer reviewed and more; better than peer reviewed.


Last week we discussed Simmons and Simonsohn’s survey of some of the literature on the so-called power pose, where they wrote:

While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve). However, at this point the evidence for the basic effect seems too fragile to search for moderators or to advocate for people to engage in power posing to better their lives.


Even if the effect existed, the replication suggests the original experiment could not have meaningfully studied it.

The first response of one of the power-pose researchers was:

I’m pleased that people are interested in discussing the research on the effects of adopting expansive postures. I hope, as always, that this discussion will help to deepen our understanding of this and related phenomena, and clarify directions for future research. . . . I respectfully disagree with the interpretations and conclusions of Simonsohn et al., but I’m considering these issues very carefully and look forward to further progress on this important topic.

This response was pleasant enough but I found it unsatisfactory because it did not even consider the possibility that her original finding was spurious.

After Kaiser Fung and I publicized Simmons and Simonsohn’s work in Slate, the power-pose author responded more forcefully:

The fact that Gelman is referring to a non-peer-reviewed blog, which uses a new statistical approach that we now know has all kinds of problems, as the basis of his article is the WORST form of scientific overreach. And I am certainly not obligated to respond to a personal blog. That does not mean I have not closely inspected their analyses. In fact, I have, and they are flat-out wrong. Their analyses are riddled with mistakes, not fully inclusive of all the relevant literature and p-values, and the “correct” analysis shows clear evidential value for the feedback effects of posture.

Amy Cuddy, the author of this response, did not at any place explain how Simmons and Simonsohn were “flat-out wrong,” nor did she list even one of the mistakes with which their analyses were “riddled.”

Peer review

The part of the above quote I want focus on, though, is the phrase “non-peer-reviewed.” Peer reviewed papers have errors, of course (does the name “Daryl Bem” ring a bell?). Two of my own published peer-reviewed articles had errors so severe as to destroy their conclusions! But that’s ok, nobody’s claiming perfection. The claim, I think, is that peer-reviewed articles are much less likely to contain errors, as compared to non-peer-reviewed articles (or non-peer-reviewed blog posts). And the claim behind that, I think, is that peer review is likely to catch errors.

And this brings up the question I want to address today: What sort of errors can we expect peer review to catch?

I’m well placed to answer this question as I’ve published hundreds of peer-reviewed papers and written thousands of referee reports for journals. And of course I’ve also done a bit of post-publication review in recent years.

To jump to the punch line: the problem with peer review is with the peers.

In short, if an entire group of peers has a misconception, peer review can simply perpetuate error. We’ve seen this a lot in recent years, for example that paper on ovulation and voting was reviewed by peers who didn’t realize the implausibility of 20-percentage-point vote swings during the campaign, peers who also didn’t know about the garden of forking paths. That paper on beauty and sex ratio was reviewed by peers who didn’t know much about the determinants of sex ratio and didn’t know much about the difficulties of estimating tiny effects from small sample sizes.

OK, let’s step back for a minute. What is peer review good for? Peer reviewers can catch typos, they can catch certain logical flaws in an argument, they can notice the absence of references to the relevant literature—that is, the literature that the peers are familiar with. That’s how the peer reviewers for that psychology paper on ovulation and voting didn’t catch the error of claiming that days 6-14 were the most fertile days of the cycle: these reviewers were peers of the people who made the mistake in the first place!

Peer review has its place. But peer reviewers have blind spots. If you want to really review a paper, you need peer reviewers who can tell you if you’re missing something within the literature—and you need outside reviewers who can rescue you from groupthink. If you’re writing a paper on himmicanes and hurricanes, you want a peer reviewer who can connect you to other literature on psychological biases, and you also want an outside reviewer—someone without a personal and intellectual stake in you being right—who can point out all the flaws in your analysis and can maybe talk you out of trying to publish it.

Peer review is subject to groupthink, and peer review is subject to incentives to publishing things that the reviewers are already working on.

This is not to say that a peer-reviewed paper is necessarily bad—I stand by over 99% of my own peer-reviewed publications!—rather, my point is that there are circumstances in which peer review doesn’t give you much.

To return to the example of power pose: There are lots of papers in this literature and there’s a group of scientists who believe that power pose is real, that it’s detectable, and indeed that it can help millions of people. There’s also a group of scientists who believe that any effects of power pose are small, highly variable, and not detectable by the methods used in the leading papers in this literature.

Fine. Scientific disagreements exist. Replication studies have been performed on various power-pose experiments (indeed, it’s the null result from one of these replications that got this discussion going), and the debate can continue.

But, my point here is that peer-review doesn’t get you much. The peers of the power-pose researchers are . . . other power-pose researchers. Or researchers on embodied cognition, or on other debatable claims in experimental psychology. Or maybe other scientists who don’t work in this area but have heard good things about it and want to be supportive of this work.

And sometimes a paper will get unsupportive reviews. The peer review process is no guarantee. But then authors can try again until they get those three magic positive reviews. And peer review—review by true peers of the authors—can be a problem, if the reviewers are trapped in the same set of misconceptions, the same wrong framework.

To put it another way, peer review is conditional. Papers in the Journal of Freudian Studies will give you a good sense of what Freudians believe, papers in the Journal of Marxian Studies will give you a good sense of what Marxians believe, and so forth. This can serve a useful role. If you’re already working in one of these frameworks, or if you’re interested in how these fields operate, it can make sense to get the inside view. I’ve published (and reviewed papers for) the journal Bayesian Analysis. If you’re anti-Bayesian (not so many of these anymore), you’ll probably think all these papers are a crock of poop and you can ignore them, and that’s fine.

(Parts of) the journals Psychological Science and PPNAS have been the house organs for a certain variety of social psychology that a lot of people (not just me!) don’t really trust. Publication in these journals is conditional on the peers who believe the following equation:

“p less than .05” + a plausible-sounding theory = science.

Lots of papers in recent years by Uri Simonsohn, Brian Nosek, John Ioannidis, Katherine Button, etc etc etc., have explored why the above equation is incorrect.

But there are some peers that haven’t got the message yet. Not that they would endorse the above statement when written as crudely as in that equation, but I think this is how they’re operating.

And, perhaps more to the point, many of the papers being discussed are several years or even decades old, dating back to a time when almost nobody (myself included) realized how wrong the above equation is.

Back to power pose

And now back to the power pose paper by Carney et al. It has many garden-of-forking-paths issues (see here for a few of them). Or, as Simonsohn would say, many researcher degrees of freedom.

But this paper was published in 2010! Who knew about the garden of forking paths in 2010? Not the peers of the authors of this paper. Maybe not me either, had it been sent to me to review.

What we really needed (and, luckily, we can get) is post-publication review: not peer reviews, but outside reviews, in this case reviews by people who are outside of the original paper both in research area and in time.

And also this, from another blog comment:

It is also striking how very close to the .05 threshhold some of the implied p-values are. For example, for the task where the participants got the opportunity to gamble the reported chi-square is 3.86 which has an associated p-value of .04945.

Of course, this reported chi-square value does not seem to match the data because it appears from what is written on page 4 of the Carney et al. paper that 22 participants were in the high power-pose condition (19 took the gamble, 3 did not) while 20 were in the low power-pose condition (12 took the gamble, 8 did not). The chi-square associated with a 2 x 2 contingency table with this data is 3.7667 and not 3.86 as reported in the paper. The associated p-value is .052 – not less than .05.

You can’t expect peer reviewers to check these sorts of calculations—it’s not like you could require authors to supply their data and an R or Stata script to replicate the analyses, ha ha ha. The real problem is that the peer reviewers were sitting there, ready to wave past the finish line a result with p less than .05, which provides an obvious incentive for the authors to get p less than .05, one way or another.

Commenters also pointed out an earlier paper by one of the same authors, this time on stereotypes of the elderly, from 2005, that had a bunch more garden-of-forking-paths issues and also misreported two t statistics: the actual values were something like 1.79 and 3.34; the reported values were 5.03 and 11.14! Again, you can’t expect peer reviewers to catch these problems (nobody was thinking about forking paths in 2005, and who’d think to recalculate a t statistic?), but outsiders can find them, and did.

At this point one might say that this doesn’t matter, that the weight of the evidence, one way or another, can’t depend on whether a particular comparison in one paper was or was not statistically significant—but if you really believe this, what does it say about the value of the peer-reviewed publication?

Again, I’m not saying that peer review is useless. In particular, peers of the authors should be able to have a good sense of how the storytelling theorizing in the article fits in with the rest of the literature. Just don’t expect peers to do any assessment of the evidence.

Linking as peer review

Now let’s consider the Simmons and Simonsohn blog post. It’s not peer reviewed—except it kinda is! Kaiser Fung and I chose to cite Simmons and Simonsohn in our article. We peer reviewed the Simmons and Simonsohn post.

This is not to say that Kaiser and I are certain that Simmons and Simonsohn made no mistakes in that post; peer review never claims to that sort of perfection.

But I’d argue that our willingness to cite Simmons and Simonsohn is a stronger peer review than whatever was done for those two articles cited above. I say this not just because those papers had demonstrable errors which affect their conclusions (and, yes, in the argot of psychology papers, if a p-value shifts from one side of .05 to the other, it does affect the conclusions).

I say this also because of the process. When Kaiser and I cite Simmons and Simonsohn in the way that we do, we’re putting a little bit of our reputation on the line. If Simmons and Simonsohn made consequential errors—and, hey, maybe they did, I didn’t check their math, any more than the peer reviewers of the power pose papers checked their math—that rebounds negatively on us, that we trusted something untrustworthy. In contrast, the peer reviewers of those two papers are anonymous. The peer review that they did was much less costly, reputationally speaking, than ours. We have skin in the game, they do not.

Beyond this, Simmons and Simonsohn say exactly what they did, so you can work it out yourself. I trust this more than the opinions of 3 peers of the authors in 2010, or 3 other peers in 2005.


Peer review can serve some useful purposes. But to the extent the reviewers are actually peers of the authors, they can easily have the same blind spots. I think outside review can serve a useful purpose as well.

If the authors of many of these PPNAS or Psychological Science-type papers really don’t know what they’re doing (as seems to be the case), then it’s no surprise that peer review will fail. They’re part of a whole peer group that doesn’t understand statistics. So, from that perspective, perhaps we should trust “peer review” less than we should trust “outside review.”

I am hoping that peer review in this area will improve, given the widespread discussion of researcher degrees of freedom and garden of forking paths. Even so, though, we’ll continue to have a “legacy” problem of previously published papers with all sorts of problems, up to and including t statistics misreported by factors of 3. Perhaps we’ll have to speak of “post-2015 peer-reviewed articles” and “pre-2015 peer-reviewed articles” as different things?

On deck this week

Mon: When does peer review make no damn sense?

Tues: Stunning breakthrough: Using Stan to map cancer screening!

Wed: Where the fat people at?

Thurs: The Notorious N.H.S.T. presents: Mo P-values Mo Problems

Fri: What’s the difference between randomness and uncertainty?

Sat: You’ll never guess what I say when I have nothing to say

Sun: I refuse to blog about this one

I don’t know about you, but I love these blog titles. Each week I put together this “on deck” post and I get interested all again in these topics. I wrote most of these so many months ago, I have no idea what’s in them. I’m looking forward to these posts almost as much as you are!

What a great way to start the work week.

Ted Cruz angling for a position in the Stanford poli sci department

In an amusing alignment of political and academic scandals, presidential candidate Ted Cruz was blasted for sending prospective voters in the Iowa Caucus this misleading mailer:


Which reminds me of the uproar two years ago when a couple of Stanford political science professors sent prospective Montana voters this misleading mailer:

Screen Shot 2014-10-29 at 6.04.24 PM

I don’t know which is worse: having a “voting violation” in Iowa or being almost as far left as Barack Obama in Montana.

There is well known research in political science suggesting that shaming people can motivate them to come out and vote, so I can understand how Cruz can describe this sort of tactic as “routine.”

It’s interesting, though: In 2014, some political scientists got into trouble by using campaign-style tactics in a nonpartisan election (and also for misleading potential voters by sending them material with the Montana state seal). In 2016, a political candidate is getting into trouble by using political-science-research tactics in a partisan election (and also for misleading potential voters with a “VOTING VIOLATION” note).

What’s the difference between Ted Cruz and a Stanford political scientist?

Some people wrote to me questioning the link I’m drawing above between Cruz and the Stanford political scientists. So let me emphasize that I know of no connections here. I don’t even know if Cruz has any political scientists on his staff, and I’m certainly not trying to suggest that the Stanford profs in question are working for Cruz or for any other presidential candidate. I have no idea. Nor would I think it a problem if they are. I was merely drawing attention to the similarities between Cruz’s item and the Montana mailer from a couple years back.

I do think what Cruz did is comparable to what the political scientists did.

There are some differences:

1. Different goals: Cruz wants to win an election, the political scientists wanted to do research.

2. Different time frames: Cruz is in a hurry and got sloppy, the political scientists had more time and could be more careful with the information on their mailers.

But I see two big similarities:

1. Research-based manipulation of voters: Cruz is working off of the effects of social pressure on turnout, the political scientists were working off the effects of perceived ideology on turnout.

2. Misleading information: Cruz is implying that people have some sort of obligation to vote, the political scientists were implying that their mailer was coming from the State of Montana.

Postdoc opportunity with Sophia Rabe-Hesketh and me in Berkeley!

Sophia writes:

Mark Wilson, Zach Pardos and I are looking for a postdoc to work with us on a range of projects related to educational assessment and statistical modeling, such as Bayesian modeling in Stan (joint with Andrew Gelman).

See here for more details.

We will accept applications until February 26.

The position is for 15 months, starting this Spring. To be eligible, applicants must be U.S. citizens or permanent residents.

Empirical violation of Arrow’s theorem!


Regular blog readers know about Arrow’s theorem, which is that any result can be published no more than five times.

Well . . . I happened to be checking out Retraction Watch the other day and came across this:

“Exactly the same clinical study” published six times

Here’s the retraction notice in the journal Inflammation:

This article has been retracted at the request of the Editor-in-Chief.

The authors have published results from exactly the same clinical study and patient population in 6 separate articles, without referencing the publications in any of the later articles:

1. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’angelo, A., Maffioli, P. 2013. Olmesartan/amlodipine combination versus olmesartan or amlodipine monotherapies on blood pressure and insulin resistance in a sample of hypertensive patients. Clinical and Experimental Hypertension 35: 301–307. doi:10.​3109/​10641963.​2012.​721841.

2. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’Angelo, A., Maffioli, P. 2013. Effects of an olmesartan/amlodipine fixed dose on blood pressure control, some adipocytokines and interleukins levels compared with olmesartan or amlodipine monotherapies. Journal of Clinical Pharmacy and Therapeutics 38: 48–55. doi:10.​1111/​jcpt.​12021.

3. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’Angelo, A., Maffioli, P. 2013. Variation of some inflammatory markers in hypertensive patients after 1 year of olmesartan/amlodipine single-pill combination compared with olmesartan or amlodipine monotherapies. Journal of the American Society of Hypertension 7: 32–39. doi:10.​1016/​j.​jash.​2012.​11.​006.

4. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’Angelo, A., Maffioli, P. 2013. Evaluation of safety and efficacy of a fixed olmesartan/amlodipine combination therapy compared to single monotherapies. Expert Opinion on Drug Safety 12: 621–629. doi:10.​1517/​14740338.​2013.​816674.

5. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’Angelo, A., Maffioli, P. 2014. Different aspects of sartan + calcium antagonist association compared to the single therapy on inflammation and metabolic parameters in hypertensive patients. Inflammation 37: 154–162. doi:10.​1007/​s10753-013-9724-x.

6. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’Angelo, A., Maffioli, P. 2014. Results from a 12 months, randomized, clinical trial comparing an olmesartan/amlodipine single pill combination to olmesartan and amlodipine monotherapies on blood pressure and inflammation. European Journal of Pharmaceutical Sciences 51: 26–33. doi:10.​1016/​j.​ejps.​2013.​08.​031.

In addition, the article in Inflammation contains results published especially in articles 2 and 6, which is the main reason for retraction of the article in Inflammation.

The publisher apologizes for the inconvenience caused.

From my perspective, though, it’s all worth it to see a counterexample to a longstanding theorem. Bruno Frey must be soooooo jealous right now.

P.S. I don’t think it’s so horrible to publish similar material in different places. Not everyone reads every article and so it can be good to reach different audiences. But if you have multiple versions of an article, you should make that clear. Otherwise you’re poisoning the meta-analytic well.

TOP SECRET: Newly declassified documents on evaluating models based on predictive accuracy

We recently had an email discussion among the Stan team regarding the use of predictive accuracy in evaluating computing algorithms. I thought this could be of general interest so I’m sharing it here.

It started when Bob said he’d been at a meting on probabilistic programming where there was confusion on evaluation. In particular, some of the people at the meeting had the naive view that you could just compare everything on cross-validated proportion-predicted-correct for binary data.

But this won’t work, for three reasons:

1. With binary data, cross-validation is noisy. Model B can be much better than model A but the difference might barely show up in the empirical cross-validation, even for a large data set. Wei Wang and I discuss that point in our article, Difficulty of selecting among multilevel models using predictive accuracy.

2. 0-1 loss is not in general a good measure. You can see this by supposing you’re predicting a rare disease. Upping the estimated probability from 1 in a million to 1 in a thousand will have zero effect on your 0-1 loss (your best point prediction is 0 in either case) but it can be a big real-world improvement.

3. And, of course, a corpus is just a corpus. What predicts well in one corpus might not generalize. That’s one reason we like to understand our predictive models if possible.

Bob in particular felt strongly about point 1 above. He wrote:

Given that everyone (except maybe those SVM folks) are doing *probabilistic* programming, why not use log loss? That’s the metric that most of the Kaggle competitions moved to. It tests how well calibrated the probability statements of a model are in a way that neither 0/1 loss, squared error, or ROC curve metrics like mean precision don’t.

My own story dealing with this involved a machine learning
researcher trying to predict industrial failures who built a logistic regression where the highest likelihood of a component failure was 0.2 or so. They were confused because the model didn’t seem to predict any failures at all, which seemed wrong. That’s just a failure to think in terms of expectations (20 parts with a 20% chance of failure each would lead to 4 expected failures). I also tried explaining that the model may be well calibrated and there may not be a part that has more than a 20% chance of failure. But they wound up doing what PPAML’s about to do for the image tagging task, namely compute some kind of ROC curve evaluation based on varying thresholds, which of course, doesn’t measure how well calibrated the probabilities are, because it’s only sensitive to ranking.

Tom Dietterich concurred:

Regarding holdout likelihood, yes, this is an excellent suggestion. We have evaluated on hold-out likelihood on some of our previous challenge problems. In CP6, we focused on the other metrics (mAP and balanced error rate) because that is what the competing “machine learning” methods employed.

Within the machine learning/computer vision/natural language processing communities, there is a wide-spread belief that fitting to optimize metrics related to the specific decision problem in the application is a superior approach. It would be interesting to study that question more deeply.

To which Bob elaborated:

I completely agree, which is why I don’t like things like mean average precision (MAP), balanced 0/1 loss, and balanced F measure, none of which relate to any relevant decision problem.

It’s also why I don’t like 0/1 loss (either straight up, through balanced F measures, through macro-averaged F measure, etc.), because that’s never the operating point anyone wants. At least in 10 years working in industrial machine learning, it was never the decision problem anyone wanted. Customers almost always had asymmetric utility for false positives and false negatives (think epidemiology, suggesting search spelling corrections, speech recognition in an online dialogue system for airplane reservations, etc.) and wanted to operate at either very high precision (positive predictive accuracy) or very high recall (sensitivity). No customer or application I’ve ever seen other than writing NIPS or Computational Linguistics papers ever cared about balanced F measure in a large data set in an application.

The advantage of log loss is a better measure for generic decision making than area under the curve because it measures how well calibrated the probabilistic inferences are. Well-calibrated inferences are optimal for all decision operating points assuming you want to make Bayes-optimal decisions to maximize expected utility while minimizing risk. There’s a ton of theory around this, starting with Berger’s influential book on Bayesian decision theory from the 1980s. And it doesn’t just apply to Bayesian models, though almost everything in the machine learning world can be viewed as an approximate Bayesian technique.

Being Bayesian, the log loss isn’t a simple log likelihood with point estimated parameters plugged in (popular approximate technique in the machine learning world), but a true posterior predictive estimate as I described in my paper. Of course, if your computing power isn’t up to it, you can approximate with
point estimates and log loss by treating your posterior as a delta function around its mean (or even mode if you can’t even do variational inference).

Sometimes ranking is enough of a proxy for decision making, which is why mean average precision (truncated to high precison, say average precision at 5) is relevant for some search apps, such as Google’s, and mean average precision
(truncated to high recall) is relevant to other search apps, such as that of a biology post-doc or an intelligence analyst. I used to do a lot of work with DoD and DARPA and they were quite keen to have very very high recall — the intelligence analysts really didn’t like systems that had 90% recall so that 10% of the data were missed! At some points, I think they
kept us in the evaluations because provided an exact boolean search that had 100% recall, so they could look at the data, type in a phrase, and be guaranteed to find it. That doesn’t work with first-pass first-best analyses.

I suggested to Bob that he blog this but then we decided it would be more time-efficient for me to do it. The only thing is, then it won’t appear till October.

P.S. Here are Bob’s slides from that conference. He spoke on Stan.

Placebo effect shocker: After reading this, you won’t know what to believe.

Martha Smith writes:

Yesterday’ BBC News Magazine featured an article by William Kremer entitled, “”Why are placebos getting more effective?”, which looks like a possibility for a blog post discussing how people treat surprising effects. The article asserts that the placebo effect has been decreasing, especially in the U.S.

The author asks, “Why? What could it be about Americans that might make them particularly susceptible to the placebo effect?” then gives lots of speculation. This might be characterized as “I believe the effect is real, so I’’ll look for possible causes.”

However, applying the skeptical maxim, “If an effect is surprising, it’’s probably false or overestimated,” I quickly came up with two plausible reasons why the “increasing effect of placebos” might be apparent rather than real:

1. The statistical significance filter could operate indirectly: One reason a study comparing treatment with placebo might get through the statistical significance filter is because it happens to have an uncharacteristically small placebo effect. Thus small placebo effects are likely to be overrepresented in published studies; a later replication of such a study is likely to show a larger (but more typical) placebo effect.

2. If early studies are not blinded but later studies are, the earlier studies would be expected to show deflated effects for placebo but inflated effects for treatment.

My reply: There’s something about this placebo thing that just keeps confusing me. So I’ll stay out of this one, except to post the above note to give everyone something to think about today.

One thing I like about hierarchical modeling is that is not just about criticism. It’s a way to improve inferences, not just a way to adjust p-values.

In an email exchange regarding the difficulty many researchers have in engaging with statistical criticism (see here for a recent example), a colleague of mine opined:

Nowadays, promotion requires more publications, and in an academic environment, researchers are asked to do more than they can. So many researchers just work like workers in a product line without critical thinking. Quality becomes a tradeoff of quantity.

I replied:

I think that many (maybe not all) researchers are interested in critical thinking, but they don’t always have a good framework for integrating critical thinking into their research. Criticism is, if anything, too easy: once you’ve criticized, what do you do about it (short of “50 shades of gray” self-replication, which really is a lot of work)? One thing I like about hierarchical modeling is that is not just about criticism. It’s a way to improve inferences, not just a way to adjust p-values.

The point is that in this way criticism can be a step forward.

When we go through the literature (or even all the papers by a particular author) and list all the different data-coding, data-exclusion, and data-analysis rules that were done (see comment thread from above link for a long list of examples of data excluded or included, outcomes treated separately or averaged, variables controlled for or not, different p-value thresholds, etc.), it’s not just about listing multiple comparisons and criticizing p-values (which ultimately only gets you so far, because even correct p-values bear only a very indirect relation to any inferences of interest), it’s also about learning more from data, constructing a fuller model that includes all the possibilities corresponding to the different theories. Or even just recognizing that a particular dataset with a particular small sample and noisy, variable measurements, is too weak to learn what you want to learn. That can good to know too: if it’s a topic you really care about, you can devote some effort to more careful measurement, or at least know the limitations of your data. All good—the point is to make the link to reality rather than to try to compute some correct p-value, which has little to do with anything.

Is a 60% risk reduction really no big deal?

Paul Alper writes:

Here’s something really important.

Notice how meaningless the numbers can be. Referring to a 60% risk reduction in flu due to the flu vaccine:

As for the magical “60?” Dr. Tom Jefferson didn’t mince words: “Sorry I have no idea where the 60% comes from – it’s either pure propaganda or bandied about by people who do not understand epidemiology. In both cases they should not be making policy as they do not know what they are talking about,” he said, insisting that I quote him.

Or, you could look here:

Researchers reported in the New England Journal of Medicine (August 14, 2014) that a high dose flu vaccine was more effective than the standard flu vaccine for seniors. The vaccine is called Fluzone High Dose vaccine. Of course, the media jumped on this report. In the Healthday article, the chief medical office for Sanofi-Pasteur—the Big Pharma company who funded the study—stated, “The study demonstrated a 24 percent reduction [emphasis added] in influenza illness among the participants who received the high-dose vaccine compared to those who received the standard dose.” . . .

1.4% of the seniors who received the high dose vaccine became ill with the flu and 1.9% of the seniors who received the standard flu vaccine developed the flu. (I hope you are not laughing as I did when I read that.) How in the world did they report a 24% lowered incidence of the flu with the use of the high-dose vaccine? Simply dividing 1.4% by 1.9% gives the relative risk reduction of 24%. However, this is a relative risk reduction—a useless number to use when deciding whether a therapy is good for any patient.

Dr. Brownstein, the source of the above quote, is an advocate of holistic medicine and supplements, so he says:

Folks, don’t be fooled here. This study was another failed flu vaccine study. The flu vaccine has never been shown to protect the elderly from getting the flu, dying from the flu, or developing complications from the flu. The elderly would be better served by eating a better diet, maintaining hydration and taking vitamin C.[!!]

I replied: I’m confused. Setting aside bias, sampling error, etc., a reduction from 1.9% to 1.4% is pretty good, no? It’s not a reduction all the way to 0, but why would this be called a failure?Also the diet, hydration, vitamin C thing seems irrelevant to the vaccine question, in that you could do all these things and also take the vaccine. What am I missing?

Alper wrote:

The actual 1.9% to 1.4% is far less impressive (and possibly not reported) than the 24% reduction touted in a press release. Relative risk is in general a misleading number. Absolute risk should always be stated as well. According to the links I listed, relative risk should never stand by itself without asking, “relative to what?”

Different example: suppose a disease is very rare, 2 in a million and a treatment reduces the incidence to 1 in a million. Huge relative risk reduction but hardly any effect on absolute risk. And don’t forget that vaccinations are not without “harms” so any benefits should also be compared to problems due to the treatment. And then there is the cost of the treatment which is often not reported.

I included that vitamin C quotation of Brownstein to indicate his “holistic” bias. Personally, I don’t trust Big Pharma at all, but the alternative medicine advocates are often so kooky that conventional medicine can look good in comparison. Besides, even great scientists (Pauling) lose all their marbles when it comes to the benefits of vitamin C.

I replied:

Sure, that I understand, but it also depends on seriousness. Suppose for example that 1.9% of seniors _died_ of the flu in a given year. That would be a huge number, a large proportion of total deaths, and a reduction to 1.4% would likewise be a big deal. Indeed, framing it as 1.9% vs 1.4% would be a bit of a minimization.


We agree that researchers should always report relative and absolute risks as well as costs of treatment in addition to harms due to treatment. To often, the hyping of a treatment omits costs and side effects, emphasizing relative risk reduction which always sounds more impressive. This particular case refers to high dose vaccine (for people like me, an octogenarian) vs. ordinary dose (for you youngsters). I believe the percentages refer to infection rather than death. And for no good reason other than ignorance at the time, I selected the high dose. The price was the same, i.e., zero for those on medicare.

Certainly, with a large population the extra .5% would save a lot of people dying from a lethal infection but even then we need to know the cost of treatment. Indeed, as my very first link on this subject, a physician offers up:

What I long for—and I haven’t seen it yet—is for media coverage this season to start reporting on absolute differences related to the flu vaccine. I’d like to see how the “1-3% effectiveness of the vaccine” floats around in the public’s thought bubbles. How does that compare with something as simple as staying home and not infecting other people or washing your hands more frequently?

“Why IT Fumbles Analytics Projects”

Screen Shot 2016-01-24 at 11.25.14 AM

Someone pointed me to this Harvard Business Review article by Donald Marchand and Joe Peppard, “Why IT Fumbles Analytics,” which begins as follows:

In their quest to extract insights from the massive amounts of data now available from internal and external sources, many companies are spending heavily on IT tools and hiring data scientists. Yet most are struggling to achieve a worthwhile return. That’s because they treat their big data and analytics projects the same way they treat all IT projects, not realizing that the two are completely different animals.

Interesting! I was expecting something pretty generic, but this seems to be leading in an unusual direction. Marchand and Peppard continue:

The conventional approach to an IT project, such as the installation of an ERP or a CRM system, focuses on building and deploying the technology on time, to plan, and within budget. . . . Despite the horror stories we’ve all heard, this approach works fine if the goal is to improve business processes and if companies manage the resulting organizational change effectively.

But we have seen time and again that even when such projects improve efficiency, lower costs, and increase productivity, executives are still dissatisfied. The reason: Once the system goes live, no one pays any attention to figuring out how to use the information it generates to make better decisions or gain deeper—and perhaps unanticipated—insights into key aspects of the business. . . .

Our research, which has involved studying more than 50 international organizations in a variety of industries, has identified an alternative approach to big data and analytics projects . . . rather than viewing information as a resource that resides in databases—which works well for designing and implementing conventional IT systems—it sees information as something that people themselves make valuable.

OK, I don’t know anything about their research, but I like some of their themes:

It’s crucial to understand how people create and use information. This means that project teams need members well versed in the cognitive and behavioral sciences, not just in engineering, computer science, and math.

I’m a bit miffed that they didn’t mention statistics at all here (“math”? Really??), but I’m with them in their larger point that communication is central to any serious data project. We have to move away from the idea that we do the hard stuff and then communication is just public relations. No! Communication should be “baked in” to the project, as Bob C. would say.

One more thing

One thing that Marchand and Peppard didn’t mention, but is closely related to their themes, is that people make big claims about the effect of analytics, but ironically these claims are just made up, they’re not themselves data-based. We saw this a couple years ago with a claim that “one or two patients died per week in a certain smallish town because of the lack of information flow between the hospital’s emergency room and the nearby mental health clinic.” Upon a careful look, these numbers (saving 75 people a year in a “smallish town”!) fell apart, and the person who promoted this claim has never shown up to defend it.

Hype can occur in any field, but I get particularly annoyed when someone hypes the benefits of data technology without reference to any data (or even, in this case, the name of the “smallish town”). Business books (you know, the ones you see at the airport) seem to be just full of this sort of story.