Skip to content

Forking paths vs. six quick regression tips


Bill Harris writes:

I know you’re on a blog delay, but I’d like to vote to raise the odds that my question in a comment to discussed, in case it’s not in your queue.

It’s likely just my simple misunderstanding, but I’ve sensed two bits of contradictory advice in your writing: fit one complete model all at one, and fit models incrementally, starting with the overly small.

For those of us who are working in industry and trying to stay abreast of good, current practice and thinking, this is important.

I realize it may also not be a simple question.  Maybe both positions are correct, and we don’t yet have a unifying concept to bring them together.

I am open to a sound compromise.  For example, I could imagine the need to start with EDA and small models but hold out a test set for one comprehensive model.  I recall you once wrote to me that you don’t worry much about holding out data for testing, since your field produces new data with regularity.  Others of us aren’t quite so lucky, either because data is produced parsimoniously or the data we need to use is produced parsimoniously.

Still, building the one big model, even after the discussions on sparsity and on horseshoe priors, can sound a bit like, and, although I recognize that regularization can make a big difference.


My reply:

I have so many things I really really must do, but am too lazy to do.  Things to figure out, data to study, books to write.  Every once in awhile I do some work and it feels soooo good.  Like programming the first version of the GMO algorithm, or doing that simulation the other day that made it clear how the simple Markov model massively underestimates the magnitude of the hot hand (sorry, GVT!), or even buckling down and preparing R and Stan code for my classes.  But most of the time I avoid working, and during those times, blogging keeps me sane.  It’s now May in blog time, and I’m 1/4 of the way toward being Jones.

So, sure, Bill, I’ll take next Monday’s scheduled post (“Happy talk, meet the Edlin factor”) and bump it to 11 May, to make space for this one.

And now, to get to the topic at hand:  Yes, it does seem that I give two sorts of advice but I hope they are complementary, not contradictory.

On one hand, let’s aim for hierarchical models where we study many patterns at once.  My model here is Aki’s birthday model (the one with graphs on cover of BDA3) where, instead of analyzing just Valentine’s Day and Halloween, we looked at all 366 days at once, also adjusting for day of week in a way that allows that adjustment to change over time.

On the other hand, we can never quite get to where we want to be, so let’s start simple and build our models up.  This happens both within a project—start simple, build up, keep going until you don’t see any benefit from complexifying your model further—and across projects, where we (statistical researchers and practitioners) gradually get comfortable with methods and can go further.

This is related to the general idea we discussed  a few years ago (wow—it was only a year ago, blogtime flies!), that statistical analysis recapitulates the development of statistical methods.

In the old days, many decades ago, one might start by computing correlation measures and then move to regression, adding predictors one at a time.  Now we might start with a (multiple) regression, then allow intercepts to vary, then move to varying slopes.  In a few years, we may internalize multilevel models (both in our understanding and in our computation) so that they can be our starting point, and once we’ve chunked that, we can walk in what briefly will feel like seven-league boots.

Does that help?

On deck this week


Mon: Forking paths vs. six quick regression tips

Tues: Primed to lose

Wed: Point summary of posterior simulations?

Thurs: In general, hypothesis testing is overrated and hypothesis generation is underrated, so it’s fine for these data to be collected with exploration in mind.

Fri: “Priming Effects Replicate Just Fine, Thanks”

Sat: Pooling is relative to the model

Sun: Hierarchical models for phylogeny: Here’s what everyone’s talking about

The above image is so great I didn’t want you to have to wait till Tues and Fri to see it.

You’ll never guess what I say when I have nothing to say

A reporter writes:

I’m a reporter working on a story . . . and I was wondering if you could help me out by taking a quick look at the stats in the paper it’s based on.

The paper is about paedophiles being more likely to have minor facial abnormalities, suggesting that paedophilia is a neurodevelopment disorder that starts in the womb. We’re a bit concerned that the stats look weak though – small sample size, no comparison to healthy controls, large SD, etc.

If you have time, could you take a quick look and let me know if the statistics seem to be strong enough to back up their conclusions? The paper is here:

I replied: Yes, I agree, I don’t find this convincing, also it’s hard to know what to do with this. It doesn’t seem newsworthy to me. That said, I’m not an expert on this topic.

What’s the difference between randomness and uncertainty?

Julia Galef mentioned “meta-uncertainty,” and how to characterize the difference between a 50% credence about a coin flip coming up heads, vs. a 50% credence about something like advanced AI being invented this century.

I wrote: Yes, I’ve written about this probability thing. The way to distinguish these two scenarios is to embed each of them in a larger setting. The question is, how would each probability change as additional information becomes available. The coin flip is “random” to the extent that intermediate information is not available that would change the probability. Indeed, the flip becomes less “random” to the extent that it is flipped. In other settings such as the outcome of an uncertain sports competition, intermediate information could be available (for example, maybe some key participants are sick or injured) hence it makes sense to speak of “uncertainty” as well as randomness.

It’s an interesting example because people have sometimes considered this to be merely a question of “philosophy” or interpretation, but the distinction between different sources of uncertainty can in fact be encoded in the mathematics of conditional probability.

The bejeezus


Tova Perlmutter writes of a recent online exchange:
Continue reading ‘The bejeezus’ »

Stat Podcast Plan


In my course on Statistical Communication and Graphics, each class had a special guest star who would answer questions on his or her area of expertise. These were not “guest lectures”—there were specific things I wanted the students to learn in this course, it wasn’t the kind of seminar where they just kick back each week and listen—rather, they were discussions, typically around 20 minutes long, facilitated by the outside expert.

One thing that struck me about these discussions was how fun they were, and how various interesting and unexpected things came up in our conversations.

And that made me think—Hey, we should do a podcast! I can be the host and have conversations with these guests, one at a time, and then release these as (free) 15-minute podcasts. How awesome! The only challenge is to keep them lively. Without a roomful of students, a recorded conversation between two people could get stilted.

Also we need a title for the series. “Statistics Podcast” is pretty boring. “Statcast”? The topics we’ve had so far have been focused on statistical communication, but once we go through that, we could cover other statistical areas as well.

And then there’s the technical details: how to actually set up a podcast, also maybe it needs to be edited a bit?

So here’s what I’m needing from you:

– A title for the podcast series.

– Advice on production and distribution.

Our starting lineup

Here are some of the visitors we’ve had in our course so far. I’d plan to start with them, since I’ve already had good conversations with them.

I list the topic corresponding to each visitor, but the actual conversations ranged widely.

    Thomas Basbøll, Writing Consultant, Copenhagen Business School (topic: Telling a story)

    Howard Wainer, Distinguished Research Scientist, National Board of Medical Examiners (topic: Principles of statistical graphics, although the actual discussion ended up being all about educational testing, because that’s what the students’ questions were about)

    Deborah Nolan, Professor of Statistics, University of California (topic: Student activities and projects)

    Jessica Watkins, Department of Education, Tufts University (topic: Facilitating class participation)

    Justin Phillips, Professor of Political Science, Columbia University (topic: Classroom teaching)

    Beth Chance, Professor of Statistics, California Polytechnic State University (topic: Preparing and evaluating a class)

    Amanda Cox, Graphics Editor, New York Times (topic: Graphing data: what to do)

    Jessica Hullman, Assistant Professor of Information Visualization, University of Washington (topic: Graphing data: what works)

    Kaiser Fung, Senior Data Advisor, Vimeo (topic: Statistical reporting)

    Elke Weber, Professor of Psychology and Management, Columbia University (topic: Communicating variation and uncertainty)

    Eric Johnson, Professor of Psychology and Management, Columbia University (topic: Communicating variation and uncertainty)

    Cynthia Rudin, Associate Professor of Statistics, MIT (topic: Understanding fitted models)

    Kenny Shirley, Principal Inventive Scientist, Statistics Research Department, AT&T Laboratories (topic: Understanding fitted models)

    Tom Wood, Assistant Professor of Political Science, Ohio State University (topic: Displaying fitted models)

    Elizabeth Tipton, Assistant Professor of Applied Statistics, Teachers College, Columbia University (topic: Displaying fitted models)

    Brad Paley, Principal, Digital Image Design Incorporated (topic: Giving a presentation)

    Jared Lander, statistical consultant and author of R for Everyone (topic: Teaching in a non-academic environment)

    Jonah Gabry, Researcher, Department of Statistics, Department of Political Science, and Population Research Center, Columbia University (topic: Dynamic graphics)

    Martin Wattenberg, Data Visualization, Google (topic: Dynamic graphics)

    Hadley Wickham, Chief Scientist, RStudio (topic: Dynamic graphics)

    David Rindskopf, Professor of Educational Psychology, City University of New York (topic: Consulting)

    Shira Mitchell, Postdoctoral Researcher, Earth Institute, Columbia University (topic: Collaboration)

    Katherine Button, Lecturer, Department of Psychology, University of Bath (topic: Communication and its impact on science)

    Jenny Davidson, Professor of English, Columbia University (topic: Writing for a technical audience)

    Rachel Schutt, Senior Vice President of Data Science, News Corporation (topic: Communication with a non-technical audience)

    Leslie McCall, Professor of Sociology, Northwestern University (topic: Social research and policy)

    Yair Ghitza, Senior Scientist, Catalist (topic: Data processing)

    Bob Carpenter, Research Scientist, Department of Statistics, Columbia University (topic: Programming)

P.S. Lots of suggested titles in comments. My favorite title so far: Learning from Numbers.

P.P.S. I asked Sharad if he could come up with any names for the podcast and he sent me these:

White Noise
In the Noise
The Signal
Random Samples

I’ll have to nix the first suggestion as it’s a bit too accurate a description of the ethnic composition of myself and our guest stars. The third suggestion is pretty good but it’s almost a bit too slick. After all, we’re not the signal, we’re just a signal. I’m still leaning toward Learning from Numbers.

The Notorious N.H.S.T. presents: Mo P-values Mo Problems

Alain Content writes:

I am a psycholinguist who teaches statistics (and also sometimes publishes in Psych Sci).

I am writing because as I am preparing for some future lessons, I fall back on a very basic question which has been worrying me for some time, related to the reasoning underlying NHST [null hypothesis significance testing].

Put simply, what is the rational justification for considering the probability of the test statistic and any more extreme value of it?

I know of course that the point value probability cannot be used, but I can’t figure the reasoning behind the choice of any more extreme value. I mean, wouldn’t it be as valid (or invalid) to consider for instance the probability of some (conventionally) fixed interval around the observed value? (My null hypothesis is that there is no difference between Belgians and Americans in chocolate consumption. If find a mean difference of say 3 kgs. I decide to reject H0 based on the probability of [2.9-3.1].)

My reply: There are 2 things going on:

1. The logic of NHST. To get this out of the way, I don’t like it. As we’ve discussed from time to time, NHST is all about rejecting straw-man hypothesis B and then using this to claim support for the researcher’s desired hypothesis A. The trouble is that both models are false, and typically the desired hypothesis A is not even clearly specified.

In your example, the true answer is easy: different people consume different amounts of chocolate. And the averages for two countries will differ. The average also differs from year to year, so a more relevant question might be how large are the differences between countries, compared to the variation over time, the variation across states within a country, the variation across age groups, etc.

2. The use of tail-area probabilities as a measure of model fit. This has been controversial. I don’t have much to say on this. On one hand, if a p-value is extreme, it does seem like we learn something about model fit. If you’re seeing p=.00001, that does seem notable. On the other hand, maybe there are other ways to see this sort of lack of fit. In my 1996 paper with Meng and Stern on posterior predictive checks, we did some p-values, but now I’m much more likely to perform a graphical model check.

In any case, you really can’t use p-values to compare model fits or to compare datasets. This example illustrates the failure of the common approach of using p-value as a data summary.

My main message is to use model checks (tail area probabilities, graphical diagnostics, whatever) to probe flaws in the model you want to fit—not as a way to reject null hypotheses.

“Chatting with the Tea Party”

I got an email last month offering two free tickets to the preview of a new play, Chatting with the Tea Party, described as “a documentary-style play about a New York playwright’s year attending Tea Party meetings around the country and interviewing local leaders. Nothing the Tea Party people in the play say has been made up.”

I asked if they could give me 3 tickets and they did, and I went with two family members.

I won’t be spoiling much if I share the plot: self-described liberal playwright talks with liberal friends during the rise of the conservative Tea Party movements, realizes he doesn’t know any Tea Party activists himself, so during his random travels around the country (as a playwright, he’s always going to some performance or workshop or another), he arranges meetings with Tea Party activists in different places. Some of these people say reasonable things, some of them say rude things, many have interesting personal stories. No issue attitudes get changed, but issues get explored.

The play, directed by Lynnette Barkley, had four actors; one played the role of the playwright, the others did the voices of the people he met. They did the different voices pretty well: each time it seemed like a new person. If Anna Deavere Smith or Mel Blanc had been there to do all the voices, it would’ve been amazing, but these actors did the job. And the playwright, Rich Orloff, did a good job compressing so many hours of interviews to yield some intense conversations.

There were two things that struck me during the watching of the play.

First, it would’ve been also interesting to see the converse: a conservative counterpart of the reasonable, pragmatic Orloff interviewing liberal activists. I could imagine a play that cut back and forth between the two sets of scenes. The play did have some scenes with Orloff’s know-nothing liberal NYC friends, but I think it would’ve worked better for them to be confronting an actual conservative, rather than just standing there expressing their biases.

Second, I was struck by how different the concerns of 2009-2010 were, compared to the live political issues now. Back then, it was all about the national debt, there were 3 trillion dollars being released into the economy, everything was gonna crash. Now the concerns seem more to do with national security and various long-term economic issues, but nothing like this spending-is-out-of-control thing. I guess this makes sense: with a Republican-controlled congress, there’s less concern that spending will get out of control. In any case, the central issues have changed. There’s still polarization, though, and still space for literary explorations of the topic. As a person who has great difficulty remembering exact dialogue myself, I’m impressed with a play that can capture all these different voices.

Where the fat people at?


Pearly Dhingra points me to this article, “The Geographic Distribution of Obesity in the US and the Potential Regional Differences in Misreporting of Obesity,” by Anh Le, Suzanne Judd, David Allison, Reena Oza-Frank, Olivia Affuso, Monika Safford, Virginia Howard, and George Howard, who write:

Data from BRFSS [the behavioral risk factor surveillance system] suggest that the highest prevalence of obesity is in the East South Central Census division; however, direct measures suggest higher prevalence in the West North Central and East North Central Census divisions. The regions relative ranking of obesity prevalence differs substantially between self-reported and directly measured height and weight.

And they conclude:

Geographic patterns in the prevalence of obesity based on self-reported height and weight may be misleading, and have implications for current policy proposals.

Interesting. Measurement error is important.

But, hey, what’s with this graph:

Screen Shot 2015-07-31 at 10.45.29 AM

Who made this monstrosity? Ed Wegman?

I can’t imagine a clearer case for a scatterplot. Ummmm, OK, here it is:


Hmmm, I don’t see the claimed pattern between region of the country and discrepancy between the measures.

Maybe things will be clearer if we remove outlying Massachusetts:


Maryland’s a judgment call; I count my home state as northeastern but the cited report places it in the south. In any case, I think the scatterplot is about a zillion times clearer than the parallel coordinates plot (which, among other things, throws away information by reducing all the numbers to ranks).

P.S. Chris in comments suggests redoing the graphs with same scale on the two axes. Here they are:


It’s a tough call. These new graphs make the differences between the two assessments more clear, but then it’s harder to compare the regions. It’s fine to show both, I guess.

Hey—go to Iceland and work on glaciers!


Egil Ferkingstad and Birgir Hrafnkelsson write:

We have an exciting PhD position here at the University of Iceland on developing Bayesian hierarchical spatio-temporal models to the field of glaciology. Havard Rue at NTNU, Trondheim and Chris Wikle at the University of Missouri will also be part of the project.

The Department of Mathematics at the University of Iceland (UI) seeks applicants for a fully funded 3 year PhD position for the project Statistical Models for Glaciology.

The student will develop Bayesian hierarchical spatio-temporal models to the field of glaciology, working with a consortium of experts at the University of Iceland, the University of Missouri and the Norwegian University of Science and Technology. The key people in the consortium are Prof. Birgir Hrafnkelsson at UI, Prof. Chris Wikle, and Prof. Håvard Rue, experts in spatial statistics and Bayesian computation. Another key person is Prof. Gudfinna Adalgeirsdottir at UI, an expect in glaciology. The Glaciology group at UI possesses extensive data and knowledge about the Icelandic glaciers.

The application deadline is February 29, 2016.

Detailed project description:

Job ad with information on how to apply:

It’s a good day for cold research positions.

Summer internship positions for undergraduate students with Aki

There are couple cool summer internship positions for undergraduate students (BSc level) in Probabilistic Machine Learning group at Aalto (Finland) with me (Aki) and Samuel Kaski. Possible research topics are related to Bayesian inference, machine learning, Stan, disease risk prediction, personalised medicine, computational biology, contextual information retrieval, information visualization, etc. Application deadline 18 February. See more here.

Stunning breakthrough: Using Stan to map cancer screening!

Screen Shot 2015-07-25 at 12.28.14 AM

Paul Alper points me to this article, Breast Cancer Screening, Incidence, and Mortality Across US Counties, by Charles Harding, Francesco Pompei, Dmitriy Burmistrov, Gilbert Welch, Rediet Abebe, and Richard Wilson.

Their substantive conclusion is there’s too much screening going on, but here I want to focus on their statistical methods:

Spline methods were used to model smooth, curving associations between screening and cancer rates. We believed it would be inappropriate to assume that associations were linear, especially since nonlinear associations often arise in ecological data. In detail, univariate thin-plate regression splines (negative binomial model to accommodate overdispersion, log link, and person-years as offset) were specified in the framework of generalized additive models and fitted via restricted maximum likelihood, as implemented in the mgcv package in R. . . .

To summarize cross-sectional changes in incidence and mortality, we evaluated the mean rate differences and geometric mean relative rates (RRs) associated with a 10–percentage point increase in the extent of screening across the range of data (39%-78% screening). The 95% CIs were calculated by directly simulating from the posterior distribution of the model coefficients (50 000 replicates conditional on smoothing parameters).

Can someone get these data and re-fit in Stan? I have no reason to think the published analysis by Harding et al. has any problems; I just think it would make sense to do it all in Stan, as this would be a cleaner workflow and easier to apply to new problems.

P.S. See comments for some discussions by Charles Harding, author of the study in question.

When does peer review make no damn sense?

Disclaimer: This post is not peer reviewed in the traditional sense of being vetted for publication by three people with backgrounds similar to mine. Instead, thousands of commenters, many of whom are not my peers—in the useful sense that, not being my peers, your perspectives are different from mine, and you might catch big conceptual errors or omissions that I never even noticed—have the opportunity to point out errors and gaps in my reasoning, to ask questions, and to draw out various implications of what I wrote. Not “peer reviewed”; actually peer reviewed and more; better than peer reviewed.


Last week we discussed Simmons and Simonsohn’s survey of some of the literature on the so-called power pose, where they wrote:

While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve). However, at this point the evidence for the basic effect seems too fragile to search for moderators or to advocate for people to engage in power posing to better their lives.


Even if the effect existed, the replication suggests the original experiment could not have meaningfully studied it.

The first response of one of the power-pose researchers was:

I’m pleased that people are interested in discussing the research on the effects of adopting expansive postures. I hope, as always, that this discussion will help to deepen our understanding of this and related phenomena, and clarify directions for future research. . . . I respectfully disagree with the interpretations and conclusions of Simonsohn et al., but I’m considering these issues very carefully and look forward to further progress on this important topic.

This response was pleasant enough but I found it unsatisfactory because it did not even consider the possibility that her original finding was spurious.

After Kaiser Fung and I publicized Simmons and Simonsohn’s work in Slate, the power-pose author responded more forcefully:

The fact that Gelman is referring to a non-peer-reviewed blog, which uses a new statistical approach that we now know has all kinds of problems, as the basis of his article is the WORST form of scientific overreach. And I am certainly not obligated to respond to a personal blog. That does not mean I have not closely inspected their analyses. In fact, I have, and they are flat-out wrong. Their analyses are riddled with mistakes, not fully inclusive of all the relevant literature and p-values, and the “correct” analysis shows clear evidential value for the feedback effects of posture.

Amy Cuddy, the author of this response, did not at any place explain how Simmons and Simonsohn were “flat-out wrong,” nor did she list even one of the mistakes with which their analyses were “riddled.”

Peer review

The part of the above quote I want focus on, though, is the phrase “non-peer-reviewed.” Peer reviewed papers have errors, of course (does the name “Daryl Bem” ring a bell?). Two of my own published peer-reviewed articles had errors so severe as to destroy their conclusions! But that’s ok, nobody’s claiming perfection. The claim, I think, is that peer-reviewed articles are much less likely to contain errors, as compared to non-peer-reviewed articles (or non-peer-reviewed blog posts). And the claim behind that, I think, is that peer review is likely to catch errors.

And this brings up the question I want to address today: What sort of errors can we expect peer review to catch?

I’m well placed to answer this question as I’ve published hundreds of peer-reviewed papers and written thousands of referee reports for journals. And of course I’ve also done a bit of post-publication review in recent years.

To jump to the punch line: the problem with peer review is with the peers.

In short, if an entire group of peers has a misconception, peer review can simply perpetuate error. We’ve seen this a lot in recent years, for example that paper on ovulation and voting was reviewed by peers who didn’t realize the implausibility of 20-percentage-point vote swings during the campaign, peers who also didn’t know about the garden of forking paths. That paper on beauty and sex ratio was reviewed by peers who didn’t know much about the determinants of sex ratio and didn’t know much about the difficulties of estimating tiny effects from small sample sizes.

OK, let’s step back for a minute. What is peer review good for? Peer reviewers can catch typos, they can catch certain logical flaws in an argument, they can notice the absence of references to the relevant literature—that is, the literature that the peers are familiar with. That’s how the peer reviewers for that psychology paper on ovulation and voting didn’t catch the error of claiming that days 6-14 were the most fertile days of the cycle: these reviewers were peers of the people who made the mistake in the first place!

Peer review has its place. But peer reviewers have blind spots. If you want to really review a paper, you need peer reviewers who can tell you if you’re missing something within the literature—and you need outside reviewers who can rescue you from groupthink. If you’re writing a paper on himmicanes and hurricanes, you want a peer reviewer who can connect you to other literature on psychological biases, and you also want an outside reviewer—someone without a personal and intellectual stake in you being right—who can point out all the flaws in your analysis and can maybe talk you out of trying to publish it.

Peer review is subject to groupthink, and peer review is subject to incentives to publishing things that the reviewers are already working on.

This is not to say that a peer-reviewed paper is necessarily bad—I stand by over 99% of my own peer-reviewed publications!—rather, my point is that there are circumstances in which peer review doesn’t give you much.

To return to the example of power pose: There are lots of papers in this literature and there’s a group of scientists who believe that power pose is real, that it’s detectable, and indeed that it can help millions of people. There’s also a group of scientists who believe that any effects of power pose are small, highly variable, and not detectable by the methods used in the leading papers in this literature.

Fine. Scientific disagreements exist. Replication studies have been performed on various power-pose experiments (indeed, it’s the null result from one of these replications that got this discussion going), and the debate can continue.

But, my point here is that peer-review doesn’t get you much. The peers of the power-pose researchers are . . . other power-pose researchers. Or researchers on embodied cognition, or on other debatable claims in experimental psychology. Or maybe other scientists who don’t work in this area but have heard good things about it and want to be supportive of this work.

And sometimes a paper will get unsupportive reviews. The peer review process is no guarantee. But then authors can try again until they get those three magic positive reviews. And peer review—review by true peers of the authors—can be a problem, if the reviewers are trapped in the same set of misconceptions, the same wrong framework.

To put it another way, peer review is conditional. Papers in the Journal of Freudian Studies will give you a good sense of what Freudians believe, papers in the Journal of Marxian Studies will give you a good sense of what Marxians believe, and so forth. This can serve a useful role. If you’re already working in one of these frameworks, or if you’re interested in how these fields operate, it can make sense to get the inside view. I’ve published (and reviewed papers for) the journal Bayesian Analysis. If you’re anti-Bayesian (not so many of these anymore), you’ll probably think all these papers are a crock of poop and you can ignore them, and that’s fine.

(Parts of) the journals Psychological Science and PPNAS have been the house organs for a certain variety of social psychology that a lot of people (not just me!) don’t really trust. Publication in these journals is conditional on the peers who believe the following equation:

“p less than .05” + a plausible-sounding theory = science.

Lots of papers in recent years by Uri Simonsohn, Brian Nosek, John Ioannidis, Katherine Button, etc etc etc., have explored why the above equation is incorrect.

But there are some peers that haven’t got the message yet. Not that they would endorse the above statement when written as crudely as in that equation, but I think this is how they’re operating.

And, perhaps more to the point, many of the papers being discussed are several years or even decades old, dating back to a time when almost nobody (myself included) realized how wrong the above equation is.

Back to power pose

And now back to the power pose paper by Carney et al. It has many garden-of-forking-paths issues (see here for a few of them). Or, as Simonsohn would say, many researcher degrees of freedom.

But this paper was published in 2010! Who knew about the garden of forking paths in 2010? Not the peers of the authors of this paper. Maybe not me either, had it been sent to me to review.

What we really needed (and, luckily, we can get) is post-publication review: not peer reviews, but outside reviews, in this case reviews by people who are outside of the original paper both in research area and in time.

And also this, from another blog comment:

It is also striking how very close to the .05 threshhold some of the implied p-values are. For example, for the task where the participants got the opportunity to gamble the reported chi-square is 3.86 which has an associated p-value of .04945.

Of course, this reported chi-square value does not seem to match the data because it appears from what is written on page 4 of the Carney et al. paper that 22 participants were in the high power-pose condition (19 took the gamble, 3 did not) while 20 were in the low power-pose condition (12 took the gamble, 8 did not). The chi-square associated with a 2 x 2 contingency table with this data is 3.7667 and not 3.86 as reported in the paper. The associated p-value is .052 – not less than .05.

You can’t expect peer reviewers to check these sorts of calculations—it’s not like you could require authors to supply their data and an R or Stata script to replicate the analyses, ha ha ha. The real problem is that the peer reviewers were sitting there, ready to wave past the finish line a result with p less than .05, which provides an obvious incentive for the authors to get p less than .05, one way or another.

Commenters also pointed out an earlier paper by one of the same authors, this time on stereotypes of the elderly, from 2005, that had a bunch more garden-of-forking-paths issues and also misreported two t statistics: the actual values were something like 1.79 and 3.34; the reported values were 5.03 and 11.14! Again, you can’t expect peer reviewers to catch these problems (nobody was thinking about forking paths in 2005, and who’d think to recalculate a t statistic?), but outsiders can find them, and did.

At this point one might say that this doesn’t matter, that the weight of the evidence, one way or another, can’t depend on whether a particular comparison in one paper was or was not statistically significant—but if you really believe this, what does it say about the value of the peer-reviewed publication?

Again, I’m not saying that peer review is useless. In particular, peers of the authors should be able to have a good sense of how the storytelling theorizing in the article fits in with the rest of the literature. Just don’t expect peers to do any assessment of the evidence.

Linking as peer review

Now let’s consider the Simmons and Simonsohn blog post. It’s not peer reviewed—except it kinda is! Kaiser Fung and I chose to cite Simmons and Simonsohn in our article. We peer reviewed the Simmons and Simonsohn post.

This is not to say that Kaiser and I are certain that Simmons and Simonsohn made no mistakes in that post; peer review never claims to that sort of perfection.

But I’d argue that our willingness to cite Simmons and Simonsohn is a stronger peer review than whatever was done for those two articles cited above. I say this not just because those papers had demonstrable errors which affect their conclusions (and, yes, in the argot of psychology papers, if a p-value shifts from one side of .05 to the other, it does affect the conclusions).

I say this also because of the process. When Kaiser and I cite Simmons and Simonsohn in the way that we do, we’re putting a little bit of our reputation on the line. If Simmons and Simonsohn made consequential errors—and, hey, maybe they did, I didn’t check their math, any more than the peer reviewers of the power pose papers checked their math—that rebounds negatively on us, that we trusted something untrustworthy. In contrast, the peer reviewers of those two papers are anonymous. The peer review that they did was much less costly, reputationally speaking, than ours. We have skin in the game, they do not.

Beyond this, Simmons and Simonsohn say exactly what they did, so you can work it out yourself. I trust this more than the opinions of 3 peers of the authors in 2010, or 3 other peers in 2005.


Peer review can serve some useful purposes. But to the extent the reviewers are actually peers of the authors, they can easily have the same blind spots. I think outside review can serve a useful purpose as well.

If the authors of many of these PPNAS or Psychological Science-type papers really don’t know what they’re doing (as seems to be the case), then it’s no surprise that peer review will fail. They’re part of a whole peer group that doesn’t understand statistics. So, from that perspective, perhaps we should trust “peer review” less than we should trust “outside review.”

I am hoping that peer review in this area will improve, given the widespread discussion of researcher degrees of freedom and garden of forking paths. Even so, though, we’ll continue to have a “legacy” problem of previously published papers with all sorts of problems, up to and including t statistics misreported by factors of 3. Perhaps we’ll have to speak of “post-2015 peer-reviewed articles” and “pre-2015 peer-reviewed articles” as different things?

On deck this week

Mon: When does peer review make no damn sense?

Tues: Stunning breakthrough: Using Stan to map cancer screening!

Wed: Where the fat people at?

Thurs: The Notorious N.H.S.T. presents: Mo P-values Mo Problems

Fri: What’s the difference between randomness and uncertainty?

Sat: You’ll never guess what I say when I have nothing to say

Sun: I refuse to blog about this one

I don’t know about you, but I love these blog titles. Each week I put together this “on deck” post and I get interested all again in these topics. I wrote most of these so many months ago, I have no idea what’s in them. I’m looking forward to these posts almost as much as you are!

What a great way to start the work week.

Ted Cruz angling for a position in the Stanford poli sci department

In an amusing alignment of political and academic scandals, presidential candidate Ted Cruz was blasted for sending prospective voters in the Iowa Caucus this misleading mailer:


Which reminds me of the uproar two years ago when a couple of Stanford political science professors sent prospective Montana voters this misleading mailer:

Screen Shot 2014-10-29 at 6.04.24 PM

I don’t know which is worse: having a “voting violation” in Iowa or being almost as far left as Barack Obama in Montana.

There is well known research in political science suggesting that shaming people can motivate them to come out and vote, so I can understand how Cruz can describe this sort of tactic as “routine.”

It’s interesting, though: In 2014, some political scientists got into trouble by using campaign-style tactics in a nonpartisan election (and also for misleading potential voters by sending them material with the Montana state seal). In 2016, a political candidate is getting into trouble by using political-science-research tactics in a partisan election (and also for misleading potential voters with a “VOTING VIOLATION” note).

What’s the difference between Ted Cruz and a Stanford political scientist?

Some people wrote to me questioning the link I’m drawing above between Cruz and the Stanford political scientists. So let me emphasize that I know of no connections here. I don’t even know if Cruz has any political scientists on his staff, and I’m certainly not trying to suggest that the Stanford profs in question are working for Cruz or for any other presidential candidate. I have no idea. Nor would I think it a problem if they are. I was merely drawing attention to the similarities between Cruz’s item and the Montana mailer from a couple years back.

I do think what Cruz did is comparable to what the political scientists did.

There are some differences:

1. Different goals: Cruz wants to win an election, the political scientists wanted to do research.

2. Different time frames: Cruz is in a hurry and got sloppy, the political scientists had more time and could be more careful with the information on their mailers.

But I see two big similarities:

1. Research-based manipulation of voters: Cruz is working off of the effects of social pressure on turnout, the political scientists were working off the effects of perceived ideology on turnout.

2. Misleading information: Cruz is implying that people have some sort of obligation to vote, the political scientists were implying that their mailer was coming from the State of Montana.

Postdoc opportunity with Sophia Rabe-Hesketh and me in Berkeley!

Sophia writes:

Mark Wilson, Zach Pardos and I are looking for a postdoc to work with us on a range of projects related to educational assessment and statistical modeling, such as Bayesian modeling in Stan (joint with Andrew Gelman).

See here for more details.

We will accept applications until February 26.

The position is for 15 months, starting this Spring. To be eligible, applicants must be U.S. citizens or permanent residents.

Empirical violation of Arrow’s theorem!


Regular blog readers know about Arrow’s theorem, which is that any result can be published no more than five times.

Well . . . I happened to be checking out Retraction Watch the other day and came across this:

“Exactly the same clinical study” published six times

Here’s the retraction notice in the journal Inflammation:

This article has been retracted at the request of the Editor-in-Chief.

The authors have published results from exactly the same clinical study and patient population in 6 separate articles, without referencing the publications in any of the later articles:

1. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’angelo, A., Maffioli, P. 2013. Olmesartan/amlodipine combination versus olmesartan or amlodipine monotherapies on blood pressure and insulin resistance in a sample of hypertensive patients. Clinical and Experimental Hypertension 35: 301–307. doi:10.​3109/​10641963.​2012.​721841.

2. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’Angelo, A., Maffioli, P. 2013. Effects of an olmesartan/amlodipine fixed dose on blood pressure control, some adipocytokines and interleukins levels compared with olmesartan or amlodipine monotherapies. Journal of Clinical Pharmacy and Therapeutics 38: 48–55. doi:10.​1111/​jcpt.​12021.

3. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’Angelo, A., Maffioli, P. 2013. Variation of some inflammatory markers in hypertensive patients after 1 year of olmesartan/amlodipine single-pill combination compared with olmesartan or amlodipine monotherapies. Journal of the American Society of Hypertension 7: 32–39. doi:10.​1016/​j.​jash.​2012.​11.​006.

4. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’Angelo, A., Maffioli, P. 2013. Evaluation of safety and efficacy of a fixed olmesartan/amlodipine combination therapy compared to single monotherapies. Expert Opinion on Drug Safety 12: 621–629. doi:10.​1517/​14740338.​2013.​816674.

5. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’Angelo, A., Maffioli, P. 2014. Different aspects of sartan + calcium antagonist association compared to the single therapy on inflammation and metabolic parameters in hypertensive patients. Inflammation 37: 154–162. doi:10.​1007/​s10753-013-9724-x.

6. Derosa, G., Cicero, A.F.G., Carbone, A., Querci, F., Fogari, E., D’Angelo, A., Maffioli, P. 2014. Results from a 12 months, randomized, clinical trial comparing an olmesartan/amlodipine single pill combination to olmesartan and amlodipine monotherapies on blood pressure and inflammation. European Journal of Pharmaceutical Sciences 51: 26–33. doi:10.​1016/​j.​ejps.​2013.​08.​031.

In addition, the article in Inflammation contains results published especially in articles 2 and 6, which is the main reason for retraction of the article in Inflammation.

The publisher apologizes for the inconvenience caused.

From my perspective, though, it’s all worth it to see a counterexample to a longstanding theorem. Bruno Frey must be soooooo jealous right now.

P.S. I don’t think it’s so horrible to publish similar material in different places. Not everyone reads every article and so it can be good to reach different audiences. But if you have multiple versions of an article, you should make that clear. Otherwise you’re poisoning the meta-analytic well.

TOP SECRET: Newly declassified documents on evaluating models based on predictive accuracy

We recently had an email discussion among the Stan team regarding the use of predictive accuracy in evaluating computing algorithms. I thought this could be of general interest so I’m sharing it here.

It started when Bob said he’d been at a meting on probabilistic programming where there was confusion on evaluation. In particular, some of the people at the meeting had the naive view that you could just compare everything on cross-validated proportion-predicted-correct for binary data.

But this won’t work, for three reasons:

1. With binary data, cross-validation is noisy. Model B can be much better than model A but the difference might barely show up in the empirical cross-validation, even for a large data set. Wei Wang and I discuss that point in our article, Difficulty of selecting among multilevel models using predictive accuracy.

2. 0-1 loss is not in general a good measure. You can see this by supposing you’re predicting a rare disease. Upping the estimated probability from 1 in a million to 1 in a thousand will have zero effect on your 0-1 loss (your best point prediction is 0 in either case) but it can be a big real-world improvement.

3. And, of course, a corpus is just a corpus. What predicts well in one corpus might not generalize. That’s one reason we like to understand our predictive models if possible.

Bob in particular felt strongly about point 1 above. He wrote:

Given that everyone (except maybe those SVM folks) are doing *probabilistic* programming, why not use log loss? That’s the metric that most of the Kaggle competitions moved to. It tests how well calibrated the probability statements of a model are in a way that neither 0/1 loss, squared error, or ROC curve metrics like mean precision don’t.

My own story dealing with this involved a machine learning
researcher trying to predict industrial failures who built a logistic regression where the highest likelihood of a component failure was 0.2 or so. They were confused because the model didn’t seem to predict any failures at all, which seemed wrong. That’s just a failure to think in terms of expectations (20 parts with a 20% chance of failure each would lead to 4 expected failures). I also tried explaining that the model may be well calibrated and there may not be a part that has more than a 20% chance of failure. But they wound up doing what PPAML’s about to do for the image tagging task, namely compute some kind of ROC curve evaluation based on varying thresholds, which of course, doesn’t measure how well calibrated the probabilities are, because it’s only sensitive to ranking.

Tom Dietterich concurred:

Regarding holdout likelihood, yes, this is an excellent suggestion. We have evaluated on hold-out likelihood on some of our previous challenge problems. In CP6, we focused on the other metrics (mAP and balanced error rate) because that is what the competing “machine learning” methods employed.

Within the machine learning/computer vision/natural language processing communities, there is a wide-spread belief that fitting to optimize metrics related to the specific decision problem in the application is a superior approach. It would be interesting to study that question more deeply.

To which Bob elaborated:

I completely agree, which is why I don’t like things like mean average precision (MAP), balanced 0/1 loss, and balanced F measure, none of which relate to any relevant decision problem.

It’s also why I don’t like 0/1 loss (either straight up, through balanced F measures, through macro-averaged F measure, etc.), because that’s never the operating point anyone wants. At least in 10 years working in industrial machine learning, it was never the decision problem anyone wanted. Customers almost always had asymmetric utility for false positives and false negatives (think epidemiology, suggesting search spelling corrections, speech recognition in an online dialogue system for airplane reservations, etc.) and wanted to operate at either very high precision (positive predictive accuracy) or very high recall (sensitivity). No customer or application I’ve ever seen other than writing NIPS or Computational Linguistics papers ever cared about balanced F measure in a large data set in an application.

The advantage of log loss is a better measure for generic decision making than area under the curve because it measures how well calibrated the probabilistic inferences are. Well-calibrated inferences are optimal for all decision operating points assuming you want to make Bayes-optimal decisions to maximize expected utility while minimizing risk. There’s a ton of theory around this, starting with Berger’s influential book on Bayesian decision theory from the 1980s. And it doesn’t just apply to Bayesian models, though almost everything in the machine learning world can be viewed as an approximate Bayesian technique.

Being Bayesian, the log loss isn’t a simple log likelihood with point estimated parameters plugged in (popular approximate technique in the machine learning world), but a true posterior predictive estimate as I described in my paper. Of course, if your computing power isn’t up to it, you can approximate with
point estimates and log loss by treating your posterior as a delta function around its mean (or even mode if you can’t even do variational inference).

Sometimes ranking is enough of a proxy for decision making, which is why mean average precision (truncated to high precison, say average precision at 5) is relevant for some search apps, such as Google’s, and mean average precision
(truncated to high recall) is relevant to other search apps, such as that of a biology post-doc or an intelligence analyst. I used to do a lot of work with DoD and DARPA and they were quite keen to have very very high recall — the intelligence analysts really didn’t like systems that had 90% recall so that 10% of the data were missed! At some points, I think they
kept us in the evaluations because provided an exact boolean search that had 100% recall, so they could look at the data, type in a phrase, and be guaranteed to find it. That doesn’t work with first-pass first-best analyses.

I suggested to Bob that he blog this but then we decided it would be more time-efficient for me to do it. The only thing is, then it won’t appear till October.

P.S. Here are Bob’s slides from that conference. He spoke on Stan.

Placebo effect shocker: After reading this, you won’t know what to believe.

Martha Smith writes:

Yesterday’ BBC News Magazine featured an article by William Kremer entitled, “”Why are placebos getting more effective?”, which looks like a possibility for a blog post discussing how people treat surprising effects. The article asserts that the placebo effect has been decreasing, especially in the U.S.

The author asks, “Why? What could it be about Americans that might make them particularly susceptible to the placebo effect?” then gives lots of speculation. This might be characterized as “I believe the effect is real, so I’’ll look for possible causes.”

However, applying the skeptical maxim, “If an effect is surprising, it’’s probably false or overestimated,” I quickly came up with two plausible reasons why the “increasing effect of placebos” might be apparent rather than real:

1. The statistical significance filter could operate indirectly: One reason a study comparing treatment with placebo might get through the statistical significance filter is because it happens to have an uncharacteristically small placebo effect. Thus small placebo effects are likely to be overrepresented in published studies; a later replication of such a study is likely to show a larger (but more typical) placebo effect.

2. If early studies are not blinded but later studies are, the earlier studies would be expected to show deflated effects for placebo but inflated effects for treatment.

My reply: There’s something about this placebo thing that just keeps confusing me. So I’ll stay out of this one, except to post the above note to give everyone something to think about today.

One thing I like about hierarchical modeling is that is not just about criticism. It’s a way to improve inferences, not just a way to adjust p-values.

In an email exchange regarding the difficulty many researchers have in engaging with statistical criticism (see here for a recent example), a colleague of mine opined:

Nowadays, promotion requires more publications, and in an academic environment, researchers are asked to do more than they can. So many researchers just work like workers in a product line without critical thinking. Quality becomes a tradeoff of quantity.

I replied:

I think that many (maybe not all) researchers are interested in critical thinking, but they don’t always have a good framework for integrating critical thinking into their research. Criticism is, if anything, too easy: once you’ve criticized, what do you do about it (short of “50 shades of gray” self-replication, which really is a lot of work)? One thing I like about hierarchical modeling is that is not just about criticism. It’s a way to improve inferences, not just a way to adjust p-values.

The point is that in this way criticism can be a step forward.

When we go through the literature (or even all the papers by a particular author) and list all the different data-coding, data-exclusion, and data-analysis rules that were done (see comment thread from above link for a long list of examples of data excluded or included, outcomes treated separately or averaged, variables controlled for or not, different p-value thresholds, etc.), it’s not just about listing multiple comparisons and criticizing p-values (which ultimately only gets you so far, because even correct p-values bear only a very indirect relation to any inferences of interest), it’s also about learning more from data, constructing a fuller model that includes all the possibilities corresponding to the different theories. Or even just recognizing that a particular dataset with a particular small sample and noisy, variable measurements, is too weak to learn what you want to learn. That can good to know too: if it’s a topic you really care about, you can devote some effort to more careful measurement, or at least know the limitations of your data. All good—the point is to make the link to reality rather than to try to compute some correct p-value, which has little to do with anything.