Skip to content

The time-reversal heuristic—a new way to think about a published finding that is followed up by a large, preregistered replication (in context of Amy Cuddy’s claims about power pose)

121206124346_1_900x600

[Note to busy readers: If you’re sick of power pose, there’s still something of general interest in this post; scroll down to the section on the time-reversal heuristic. I really like that idea.]

Someone pointed me to this discussion on Facebook in which Amy Cuddy expresses displeasure with my recent criticism (with Kaiser Fung) of her claims regarding the “power pose” research of Cuddy, Carney, and Yap (see also this post from yesterday). Here’s Cuddy:

This is sickening and, ironically, such an extreme overreach. First, we *published* a response, in Psych Science, to the Ranehill et al conceptual (not direct) replication, which varied methodologically in about a dozen ways — some of which were enormous, such as having people hold the poses for 6 instead of 2 minutes, which is very uncomfortable (and note that even so, somehow people missed that they STILL replicated the effects on feelings of power). So yes, I did respond to the peer-reviewed paper. The fact that Gelman is referring to a non-peer-reviewed blog, which uses a new statistical approach that we now know has all kinds of problems, as the basis of his article is the WORST form of scientific overreach. And I am certainly not obligated to respond to a personal blog. That does not mean I have not closely inspected their analyses. In fact, I have, and they are flat-out wrong. Their analyses are riddled with mistakes, not fully inclusive of all the relevant literature and p-values, and the “correct” analysis shows clear evidential value for the feedback effects of posture. I’ve been quiet and polite long enough.

There’s a difference between having your ideas challenged in constructive way, which is how it used in to be in academia, and attacked in a destructive way. My “popularity” is not relevant. I’m tired of being bullied, and yes, that’s what it is. If you could see what goes on behind the scenes, you’d be sickened.

I will respond here but first let me get a couple things out of the way:

1. Just about nobody likes to be criticized. As Kaiser and I noted in our article, Cuddy’s been getting lots of positive press but she’s had some serious criticisms too, and not just from us. Most notably, Eva Ranehill, Anna Dreber, Magnus Johannesson, Susanne Leiberg, Sunhae Sul, and Roberto Weber published a paper last year in which they tried and failed to replicate the results of Cuddy, Carney, and Yap, concluding “we found no significant effect of power posing on hormonal levels or in any of the three behavioral tasks.” Shortly after, the respected psychology researchers Joe Simmons and Uri Simonsohn published on their blog an evaluation and literature review, writing that “either power-posing overall has no effect, or the effect is too small for the existing samples to have meaningfully studied it” and concluding:

While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve). However, at this point the evidence for the basic effect seems too fragile to search for moderators or to advocate for people to engage in power posing to better their lives.

OK, so I get this. You work hard on your research, you find something statistically significant, you get it published in a top journal, you want to draw a line under it and move on. For outsiders to go and question your claim . . . that would be like someone arguing a call in last year’s Super Bowl. The game’s over, man! Time to move on.

So I see how Cuddy can find this criticism frustrating, especially given her success with the Ted talk, the CBS story, the book publication, and so forth.

2. Cuddy writes, “If you could see what goes on behind the scenes, you’d be sickened.” That might be so. I have no idea what goes on behind the scenes.

OK, now on to the discussion

The short story to me is that Cuddy, Carney, and Yap found statistical significance in a small sample, non-preregistered study with a flexible hypothesis (that is, a scientific hypothesis that posture could affect performance, which can map on to many many different data patterns). We already know to watch out for such claims, and in this case a large follow-up study by an outside team did not find a positive effect. Meanwhile, Simmons and Simonsohn analyzed some of the published literature on power pose and found it to be consistent with no effect.

At this point, a natural conclusion is that the existing study by Cuddy et al. was too noisy to reveal much of anything about whatever effects there might be of posture on performance.

This is not the only conclusion one might draw, though. Cuddy draws a different conclusion, which is that her study did find a real effect and that the replication by Ranehill et al. was done under different, less favorable conditions, for which the effect disappeared.

This could be. As Kaiser and I wrote, “This is not to say that the power pose effect can’t be real. It could be real and it could go in either direction.” We question on statistical grounds the strength of the evidence offered by Cuddy et al. And there is also the question of whether a lab result in this area, if it were real, would generalize to the real world.

What frustrates me is that Cuddy in all her responses doesn’t seem to even consider the possibility that the statistically significant pattern they found might mean nothing at all, that it might be an artifact of a noisy sample. It’s happened before: remember Daryl Bem? Remember Satoshi Kanazawa? Remember the ovulation-and-voting researchers? The embodied cognition experiment? The 50 shades of gray? It happens all the time! How can Cuddy be so sure it hasn’t happened to her? I’d say this even before the unsuccessful replication from Ranehill et al.

Response to some specific points

“Sickening,” huh? So, according to Cuddy, her publication is so strong it’s worth a book and promotion in NYT, NPR, CBS, TED, etc. But Ranehill et al.’s paper, that somehow has a lower status, I guess because it was published later? So it’s “sickening” for us to express doubt about Cuddy’s claim, but not “sickening” for her to question the relevance of the work by Ranehill et al.? And Simmons and Simonsohn’s blog, that’s no good because it’s a blog, not a peer reviewed publication. Where does this put Daryl Bem’s work on ESP or that “bible code” paper from a couple decades ago? Maybe we shouldn’t be criticizing them, either?

It’s not clear to me how Simmons, Simonsohn, and I are “bullying” Cuddy. Is it bullying to say that we aren’t convinced by her paper? Are Ranehill, Dreber, etc. “bullying” her too, by reporting a non-replication? Or is that not bullying because it’s in a peer-reviewed journal?

When a published researcher such as Cuddy equates “I don’t believe your claims” with “bullying,” that to me is a problem. And, yes, the popularity of Cuddy’s work is indeed relevant. There’s lots of shaky research that gets published every year and we don’t have time to look into all of it. But when something is so popular and is promoted so heavily, then, yes, it’s worth a look.

Also, Cuddy writes that “somehow people missed that they STILL replicated the effects on feelings of power.” But people did not miss this at all! Here’s Simmons and Simonsohn:

In the replication, power posing affected self-reported power (the manipulation check), but did not impact behavior or hormonal levels. The key point of the TED Talk, that power poses “can significantly change the outcomes of your life”, was not supported.

In any case, it’s amusing that someone who’s based an entire book on an experiment that was not successfully replicated is writing about “extreme overreach.” As I’ve written several times now, I’m open to the possibility that power pose works, but skepticism seems to me to be eminently reasonable, given the evidence currently available.

In the meantime, no, I don’t think that referring to a non-peer-reviewed blog is “the worst form of scientific overreach.” I plan to continue to read and refer to the blog of Simonsohn and his colleagues. I think they do careful work. I don’t agree with everything they write—but, then again, I don’t agree with everything that is published in Psychological Science, either. Simonsohn et al. explain their reasoning carefully and they give their sources.

I have no interest in getting into a fight with Amy Cuddy. She’s making a scientific claim and I don’t think the evidence is as strong as she’s claiming. I’m also interested in how certain media outlets take her claims on faith. That’s all. Nothing sickening, no extreme overreach, just a claim on my part that, once again, a researcher is being misled by the process in which statistical significance, followed by publication in a major journal, is taken as an assurance of truth.

The time-reversal heuristic

One helpful (I think) way to think about this episode is to turn things around. Suppose the Ranehill et al. experiment, with its null finding, had come first. A large study finding no effect. And then Cuddy et al. had run a replication under slightly different conditions with a much smaller sample size and found statistically significance under non-preregistered conditions. Would be be inclined to believe it? I don’t think so. At the very least, we’d have to conclude that any power-pose effect is fragile.

From this point of view, what Cuddy et al.’s research has going for it is that (a) they found statistical significance, (b) their paper was published in a peer-reviewed journal, and (c) their paper came before, rather than after, the Ranehill et al. paper. I don’t find these pieces of evidence very persuasive. (a) Statistical significance doesn’t mean much in the absence of preregistration or something like it, (b) lots of mistakes get published in peer-reviewed journals, to the extent that the phrase “Psychological Science” has become a bit of a punch line, and (c) I don’t see why we should take Cuddy et al. as the starting point in our discussion, just because it was published first.

What next?

I don’t see any of this changing Cuddy’s mind. And I have no idea what Carney and Yap think of all this; they’re coauthors of the original paper but don’t seem to have come up much in the subsequent discussion. I certainly don’t think of Cuddy as any more of an authority on this topic than are Eva Ranehill, Anna Dreber, etc.

And I’m guessing it would take a lot to shake the certainty expressed on the matter by team TED. But maybe people will think twice when the next such study makes its way through the publicity mill?

And, for those of you who can’t get enough of power pose, I just learned that the journal Comprehensive Results in Social Psychology, “the preregistration-only journal for social psychology,” will be having a special issue devoted to replications of power pose! Publication is expected in fall 2016. So you can expect some more blogging on this topic in a few months.

The potential power of self-help

What about the customers of power pose, the people who might buy Amy Cuddy’s book, follow its advice, and change their life? Maybe Cuddy’s advice is just fine, in which case I hope it helps lots of people. It’s perfectly reasonably to give solid, useful advice without any direct empirical backing. I give advice all the time without there being any scientific study behind it. I recommend to write this way, and teach that way, and make this and that sort of graphs, typically basing my advice on nothing but a bunch of stories. I’m not the best one to judge whether Cuddy’s advice will be useful for its intended audience. But if it, that’s great and I wish her book every success. The advice could be useful in any case. Even if power pose has null or even negative effects, the net effect of all the advice in the book, informed by Cuddy’s experiences teaching business students and so forth, could be positive.

As I wrote in a comment in yesterday’s thread, consider a slightly different claim: Before an interview you should act confident; you should fold in upon yourself and be coiled and powerful; you should be secure about yourself and be ready to spring into action. It would be easy to imagine an alternative world in which Cuddy et al. found an opposite effect and wrote all about the Power Pose, except that the Power Pose would be described not as an expansive posture but as coiled strength. We’d be hearing about how our best role model is not cartoon Wonder Woman but rather the Lean In of the modern corporate world. Etc. And, the funny thing is, that might be good advice too! As they say in chess, it’s important to have a plan. It’s not good to have no plan. It’s better to have some plan, any plan, especially if you’re willing to adapt that plan in light of events. So it could well be that either of these power pose books—Cuddy’s actual book, or the alternative book, giving the exact opposite posture advice, which might have been written had the data in the Cuddy, Carney, and Yap paper come out different—could be useful to readers.

So I want to separate three issues: (1) the general scientific claim that some manipulation of posture will have some effects, (2) the specific claim that the particular poses recommended by Cuddy et al. will have the specific effects claimed in their paper, and (3) possible social benefits from Cuddy’s Ted talk and book. Claim (1) is uncontroversial, claim (2) is suspect (both from the failed replication and from consideration of statistical noise in the original study), and item (3) is completely different issue entirely, which is why I wouldn’t want to argue with claims that the talk and the book have helped people.

P.P.S. You might also want to take a look at this post by Uri Simonsohn who goes into detail on a different example of a published and much-cited result from psychology that did not replicate. Long story short: forking paths mean that it’s possible to get statistical significance from noise, also mean that you can keep finding conformation by doing new studies and postulating new interactions to explain whatever you find. When an independent replication fails, it doesn’t necessarily mean that the original study found something and the replication didn’t; it can mean that the original study was capitalizing on noise. Again, consider the time-reversal heuristic: Pretend that the unsuccessful replication came first, then ask what you would think if a new study happened to find a statistically significant interaction happened somewhere.

P.P.P.S. More here from Ranehill and Dreber. I don’t know if Cuddy would consider this as bullying. One hand, it’s a blog comment, so it’s not like it has been subject to the stringent peer review of Psych Science, PPNAS, etc, ha ha; on the other hand, Ranehill and Dreber do point to some published work:

Finally, we would also like to raise another important point that is often overlooked in discussions of the reliability of Carney et al.’s results, and also absent in the current debate. This issue is raised in Stanton’s earlier commentary to Carney et al. , published in the peer-reviewed journal Frontiers in Behavioral Neuroscience (available here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3057631/). Apart from pointing out a few statistical issues with the original article, such as collapsing the hormonal analysis over gender, or not providing information on the use of contraceptives, Stanton (footnote 2) points out an inconsistency between the mean change in cortisol reported by Carney et al. in the text, and those displayed in Figure 3, depicting the study’s main hormonal results. Put succinctly, the reported hormone numbers in Carney, et al., “don’t add up.” Thus, it seems that not even the original article presents consistent evidence of the hormonal changes associated with power poses. To our knowledge, Carney, et al., have never provided an explanation for these inconsistencies in the published results.

From the standpoint of studying hormones and behavior, this is all interesting and potentially important. Or we can just think of this generically as some more forks in the path.

Ted Versus Powerpose and the Moneygoround, Part One

Screen Shot 2016-01-19 at 9.40.38 PM

So. I was reading the newspaper the other day and came across a credulous review of the recent book by Amy “Power Pose” Cuddy. The review, by Heather Havrilesky, expressed some overall wariness regarding the self-help genre, but I was disappointed to see no skepticism regarding Cuddy’s scientific claims. And then I did a web search and found a completely credulous CBS News report: “Believe it or not, her studies show that if you stand like a superhero privately before going into a stressful situation, there will actually be hormonal changes in your body chemistry that cause you to be more confident and in-command . . . make no mistake, Cuddy’s work is grounded in science.”

Actually Cuddy’s claims were iffy from the start and suffer a credibility gap, given the failure of a large-scale replication of her key experiment, as discussed a few months ago here under the clever title “Low-power pose” and in careful detail by Joe Simmons and Uri Simonsohn on their blog.

This all inspired me to write, with Kaiser Fung, an article for Slate exploring the mismatch between what one might call external and internal views of science:

– For outsiders, people who read the New York Times or Slate or Malcolm Gladwell or Freakonomics who tune into Ted talks, science is a string of stunning findings by heroic scientists, daring to think outside the box.

– But when insiders see hyped findings about himmicanes or college men with fat arms or ESP or sex ratios of beautiful parents or wobbly stools or embodied cognition or power pose, we laugh or we sigh (depending on our mood), knowing that one more bit of junk science got through the filter.

This is not to say that none of the effects being talked about are real, just that the studies tend to be too noisy to tell us anything useful, and we know by now the problems of creativeness.

The insider-outsider distinction is not always so clear: Daryl Bem and Ellen Langer are Ivy League professors, after all, and even the much-mocked Satoshi Kanazawa teaches at the respected London School of Economics. But all three of these researchers are outsiders when it comes to facing the statistical crisis in science.

Anyway, the focus of our Slate article was the yawning gap, as we put it, between the news media, science celebrities, and publicists on one side, and the general scientific community on the other. Various exceptions aside, it’s my impression that most scientists are a bit embarrassed by headline-grabbing claims on gay genes or ovulation and voting or whatever: We know that Science and Nature and PPNAS sometimes like publishing such papers, and that they can get lots of press, but we don’t take it seriously.

Meanwhile, your ordinary civilian Gladwell-readers can get the impression that these flashy findings are what science is all about.

But . . . then I read the comments on our Slate piece. And what struck me is that nobody came to the defense of the power-pose researchers. But it wasn’t even that. Even more striking was that none of the Slate commenters seemed to take that study seriously in the first place. It wasn’t like: Hey, this is interesting, that much-touted power pose study was in error. It was like: Yeah, what a joke, who’d ever think that that could make sense.

This is good news: Despite all the influence of the New York Times, CBS News, NPR (yes, of course, NPR too), Amy Cuddy’s publisher, and Harvard Business School, still, after all that, 100 Slate readers assume it’s all a scam. That’s good to hear. All the king’s horses etc.

Anyway, I’m not sure what to make of this division between the gullibles and the skeptics. On one side you have the NYT, CBS News, NPR, Science, Nature, PPNAS, Malcolm Gladwell, a major book publisher, the publicity department of the Harvard Business School, and the TED organization (whoever they are). On the other side, Eva Ranehill, Anna Dreber, Chris Chabris, Kaiser Fung, Uri Simonsohn, E. J. Wagenmakers, Arina K. Bones, me, . . . and several dozen random people who write in the comments section of Slate.

P.S. Just to be clear: I don’t think this is a debate about personalities and I’m not trying to personalize this. I’ve never met Amy Cuddy or her coauthors, or, for that matter, Eva Ranehill or any of her coauthors on the paper that reported the non-replication of the power-pose finding. I’ve never met Daryl Bem or Ellen Langer or Satoshi Kanazawa or Malcolm Gladwell either. It’s not about good guys and bad guys. It’s about different experiences and different perspectives. In this case, I was interested to see that these Slate readers had an ambient level of skepticism which actually in this case gave them a clearer perspective than that of NPR editors etc. (I can’t really speak to the sophistication of the Ted talk organizers because maybe they know this is iffy science but they’re hooked on the clicks, I have no idea.)

On deck this week

Mon: Ted Versus Powerpose and the Moneygoround, Part One

Tues: “Null hypothesis” = “A specific random number generator”

Wed: “Why IT Fumbles Analytics Projects

Thurs: Is a 60% risk reduction really no big deal?

Fri: Placebo effect shocker: After reading this, you won’t know what to believe.

Sat: TOP SECRET: Newly declassified documents on evaluating models based on predictive accuracy

Sun: Empirical violation of Arrow’s theorem!

2 new reasons not to trust published p-values: You won’t believe what this rogue economist has to say.

Political scientist Anselm Rink points me to this paper by economist Alwyn Young which is entitled, “Channelling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results,” and begins,

I [Young] follow R.A. Fisher’s The Design of Experiments, using randomization statistical inference to test the null hypothesis of no treatment effect in a comprehensive sample of 2003 regressions in 53 experimental papers drawn from the journals of the American Economic Association. Randomization tests reduce the number of regression specifications with statistically significant treatment effects by 30 to 40 percent. An omnibus randomization test of overall experimental significance that incorporates all of the regressions in each paper finds that only 25 to 50 percent of experimental papers, depending upon the significance level and test, are able to reject the null of no treatment effect whatsoever. Bootstrap methods support and confirm these results.

This is all fine, I’m supportive of the general point, but it seems to that the title of this paper is slightly misleading, as the real difference here comes not from the randomization but from the careful treatment of the multiple comparisons problem. All this stuff about randomization and bootstrap is kind of irrelevant. I mean, sure, it’s fine if you have nothing better to do, if that’s what it takes to convince you and you don’t take it too seriously, but that’s not where the real juice is coming from. So maybe Young could take the same paper and replace “randomization tests” and “randomization inference” by “multiple comparisons corrections” throughout.

This graph is so ugly—and you’ll never guess where it appeared

Raghu Parthasarathy writes:

I know you’re sick of seeing / being pointed to awful figures, but this one is an abomination of a sort I’ve never seen before:

1-s2.0-S0092867415011113-gr1

It’s a pie chart *and* a word cloud. In an actual research paper! Messy, illegible, and generally pointless.

It’s Figure 1 of this paper (in Cell — super high-impact. Sigh…)

I agree. This is one bad graph.

One quick tip for building trust in missing-data imputations?

Peter Liberman writes:

I’m working on a paper that, in the absence of a single survey that measured the required combination of variables, analyzes data collected by separate, uncoordinated Knowledge Networks surveys in 2003. My co-author (a social psychologist who commissioned one of the surveys) and I obtained from KN unique id numbers for all of the respondents in the surveys, and found that 363 participated in both of the main ones.

The resultant dataset resembles a panel survey with ~90% attrition and sample refreshment. Rather than just analyze the overlap between the samples, to improve statistical power I used some of the wave-nonreponse cases and imputed the missing data using MI. (Using them all results in far too much missing data for an MI imputation model to converge.) I’m not a methodologist and did not use rigorous criteria, much less state-of-the-art ones, in choosing how many and which wave-nonresponse cases to add and in dealing with selection effects (survey acquiescence having some impact on appearing in the overlapping samples).

Might you be able to suggest publications, besides this paper by Gelman, King, and Liu and this paper by Si, Reiter, and Hillygus,t that might provide useful guidance for analyzing this type of data? Would you be interested in providing advice on the attached paper, or even collaborating on it, or can you think of someone with the relevant expertise who might be?

The goal would not just be to provide more rigorous analysis of the research questions in this paper, but to provide more methodologically sound direction for other researchers wanting to use this novel type of data. I say “novel” because I have not yet found a previous example of its use, and execs at Gfk/KN and YouGov/Polimetrix project director told me that nobody has requested such data before. I can imagine researchers often wanting to conduct secondary analysis of variables measured only in separate surveys. Sample size presents an obvious limitation, and my particular study benefitted from surveys that had unusually large original samples (each with Ns >3,000). But usable overlap might be quite common among surveys using specialized sampling frames (e.g., political science surveys being fielded to online respondent panelists from whom political data already has been collected). Given the accumulation of data sitting in online survey companies’ archives, this could represent a significant untapped resource for testing post-hoc hypotheses specific to certain time periods.

My reply: I have no great answers here. I think the problem of building trust in imputations is important, and I’ve written two papers on the topic, one with Kobi Abayomi and Marc Levy, and one with Yu-Sung Su, Jennifer Hill, and Masanao Yajima. But much more needs to be done. Our original plan with our multiple imputation package mi (available on CRAN for use in R) was to include all sorts of diagnostics by default. We do have a few diagnostics in mi (see the above-linked paper by Su et al.) but we have not really integrated them into our workflow.

P.S. In case you’re interested, here’s the abstract to the research paper by Liberman and Linda Skitka:

This paper examines the role of revenge in U.S. public support for the Iraq War. Citizens who mistakenly blamed Iraq for 9/11 felt relatively strongly that it would satisfy a desire for revenge, and such feelings significantly predicted war support after controlling for security incentives, beliefs about the costs of war, and political orientations. But many of those who said Iraq was not involved also expected war would satisfy a desire for revenge, which we interpret as a foreign policy analogue of displaced aggression. This research helps us understand how the Bush Administration was able bring the nation to war against a country having nothing to do with 9/11, testifies to the roles of emotion and moral motivation in public opinion, and demonstrates the feasibility of utilizing independently conducted online surveys in secondary data analysis.

Kéry and Schaub’s Bayesian Population Analysis Translated to Stan

Hiroki ITÔ (pictured) has done everyone a service in translating to Stan the example models [update: only chapters 3–8, not the whole book; the rest are in the works] from

You can find the code in our example-models repository on GitHub:

This greatly expands on the ecological models we previously had available and should make a great jumping-off point for people looking to fit models for ecology. Hiroki did a fantastic job translating everything, and as an added bonus, he included the data and the R code to fit the models as part of the repository.

If anyone else has books they’d like to translate and publish as part of our example models suite, let us know. We’re more than happy to help with the modeling issues and provide feedback.

P.S. Ecologists have the best images! Probably because nature’s a big part of their job—Hiroki ITÔ is a forestry researcher.

If you’re using Stata and you want to do Bayes, you should be using StataStan

Robert Grant, Daniel Furr, Bob Carpenter, and I write:

Stata users have access to two easy-to-use implementations of Bayesian inference: Stata’s native bayesmh function and StataStan, which calls the general Bayesian engine Stan. We compare these on two models that are important for education research: the Rasch model and the hierarchical Rasch model. Stan (as called from Stata) fits a more general range of models than can be fit by bayesmh and is also more scalable, in that it could easily fit models with at least ten times more parameters than could be fit using Stata’s native Bayesian implementation. In addition, Stan runs between two and ten times faster than bayesmh as measured in effective sample size per second: that is, compared to Stan, it takes Stata’s built-in Bayesian engine twice to ten times as long to get inferences with equivalent precision. We attribute Stan’s advantage in flexibility to its general modeling language, and its advantages in scalability and speed to an efficient sampling algorithm: Hamiltonian Monte Carlo using the no-U-turn sampler. In order to further investigate scalability, we also compared to the package Jags, which performed better than Stata’s native Bayesian engine but not as well as StataStan.

Here’s the punchline:

Screen Shot 2016-01-19 at 8.09.23 AM

This is no surprise; still, it’s reassuring to see. (The lines in the graphs look a little jagged because we did just one simulation, from which the results are clear enough.)

Stan’s real advantage comes not just from speed but from flexibility—Stan can fit any continuous parameter model for which you can write the log-density—and from scalability: you can fit bigger models to bigger datasets. We’re moving closer to a one-size-fits-most data analysis tool where we don’t have to jump from program to program as our data and modeling needs increase.

Irritating pseudo-populism, backed up by false statistics and implausible speculations

I was preparing my lecture for tomorrow and happened to come across this post from five years ago. And now I’m irritated by Matt Ridley all over again! I wonder if he’s still bashing “rich whites” and driving that 1975 Gremlin? Grrrr…

Rogue sociologist can’t stop roguin’

Mark Palko points me to two posts by Paul Campos (here and here) on this fascinating train wreck of a story.

What happens next? It was ok that George Orwell and A. J. Liebling and David Sedaris made stuff up because they’re such good writers. And journalists make up quotes all the time. But who’s gonna want to read the next Alice Goffman book? I dunno. Weggy’s retired by now I guess, so he can just keep quiet and not bother anyone. But it’s not quite clear to me what Goffman is qualified to do, once she’s lost the trust of her audience. Work with public documents, perhaps? She could be the new I. F. Stone.

P.S. More from Campos in the comment thread. I hadn’t realized that Goffman may have done a Michael Lacour and fabricated survey data too.

P.P.S. There was lots of discussion in the comments, so let me clarify some things. I’ve never met Alice Goffman or Paul Campos or any of the other people involved in this story, and I have no idea if Goffman really had the conversations she wrote about in her book. To my reading, Campos makes a convincing case that it’s highly implausible that these things could’ve happened the way Goffman said, but I don’t really know.

Here’s the key point: Alice Goffman’s success comes from telling stories that are both surprising and plausible, at least on a casual reading (no, this is not a contradiction to my earlier statement, as Campos’s point is that yes these stories could be plausible in a general sense but not in their details), stories that are socially relevant and highly detailed, which are presented as (and may actually be) true, but which are not documented.

So Goffman’s success, and the reputation of her work, depend crucially on the trust of her audience. Once that trust is gone, I think it’s very hard to get it back. I think she’ll have to move into an arena in which she can document her work, or else move into some field such as advocacy in which documented truth is not required.

Did Goffman make things up, was she misled by her charming sources, did she perhaps make up some things but not others, did it all happen just exactly as she said despite the seeming implausibility of the details, is it all basically correct except for some relatively minor exaggerations which she is now to embarrassed to admit? All these are possible. But, whatever it is, the trust is gone. If you got no documentation, it’s not enough that your stories could be true.

Jim Albert’s Baseball Blog

Jim Albert has a baseball blog:

I sent a link internally to people I knew were into baseball, to which Andrew replied, “I agree that it’s cool that he doesn’t just talk, he has code.” (No kidding—the latest post as of writing this was on an R package to compute value above replacement players (VAR).)

You may know me from…

You may know Jim Albert from the “Albert and Chib” approach to Gibbs sampling for probit regression. I first learned about him through his fantastic book, Curve Ball, which I recommend at every opportunity (the physical book’s inexpensive and I’m stunned Springer’s selling an inexpensive PDF with no DRM—no reason not to get it). It’s not only very insightful about baseball, it’s a wonderful introduction to statistics via simulation. It starts out analyzing All-Star Baseball, a game based on spinners. This book went a long way in helping me understand statistics, but at a level I could share with friends and family, not just math geeks. It then took Gelman and Hill’s regression book and understanding the BUGS examples until I could make sense of BDA.

In the same vein, Albert has a solo book aimed at undergraduates or their professors—Teaching Statistics Using Baseball. And I just saw from his home page, a book on Analyzing Baseball Data with R.

Little Professor Baseball

I first wrote to Jim Albert way back before I was working with Andrew on Stan. I’d just read Curve Ball and had just created my very simple baseball simulation, Little Professor Baseball. I was very pleased with how I’d made it simple like All-Star Baseball, but included pitching and batting, like Strat-o-Matic Baseball (a more “serious” baseball simulation game). My only contribution was figuring out how to allow both players (offense/defnese) to roll dice, with the resulting being read from the card of the highest roller. I had to solve a quadratic equation to adjust for the bias of taking the highest roller and further adjusting to deal with the Strat-o-Matic-style correction for only reading the results off a player’s card half the time (here’s the derivations with a statistical discussion on getting the expectations right). I analyze the 1970 Major League Baseball season (same one used by Efron and Morris, by the way). I even name-drop Andrew’s hero, Earl Weaver, in the writeup.

My talk Fri 1pm at the University of Chicago

It’s the Data Science and Public Policy colloquium, and they asked me to give my talk, Little Data: How Traditional Statistical Ideas Remain Relevant in a Big-Data World. Here’s the abstract:

“Big Data” is more than a slogan; it is our modern world in which we learn by combining information from diverse sources of varying quality. But traditional statistical questions—how to generalize from sample to population, how to compare groups that differ, and whether a given data pattern can be explained by noise—continue to arise. Often a big-data study will be summarized by a little p-value. Recent developments in psychology and elsewhere make it clear that our usual statistical prescriptions, adapted as they were to a simpler world of agricultural experiments and random-sample surveys, fail badly and repeatedly in the modern world in which millions of research papers are published each year. Can Bayesian inference help us out of this mess? Maybe, but much research will be needed to get to that point.

It’s for the Data Science for Social Good program, so I suppose I’ll alter my talk a bit to discuss how data science can be used for social bad. The talk should be fun, but I do want to touch on some open research questions. Remember, theoretical statistics is the theory of applied statistics, and we have a lot of applied statistics to do, so we have a lot of theoretical statistics to do too.

Stan Talk in NYC: Macroeconomic Forecasting using Analogy Weighting

This post is by Eric.

The next Stan meetup is coming up in February. It will be hosted by the New York Bayesian Data Analysis Meetup group and International Securities Exchange. The BDA group was formerly called Stan Users – NYC. We will still be focusing on Stan, but we would also like to open it up to a broader Bayesian community and hold more regular meetups.

P.S. What is Analogy Weighting you ask? I have no idea, but I am sure Jim Savage will tell us.

Middle-aged white death trends update: It’s all about women in the south

Jonathan Auerbach and I wrote up some of the age-adjustment stuff we discussed on this blog a couple months ago. Here’s our article, a shorter version of which will appear as a letter in PPNAS.

And here’s the new analysis we did showing age-adjusted death rates for 45-54-year-old non-Hispanic white men and women:

Screen Shot 2016-01-18 at 10.35.52 PM

Wow!! Remember that increasing death rate among middle-aged non-Hispanic whites? It’s all about women in the south (and, to a lesser extent, women in the midwest). Amazing what can be learned just by slicing data.

I don’t have any explanations for this. As I told a reporter the other day, I believe in the division of labor: I try to figure out what’s happening, and I’ll let other people explain why.

I’m sure you can come up with lots of stories on your own, though. When performing your reverse causal inference, remember that people move, and, as we’ve discussed before, the cohorts are changing. 45-54-year-olds in 1999 aren’t the same people as 45-54-year-olds in 2013. We adjust for changing age distributions (ya gotta do that) but we’re still talking about different cohorts.

Here’s how our paper begins:

In a recent article in PNAS, Case and Deaton show a figure illustrating “a marked increase in the all-cause mortality of middle-aged white non-Hispanic men and women in the United States between 1999 and 2013.” The authors state that their numbers “are not age-adjusted within the 10-y 45-54 age group.” They calculated the mortality rate each year by dividing the total number of deaths for the age group by the population of the age group.

We suspected an aggregation bias and examined whether much of the increase in aggregate mortality rates for this age group could be due to the changing composition of the 45–54 year old age group over the 1990 to 2013 time period. If this were the case, the change in the group mortality rate over time may not reflect a change in age-specific mortality rates. Adjusting for age confirmed this suspicion. Contrary to Case and Deaton’s figure, we find there is no longer a steady increase in mortality rates for this age group. Instead there is an increasing trend from 1999–2005 and a constant trend thereafter. Moreover, stratifying age-adjusted mortality rates by sex shows a marked increase only for women and not men, contrary to the article’s headline.

And here’s the age-adjustment story in pictures:

Screen Shot 2016-01-18 at 10.41.33 PM

For some reason, the NYT ran a story on this the other day and didn’t age adjust, which was a mistake. Nor did they break down the data by region of the country. Too bad. Lots more people read the NYT than read this blog or even PPNAS.

My namesake doesn’t seem to understand the principles of decision analysis

ECW_front_cover_600PIX

It says “Never miss another deadline.” But if you really could never miss your deadlines, you’d just set your deadlines earlier, no? It’s statics vs. dynamics all over again.

That said, this advice seems reasonable:

The author has also developed a foolproof method of structuring your writing, so that you make effective use of your time. It’s based on the easy-to-remember three-step formula: Pre-write, Free-write, Re-write. Pre-write refers to researching the necessary information. Free-write refers to getting the information onto the computer screen. Re-write refers to the essential task of editing the writing into clear readable text. This technique allows writers to become the editors of their own writing, thereby dramatically improve its quality.

I haven’t actually read or even seen this book, but maybe I should take a look, as it is important to me that my students learn how to write effectively. A bit odd to choose a book based on the author’s last name, but that’s serendipity for you.

On deck this week

Mon: My namesake doesn’t seem to understand the principles of decision analysis

Tues: Middle-aged white death trends update: It’s all about women in the south

Wed: My talk Fri 1pm at the University of Chicago

Thurs: If you’re using Stata and you want to do Bayes, you should be using StataStan

Fri: One quick tip for building trust in missing-data imputations?

Sat: This graph is so ugly—and you’ll never guess where it appeared

Sun: 2 new reasons not to trust published p-values: You won’t believe what this rogue economist has to say.

P.S. If you just can’t wait till Tues to learn about the death trends, the paper is here.

And if you just can’t wait till Thurs to learn about why StataStan is the way to go for Bayes in Stata, that paper is here.

Both are already up on Arxiv so interested readers may have already encountered them.

It’s funny to think that I know what’s up all the way through April (modulo topical insertions), but you don’t!

Grizzly Adams is an object of the class Weekend at Bernies

Foghorn_Leghorn.png

It just came to me when I saw his obit.

The devil really is in the details; or, You’ll be able to guess who I think are the good guys and who I think are the bad guys in this story, but I think it’s still worth telling because it provides some insight into how (some) scientists view statistics

I noticed this on Retraction Watch:

“Scientists clearly cannot rely on the traditional avenues for correcting problems in the literature.” PubPeer responds to an editorial slamming the site.

I’ve never actually read anything on PubPeer but I understand it’s a post-publication review site, and I like post-publication review.

So I’m heading into this one on the side of PubPeer, and let me deflate any suspense right here but telling you that, having followed the links and read the discussion, my position hasn’t changed.

So, no news and no expectation that this new story should change your beliefs, if you happen to be on the Evilicious side of this particular debate.

So, if I’m not trying to convince anybody, why am my writing this post? Actually, I’m usually not trying to convince anyone when I write; rather, I use writing as a way to explore my thoughts and to integrate the discordant information I see into coherent stories (with one sort of coherent story being of the form, “I don’t yet understand what’s going on, the evidence seems to be contradictory, and I can’t form a coherent story”).

In that sense, writing is a form of posterior predictive check, or perhaps I should just say posterior inference, a way of working out the implications of my implicit models of the world in the context of available data.

They say Code Never Lies and they’re right, but writing has its own logic that can be helpful to follow.

Hence, I blog.

Now back to the item at hand. The above link goes to a post on PubPeer that begins as follows:

In an editorial entitled “Vigilante science”, the editor-in-chief of Plant Physiology, Michael Blatt, makes the hyperbolic claim that anonymous post-publication peer review by the PubPeer community represents the most serious threat to the scientific process today.

We obviously disagree. We believe a greater problem, which PubPeer can help to address, is the flood of low-quality, overinterpreted and ultimately unreliable research being experienced in many scientific fields . . .

I then clicked to see what Michael Blatt had to say in the journal Plant Physiology.

Since its launch in October 2012, PubPeer has sought to facilitate community-wide, postpublication critique of scientific articles. The Web site has also attracted much controversy . . . .

PubPeer operates as a blog on which anyone can post comments, either to a published article or to comments posted by other participants, and authors may respond. It is a bit like an extended journal club; not a bad idea to promote communication among scientists, you might think, so why the controversy?

Why, indeed? Blatt explains:

The problems arising are twofold . . . First, most individuals posting on PubPeer—let’s use the euphemism commenters for now—take advantage of the anonymity afforded by the site in full knowledge that their posts will be available to the public at large.

I don’t understand why “commenters” is considered a euphemism. That’s the problem with entering a debate in the middle—sometimes you can’t figure out what people are talking about.

Anyway:

Second, the vast majority of comments that are posted focus on image data (gels, blots, and micrographs) that contribute to the development of scientific ideas but are not ideas in themselves. With few exceptions, commenters on PubPeer do no more than flag perceived faults and query the associated content.

But, wait, what’s wrong with commenting on image data? And “flagging perceived faults”—that’s really important, no? We should all be aware of faults in published papers.

Of course, I say this as someone who’s published a paper that was invalidated by a data error, so I personally would benefit from outsiders checking my work and letting me know when they see something fishy.

So what’s the problem, then? Blatt tells us:

My overriding concern with PubPeer is the lack of transparency that arises from concealing the identities of both commenters and moderators.

This is so wrong I hardly know where to start. No, actually, I do know where to start, which is to point out that articles are published based on anonymous peer review.

Who were the reviewers who made the mistake of recommending publication of those papers by Daryl Bem or Satoshi Kanazawa or those ovulation-and-voting people? We’ll never know. For the himmicanes and hurricanes people, we do know that Susan Fiske was the editor who recommended publication, and she can be rightly criticized for her poor judgment on this one (nothing personal, I make lots of poor judgments myself, feel free to call me out on them), but we don’t know who were the external referees who failed to set her straight. Or, to go back 20 years, we don’t know who were the statistical referees who made the foolish, foolish decision to recommend that Statistical Science publish that horrible Bible Code paper. I do know the journal’s editor at the time, but he was in a difficult position if he was faced with positive referee reports.

So, according to Blatt: Anonymous pre-publication review, good. Anonymous post-publication review, bad. Got it.

Indeed, Blatt is insistent on this point:

I accept that there is a case for anonymity as part of the peer-review process. However, the argument for anonymity in postpublication discussion fallaciously equates such discussion with prepublication peer review. . . . In short, anonymity makes sense when reviews are offered in confidence to be assessed and moderated by an editor, someone whose identity is known and who takes responsibility for the decision informed by the reviews. Obviously, this same situation does not apply postpublication, not when the commenters enter into a discussion anonymously and the moderators are also unknown.

Oh no god no no no no no. Here’s the difference between pre-publication reviews, as usually conducted, and post-publication reviews:

Pre-publication reviews are secret. Not just the author of the review, also the actual content. Only very rarely are pre-publication reviews published in any form. Post-publication reviews, by their very nature, are public.

As Stephen King says, it’s the tale, not he who tells it. Post-publication reviews don’t need to be signed; we actually have the damn review. Given the review, the identity of the reviewer supplies very little information.

The other difference is that pre-publication reviews tend to be much more negative than post-publication reviews. I find it laughable when Blatt writes that post-publication reviews are “one-sided,” “petty,” “missing . . . courtesy and common sense,” “negative and occasionally malicious,” and “about policing, not discussion.” All these descriptions apply even more for pre-publication reviews.

Why do I care?

At this point, you might be asking yourself why I post this at all. Neither you nor I have ever heard of the journal Plant Physiology before, and we’ll likely never hear of it again. So who cares that the editor of an obscure journal emits a last-gasp rant against PubPeer, a site with represents the future in the same way that editor-as-gatekeeper Michael Blatt represents the past.

Who indeed? I don’t care what the editor of Plant Physiology thinks about post-publication review. What I do care about is we’re not there yet. Any dramatic claim with “p less than .05” that appears in Science or Nature or PPNAS or Psychological Science still has a shot of getting massive publicity. That himmicanes-and-hurricanes study was just last year. And this year we’ve seen a few more.

P.S. Incidentally, it seems that journals vary greatly in the power they afford to their editors. I can’t imagine the editor of Biometrics or the Journal of the American Statistical Association being able to publish this sort of opinion piece in the journal like this. I don’t know the general pattern here, but I have the vague impression that biomedical journals feature more editorializing, compared to journals in the physical and social sciences.

P.P.S. Two commenters pointed out small mistakes in this post, which I’ve fixed. Another point in favor of post-publication review!

Scientists Not Behaving Badly

Evilicious-Cover

Andrea Panizza writes:

I just read about psychologist Uri Simonson debunking a research by colleagues Raphael Silberzahn & Eric Uhlmann on the positive effects of noble-sounding German surnames on people’s careers (!!!). Here the fact is mentioned.

I think that the interesting part (apart, of course, from the general weirdness of Silberzahn & Uhlmann’s research hypothesis) is that Silberzahn & Uhlmann gave Simonson full access to their data, and apparently he debunked their results thanks to a better analytical approach.

My reply: Yes, this is an admirable reaction. I had seen that paper when it came out, and what struck me was that, if there is such a correlation, there could be lots of reasons not involving a causal effect of the name. in any case, it’s good to see people willing to recognize their errors: “Despite our public statements in the media weeks earlier, we had to acknowledge that Simonsohn’s technique showing no effect was more accurate.”

More generally, this sort of joint work is great, even if it isn’t always possible. Stand-alone criticism is useful, and collaborative criticism such as this is good too.

In a way it’s a sad state of affairs that we have to congratulate a researcher for acting constructively in response to criticism, but that’s where we’re at. Forward motion, I hope.

McElreath’s Statistical Rethinking: A Bayesian Course with Examples in R and Stan

McElreath's book cover

We’re not even halfway through with January, but the new year’s already rung in a new book with lots of Stan content:

This one got a thumbs up from the Stan team members who’ve read it, and Rasmus Bååth has called it “a pedagogical masterpiece.”

The book’s web site has two sample chapters, video tutorials, and the code.

The book is based on McElreath’s R package rethinking, which is available from GitHub with a nice README on the landing page.

If the cover looks familiar, that’s because it’s in the same series as Gelman et al.’s Bayesian Data Analysis.