Skip to content

Sudden Money

Anne Pier Salverda writes:

I’m not sure if you’re keeping track of published failures to replicate the power posing effect, but this article came out earlier this month:

“Embodied power, testosterone, and overconfidence as a causal pathway to risk-taking”

From the abstract:

We were unable to replicate the findings of the original study and subsequently found no evidence for our extended hypotheses.

Gotta love that last sentence of the abstract:

As our replication attempt was conducted in the Netherlands, we discuss the possibility that cultural differences may play a moderating role in determining the physiological and psychological effects of power posing.

I’d like to stop here but maybe I should explain further. I think the effects of power pose do vary by country. They also vary from year to year, they’re different on weekday and weekend, different in work and home environments, they differ by outdoor temperature (the clothing you wear will affect the comfort or awkwardness of the pose), of course it varies by sex, and hormone level, and the time of the month, and your marital/relationship status, and the socioeconomic status of your parents, and the number of older siblings you have, and every other damn factor that’s every been considered as an interaction in a social psychology study. The effects can also be moderated by subliminal smiley faces and priming with elderly-related words and shark attacks and college football games and whether your age ends in a 9 and gay genes and ESP and . . . hmmm, did I forget anything? I’m too lazy to supply links but you can search this blog for all the above phrases for more.

The point is, in a world where everything’s affecting everything else, the idea of “the effect” of power pose is pretty much meaningless. I mean, sure, it could have a huge and consistent effect. But the experiments that have been conducted don’t find that, and this is no surprise. Trying to come up with explanations with patterns in noise (as in the “Netherlands” comment above), that’s a mug’s game. You might as well just cut out the middleman, go to Vegas, and gamble away your reputation on the craps table. (See item 75 here.) In which case you’ll have to support yourself by writing things like The Book of Virtues, and who wants to do that?

P.S. We’re making slow but steady progress going through these Westlake-inspired post titles.

I respond to E. J.’s response to our response to his comment on our paper responding to his paper

In response to my response and X’s response to his comment on our paper responding to his paper, E. J. writes:

Empirical claims often concern the presence of a phenomenon. In such situations, any reasonable skeptic will remain unconvinced when the data fail to discredit the point-null. . . . When your goal is to convince a skeptic, you cannot ignore the point-null, as the point-null is a statistical representation of the skeptic’s opinion. Refusing to discredit the point-null means refusing to take seriously the opinion of a skeptic. In academia, this will not fly.

I don’t know why E. J. is so sure about what will or not fly in academia, given that I’ve published a few zillion applied papers in academic journals while only very occasionally doing significance tests.

But, setting aside claims about things not flying, I agree with the general point that hypothesis tests can be valuable at times. See, for example, page 70 of this paper. Indeed, in our paper, we wrote, “We have no desire to ‘ban’ p-values. . . . in practice, the p-value can be demoted from its threshold screening role and instead be considered as just one among many pieces of evidence.” I think E. J.’s principle about respecting skeptics is consistent with what we wrote, that p-values can be part of a statistical analysis.

P.S. Also E. J. promises to blog on chess. Cool. We need more statistician chessbloggers. Maybe Chrisy will start a blog too. After all, there’s lots of great material he could copy. There’d be no need for plagiarize: Chrissy could just read the relevant material, not check it for accuracy, and then rewrite it in his own words, slap his name on it, and be careful not to give credit to the people who went to the trouble to compile the material themselves.

P.P.S. I happened to have just come across this relevant passage from Regression and Other Stories:

We have essentially no interest in using hypothesis tests for regression because we almost never encounter problems where it would make sense to think of coefficients as being exactly zero. Thus, rejection of null hypotheses is irrelevant, since this just amounts to rejecting something we never took seriously in the first place. In the real world, with enough data, any hypothesis can be rejected.

That said, uncertainty in estimation is real, and we do respect the deeper issue being addressed by hypothesis testing, which is assessing when an estimate is overwhelmed by noise, so that some particular coefficient or set of coefficients could just as well be zero, as far as the data are concerned. We recommend addressing such issues by looking at standard errors as well as parameter estimates, and by using Bayesian inference when estimates are noisy, as the use of prior information should stabilize estimates and predictions.

“Why bioRxiv can’t be the Central Service”

I followed this link to Jordan Anaya’s page and there to this post on biology preprint servers.

Anyway, as a fan of preprint servers I appreciate Anaya’s point-by-point discussion of why one particular server, bioRxiv (which I’d never heard of before but I guess is popular in biology), can’t do what some people want it to do.

The whole thing is also one more demonstration of why twitter sucks (except this one time), in that Anaya is responding to some ignorance coming from that platform. On the other hand, one could say that twitter is valuable in this case as having brought a widespread misconception to the surface.

P.S. Lots and lots of biology papers get written and cited. Just for example I was looking up my colleague John Carlin on Google Scholar. He has an h-index of 95! He works in biostatistics. Another friend from grad school, Chris Schmid, his h-index is 85. h-index is just one thing, it’s no big deal, it’s just interesting to see how that works. Some fields get lots of citation because people are publishing tons of papers there. In biology there are a zillion postdocs all publishing papers, and every paper has about 30 authors. imagine there will soon be a similar explosion of citations in computer science—if it hasn’t happened already—because every Ph.D. student and postdoc in CS is submitting multiple papers to all the major conferences. If conference papers are getting indexed, this is gonna blow all the citation counts through the roof. Actually this sort of hyperinflation might be a net positive in that it would devalue the whole citation-count thing.

P.P.S. Anaya’s post has a place for comments but it’s on this site called Medium where if you want to comment, you need to sign in, and then you start getting mail in your inbox from Medium, and if you want to cancel your Medium account, it tells you that if you do so, it will delete all your posted comments. That ain’t cool.

Stan Roundup, 6 October 2017

I missed last week and almost forgot to add this week’s.

  • Jonah Gabry returned from teaching a one-week course for a special EU research institute in Spain.

  • Mitzi Morris has been knocking out bug fixes for the parser and some pull requests to refactor the underlying type inference to clear the way for tuples, sparse matrices, and higher-order functions.

  • Michael Betancourt with help from Sean Talts spent last week teaching an intro course to physicists about Stan. Charles Margossian attended and said it went really well.

  • Ben Goodrich, in addition to handling a slew of RStan issues has been diving into the math library to define derivatives for Bessel functions.

  • Aki Vehtari has put us in touch with the MxNet developers at Amazon UK and Berlin and we had our first conference call with them to talk about adding sparse matrix functionality to Stan (Neil Lawrence is working there now).

  • Aki is also working on revising the EP as a way of life paper and finalizing other Stan-related papers.

  • Bob Carpenter and Andrew Gelman have recruited Advait Rajagopal to help us with the Coursera specialization we’re going to offer (contingent on coming to an agreement with Columbia). The plan’s to have four course: Intro to BDA (Andrew), Stan (Bob), MCMC (Bob), and Regression and other stories (Andrew).

  • Ben Bales finished the revised pull request for vectorized RNGS. Turns out these things are much easier to write than they are to test thoroughly. Pesky problems with instantiations by integers and what not turn up.

  • Daniel Lee is getting ready for ACoP, which Bill Gillespie and Charles Margossian will also be presenting at.

  • Steven Bronder and Rok Češnovar, with some help from Daniel Lee, are going to merge the ViennaCL library for GPU matrix ops with their own specializations for derivatives in Stan into the math library. This is getting close to being real for users.

  • Sean Talts when he wasn’t teaching or learning physics has been refactoring the Jenkins test facilities. As our tests get bigger and we get more developers, it’s getting harder and harder to maintain stable continuous integration testing.

  • Breck Baldwin is taking over dealing with StanCon. Our goal is to get up to 150 registrations.

  • Breck Baldwin has also been working with Andrew Gelman and Jonathan Auerbach on non-conventional statistics training (like at Maker Fairs)—they have the beginnings of a paper. Breck’s highly recommending the math musueum in NY to see how this kind of thing’s done.

  • Bob Carpenter published a Wiki page on a Stan 3 model concept, which is probably what we’ll be going with going forward. It’s pretty much like what we have now with better const correctness and some better organized utility functions.

  • Imad Ali went to the the New England Sports Stats conference. Expect to see more models of basketball using Stan soon.

  • Ben Goodrich fixed the problem with exception handling in RStan on some platforms (always a pain because it happened on Macs and he’s not a Mac user).

  • Advait Rajagopal has been working with Imad Ali on adding ARMA and ARIMA time-series functions to rstanarm.

  • Aki Vehtari is working to enhance the loo package with automated code for K-fold cross validation for (g)lmer models.

  • Lizzie Wolkovich visited us for a meeting (she’s on our NumFOCUS leadership body), where she reported that she and a postdoc have been working on calibrating Stan models for phenology (look it up).

  • Krzysztof Sakrejda has been working on proper standalone function generation for Rcpp. Turns out to be tricky with their namespace requirements, but I think we have it sorted out as of today.

  • Michael Andreae has kicked off is meta-analysis and graphics project at Penn State with Jonah Gabry and Ben Goodrich chipping in.

  • Ben Goodrich also fixed the infrastructure for RStan so that multiple models may be supported more easily, which should make it much easier for R package writers to incorporate Stan models.

  • Yuling Yao gave us the rundown on where ADVI testing stands. It may falsely report convergence when it’s not at a maximum, it may converge to a local minimum, or it may converge but the Gaussian approximation may be terrible, either in terms of the posterior means or the variances. He and Andrew Gelman are looking at using Pareto smoothed importance sampling (a la the loo package) to try to sort out the quality of the approximation. Yuling thinks convergence is mostly scaling issues and preconditioning along with natural gradients may solve the problem. It’s nice to see grad students sink their teeth into a problem! It’d be great if we could come up a more robust ADVI implementation that had diagnostic warnings if the approximation wasn’t reliable.

I disagree with Tyler Cowen regarding a so-called lack of Bayesianism in religious belief

Tyler Cowen writes:

I am frustrated by the lack of Bayesianism in most of the religious belief I observe. I’ve never met a believer who asserted: “I’m really not sure here. But I think Lutheranism is true with p = .018, and the next strongest contender comes in only at .014, so call me Lutheran.” The religious people I’ve known rebel against that manner of framing, even though during times of conversion they may act on such a basis.

I think Cowen’s missing the point here when it comes to Bayesianism. Indeed, as an applied Bayesian statistician, I’m not even “Bayesian” in Cowen’s sense when it comes to statistical inference! Suppose I fit some data using logistic regression (my go-to default when modeling survey data with Mister P). I don’t say “logistic regression is true with p = .018, and the next strongest contender comes in only at .014, so call me logistic.” What I say is that I use logistic regression because it works for the problems I work on, and if it has problems, I’ll change the model. I also might want to try some other models as a robustness check. But Bayesian reasoning doesn’t at all require that I assign probabilities to my models.

Or, in a different direction, we can resolve Cowen’s problem by thinking of religious belief as analogous to nationality. Being an American doesn’t mean that I say that Pr(Americanism is true) = .018 or whatever. It’s just an aspect of who I am. This framing becomes particularly clear if you think of interactions between religion and nationality, such as Irish Catholic or Indian Muslim or whatever. And then there are Episcopalians, which from a doctrinal perspective are very close to Roman Catholics but are just part of a different organization. There’s a lot of overlap between religion and nationality. Another way to put it is that sticking with your own nationality, or your own religion, is the default. You can switch religions if there’s another religion you really like, or because you have some other reason (for example, liking the community at one of the local churches), but that’s not a statement about which religion is “true.”

To loop back to statistics, I suppose someone might talk Bayesianly about the probability that a particular religion is best for him/herself, but that’s not at all the same as the probability that the doctrine is true.

Cowen is frustrated by what he sees as “lack of Bayesianism” in religious beliefs that he observes, but I think that if he had a fuller view of Bayeisanism this would all make sense to him. In my recent paper with Hennig we talk about “falsificationist Bayesianism.” The idea is that a falsificationist Bayesian performs inference conditional on a model—that is, treats the model as if it were true—and then uses these inferences to make decisions while keeping an eye out for implications of the model that conflict with data or don’t make sense. From a Bayesian perspective, if a prediction “doesn’t make sense,” this implies that it’s in contradiction with some piece of prior information that may not yet have been included in the model. As we move forward in this way, we continue to update and revise our model, occasionally revamping or even discarding the model entirely if it is continuing to offer predictions that make no sense. This sort of Bayeisanism does not seem so far off from many forms of non-fundamentalist religious belief.

P.S. to all the wiseguys who will joke that Bayeisanism is a religion: Sure, whatever. The same principle applies to statistical methods and frameworks: we use them to solve problems and then alter or abandon them when they no longer seem to be working for us.

What am I missing and what will this paper likely lead researchers to think and do?

This post is by Keith.

In a previous post Ken Rice brought our attention to a recent paper he had published with Julian Higgins and  Thomas Lumley (RHL). After I obtained access and read the paper, I made some critical comments regarding RHL which ended with “Or maybe I missed something.”

This post will try to discern what I might have missed by my recasting some of the arguments I discerned as being given in the paper. I do still think,  “It is the avoidance of informative priors [for effect variation] that drives the desperate holy grail quest to make sense of varying effects as fixed”. However, given for argument’s sake that one must for some vague reason avoid informative priors for effect variation at all cost, I will try to discern if RHL’s paper outlined a scientifically profitable approach.

However, I should point out their implied priors seem to be a point prior of zero for there being any effect variation due to varying study quality and a point prior of one that the default fixed effect estimate can be reasonably generalized to a population of real scientific interest.  In addition to this, as I think the statistical discipline needs to take more responsibility for the habits of inference they instil in others I am very concerned what various research groups most likely will think and do given an accurate reading of RHL?

Succinctly (as its a long post) what I mostly don’t like about RHL’s paper is that they seem to suggest their specific weighted averaging to a population estimand – which annihilates the between study variation – will be of scientific relevance and from which one can sensibly generalize to a target population of interest. Furthermore it is suggested as being widely applicable and often only involves the use of default inverse variance weights. Appropriate situations will exist but I think they will be very rare. Perhaps most importantly, I believe RHL need to be set out how this will be credibly assessed to be the case in application. RHL does mention limitations, but I believe these are of a rather vague sort of don’t use these methods when they are not appropriate.

That is  seemingly little or no advice for when (or how to check) if one should use the publication interesting narrow intervals or the publication uninteresting wide intervals.

Continue reading ‘What am I missing and what will this paper likely lead researchers to think and do?’ »

I’m not on twitter

This blog auto-posts. But I’m not on twitter. You can tweet at me all you want; I won’t hear it (unless someone happens to tell me about it).

So if there’s anything buggin ya, put it in a blog comment.

Should we worry about rigged priors? A long discussion.

Today’s discussion starts with Stuart Buck, who came across a post by John Cook linking to my post, “Bayesian statistics: What’s it all about?”. Cook wrote about the benefit of prior distributions in making assumptions explicit.

Buck shared Cook’s post with Jon Baron, who wrote:

My concern is that if researchers are systematically too optimistic (or even self-deluded) about about the prior evidence—which I think is usually the case—then using prior distributions as the basis for their new study can lead to too much statistical confidence in the study’s results. And so could compound the problem.

Stuart Buck asked what would I say to this, and I replied:

My response to Jon is that I think all aspects of a model should be justified. Sometimes I speak of there being a “paper trail” of all modeling and data-analysis decisions. My concern here is not so much about p-hacking etc. but rather that people can get wrong answers because they just use conventional modeling choices. For example, in those papers on beauty and sex ratios, the exciting but wrong claims can be traced to the use of a noninformative uniform prior on the effects, even though there’s a huge literature showing that sex ratios vary by very little. Similarly in that ovulation-and-clothing paper: for the data to have been informative, any real effect would have had to be huge, and this just makes no sense. John Carlin and I discuss this in our 2014 paper.

To address Jon’s concern more directly: Suppose a researcher does an experiment and he says that his prior is that the new treatment will be effective, for example his prior dist on the effect size is normal with mean 0.2 and sd 0.1, even before he has any data. Fine, he can say this, but he needs to justify this choice. Just as, when he supplies a data model, it’s not enough for him just to supply a vector of “data,” he also needs to describe his experiment so we know where his data came from. What’s his empirical reasoning for his prior? Implicitly if he gives a prior such as N(0.2, 0.1), he’s saying that in other studies of this sort, real effects are of this size. That’s a big claim to make, and I see no reason why a journal would accept this or why a policymaker would believe it, if no good evidence is given.

Stuart responded to me:

“Implicitly if he gives a prior such as N(0.2, 0.1), he’s saying that in other studies of this sort, real effects are of this size.”

Aha, I think that’s just the rub – what are “real” effects as opposed to the effects found in prior studies? Due to publication bias, researcher biases, etc., effects found in prior studies may be highly inflated, right? So anyone studying a particular social program (say, an educational intervention, a teen pregnancy program, a drug addiction program, etc.) might be able to point to several prior studies finding huge effects. But does that mean the effects are real? I’d say no. Likely the effects are inflated.

So if the prior effects are inflated, how would that affect a Bayesian analysis of a new study on the same type of program?

I replied: Yes, exactly. Any model has to be justified. For example, in that horrible paper purporting to estimate the effects of air pollution in China (see figure 1 here), the authors should have felt a need to justify that high-degree polynomial—actually, the problem is not so much with a high-degre curve but with the unregularized least-squares fit. It’s enough just to pick a conventional model and start interpreting coefficients. Picking a prior distribution based on biased point estimates from the published literature, that’s not a good justification. One of the advantages of requiring a paper trail is that then you can see the information that people are using to make their modeling decisions.

Stuart followed up:

Take a simpler question (as my colleague primarily funds RCTs) — a randomized trial of a program intended to raise high school graduation rates. 1,000 kids are randomized to get the program, 1,000 are randomized into the control, and we follow up 3 years later to see which group graduated more often.

The simplest frequentist way to analyze that would be a t-test of the means, right? Or just a simple regression — Y (grad rate) = alpha + Beta * [treatment] + error.

If you analyzed the RCT using Bayesian stats instead, would your ultimate conclusion about the success of the program be affected by your choice of prior, and if so, how much? My colleague has the impression that a researcher who is strongly biased in favor of that program would somehow use Bayesian stats in order to “stack the deck” to show the program really works, but I’m not sure that makes sense.

I replied: The short story is that, yes, the Bayesian analysis depends on assumptions, and so does the classical analysis. I think it’s best for the assumps to be clear.

Let’s start with the classical analysis. A t-test is a t-test, and a regression is a regression, no assumptions required, these are just data operations. The assumptions come in when you try to interpret the results. For example, you do the t-test and the result is 2.2 standard errors away from 0, and you take that as evidence that the treatment “works.” That conclusion is based on some big assumptions, as John Carlin and I discuss in our paper. In particular, the leap from “statistical significance” to “the treatment works” is only valid when type M and type S errors are low—and any statement about these errors requires assumptions about effect size.

Let’s take an example that I’ve discussed a few times on the blog. Gertler et al. ran a randomized experiment of an early childhood intervention in Jamaica and found that the treatment raised earnings by 42% (the kids in the study were followed up until they were young adults and then their incomes were compared). The result was statistically significant so for simplicity let’s say the 95% conf interval is [2%, 82%]. Based on the classical analysis, what conclusions are taken from this study? (1) The treatment works and has a positive effect. (2) The estimated treatment effect is 42%. Both these conclusions are iffy: (a) Given the prior literature (see, for example, the Charles Murray quote here), it’s hard to believe the true effect is anything near 42%, which suggests that Type M and Type S errors in this study could be huge, implying that statistical significance doesn’t tell us much; (b) The Gertler et al. paper has forking-path issues so it would not be difficult for them to find a statistically significant comparison even in the absence of any consistent true effect; (c) in any case, the 42% is surely an overestimate: Would the authors or anyone else really be wiling to bet that a replication would achieve such a large effect?

So my point is that the classical inferences—the conclusion that the treatment works and the point estimate of the effect—are strongly based on assumptions which, in conventional reporting, are completely hidden. Indeed I doubt that Gertler et al. themselves are aware of the assumptions underlying their conclusions. They correctly recognize that the mathematical operations they apply to their data—the t-test and the regression—are assumption-free (or, I should say, rely on very few assumptions). But they don’t recognize that the implications they draw from their statistical significance depend very strongly on assumptions which, in their example, are difficult to justify. If they were required to justify their assumptions (to make a paper trail, as I put it), they might see the problem. They might recognize that the strong claims they draw from their study are only justifiable conditional on already believing the treatment has a very large and positive effect.

OK, now on to the Bayesian analysis. You can start with the flat-prior analysis. Under the flat prior, a statistically significant difference gives a probability of greater than 97.5% probability that the true effect is in the observed direction. For example in that Gerler et al. study you’d be 97.5%+ sure that the treatment effect is positive, and you’d be willing to bet at even odds that the true effect is bigger or smaller than 42%. Indeed, you’d say that the effect is as likely to be 82% as 2%. That of course is ridiculous: a 2% or even a 0% effect is quite plausible, whereas an 82% effect, even if it might exist in this population for some unlikely historical reason, is not plausible in any larger context. But that’s fine, this tells us that we have prior information that’s not included in our model. A more plausible prior might have a mean of 0 and a standard deviation of 10%, or maybe some longer-tailed distribution such as a t with low degrees of freedom with center 0 and scale 10%. I’m not sure what’s best here, but one could make some prior based on the literature. The point is that it would have to be justified.

Now suppose some wise guy wants to stack the deck by, for example, giving the effect size a prior that’s normal with mean 20% and sd 10%. Well, the first thing is that he’d have to justify that prior, and I think it would be hard to justify. If it did get accepted by the journal reviewers, that’s fine, but then anyone who reads the paper would see this right there in the methods section: “We assumed a normal prior with mean 20% and sd 10%.” Such a statement would be vulnerable to criticism. People know about priors. Even a credulous NPR reporter or a Gladwell would recognize that the prior is important here! The other funny thing is, in this case, such a prior is in some ways an improvement upon the flat prior in that the estimate would be decreased from the 42% that comes from the flat prior.

So I think my position here is clear. Sure, people can stack the deck. Any stacking should be done openly, and then readers can judge the evidence for themselves. That would be much preferable to the current situation in which inappropriate inferences are made without recognition of the assumptions that justify them.

At this point Jon Baron jumped back in. First, where I wrote above “Even a credulous NPR reporter or a Gladwell would recognize that the prior is important here!”, Baron wrote:

I’m not sure it wouldn’t fly under the radar just like the other assumptions in Gertler’s study that make its findings unreliable—I think the Heckmans and many other wishful thinkers on early childhood programs would say that the assumption about priors is fully justified.

I replied: Sure, maybe they’d say so, but I’d like to see that claim in black and white in the paper: then I could debate it directly! As it is, the authors can implicitly rely on such a claim and then withdraw it later. That’s the problem I have with these point estimates: the point estimate is used as advertising but then if you question it, the authors retreat to saying it’s just proof of an effect.

That happened with that horrible ovulation-and-clothing paper: my colleague and I asked how anyone could possibly believe that women are 3 times as likely to wear red on certain days of the month, and then the authors and their defenders pretty much completely declined to defend that factor of 3. I have this amazing email exchange with a psych prof who was angry at me for dissing that study: I asked him several times whether he thought that women were actually 3 times more likely to wear red on these days, and he just refused to respond on that point.

So, yeah, I think it would be a big step forward for these sorts of quantitative claims to be out in the open.

Second, Baron followed up my statement that “such a prior [normal with mean 20% and sd 10%] is in some ways an improvement upon the flat prior in that the estimate would be decreased from the 42% that comes from the flat prior,” by asking:

What about the not-unrealistic situation where the wishful thinker says the prior effect size is 30% (based on Perry Preschool and Abecedarian etc.) and his new study comes in with an effect size of, say, 25%. Would the Bayesian approach be more likely to find a statistically significant effect than the classical approach in this situation?

My reply: Changing the prior will change the point estimate and also change the uncertainty interval. In your example, if the wishful thinker says 30% and the new study estimate says 25%, then, yes, the wiseguy will feel confirmed. But it’s the role of the research community to point out that an appropriate analysis of Perry, Aecedarian, etc., do not lead to a 30% estimate!

BREAKING . . . . . . . PNAS updates its slogan!

I’m so happy about this, no joke.

Here’s the story. For awhile I’ve been getting annoyed by the junk science papers (for example, here, here, and here) that have been published by the Proceedings of the National Academy of Sciences under the editorship of Susan T. Fiske. I’ve taken to calling it PPNAS (“Prestigious proceedings . . .”) because so many news outlets seem to think the journal is so damn prestigious. Indeed, if PNAS just published those articles and nobody listened, it would be fine. I have a blog where I can publish any old things that I want; Susan T. Fiske has a journal where she can publish articles by her friends and other papers that she personally thinks are interesting and important. The problem is that, to many in the outside world, publication in PPNAS is a signal of quality, and organs such as NPR will report PPNAS articles without appropriate skepticism.

One thing that bugged me about PPNAS was this self-description on their website:

So I contacted someone at the National Academy of Sciences, asking if they could do something about that false statement on its webpage: “PNAS publishes only the highest quality scientific research.”

No journal is perfect, it’s no slam on PNAS to say that they publish some low quality papers. But that statement seems weird in that it puts the National Academy of Sciences in the position of defending some extremely bad papers that the journal has happened to mistakenly publish.

And . . . they fixed it. Here’s the new version:

They strive to publish only the highest quality scientific research. That’s exactly right! I’m so glad they fixed that. I’m not being ironic here. I really mean it.

PNAS no longer has a demonstrably false statement on their webpage. Progress happens one step at a time, and I welcome this step. Good on ya, PNAS!

P.S. Just to be clear, let me emphasize that the message of this post is positive positive positive. PNAS is a journal that publishes lots of excellent papers. It publishes some duds, but that’s unavoidable if you publish 3000 papers a year. Editors are busy, peer reviewers are unpaid and don’t always know what they’re doing, and it can be hard for everyone involved to keep up with the latest scientific developments. And PNAS sometimes publishes papers that are outside its areas of core competence (PNAS publishing on baseball makes about as much sense as Bill James writing on physics), but, even here, I can see the virtue of stepping out on occasion, living on the edge. Little is lost by such experiments and there’s always the potential for unexpected connections. Striving to publish only the highest quality scientific research is an excellent aim, and I’m glad that’s what the National Academy of Sciences is doing.

When considering proposals for redefining or abandoning statistical significance, remember that their effects on science will only be indirect!

John Schwenkler organized a discussion on this hot topic, featuring posts by
– Dan Benjamin, Jim Berger, Magnus Johannesson, Valen Johnson, Brian Nosek, and E. J. Wagenmakers
– Felipe De Brigard
– Kenny Easwaran
– Andrew Gelman and Blake McShane
– Kiley Hamlin
– Edouard Machery
– Deborah Mayo
– “Neuroskeptic”
– Michael Strevens
– Kevin Zollman.

Many of the commenters have interesting things to say, and I recommend you read the entire discussion.

The one point that I think many of the discussants are missing, though, is the importance of design and measurement. For example, Benjamin et al. write, “Compared to using the old 0.05 threshold, maintaining the same level of statistical power requires increasing sample sizes by about 70%.” I’m not disputing the math, but I think that sort of statement paints much too optimistic a picture. Existing junk science such as himmicanes and air rage, or ovulation and voting and clothing, or the various fmri and gay-gene studies that appear regularly in the news, will not be saved by increasing sample size by 70% or 700%. Larger sample size might enable researchers to more easily reach those otherwise elusive low p-values but I don’t see this increasing our reproducible scientific knowledge. Along those likes, Kiley Hamlin recommends going straight to full replications, which would have the advantage of giving researchers a predictive target to aim at. I like the idea of replication, rather than p-values, being a goal. On the other hand, again, p-values are noisy, and none of this is worth anything if measurements are no good.

So one thing I wish more of the discussants had talked about is that, when applied to junk science—and all of this discussion is in large part the result of the cancerous growth of junk science within the scientific enterprise—the effect of new rules on p-values etc. will be indirect. Requiring p less than 0.005, or requiring Bayes factors, abandoning statistical significance entirely, or anything in between: none of these policies will turn work such as power pose or beauty-and-sex-ratio or the work of the Cornell University Food and Brand Lab into reproducible science. All it will do is possibly (a) make such work harder to publish as is, and (b) as a consequence of that first point, motivate researchers to better science, to design more targeted studies with better measurements so as to be able to succeed in the future.

It’s good goal to aim for (a) and (b), so I’m glad of all this discussion. But I think it’s important to emphasize that all the statistical analysis and statistical rule-giving in the world can’t transform bad data into good science. So I’m a bit concerned about messages implying that with a mere increase of sample size by a factor of 1.7 or 2, that reproducibility problems will be solved. At some point, good science requires good design and measurement.

There’s an analogy to approaches to education reform that push toward high standards, toward not letting students graduate unless their test scores reach some high threshold. Ending “social promotion” from grade to grade in school might be a good idea in itself, and in the right environment it might motivate students to try harder at learning and schools to try harder at teaching—but, by themselves, standards are just an indirect tool. At some point the learning has to happen. This analogy is not perfect—for one thing, a p-value is not a measure of effect size, and null hypothesis significance testing addresses an uninteresting model of zero effect and zero systematic error, kind of like if an educational test did not even attempt to measure mastery, instead merely trying to demonstrate that the amount learned was not exactly zero—but my point in the present post is to emphasize the essentially indirect nature of any procedural solutions to research problems.

Again we can consider that hypothetical study attempting to measure the speed of light by using a kitchen scale to weighing an object before and after it is burned: it doesn’t matter what p-value is required, this experiment will never allow us to measure the speed of light. The best we can do with rules is to make it more difficult and awkward to claim that such a study can give definitive results, and thus dis-incentivize people from trying to perform, publish, and promote such work. Substitute ESP or power pose or fat arms and voting or himmicanes etc. in the above sentences and you’ll get the picture.

As Blake and I wrote in the conclusion of our contribution to the above-linked discussion:

Looking forward, we think more work is needed in designing experiments and taking measurements that are more precise and more closely tied to theory/constructs, doing within-person comparisons as much as possible, and using models that harness prior information, that feature varying treatment effects, and that are multilevel or meta-analytic in nature, and—of course—tying this to realism in experimental conditions.

See here and here for more on this topic which we are blogging to death. I appreciate the comments we’ve had here from people who disagree with me on these issues: a blog comment thread is a great place to have a discussion back and forth involving multiple viewpoints.

Alan Sokal’s comments on “Abandon Statistical Significance”

The physicist and science critic writes:

I just came across your paper “Abandon statistical significance”. I basically agree with your point of view, but I think you could have done more to *distinguish* clearly between several different issues:

1) In most problems in the biomedical and social sciences, the possible hypotheses are parametrized by a continuous variable (or vector of variables), or at least one that can be reasonably approximated as continuous. So it is conceptually wrong to discretize or dichotomize the possible hypotheses. [The same goes for the data: usually it is continuous, or at least discrete with a large number of possible values, and it is silly to artifically dichotomize or trichotomize it.]

Now, in such a situation, the sharp point null hypothesis is almost certainly false: as you say, two treatments are *always* different, even if the difference is tiny.

So here the solution should be to report, not the p value for the sharp point null hypothesis, but the complete likelihood function — or if it can be reasonably approximated by a Gaussian, then the mean and standard deviation (or mean vector and covariance matrix).

2) The difference between the two treatments — especially if it is small — might be due, *not* to an actual difference between the two treatments, but to a systematic error in the experiment (e.g. a small failure of double-blinding, or a correlation between measurement errors and the treatment).

This is not a statistical issue, but rather an experimental and interpretive one: every experimenter must strive to reduce systematic errors to the smallest level possible AND to estimate honestly whatever systematic errors might remain; and an observed effect, even if it is statistically established beyond a reasonable doubt, can be considered “real” only if it is much larger than any plausible systematic error.

3) The likelihood function does not contain the whole story (from a Bayesian point of view), because the prior matters too. After all, even people who are not die-hard Bayesians can understand that “extraordinary claims require extraordinary evidence”. So one must try to understand — at least at the level of orders of magnitude — the prior likelihood of various alternative hypotheses. If only 1 out of 1000 drugs (or social interventions) have an effect anywhere near as large as the likelihood function seems to indicate, then probably the result is a false positive.

4) When practical decisions are involved (e.g. whether or not to approve a drug, whether or not to start or terminate a social program), the loss function matters too. There may be a huge difference in the losses from failing to approve a useful drug and approving a useless or harmful one — and I could imagine that in some cases those huge differences might go one way, and in other cases the other way. So the decision-makers have to analyze explicitly the loss function, and take it into account in the final decision. (But they should also always keep this analysis — which is basically economic — separate from the analysis of issues #1,2,3, which are basically “scientific”.)

My reply:

I agree with you on most of these points; see for example here.

Regarding your statement about the likelihood function: that’s fine but more generally I like to say that researchers should display all comparisons of interest and not select based on statistical significance. The likelihood function is a summary based on some particular model but in a lot of applied statistics there is no clear model, hence I give the more general recommendation to display all comparisons.

Regarding your point 2: yes on the relevance of systematic error, which is why we refer on page 1 of our paper to the “sharp point null hypothesis of zero effect and zero systematic error”! Along similar lines, see the last paragraph of this post.

Regarding your point 3, I prefer to avoid the term “false positive” in most statistical contexts because of the association of the typically nonsensical model of zero effect and zero systematic error; see here.

Regarding your point 4, yes, as we say in our paper, “For regulatory, policy, and business decisions, cost-benefit calculations seem clearly superior to acontextual statistical thresholds.”

Alan responded:

I think we are basically in agreement. My suggestion was simply to *distinguish* more clearly these 4 different issues (possibly by making them explicit and numbering them), because they are really *very* different in nature.

2 quick calls

Kevin Lewis asks what I think of these:

Study 1:
Using footage from body-worn cameras, we analyze the respectfulness of police officer language toward white and black community members during routine traffic stops. We develop computational linguistic methods that extract levels of respect automatically from transcripts, informed by a thin-slicing study of participant ratings of officer utterances. We find that officers speak with consistently less respect toward black versus white community members, even after controlling for the race of the officer, the severity of the infraction, the location of the stop, and the outcome of the stop. Such disparities in common, everyday interactions between police and the communities they serve have important implications for procedural justice and the building of police-community trust.

Study 2:
Exposure to parental separation or divorce during childhood has been associated with an increased risk for physical morbidity during adulthood. Here we tested the hypothesis that this association is primarily attributable to separated parents who do not communicate with each other. We also examined whether early exposure to separated parents in conflict is associated with greater viral-induced inflammatory response in adulthood and in turn with increased susceptibility to viral-induced upper respiratory disease. After assessment of their parents’ relationship during their childhood, 201 healthy volunteers, age 18-55 y, were quarantined, experimentally exposed to a virus that causes a common cold, and monitored for 5 d for the development of a respiratory illness. Monitoring included daily assessments of viral-specific infection, objective markers of illness, and local production of proinflammatory cytokines. Adults whose parents lived apart and never spoke during their childhood were more than three times as likely to develop a cold when exposed to the upper respiratory virus than adults from intact families. Conversely, individuals whose parents were separated but communicated with each other showed no increase in risk compared with those from intact families. These differences persisted in analyses adjusted for potentially confounding variables (demographics, current socioeconomic status, body mass index, season, baseline immunity to the challenge virus, affectivity, and childhood socioeconomic status). Mediation analyses were consistent with the hypothesis that greater susceptibility to respiratory infectious illness among the offspring of noncommunicating parents was attributable to a greater local proinflammatory response to infection.

My reply:

1. I’d run this by a computational linguist who doesn’t have a stake in this example. I’m skeptical in any case because this kind of “respect” thing is contextual. I mean, sure, I believe that building of trust is important; I just don’t know if much is gained by the “extract levels of respect automatically” thing.

2. I’ll believe this one after it appears in an independent preregistered replication, not before.

“Do statistical methods have an expiration date?” My talk at the University of Texas this Friday 2pm

Fri 6 Oct at the Seay Auditorium (room SEA 4.244):

Do statistical methods have an expiration date?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

There is a statistical crisis in science, particularly in psychology where many celebrated findings have failed to replicate, and where careful analysis has revealed that many celebrated research projects were dead on arrival in the sense of never having sufficiently accurate data to answer the questions they were attempting to resolve. The statistical methods which revolutionized science in the 1930s-1950s no longer seem to work in the 21st century. How can this be? It turns out that when effects are small and highly variable, the classical approach of black-box inference from randomized experiments or observational studies no longer works as advertised. We discuss the conceptual barriers that have allowed researchers to avoid confronting these issues, which arise not just in psychology but also in policy research, public health, and other fields. To do better, we recommend three steps: (a) designing studies based on a perspective of realism rather than gambling or hope, (b) higher quality data collection, and (c) data analysis that combines multiple sources of information.

Some of material in the talk appears in our recent papers, “The failure of null hypothesis significance testing when studying incremental changes, and what to do about it” and “Some natural solutions to the p-value communication problem—and why they won’t work.”

The talk will be in the psychology department but should be of interest to statisticians and quantitative researchers more generally. I was invited to come by the psychology Ph.D. students—that’s so cool!

Response to some comments on “Abandon Statistical Significance”

The other day, Blake McShane, David Gal, Christian Robert, Jennifer Tackett, and I wrote a paper, Abandon Statistical Significance, that began:

In science publishing and many areas of research, the status quo is a lexicographic decision rule in which any result is first required to have a p-value that surpasses the 0.05 threshold and only then is consideration—often scant—given to such factors as prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain. There have been recent proposals to change the p-value threshold, but instead we recommend abandoning the null hypothesis significance testing paradigm entirely, leaving p-values as just one of many pieces of information with no privileged role in scientific publication and decision making. We argue that this radical approach is both practical and sensible.

Since then we’ve received some feedback that we’d like to share and address.

1. Sander Greenland commented that maybe we shouldn’t label as “radical” our approach of removing statistical significance from its gatekeeper role, given that prominent statisticians and applied researchers have recommended this approach (abandoning statistical significance as a decision rule) for a long time.

Here are two quotes from David Cox et al. from a 1977 paper, “The role of significance tests”:

Here’s Cox from 1982 implicitly endorsing the idea of type S errors:

And here he is, explaining (a) the selection bias involved in any system in which statistical significance is a decision rule, and (b) the importance of measurement, a crucial issue in statistics that is obscured by statistical significance:

Hey! He even pointed out that the difference between “significant” and “non-significant” is not itself statistically significant:

In this paper, Cox also brings up the crucial point that the “null hypothesis” is not just the assumption of zero effect (which is typically uninteresting) but also the assumption of zero systematic error (which is typically ridiculous).

And he says what we say, that the p-value tells us very little on its own:

There are also more recent papers that say what McShane et al. and I say; for example, Valentin Amrhein​, Fränzi Korner-Nievergelt, and Tobias Roth wrote:

The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process. We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. . . . Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. . . . We further discuss potential arguments against removing significance thresholds, such as ‘we need more stringent decision rules’, ‘sample sizes will decrease’ or ‘we need to get rid of p-values’. We conclude that, whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.

Damn! I liked that paper when it came out, but now that I see it again, I realize how similar our points are to theirs.

Also this recent letter by Valentin Amrhein and Sander Greenland, “Remove, rather than redefine, statistical significance” which, again, has a very similar perspective to ours.

2. In the park today I ran into a friend who said that he’d read our recent article. He expressed the opinion that our plan might be good in some ideal sense but it can’t work in the real world because it requires more time-consuming and complex analyses than researchers are willing or able to do. If we get rid of p-values, what would we replace them with?

I replied: No, our plan is eminently realistic! First off, we don’t recommend getting rid of p-values; we recommend treating them as one piece of evidence. Yes, it can be useful to see that a given data pattern could or not plausibly have arisen purely by chance. But, no, we don’t think that publication of a result, or further research in an area, should require a low p-value. Depending on the context, it can be completely reasonable to report and follow up on a result that is interesting and important, even if the data are weak enough that the pattern could’ve been obtained by chance: that just tells us we need better data. Report the p-value and the confidence interval and other summaries; don’t use them to decide what to report. And definitely don’t use them to partition results into “significant” and “non-significant” groups.

I also remarked that it’s not like the current system is so automatic. Statistically significance, in most cases, a requirement for publication, but journals still have to decide what to do with the zillions of “p less than 0.05” papers that get sent to them every month. So we’re just saying that, at a start, that journals can use whatever rules they’re currently using to decide which of these papers to publish.

Then I launched into another argument . . . but at this point my friend gave me a funny look and started to back away. I think he’d just mentioned my article and his reaction as a way to say hi, and he wasn’t really asking for a harangue in the middle of the park on a nice day.

But I’m pretty sure that most of you reading this blog are sitting in your parent’s basement eating Cheetos, with one finger on the TV remote and the other on the Twitter “like” button. So I can feel free to rant away.

3. There’s a paper, “Redefine statistical significance,” by Daniel Benjamin et al., who recognize that the p=0.05 threshold has lots of problems (I don’t think they mention air rage, himmicanes, ages ending in 9, fat arms and political attitudes, ovulation and clothing, ovulation and voting, power pose, embodied cognition, and the collected works of Satoshi Kanazawa and Brian Wansink, but they could have) and promote a revised p-value threshold of 0.005. As we wrote in our article (which was in part a response to Benjamin et al.):

We believe this proposal is insufficient to overcome current difficulties with replication . . . In the short term, a more stringent threshold could reduce the flow of low quality work that is currently polluting even top journals. In the medium term, it could motivate researchers to perform higher-quality work that is more likely to crack the 0.005 barrier. On the other hand, a steeper cutoff could lead to even more overconfidence in results that do get published as well as greater exaggeration of the effect sizes associated with such results. It could also lead to the discounting of important findings that happen not to reach it. In sum, we have no idea whether implementation of the proposed 0.005 threshold would improve or degrade the state of science as we can envision both positive and negative outcomes resulting from it. Ultimately, while this question may be interesting if difficult to answer, we view it as outside our purview because we believe that p-value thresholds (as well as those based on other statistical measures) are a bad idea in general.

4. And then yet another article, this one by Lakens et al., “Justify your alpha.” Their view is closer to ours in that they do not want to use any fixed p-value threshold, but they still seem to recommend that statistical significance be used for decision rules: “researchers justify their choice for an alpha level before collecting the data, instead 2of adopting a new uniform standard.” We agree with most of what Lakens et al. write, especially things like, “Single studies, regardless of their p-value, are never enough to conclude that there is strong evidence for a theory” and their call to researchers to provide “justifications of key choices in research design and statistical practice.”

We just don’t see any good reason to make design, analysis, publication, and decision choices based on “alpha” or significance levels. As we write:

Various features of contemporary biomedical and social sciences—small and variable effects, noisy measurements, a publication process that screens for statistical significance, and research practices—make null hypothesis significance testing and in particular the sharp point null hypothesis of zero effect and zero systematic error particularly poorly suited for these domains. . . .

Proposals such as changing the default p-value threshold for statistical significance, employing confidence intervals with a focus on whether or not they contain zero, or employing Bayes factors along with conventional classifications for evaluating the strength of evidence suffer from the same or similar issues as the current use of p-values with the 0.05 threshold. In particular, each implicitly or explicitly categorizes evidence based on thresholds relative to the generally uninteresting and implausible null hypothesis of zero effect and zero systematic error.

5. E. J. Wagenmakers, one of the authors of the Benjamin et al. paper that motivated a lot of this recent discussion, wrote a post on his new blog (E. J. has a blog now! Cool. Will he start posting on chess?), along with Quentin Gronau, responding to our recent article.

E. J. and Quentin begin their post with five places where they agree with us. Then, in true blog fashion, they spends most of the post elaborating on three places where they disagree with us. Fair enough.

I’ll go through them one at a time:

E. J. and Quentin’s disagreement 1. E. J. says that our general advice (studying and reporting the totality of their data and relevant results) is eminently sensible, but it is not sufficiently explicit to replace anything. Rightly or wrongly, the p-value offers a concrete and unambiguous guideline for making key claims; the Abandoners [that’s us!] wish to replace it with something that can be summarized as ‘transparency and common sense.'”

I disagree!

First, the p-value does not offer “a concrete and unambiguous guideline for making key claims.” Thousands of experiments are performed every month (maybe every day!) with “p less than 0.05” results, but only a very small fraction of these make their way into JPSP, Psych Science, PPNAS, etc. P-value thresholds supply an illusion of rigor, and maybe in some settings that’s a good idea, by analogy to “the consent of the governed” in politics, but there’s nothing concrete or unambiguous about their use.

Second, yes I too support “transparency and common sense,” but that’s not all we’re recommending. Not at all! Recall my recent paper, Transparency and honesty are not enough. All the transparency and common sense in the world—even with preregistered replication—won’t get you very far in the absence of accurate and relevant measurement. Hence the last paragraph of this post.

E. J. and Quentin’s disagreement 2. I’ll let my coauthor Christian Robert respond to this one. And he did!

E. J. and Quentin’s disagreement 3. They write, “One of the Abandoners’ favorite arguments is that the point-null hypothesis is usually neither true nor interesting. So why test it? This echoes the opinion of researchers like Meehl and Cohen. We believe, however, that Meehl and Cohen were overstating their case.”

E. J. and Quentin begin with an example of a hypothetical researcher comparing the efficacies of unblended or blended whisky as a treatment of snake bites. I agree that in this case the point null hypothesis is worth studying. This sort of example has come up in some recent comment threads so I’ll repeat what I said there:

I don’t think that point hypotheses are never true; I just don’t find them interesting or appropriate in the problems in social and environmental science that I work on and which we spend a lot of time discussing on this blog.

There are some problems where discrete models make sense. On commenter gave the example of a physical law; other examples are spell checking (where, at least most of the time, a person was intending to write some particular word) and genetics (to some reasonable approximation). In such problems I recommend fitting a Bayesian model for the different possibilities. I still don’t recommend hypothesis testing as a decision rule, in part because in the examples I’ve seen, the null hypothesis also bundles in a bunch of other assumptions about measurement error etc. which are not so sharply defined.

I’m happy to (roughly) discretely divide the world into discrete and continuous problems, and to use discrete methods when studying the effects of snakebites, and ESP, and spell checking, and certain problems in genetics, and various other problems of this sort; and to use continuous methods when studying the effects of educational interventions, and patterns of voting and opinion, and the effects of air pollution on health, and sex ratios and hurricanes and behavior on airplanes and posture and differences between gay and straight people and all sorts of other topics that come up all the time. And I’m also happy to use mixture models with some discrete components; for example, in some settings in drug development I expect it makes sense to allow for the possibility that a particular compound has approximately no effect (I’ve heard this line of research is popular at UC Irvine right now). I don’t want to take a hard line, nothing-is-ever-approximately-zero position. But I do think that comparisons to a null model of absolutely zero effect and zero systematic error are rarely relevant.

E. J. and Quentin also point out that if an effect is very small compared to measurement/estimation error, then it doesn’t matter, from the standpoint of null hypothesis significance testing, whether the effect is exactly zero. True. But we don’t particularly care about null hypothesis significance testing! For example, consider “embodied cognition.” Embodied cognition is a joke, and it’s been featured in lots of junk science, but I don’t think that masked messages have zero or even necessarily tiny effects. I think that any effects will vary a lot by person and by context. And, more to the point, if someone wants to do research in this topic, I don’t think that a null hypothesis significance test should be a screener for what results are considered worth looking at, and I think that it’s a mistake to use a noisy data summary to selecting a limited subset of results to report.


We’re in agreement with just about all the people in this discussion on the following key point: We’re unhappy with the current in which “p less than 0.05” is used as the first step in a lexicographic decision rule in deciding which results in a study should be presented, which studies should be published, and which lines of research should be pursued.

Beyond this, here are the different takes:

Benjamin et al. recommend replacing 0.05 by 0.005, not because they think a significance-testing-based lexicographic decision rule is a good idea, but, as I understand them, because they think that 0.005 is a stringent enough cutoff that it will essentially break the current system. Assuming there is a move to reduce uncorrected researcher degrees of freedom and forking paths, it will become very difficult for researchers to reach the 0.005 threshold with noisy, useless studies. Thus, the new threshold, if applied well, will suddenly cause the stream of easy papers to dry up. Bad news for Ted, NPR, and Susan Fiske, but good news for science, as lots of journals will either have to get a lot thinner or will need to find some interesting papers outside the usual patterns. In the longer term, the stringent threshold (if tied to control of forking paths) could motivate researchers to do higher-quality studies with more serious measurement tied more carefully to theory.

Lakens et al. recommend using p-value thresholds but with different thresholds for different problems. This has the plus of moving away from automatic rules but has the minus of asking people to “justify their alpha.” I’d rather have scientists justifying their substantive conditions by delineating reasonable ranges of effect sizes (see, for example, section 2.1 of this paper) rather than having them justify a scientifically meaningless threshold, and I’d prefer that statisticians and methodologists evaluate frequency properties of type M and type S errors rather than p-values. But, again, we agree with Lakens et al., and with Benjamin et al., on the key point that what we need is better measurement and better science.

Finally, our perspective, shared with Amrhein​, Korner-Nievergelt, and Roth, as well as Amrhein and Greenland, is that it’s better to just remove null hypothesis significance testing from its gatekeeper role. That is, instead of trying to tinker with the current system (Lakens et al.) or to change the threshold so much that the system will break (Benjamin et al.), let’s just discretize less and display more.

We have some disagreements regarding the relevance of significance tests and null hypotheses but we’re all roughly on the same page as Cox, Meehl, and other predecessors.

“5 minutes? Really?”

Bob writes:

Daniel says this issue

is an easy 5-minute fix.

In my ongoing role as wet blanket, let’s be realistic. It’s
sort of like saying it’s an hour from here to Detroit because
that’s how long the plane’s in the air.

Nothing is a 5 minute fix (door to door) for Stan and I really
don’t want to give people the impression that it should be. It
then just makes them feel bad when it takes longer than 5 minutes,
because they feel like they’ve wasted the time this will really take.
Or it makes people angry who suggest other “5 minute fixes” that
we don’t get around to doing because they’re really more involved.

This can’t be five minutes (certainly not net to the project)
when you need to create a branch, fix the issue, run the tests,
run cpplint, commit, push, create a pull request,
nag someone else to review it (then they have to then fill out the
code-review form), then you might have to make fixes (and perhaps
get another sign off from the reviewer), then Jenkins and Travis
may need to be kicked, (then someone has to decide to merge),
then we get to do it again in the upstream with changes to
the interfaces.

Easy once you’re used to the process, but not 5 minutes!

The mythical man-minute.

“From ‘What If?’ To ‘What Next?’ : Causal Inference and Machine Learning for Intelligent Decision Making”

Panos Toulis writes in to announce this conference:

NIPS 2017 Workshop on Causal Inference and Machine Learning (WhatIF2017)

“From ‘What If?’ To ‘What Next?’ : Causal Inference and Machine Learning for Intelligent Decision Making” — December 8th 2017, Long Beach, USA.

Submission deadline for abstracts and papers: October 31, 2017
Acceptance decisions: November 7, 2017

In recent years machine learning and causal inference have both seen important advances, especially through a dramatic expansion of their theoretical and practical domains. This workshop is aimed at facilitating more interactions between researchers in machine learning, causal inference, and application domains that use both for intelligent decision making. To this effect, the 2017 ‘What If?’ To ‘What Next?’ workshop welcomes contributions from a variety of perspectives from machine learning, statistics, economics and social sciences, among others. This includes, but it is not limited to, the following topics:
– Combining experimental control and observational data
– Bandit algorithms and reinforcement learning with explicit links to causal inference and counterfactual reasoning
– Interfaces of agent-based systems and causal inference
– Handling selection bias
– Large-scale algorithms
– Applications in online systems (e.g. search, recommendation, ad placement)
– Applications in complex systems (e.g. cell biology, smart cities, computational social sciences)
– Interactive experimental control vs. counterfactual estimation from logged experiments
– Discriminative learning vs. generative modeling in counterfactual settings
We invite contributions both in the form of extended abstract and full papers. At the discretion of the organizers, some contributions will be assigned slots as short contributed talks and others will be presented as posters.

Submission length: 2 page extended abstracts or up to 8 page full paper. At least one author of each accepted paper must be available to present the paper at the workshop.

I’m pretty sure that, in these settings, there’s not much reason to be interested in the model of zero causal effects and zero systematic error, so I hope people at this conference don’t waste any time on null hypothesis significance testing except when they are talking about how to do better.

The “fish MRI” of international relations studies.

Kevin Lewis pointed me to this paper by Stephen Chaudoin, Jude Hays and Raymond Hicks, “Do We Really Know the WTO Cures Cancer?”, which begins:

This article uses a replication experiment of ninety-four specifications from sixteen different studies to show the severity of the problem of selection on unobservables. Using a variety of approaches, it shows that membership in the General Agreement on Tariffs and Trade/World Trade Organization has a significant effect on a surprisingly high number of dependent variables (34 per cent) that have little or no theoretical relationship to the WTO. To make the exercise even more conservative, the study demonstrates that membership in a low-impact environmental treaty, the Convention on Trade in Endangered Species, yields similarly high false positive rates. The authors advocate theoretically informed sensitivity analysis, showing how prior theoretical knowledge conditions the crucial choice of covariates for sensitivity tests. While the current study focuses on international institutions, the arguments also apply to other subfields and applications.

My reply: I’m not a fan of the “false positive” framework, but the general attitude expressed in the paper makes sense to me and I’m guessing this paper will be a very useful contribution to the literature in its field. It’s the “fish MRI” of international relations studies.

Apply for the Earth Institute Postdoc at Columbia and work with us!

The Earth Institute at Columbia brings in several postdocs each year—it’s a two-year gig—and some of them have been statisticians (recently, Kenny Shirley, Leontine Alkema, Shira Mitchell, and Milad Kharratzadeh). We’re particularly interested in statisticians who have research interests in development and public health. It’s fine—not just fine, but ideal—if you are interested in statistical methods also. The EI postdoc can be a place to do interesting work and begin a research career. Details here. If you’re a statistician who’s interested in this fellowship, feel free to contact me—you have to apply to the Earth Institute directly (see link above), but I’m happy to give you advice about whether your goals fit into our program. It’s important to me, and to others in the EI, to have statisticians involved in our research. Deadline for applications is 31 Oct, so it’s time to prepare your application NOW!

For mortality rate junkies

Paul Ginsparg and I were discussing that mortality rate adjustment example. I pointed him to this old tutorial that laid out the age adjustment step by step, and he sent along this:

For mortality rate junkies, here’s another example [by Steven Martin and Laudan Aron] of bundled stats lending to misinterpretation, in this case not correcting for the black cohort having a slightly younger average age plus a higher percentage of women.

As Martin and Aron point out:

Why were these differences by race not apparent in the CDC figure? Because for adults ages 65 and older, blacks and whites looked very different in 2015. The average age of blacks was 73.7 compared with 74.6 for whites. Also, a higher share of blacks ages 65 and older were women: 61 percent compared with 57 percent for whites. Because younger people have lower death rates than older people and women have lower death rates than men, comparing a younger and more female population of blacks with an older and more male population of whites offsets the underlying race differences in death rates.

Contribute to this pubpeer discussion!

Alex Gamma writes:

I’d love to get feedback from you and / or the commenters on a behavioral economics / social neuroscience study from my university (Zürich). This would fit perfectly with yesterday’s “how to evaluate a paper” post. In fact, let’s have a little journal club, one with a twist!

The twist is that I’ve already posted a critique of the study on PubPeer and just now got a response by the authors. I’m preparing a response to the response, and here’s where I could use y’all’s input.

I don’t have the energy to read the paper or the discussions but I wanted to post it here as an example of this sort of post-publication review.

Gamma continues:

Here’s the abstract to the paper:

Goal-directed human behaviors are driven by motives. Motives are, however, purely mental constructs that are not directly observable. Here, we show that the brain’s functional network architecture captures information that predicts different motives behind the same altruistic act with high accuracy. In contrast, mere activity in these regions contains no information about motives. Empathy-based altruism is primarily characterized by a positive connectivity from the anterior cingulate cortex (ACC) to the anterior insula (AI), whereas reciprocity-based altruism additionally invokes strong positive connectivity from the AI to the ACC and even stronger positive connectivity from the AI to the ventral striatum. Moreover, predominantly selfish individuals show distinct functional architectures compared to altruists, and they only increase altruistic behavior in response to empathy inductions, but not reciprocity inductions.

The exchange on PubPeer so far is here, the paper is here (from first author’s ResearchGate page), the supplementary material here. (Email me at if you have trouble with any of these links.)

The basic set-up of the study is: 

• fMRI
• N=34 female subjects
• 3 conditions
• baseline (N=34)
• induced motive “empathy” (N=17; between-subject)
• induced motive “reciprocity” (N=17;  between-subject)
• ML prediction / classification of the two motives using SVM
• accuracy ~ 70%, stat. sign.

In my comment, the main criticism was that their presentation of the results was misleading by suggesting that the motives in question had been “read off” the brain directly without already knowing them by other means. I’ve since realized that it is mainly the title that suggests so and thereby creates a context within which one interprets the rest of the paper. Without the title, the paper would be more or less OK in this regard. In any case, to say in the title that brain data “reveals” human motives suggests (clearly, to me) that these motives were not previously known. That they were “hidden” and then uncovered by examining the brain. But obviously, the prediction algorithm had to be trained on prior knowledge of the motives, so that’s not at all what happens. This is one thing I intend to argue in my response. 

But there’s more.

In the comment, I’ve also raised issues about the prediction/machine learning aspects and I want to bring up more in my response to their reponse. These issues concern the purpose of prediction, the relationship between prediction and causal inference, generalizability, overfitting and the scope for forking paths. So lots of interesting stuff! And since I’m not an expert (not a statistician, but with not-too-technical exposure to ML), I’d love to get input from the knowledgeable crowd here on the blog.

Before I separate the issues into chunks, I’ll outline what I gathered they did with their data. As far as neuroimaging studies go, they used quite sophisticated modeling. Below, the dotted lines (—) are loosely used to indicate “is input to” or “leads to” or “produces as output”, or simply “is followed by”.

  1. fMRI (“brain activity”) — GLM (empathy vs reciprocity vs baseline) —  diff between two motives n.s., but diff betw. motives and baseline stat. sign. in a “network” of 3 brain areas — use of DCM (“Dynamic causal models”) to get “functional connectivity” in this network, separately for the 3 conditions
  2. DCM: uses time-series of fMRI activations to infer connectivity in the network of 3 brain areas — start w/ 28 plausible initial models, each a different combination of 7 network components (see Fig. 2A, p.1075, and Fig S2., p.12 of the supplement) —  use Bayesian model averaging to estimate parameters of the 7 components (components = strengths/direction of connections and external inputs) — end up with 14 “DCM-parameters” per subject, 7 per motive condition, 7 per baseline

  3. Prediction: compute diff between DCM-parameters of each motive vs baseline (1: emp – base; 2: rec – base) = dDCM parameters — input these into SVM to classify empathy vs reciprocity — LOOCV — classification weights for 7 dDCM params (Fig. 2B, p.1075)

  4. “Mechanistic models”: start again with 28 initial models from 2. — random-effect Bayesian model selection — average best models for each condition (emp – rec – base; Fig. 3, p.1076)

The paper is a mix of talk about the prediction aspect and the mechanistic insight into the neural basis of the two motives that supposedly can be gleaned from the data. There seems to be some confusion on the part of the authors as how these two aspects are related. Which leads to the first issue.

I. Purpose of prediction

In my comment, I questioned the usefulness of their prediction exercise (I called it a “predictive circus act”). I thought the causal modeling part (DCM) is OK because it could contribute to an understanding of what, and eventually how, brain processes generate mental states. However, I didn’t think the predictive part added anything to that. (And I couldn’t help noticing that the  predictive part would allow them advertize their findings as “the brain revealing motives” instead of just  “here’s what’s going on in the brain while we experience some motives”.)

What’s your take? Does the prediction per se have a role to play in such a context? 

II. Relationship between prediction and causal modeling/mechanistic insights

The authors claim that the predictive part supports or even furnishes the mechanistic (causal?) insight the data supposedly deliver, although that is not stated as the official purpose of the predictive part. They write: 

“We obtain these mechanistic insights because the inputs into the support vector machine are not merely brain activations but small brain models of how relevant brain regions interact with each other (i.e., functional neural architectures)…. And it is these models that deliver the mechanistic insights into brain function…”

The last sentence of the paper then reads: 

“Our study, therefore, also demonstrates how “mere prediction” and “insights into the mechanisms” that underlie psychological concepts (such as motives) can be simultaneously achieved if functional neural architectures are the inputs for the prediction.”

But if my outline of their analytic chain is correct, these statements are confused. As a matter of fact, they do *not* derive their mechanistic models (i.e. the specific connectivity parameters of the network of 3 brain areas, see Fig. 3 p.1076) from the predictive model. The mechanistic models are the result of a different analytic path than the predictive model. This can already be seen from the fact that the predictive model is based on *differences* between motive and baseline conditions, while the mechanistic models they discuss at length in the paper exist for each of these conditions separately. 

If all this is right, the authors misunderstand their own analysis. 
(They also have *this* sentence, which I consider a tautology: “Thus, by correctly predicting the induced motives, we simultaneously determine those mechanistic models of brain interaction that best predict the motives.”)

I would be happy, however, if someone found this interesting enough to check whether my understanding of the modeling procedure is correct. 

III. Generalizability

The authors make much of their use of LOOCV: 

“We predicted each subject’s induced motive with a classifier whose parameters were not influenced by that subject’s brain data… Instead, the parameters of the classifier were solely informed by other subjects’ brain data. This means that the motive-specific brain connectivity patterns are generalizable across subjects. The distinct and across-subject–generalizable neural representation of the different motives thus provides evidence for a distinct neurophysiological existence of motives.”

They do not address at all, however, the issue of generalizability to new samples (all the more important for a single-sex sample).  I thought the emphasis is completely wrong here. My understanding was and is that achieving a decent in-sample classification accuracy is only the smallest part of finding a robust classifier. The real test is the performance in new samples from new populations. Also, I felt that something was wrong with their particular emphasis on how cool it is that LOOCV leads to a classifier that generalizes within the sample. 

I wrote that “the authors’ appeal to generalizability is misleading. They emphasize that their predictive analysis is conducted using a particular technique (called leave-one out-cross-validation or LOOCV) to make sure the resulting classifier is “generalizable across subjects”. But that is rather trivial. LOOCV and its congeners are a standard feature of predictive models, and achieving a decent performance within a sample is nothing special.” 

In their response, they challenged this: 

“Well, if it is so easy to achieve a decent predictive performance, why do the behavioral changes in altruistic behavior induced by the empathy and the reciprocity motive enable only a very poor predictability of the underlying motives? On the basis of the motive-induced behavioral changes the classification accuracy of the support vector machine is only 41%, i.e., worse than chance. And if achieving decent predictive performance is so easy, why is it then impossible to predict better than chance based on brain activity levels for those network nodes for which brain connectivity is highly predictive (motive classification accuracy based on the level of brain activity = 55.2%, P = 0.3). The very fact that we show that brain activity levels are not predictive of the underlying motives means that we show – in our context – the limits of traditional classification analyses which predominantly feed the statistical machine with brain activity data.”

What they say certainly shows that you can’t get a good classifier out of just any features you have in the data. So in that sense my statement would be false. But what I had in mind was more along the lines that to find *some* good predictors among many features is nothing special. But is this true? And is it true for their particular study? This will come up again under forking paths later.

To get back to the bigger issue, was I right to assume that getting a decent classifier in a small sample is not even half the rent if you want to say something general about human beings?

(To be fair to the authors, they state in their response that it would be desirable, even “very exciting”, to be able to predict subjects’ motives out-of-sample.)

IV. Overfitting and forking paths

Finally, what is the scope for overfitting, noise mining and forking paths in this study? I would love to get some expert opinion on that. They had 17 subjects per motive condition. They first searched for stat. sign. differences in brain activity between the 3 conditions. What shows up is a network of 3 brain regions. They attached to it 7 connectivity parameters and tested 28 combinations of them (“models”). Bayesian model averaging yielded averages for the 7 parameters, per condition. Subtract baseline from motive parameters, feed the differences  into an SVM. 

Can you believe anything coming from such an analysis?

I hope and believe that this could also bring the familiar insights and excitement of a journal club that so many of you have professed their love for. And last not least, maybe the scientific audience of PubPeer could learn something, too.

I have no idea, but, again, I wanted to share this as an example of a post-publication review.