Nature magazine just published a short feature on statistics and the replication crisis, featuring the following five op-ed-sized bits:

Jeff Leek: Adjust for human cognition

Blake McShane, Andrew Gelman, David Gal, Christian Robert, and Jennifer Tackett: Abandon statistical significance

David Colquhoun: State false-positive risk, too

Michele Nuijten: Share analysis plans and results

Steven Goodman: Change norms from within

Our segment was listed as by Blake and me, but that’s because Nature would not allow us to include more than two authors. Our full article is here; see also this response to comments, which also includes links to relevant papers by Amrhein, Korner-Nievergelt, and Roth, and Amrhein and Greenland.

Regarding the five short articles above: You can read them yourself, but my quick take is that these discussions all seem reasonable, except for the one by Colquhoun, which to my taste is unhelpful in that it sticks with the false-positive, false-negative framework which Blake and I find problematic for reasons discussed in our paper and elsewhere.

Also, I agree with just about everything in Leek’s article except for this statement: “It’s also impractical to say that statistical metrics such as P values should not be used to make decisions. Sometimes a decision (editorial or funding, say) must be made, and clear guidelines are useful.” Yes, decisions need to be made, but to suggest that p-values be used to make editorial or funding decisions—that’s just horrible. That’s what’s landed us in the current mess. As my colleagues and I have discussed, we strongly feel that editorial and funding decisions should be based on theory, statistical evidence, and cost-benefit analyses—not on a noisy measure such as a p-value. Remember that if you’re in a setting where the true effect is two standard errors away from zero, that the p-value could easily be anywhere from 0.00006 and 1. That is, in such a setting, the 95% predictive interval for the z-score is (0, 4), which corresponds to a 95% predictive interval for the p-value of (1.0, 0.00006). That’s how noisy the p-value is. So, no, don’t use it to make editorial and funding decisions. Please.

And I disagree with this, from the sub-headline: “The problem is not our maths, but ourselves.” Some of the problem is in “our maths” in the sense that people are using procedures with poor statistical properties.

Overall, though, I like the reasonable message being send by the various authors here. As Goodman writes, “No single approach will address problems in all fields.”

Number 6: Change the system so academics are rewarded for running a high quality study, over which we have almost full control, instead of for getting a sexy outcome, over which we have almost no control. Under the current system, academic success is either a crapshoot or a scam.

+1

This does happen in some fields, albeit imperfectly – the one I’m familiar with is academic clinical research, where large-scale randomised trials are just about the only way to get REF4-star papers. For anyone not in the UK, the REF is our own particular academic nightmare – but 4 star means “top quality” so there is an incentive for academics to get these papers. And in general RCTs have quite a lot of emphasis on things like data quality, randomising properly, avoiding bias and so on. Having said that, there ares still a whole load of problems, many of them statistical (see Frank Harrell’s blog), including a lot of artificial dichotomising of results and interpretations based on p-values. But at least the link between sexy results and high-impact publication has been weakened.

Stephen Goodman also shares that view see my comment http://andrewgelman.com/2017/11/30/oooh-hate-talk-false-positive-false-negative-false-discovery-etc/#comment-617386

“Change the system so academics are rewarded for running a high quality study, over which we have almost full control, instead of for getting a sexy outcome, over which we have almost no control. Under the current system, academic success is either a crapshoot or a scam.”

I am most familiar with (recent events in) Psychological Science. Drawing from different sources and ideas, I have tried to come up with a format that I reasoned could possibly be useful for Psychological Science. If nothing else, it could possibly be an interesting experiment. Regardless, it is the best I could do trying to help improve Psychological Science. Here’s the idea:

1) Small groups of let’s say 5 researchers all working on the same theory/topic/construct perform a pilot study/exploratory study and at one point make it clear for themselves and the other members of the group to have their work rigorously tested.

2) These 5 studies will all then all be pre-registrated and prospectively replicated in a round robin fashion.

3) You would hereby end up with 5 (what perhaps often can be seen as “conceptual” replications depending on how far you want to go to consider something a “conceptual” replication) studies, that will all have been “directly” replicated 4 times (+ 1 version via the original researcher, which makes a total of 5).

4) All results will be published no matter the outcome in a single paper: for instance “Ego-depletion: Round 1”

5) All members of the team of 5 researchers would then come up with their own follow-up study, possibly (partly) related to the results of the “first round”. The process repeats itself as long as deemed fruitful.

Additional thoughts related to this format which might be interesting regarding recent discussions and events in psychological science:

1) Possibly think how this format could influence the discussions about “creativity”, “science being messy” and the acceptance of “null-results”.

Researchers using this format could each come up with their own ideas for each “round” (creativity), there would be a clear demarcation between pilot-studies/exploratory studies and testing it in a confirmatory fashion, and this could also contribute to publishing and “doing something” with possible null-results concerning inferences and conclusions.

2) Possibly think about how this format could influence the discussion about how there may be too much information (i.c. Simonsohn’s “let’s publish fewer papers”).

I am not experienced in performing research over a year (I only performed 1 study for my thesis), but let’s say it’s reasonable that researchers can try and run 5 studies a year (2 years?) given time and resources (50-100 pp per study per individual researcher). That would mean that a group of researchers using this format could publish a single paper every 1 or 2 years (“let’s publish fewer papers”), but this paper would be highly informational given that it would be relatively highly-powered (5 x 50-100 pp = 250-500 pp per study), and would contain both “conceptual” and “direct” replications.

3) Possibly think about how this format could influence the discussion about “expertise” and “reverse p-hacking/deliberately wanting to find a null-result” concerning replications.

Perhaps every member of these small groups would be inclined to 1) “put forward” their “best” experiment they want to rigorously test using this format, and 2) execute the replication part of the format (i.c. the replications of the other members’ study) with great attention and effort because they would be incentivized to do so. This is because “optimally” gathered information coming from this format (e.g. both significant and non-significant findings) would be directly helpful to them for coming up with study-proposals for the next round.

4) The overall process of this format would entail a clear distinction of post-hoc theorizing and theory testing (c.f. Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012), “rounds” of theory building, testing, and reformulation (cf. Wallander, 1992) and could be viewed as a systematic manner of data collection (cf. Chow, 2002)

5) Finally, it might also be interesting to note that this format could lead to interesting meta-scientific information as well. For instance, perhaps the findings of a later “round” turn out to be more replicable due to enhanced accurate knowledge about a specific theory or phenomenon. Or perhaps it will show that the devastating typical process of research into psychological phenomena and theories described by Meehl (1978) will be cut-off sooner, or will follow a different path.

“Change the system so academics are rewarded for running a high quality study, over which we have almost full control, instead of for getting a sexy outcome, over which we have almost no control. Under the current system, academic success is either a crapshoot or a scam.”

For me, incompetence is understandable, unwillingness however is not.

I did the best i could, in every way i could think of.

Thank you prof. Gelman, and the others on this forum, for giving folks like me (i.c. not in the adademic “in-crowd”) the opportunity to try and help improve matters.

Concerning at least 1 thing i thought was important, you’ve all helped me get my point across to the people that have the tools to improve matters.

For that, i am very thankful.

Isn’t the noisiness in p-values is just a consequence of the noisiness in the data (in the context of the model, etc)? Of course this has to be taken into account, but you could just as well have written:

>I strongly feel that editorial and funding decisions should be based on theory, statistical evidence, and cost-benefit analyses—not on a noisy measure such as the observed data. Remember that if you’re in a setting where the true effect is two standard errors away from zero, that the observed data could easily be anywhere between 0 and 4 times the standard error. That is, in such a setting, the 95% predictive interval for the z-score is (0, 4). That’s how noisy the observed data is. So, no, don’t use it to make editorial and funding decisions. Please.

To be more constructive, I’ll add that in my opinion it would have been better to say something like

“I strongly feel that editorial and funding decisions should be based on theory, statistical evidence, and cost-benefit analyses—not just on a measure of statistical evidence such as a p-value.”

and leave it at that.

Carlos:

Of course it makes sense to use observed data to make editorial and funding decisions. There’s a lot more to observed data than the z-score and the p-value. My reason for making this particular point about the p-value is that people just don’t seem to understand how variable the p-value is. The idea that a p-value of 0.00006 and a p-value of 1.0 could come from two different replications of the exact same experiment . . . people just don’t get it. Just consider that sometimes there’s a study and a replication, where the original study has a p-value of, say, 0.03, and the replication has a p-value of, say, 0.20, and people tie themselves into knots trying to explain why the results are so different, not recognizing that you’d expect this sort of difference just from sampling variation alone. This is, I believe, a technical point worth emphasizing in this context.

I think in this case the issue is not the p-value, but the measurement error (another of your favourite subjets). The noise is a property of the experiment, even if your (prior-free) statistical analysis is using p-values there is no way to work around the fact that if the true effect is twice the standard error the observed effect may be anywhere between zero and twice the true effect. Introducing priors is another discussion, unrelated to the multiple flaws of p-values.

Blaming the use of p-values, and not the design of the experiment or lack of enough data, seems a bit misleading to me. If people don’t understand that p-values are subject (in fact are based on!) sampling variation, it’s not just that they shouldn’t be using p-values. They shouldn’t be doing any statistical analysis at all until they improve their understanding.

> even if your (prior-free) statistical analysis is using p-values

Of course I meant “is NOT using p-values”.

It’s entirely possible to do decision analysis involving discrete decisions that does not use p-values in any way at all. Lin, Gelman, Price, and Krantz illustrated this using decisions related to indoor radon as an example. You have different amounts of information about different areas of the country; households have different makeup and perhaps different risk tolerance; and the risk of radon exposure per unit concentration is extremely uncertain; but you still have to make discrete decisions (make a measurement or not; remediate the house or not). I know there has been a lot of work in this area, especially since it has been 18 years since we published that paper!, but I still think we did a good job.

I’d like to say a word in appreciation of David Colquhoun’s piece (I initially wrote “support”, but “appreciation” might be more accurate).

I’m an applied researcher, trained in the hypothesis testing framework. I didn’t choose that framework — it chose me (it was the only framework presented to me in college and grad school). To the extent that I knew anything bout Bayesian methods at all, my impression was not positive. The definition of probability as the “intensity of personal belief” and examples involving sports betting made the Bayesian framework seem extremely unscientific. Appropriate perhaps for trivial pursuits like betting on sports, but not appropriate for science. In short, Bayesian methods seemed the domain of soft-headed hippies, glib academics, and gamblers (I’m being just a tad undiplomatic here, but hopefully you’ll forgive me, given where I end up).

Having started at that point, I found Colquhoun’s writing to be quite liberating and a very helpful, non-threatening, first step away from the hypothesis testing framework. Having read Colquhoun, I was better able to hear and understand the messages being communicated by the hypothetico-deductive Bayesians, including on this blog.

For me, there were three main advantages to starting with the false positive rate:

1. it highlights the need for prior information in order to make probabilistic statements regarding true effects

2. it demonstrates that probability need not be interpreted as the “intensity of personal belief” in order to use prior information and Bayes’ Rule

3. the more I fiddled with calculating the false positive rate, the more I realized that the dichotomization of effects doesn’t really make sense in many (perhaps most) contexts

Having used the false positive rate as a bridge to the Bayesian framework, I’m now inclined to discard the hypothesis testing framework entirely (though for practical/institutional reasons I can’t entirely do that just yet). And so in that sense, I agree with professor Gelman — the false positive rate is not the best solution, and I suspect that within a few years I won’t be using it at all.

But I also very much appreciate Colquhoun’s writing and thinking on this issue — it has been very helpful to me. So while I have crossed the bridge, I’m disinclined to burn it down, especially since many others still need to cross.

In conclusion, I’ll just say “thank you!” to both Colquhoun and Gelman — I’ve learned much from both of your writings.

I agree with John, except I first encountered this type of argument in Ioannidis’ famous paper. In my view, Ioannidis and Colquhoun are making the same basic point that Andrew is: small p-values, in and of themselves, are much weaker evidence than you might expect.

Andrew – in your article you argue that domain experts should consider the prior plausibility of a study’s conclusion in addition to its p-value when evaluating a new research contribution. I strongly agree and have come over to your position entirely over the past couple years of reading your blog.

Now, practically speaking, if the existing pre-publication review system continues to dominate for the foreseeable future (as I unfortunately expect it will), does this task fall to peer reviewers?

Also, the headline really rubs me the wrong way. “ways to fix statistics”??? More like “ways to do our job as scientists and stop pretending that statistics alone does it for us”

Paul:

I don’t think peer review is so great at finding errors—the trouble is that the sorts of people who are peers of a paper’s author are often themselves committed to the proposition that the conclusions in question are correct. Peer review is good for catching missing references.

Sadly, all too often true. Somehow, pre-publication review needs to get better and post-publication review needs to increase.

“Peer review is good for catching missing references.”

And catching mistakes regarding APA and/or individual journals’ formatting and style guides!!

It always amazes me that social scientists start talking about how possible improvements regarding their science “puts them in chains” and/or “stifle creativity” (https://www.timeshighereducation.com/comment/opinion/pre-registration-would-put-science-in-chains/2005954.article).

1) But they seem to have no problem following a 200+ page APA style book regarding rules about when to use a colon or semi-colon.

2) And are perfectly fine with individual journals having their own rules for no apparent reason whatsoever, so possible changes have to be made to the manuscript when trying to publish it in a different journal.

+1