Comments on: We have a ways to go in communicating the replication crisis

By: Anonymous

Anonymous — Fri, 21 Oct 2016 19:32:01 +0000

In reply to Matt.

Probably not the kind of mathematical argument you are looking for, but in case you don’t know it already:

https://www.xkcd.com/882/

By: Paul

Paul — Fri, 21 Oct 2016 18:42:24 +0000

In reply to Matt. Matt, as a statistical consultant for scientists I encounter similar situations far too often. Therefore, I got used to the following strategy: I try to convince my clients that analyzing the effects of 30 covariates on 20 outcomes is ok as long as it is declared as exploratory data analysis and the results are not used for any inferential statements. My main argument is about reproducibility: Results obtained by such an analysis are most often not reproducible, and thus practically worthless for publication. Instead, I encourage them to use the information of this analysis to plan a more detailed study about the effect of interest which – if it is carried out correctly – most often yields reproducible results. If this argumentation does not work and the client insists on publishing the results as they were carried out by a planned experiment I usually end the cooperation with this particular client at this stage and demand that my name must not be brought into connection with any result of their analysis. Over the years I lost some clients because of this strategy. However, I have some regular clients so that I'm able to survive without those black sheep. What really annoys me is that they manage to survive as well, although they perform crappy research.

By: Anoneuoid

Anoneuoid — Fri, 21 Oct 2016 14:46:47 +0000

In reply to Bill Jefferys.

>”Jim’s virologists and mine agreed that a zero effect was probable (maybe > .5).”

Yes, I am seeing now why there has been so little progress on the HIV vaccine front. The subject area experts are apparently extremely hubristic and uncreative. In the meantime I was looking for more info and found this paper:

“Because in HIV vaccine efficacy trials the null hypothesis (of no efficacy) is scientifically plausible, the Bayesian analysis assigns a prior probability Pr(VE = 0%) to this hypothesis. An obvious choice is Pr(VE = 0%) = .5, so that there is an even chance of zero efficacy and of nonzero efficacy.”
http://jid.oxfordjournals.org/content/203/7/969.short

I would like to see more about how the zero effect was deemed to be not only scientifically plausible, but the by far most likely outcome. I highly doubt that would stand up to scrutiny, there are simply so many routes by which a vaccine could have an effect. I bet they only considered one favorite mechanism during the discussion and improperly conflated that with the statistical hypothesis. In other words, the usual NHST error.

By: Keith O'Rourke

Keith O'Rourke — Fri, 21 Oct 2016 14:32:11 +0000

In reply to Bill Jefferys.

Anoneuoid:

I am not a virologist – but Jim’s virologists and mine agreed that a zero effect was probable (maybe > .5).

The effect was defined as protection from currently circulating HIV not extinct versions.

> included any antibody tests in the pipeline
The vaccine development pipeline was very different in this case, in most vaccines doing an RCT without being fairly sure of an effect would be extremely unlikely.

Desperate researchers often waste resources and risk high false positive claims.

By: Anoneuoid

Anoneuoid — Fri, 21 Oct 2016 13:25:26 +0000

In reply to Bill Jefferys.

>”the virologists seemed quite convinced the effect could be zero – intuitively it trained the body to recognize a now extinct version of HIV (they go extinct very quickly).”

I don’t see why the strain would need to be non-“extinct”. Off the top of my head:

Vaccine -> Immune Response (eg Fever) -> Reduced libido for a few days -> lower HIV incidence

Or if they included any antibody tests in the pipeline, you could get a cross reaction with the vaccine peptide, which will affect diagnosis rates.

By: Andrew

Andrew — Fri, 21 Oct 2016 12:06:17 +0000

In reply to Keith O’Rourke.

Keith:

Yes, I was surprised too. I wouldn’t’ve expected King to understand much about how p-values and confidence intervals work, as this is kinda technical and lots of applied researchers and even textbook writers get confused on this point—but I was surprised to see him dismiss the value of replications. Here’s where I think he made the mistake of trusting Gilbert on the substance, and then conversely Gilbert naively trusted King on the stats. I have no idea what either King or Gilbert thinks about this now, but my guess is that King may have realized that he screwed up on this one, but he’s not sure whether to publicly admit his error or just quietly move on and hope that people forget this whole episode.

By: Keith O'Rourke

Keith O'Rourke — Fri, 21 Oct 2016 11:52:34 +0000

In reply to Andrew. Thanks - but I would have thought King would be better at disaster recognition and recovery given he was one of the few working on research on reproducible/replicatable science in the early 2000,s (when almost no one was successful at getting funding for it).

By: Keith O'Rourke

Keith O'Rourke — Fri, 21 Oct 2016 11:44:21 +0000

In reply to Bill Jefferys.

Bill: Jim also did an analysis of an HIV vaccine where the virologists seemed quite convinced the effect could be zero – intuitively it trained the body to recognize a now extinct version of HIV (they go extinct very quickly).

Jim had put a non-zero probability on zero effect, which I complained about and had to back down given the virologists…

Of course one needs to avoid sure things, putting probability 1 on zero effect because RCTs are blind to mechanism of effect and one may always be wrong about that (i.e. the extra CO2 in your example and Herman’s use of the adjective virtually.)

By: Remi

Remi — Fri, 21 Oct 2016 10:21:08 +0000

The famous “Not even wrong!” comes to mind!

By: Anoneuoid

Anoneuoid — Fri, 21 Oct 2016 03:06:09 +0000

In reply to MichaelK.

Thanks, it is encouraging to see it getting across to some people. The “fooling” going on is due to:

1) People wanting to think they are making progress without doing the necessary hard work of figuring out the premises and deducing precise predictions from their speculations. In fields like medical research the vast majority study extremely dynamic systems without any need for tools like calculus. That alone should be a huge red flag.

2) Overreliance on argument from authority and consensus heuristics. These are necessary tools, but when they fail it can be quite spectacular.

3) The extreme cognitive dissonance that results amongst those who have spent a lot of time/effort/money on NHST when they take this realization to its logical conclusion. It took me a few years to really accept it and I realized the problem relatively early on.

By: Bill Jefferys

Bill Jefferys — Fri, 21 Oct 2016 00:52:28 +0000

In reply to Bill Jefferys. I have to say that the Berger example comes from Jim Berger's paper with Mohan Delampady. I should have given credit to both of them.

By: Martha (Smith)

Martha (Smith) — Thu, 20 Oct 2016 22:34:35 +0000

In reply to Matt.

Matt,

Different people may need different types of explanation to help them “get it”. You may find some of the explanations in the slides (under Course Notes) at http://www.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html to be helpful — those for Day 2 and Day 4 are probably most relevant. Also, the link further down to Jerry Dallal’s Simulation of Multiple Testing (and the two items following it) can be helpful to some people.

By: Andrew

Andrew — Thu, 20 Oct 2016 21:52:42 +0000

In reply to Bill Jefferys. Bill: Yes, indeed. And, beyond that, treatment effects can vary, they can be positive in some scenarios and negative in others.

By: Bill Jefferys

Bill Jefferys — Thu, 20 Oct 2016 21:40:08 +0000

In reply to Anoneuoid. Herman Rubin always said that he didn't need any data at all to reject such a null hypothesis, since it is virtually certain that it is false. Even null hypotheses that you might be sure could be true (Jim Berger gives the example, "My plants grow better if I talk to them") might not be exactly true (e.g., if your breath gives them extra CO2 which helps them to grow better).

By: Andrew

Andrew — Thu, 20 Oct 2016 21:26:50 +0000

In reply to Marko. Marko: I'm not quite sure what went wrong there either. But you have to remember that King works well when collaborating with people who do know statistics. I have no idea exactly what went wrong with that Gilbert, King, Pettigrew, and Wilson paper, but it's possible that: (a) Gilbert deferred to King under the impression that King was a statistics expert, and (b) King deferred to Gilbert under the impression that Gilbert was a subject-matter expert. This sometimes can happen with collaborations, that with multiple people involved, there's no one to ultimately take responsibility for the conclusions. I'd like to think that either King or Gilbert acting alone would not have made these mistakes: King would not have been emboldened by Gilbert to take such a strong and mistaken position regarding psychology's replication crisis, and Gilbert would not have been emboldened by King to make such strong and mistaken statistical claims. The whole episode was a disaster.

By: Marko

Marko — Thu, 20 Oct 2016 20:35:39 +0000

In reply to Lauren. I am asking myself the same question since I first read the Gilbert et al. comment. I really admire most of King's work and don't understand what went wrong there...

By: Geof

Geof — Thu, 20 Oct 2016 20:24:38 +0000

In reply to Lauren. That was my initial reaction too. I'm not sure if Andrew was teasing his former collaborator or if they've had a falling out....

By: Keith O'Rourke

Keith O'Rourke — Thu, 20 Oct 2016 19:41:45 +0000

In reply to Matt.

More generally “it is difficult to wrap one’s head around the fact that the researcher’s intentions impact the interpretation of what human inquiry ought to be to connect us best with reality (that’s not directly accessible).”

Currently re-reading how Russel, Wittgenstein and Ramsay struggled with this issue – of course Peirce figured it out but wrote multiple faulty drafts and the final draft was not clearly marked ;-)

One thing that does seem to clear is that you can’t evaluate the value of a method of inquiry in any single instance or group of instances – which is what your doing now. Rather you need to evaluate in an inexhaustible set of inquiries – here the statistically significant outcome will vary and not always be the same one.

It applies to Bayesian analyses as well, with a reasonable/credible/responsible prior and data generating model (an adequate representation of underlying reality we hope to be connected with for the current purpose) you might get an unlucky data set. More likely you will have an inadequate representation of underlying reality and not notice it in a given data set (this time or in the first n times).

What is baked into the definition of the p-value for the purposes that it is often put to in many disciplines is the amplification of how bad the evaluation of it is in an inexhaustible set of inquiries (from the naive unadjusted p_value.)

By: Jonathan (another one)

Jonathan (another one) — Thu, 20 Oct 2016 19:38:46 +0000

In reply to Andrew.

Another quote, this one from Richard Royall: “The experimenter who is primarily interested in studying the distribution of X… will ordinarily make observations on other random variables, Y, Z, etc. in the same survey….To the researcher who conceived the study in the first place to obtain evidence about the distribution of X, who finds a mean that differs from zero by 2.5 standard errors, and who is now told that this observation is not statistically significant, something seems wrong….The problem…is that the significance level is being used in two roles, only one of which is valid. Whether or not the researcher performs tests of other hypotheses affects his overall probability of committing at least one Type I error, but it does not change his evidence about X or how that evidence should be interpreted.” (My note: this assumes that the other observations give no information about the distribution of X, in which case Andrew’s hierarchical approach actually does tell you more.)

By: Carol

Carol — Thu, 20 Oct 2016 19:31:12 +0000

In reply to Andrew. Everyone a comedian ....

By: Andrew

Andrew — Thu, 20 Oct 2016 19:13:00 +0000

In reply to Anoneuoid. Anon: The Meehl link is excellent. Also relevant is my discussion with Deborah Mayo on confirmationist and falsificationist paradigms of science.

By: MichaelK

MichaelK — Thu, 20 Oct 2016 19:04:22 +0000

In reply to Anoneuoid. Thank you. That's a great explanation. It's funny -- stripped of its technicalities, the conceptual mistake really is a simple one. Yet it fools most of science.

By: Andrew

Andrew — Thu, 20 Oct 2016 18:57:14 +0000

In reply to Shravan. Shravan: No, we need 200% intervals. Better safe than sorry.

By: Andrew

Andrew — Thu, 20 Oct 2016 18:55:55 +0000

In reply to Bob Carpenter. Bob: And you can have large estimated effects which are statistically significant but which are not real!

By: Bob Carpenter

Bob Carpenter — Thu, 20 Oct 2016 18:46:51 +0000

In reply to BenK. The p-value isn't the estimated size of the effect. You can have small estimated effects (0.01%, say) which are significant and large estimated effects (10%, say) which are not significant.

By: Anoneuoid

Anoneuoid — Thu, 20 Oct 2016 18:17:45 +0000

In reply to Matt.

The dead paradigm is testing a default null hypothesis of zero difference between groups, or zero correlation between some parameters, and then trying to link rejection of this hypothesis to a theory or model of interest. The only time it makes sense is when your model predicts the zero difference/correlation.

What happened is that mathematicians/statisticians turned the logic of science on its head. You are supposed to compare the predictions of your hypothesis to observation, not some other hypothesis. All the other issues follow from that initial error. Here is a good write up: http://www.fisme.science.uu.nl/staff/christianb/downloads/meehl1967.pdf

By: Lauren

Lauren — Thu, 20 Oct 2016 18:13:03 +0000

> those guys don’t know a lot of statistics …

I’m no expert, but I thought Gary King was?

By: Terry

Terry — Thu, 20 Oct 2016 18:08:12 +0000

More fundamentally, does it even make sense to talk about being “statistically indistinguishable from 100%”?

If the null hypothesis is that 100% of replication studies replicate the original findings, then A SINGLE INSTANCE of a failed replication demonstrates with 100% certainty that the replication rate is not 100%. Statistics is not even necessary here, simple logic will do.

By: Matt

Matt — Thu, 20 Oct 2016 18:01:31 +0000

In reply to Anoneuoid.

I don’t care about statistical significance vs. effect size, this was just an example. I’m an economist so most empirical work I do will be grounded in theory. However it is definitely interesting/disconcerting to read this stuff. What is the dead paradigm – frequentist statistics? Sorry about being such a noob on this stuff.

By: Daniel Lakeland

Daniel Lakeland — Thu, 20 Oct 2016 17:41:29 +0000

In reply to Matt. Note that in a Bayesian analysis you would take away the same thing from the 1 vs 20 experiment. It's only the fact that frequentist statistics tells you "how often would x occur" that means that doing X many times or even potentially having a bunch of different Xes you could do and choosing one through some path that biases you towards p < 0.05 is a problem. The bayesian analysis answers a different question, not "how often would X occur" but "how much information does my model and my data give me about Y" where Y is some unknown unobserved thing that leads to X.

By: Andrew

Andrew — Thu, 20 Oct 2016 17:38:04 +0000

In reply to Matt. Matt: First off, it's fine to be confused about this. As shown by the link in my post above, a Harvard professor of psychology and a Harvard professor of political science have difficulty with these concepts too, so they're not simple. Indeed, the convoluted logic of hypothesis testing has confused many prominent researchers. To get to your example in your second paragraph there: When we get new information, our inferences change. I don't know enough about blood pressure to comment on your specific example, but in general we understand effects better when we consider multiple outcomes. Different outcomes are related to each other, and it makes perfect sense to me that learning about 19 other outcomes will affect my inferences about effects on blood pressure. In your third paragraph, you write, "it is difficult to wrap one’s head around the fact that the researcher’s intentions impact the interpretation of a p-value." I agree, this is odd, but unfortunately this is baked into the definition of the p-value. You have data y, a test statistic T(y), and the p-value is Pr(T(y_rep) >= T(y)) where y_rep is sampled from the null model. The point is that in this definition, it is necessary to define T(y_rep) as a function of y_rep, which means that to define a p-value, you need to make some assumption about what test statistic would be reported, for any y_rep. This assumption is absolutely necessary for the p-value to have any definition at all.

By: Matt

Matt — Thu, 20 Oct 2016 17:05:28 +0000

In reply to Andrew.

Thanks for getting back to me. I agree with all this (except option 3, which I would have to read about because my understanding is quite limited there).

This is still slightly fuzzy though. To keep with my example, suppose the first researcher preregisters and gets the effect on blood pressure. Then, suppose we can rewind time, and the same researcher gets to do the experiment over, but this time he looks at 20 outcomes – but again finds that blood pressure is significant (because data is identical). Why should I not take away the same information from both of these hypothetical experiments?

This is a hard concept to communicate I think, even your explanation is still conceptual – it is difficult to wrap one’s head around the fact that the researcher’s intentions impact the interpretation of a p-value. That is odd to me, perhaps that is not odd to a statistician. Anyways, just thinking out loud at this point. I guess the replication crisis is evidence that people are behaving according to option 2. However, I was trying to figure out the other day what an acceptable replication rate should be…and obviously that depends on what the “true” effects were in these experiments…so I’m not sure what we are even comparing this seemingly abysmal replication rate to (i.e. what rate should we expect?).

By: Keith O'Rourke

Keith O'Rourke — Thu, 20 Oct 2016 17:04:49 +0000

In reply to Andrew.

Madam/Mr President – I concur with 3 if there is a reasonable/credible/responsible prior and data generating model (representation of underlying reality we hope to be connected with) whilst otherwise 2.

(Additionally, your new adviser’s point that 1 is just 3 with a point prior – is seldom in practice helpful.)

By: Daniel Lakeland

Daniel Lakeland — Thu, 20 Oct 2016 16:53:35 +0000

In reply to Corey. Not quite yet ;-)

By: Anoneuoid

Anoneuoid — Thu, 20 Oct 2016 16:46:44 +0000

In reply to Matt.

>”how do I explain to someone that they should put less weight on a drug study that looked at 20 outcomes and found that the drug has a statistically significant effect on blood pressure versus a pre-registered study that finds a statistically significant effect on blood pressure (and that is the only outcome looked at).”

If you still care about statistical significance rather than estimating the size of the effect (or better, figuring out a model that can reproduce the functional relationship between the two parameters, here a dose response), then I don’t think you get his point.

He can correct me if he disagrees, but the multiple comparisons issue is just more problems on top of an already dead paradigm, you shouldn’t be doing those tests anyway… The main problem with what you mention is that it leads to a literature filled with hugely overestimated effect sizes. For example, here:

“I call it the statistical significance filter because when you select only the statistically significant results, your “type M” (magnitude) errors become worse.

And classical multiple comparisons procedures—which select at an even higher threshold—make the type M problem worse still (even if these corrections solve other problems). This is one of the troubles with using multiple comparisons to attempt to adjust for spurious correlations in neuroscience. Whatever happens to exceed the threshold is almost certainly an overestimate. ”
http://statmodeling.stat.columbia.edu/2011/09/10/the-statistical-significance-filter/

If you want the fleshed out, more mathematical argument, the best way is to run your own monte carlo simulations.

By: Andrew

Andrew — Thu, 20 Oct 2016 16:36:21 +0000

In reply to Matt.

Matt:

Suppose a study has data matrix and J possible data summaries (which might be comparisons, regression coefficients, whatever), T_j(y), for j=1,…,J.

Consider three possible scenarios:

1. J=1. One can perform a test on T_1(y), comparing it to its distribution T(y_rep) under a null model and perform a hypothesis test.

2. J=20 and the researcher picks the best result (this could be via “p-hacking” in which all 20 tests are computed and the best one is chosen, or less formally through a “garden of forking paths” in which the data are set up opportunistically and tested in a way that makes sense, conditional on the values actually observed. In either case, the test being used is T(y) = max_j T_j(y), and if you want to perform a hypothesis test you need to figure out the distribution of T(y_rep) = max_j T_j(y_rep) under the null hypothesis. The way this works is that the T_j that’s picked will depend on the data.

3. J=20 and the researcher looks at all comparisons together. I’d suggest doing this using a hierarchical model.

The following paper is relevant to option 2 above:
http://www.stat.columbia.edu/~gelman/research/published/multiverse_published.pdf

And this paper is relevant to option 3 (my preferred approach):
http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf

By: Corey

Corey — Thu, 20 Oct 2016 15:56:34 +0000

In reply to Daniel Lakeland. Madam President

By: Matt

Matt — Thu, 20 Oct 2016 15:54:58 +0000

Hi, grad student in econ here; I have been reading many of your posts and a few of your published articles concerning the replication crisis and p-hacking, garden of forking paths etc, I agree completely with the conceptual problem of p-hacking and the garden of forking paths. I think I understand the meaning of a p-value, and that for it to have literal meaning the implication is that, given a different sample from the same population, the same research decisions would have been made and the same hypothesis test would have been run. However, given recent arguments I have had with people, I have realized that I do not have the mathematical rigor to back up this claim. Taking a simple example, how do I explain to someone that they should put less weight on a drug study that looked at 20 outcomes and found that the drug has a statistically significant effect on blood pressure versus a pre-registered study that finds a statistically significant effect on blood pressure (and that is the only outcome looked at). I can given all sorts of conceptual arguments for this, but I would like to hear anyone’s more mathematical argument for this claim.

By: Daniel Lakeland

Daniel Lakeland — Thu, 20 Oct 2016 15:48:17 +0000

In reply to Shravan. Mr President, as you know you have tasked RISLAG (Really Important Six Letter Acronym Group) with determining how devastating the release of STB (Some Thing Bad) into the environment would be for the human population of earth. You have asked us to consider all the possibilities, and to analyze as many scenarios as we were able to on our BGC (Big Gigantic Computer). You have asked us to produce a summary of the effect which includes 100% of the possibilities. We have spent the last 3 years and over $100 Billion Dollars (say that in the voice of Dr. Evil while raising pinky finger to cheek). We are now able to give you the results of our study, which is that we are 100% sure that the effect in question is between negative infinity and positive infinity.

By: Shravan

Shravan — Thu, 20 Oct 2016 15:03:18 +0000

On a related note, I’ve been wondering why we don’t plot 100% confidence intervals. After all, complete certainty is the ultimate goal.

By: BenK

BenK — Thu, 20 Oct 2016 14:34:36 +0000

To tie lightning rods together, this is also what bothers me about the anti-anti-vaxxers.
If you tell someone that the risk of debilitating chronic disease, say, is indistinguishable from
zero – and then say ‘because it is less than 0.05 (1 in 20)’ – they will be justifiably concerned
that you have NO idea how risk management works.

So, how many bad studies do you need in a field before it becomes a crisis of credibility?
How many from an individual researcher? How bad do they need to be?

By: CJ

CJ — Thu, 20 Oct 2016 14:02:06 +0000

If it had been on purpose, this would have been a clever and ironic response to the problems with hypothesis testing. The response to your post may be: “Well, maybe social psychology doesn’t have a 100% replication rate overall, but I have identified some subfields where I cannot reject 100% replication (p<.05) which confirms my theory.