Over the years I lost some clients because of this strategy. However, I have some regular clients so that I’m able to survive without those black sheep. What really annoys me is that they manage to survive as well, although they perform crappy research.

]]>Yes, I am seeing now why there has been so little progress on the HIV vaccine front. The subject area experts are apparently extremely hubristic and uncreative. In the meantime I was looking for more info and found this paper:

“Because in HIV vaccine efficacy trials the null hypothesis (of no efficacy) is scientifically plausible, the Bayesian analysis assigns a prior probability Pr(VE = 0%) to this hypothesis. An obvious choice is Pr(VE = 0%) = .5, so that there is an even chance of zero efficacy and of nonzero efficacy.”

http://jid.oxfordjournals.org/content/203/7/969.short

I would like to see more about how the zero effect was deemed to be not only scientifically plausible, but the by far most likely outcome. I highly doubt that would stand up to scrutiny, there are simply so many routes by which a vaccine could have an effect. I bet they only considered one favorite mechanism during the discussion and improperly conflated that with the statistical hypothesis. In other words, the usual NHST error.

]]>I am not a virologist – but Jim’s virologists and mine agreed that a zero effect was probable (maybe > .5).

The effect was defined as protection from currently circulating HIV not extinct versions.

> included any antibody tests in the pipeline

The vaccine development pipeline was very different in this case, in most vaccines doing an RCT without being fairly sure of an effect would be extremely unlikely.

Desperate researchers often waste resources and risk high false positive claims.

]]>I don’t see why the strain would need to be non-“extinct”. Off the top of my head:

Vaccine -> Immune Response (eg Fever) -> Reduced libido for a few days -> lower HIV incidence

Or if they included any antibody tests in the pipeline, you could get a cross reaction with the vaccine peptide, which will affect diagnosis rates.

]]>Yes, I was surprised too. I wouldn’t’ve expected King to understand much about how p-values and confidence intervals work, as this is kinda technical and lots of applied researchers and even textbook writers get confused on this point—but I was surprised to see him dismiss the value of replications. Here’s where I think he made the mistake of trusting Gilbert on the substance, and then conversely Gilbert naively trusted King on the stats. I have no idea what either King or Gilbert thinks about this now, but my guess is that King may have realized that he screwed up on this one, but he’s not sure whether to publicly admit his error or just quietly move on and hope that people forget this whole episode.

]]>Jim had put a non-zero probability on zero effect, which I complained about and had to back down given the virologists…

Of course one needs to avoid sure things, putting probability 1 on zero effect because RCTs are blind to mechanism of effect and one may always be wrong about that (i.e. the extra CO2 in your example and Herman’s use of the adjective virtually.)

]]>1) People wanting to think they are making progress without doing the necessary hard work of figuring out the premises and deducing precise predictions from their speculations. In fields like medical research the vast majority study extremely dynamic systems without any need for tools like calculus. That alone should be a huge red flag.

2) Overreliance on argument from authority and consensus heuristics. These are necessary tools, but when they fail it can be quite spectacular.

3) The extreme cognitive dissonance that results amongst those who have spent a lot of time/effort/money on NHST when they take this realization to its logical conclusion. It took me a few years to really accept it and I realized the problem relatively early on.

]]>Different people may need different types of explanation to help them “get it”. You may find some of the explanations in the slides (under Course Notes) at http://www.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html to be helpful — those for Day 2 and Day 4 are probably most relevant. Also, the link further down to Jerry Dallal’s Simulation of Multiple Testing (and the two items following it) can be helpful to some people.

]]>Yes, indeed. And, beyond that, treatment effects can vary, they can be positive in some scenarios and negative in others.

]]>I’m not quite sure what went wrong there either. But you have to remember that King works well when collaborating with people who *do* know statistics. I have no idea exactly what went wrong with that Gilbert, King, Pettigrew, and Wilson paper, but it’s possible that: (a) Gilbert deferred to King under the impression that King was a statistics expert, and (b) King deferred to Gilbert under the impression that Gilbert was a subject-matter expert. This sometimes can happen with collaborations, that with multiple people involved, there’s no one to ultimately take responsibility for the conclusions. I’d like to think that either King or Gilbert acting alone would not have made these mistakes: King would not have been emboldened by Gilbert to take such a strong and mistaken position regarding psychology’s replication crisis, and Gilbert would not have been emboldened by King to make such strong and mistaken statistical claims. The whole episode was a disaster.

Currently re-reading how Russel, Wittgenstein and Ramsay struggled with this issue – of course Peirce figured it out but wrote multiple faulty drafts and the final draft was not clearly marked ;-)

One thing that does seem to clear is that you can’t evaluate the value of a method of inquiry in any single instance or group of instances – which is what your doing now. Rather you need to evaluate in an inexhaustible set of inquiries – here the statistically significant outcome will vary and not always be the same one.

It applies to Bayesian analyses as well, with a reasonable/credible/responsible prior and data generating model (an adequate representation of underlying reality we hope to be connected with for the current purpose) you might get an unlucky data set. More likely you will have an inadequate representation of underlying reality and not notice it in a given data set (this time or in the first n times).

What is baked into the definition of the p-value for the purposes that it is often put to in many disciplines is the amplification of how bad the evaluation of it is in an inexhaustible set of inquiries (from the naive unadjusted p_value.)

]]>The Meehl link is excellent. Also relevant is my discussion with Deborah Mayo on confirmationist and falsificationist paradigms of science.

]]>No, we need 200% intervals. Better safe than sorry.

]]>And you can have large estimated effects which are statistically significant but which are not real!

]]>What happened is that mathematicians/statisticians turned the logic of science on its head. You are supposed to compare the predictions of your hypothesis to observation, not some other hypothesis. All the other issues follow from that initial error. Here is a good write up: http://www.fisme.science.uu.nl/staff/christianb/downloads/meehl1967.pdf

]]>I’m no expert, but I thought Gary King was?

]]>If the null hypothesis is that 100% of replication studies replicate the original findings, then A SINGLE INSTANCE of a failed replication demonstrates with 100% certainty that the replication rate is not 100%. Statistics is not even necessary here, simple logic will do.

]]>The bayesian analysis answers a different question, not "how often would X occur" but "how much information does my model and my data give me about Y" where Y is some unknown unobserved thing that leads to X.

]]>First off, it’s fine to be confused about this. As shown by the link in my post above, a Harvard professor of psychology and a Harvard professor of political science have difficulty with these concepts too, so they’re not simple. Indeed, the convoluted logic of hypothesis testing has confused many prominent researchers.

To get to your example in your second paragraph there: When we get new information, our inferences change. I don’t know enough about blood pressure to comment on your specific example, but in general we understand effects better when we consider multiple outcomes. Different outcomes are related to each other, and it makes perfect sense to me that learning about 19 other outcomes will affect my inferences about effects on blood pressure.

In your third paragraph, you write, “it is difficult to wrap one’s head around the fact that the researcher’s intentions impact the interpretation of a p-value.” I agree, this is odd, but unfortunately this is baked into the definition of the p-value. You have data y, a test statistic T(y), and the p-value is Pr(T(y_rep) >= T(y)) where y_rep is sampled from the null model. The point is that in this definition, it is *necessary* to define T(y_rep) as a function of y_rep, which means that to define a p-value, you need to make some assumption about what test statistic would be reported, for any y_rep. This assumption is absolutely necessary for the p-value to have any definition at all.

This is still slightly fuzzy though. To keep with my example, suppose the first researcher preregisters and gets the effect on blood pressure. Then, suppose we can rewind time, and the same researcher gets to do the experiment over, but this time he looks at 20 outcomes – but again finds that blood pressure is significant (because data is identical). Why should I not take away the same information from both of these hypothetical experiments?

This is a hard concept to communicate I think, even your explanation is still conceptual – it is difficult to wrap one’s head around the fact that the researcher’s intentions impact the interpretation of a p-value. That is odd to me, perhaps that is not odd to a statistician. Anyways, just thinking out loud at this point. I guess the replication crisis is evidence that people are behaving according to option 2. However, I was trying to figure out the other day what an acceptable replication rate should be…and obviously that depends on what the “true” effects were in these experiments…so I’m not sure what we are even comparing this seemingly abysmal replication rate to (i.e. what rate should we expect?).

]]>(Additionally, your new adviser’s point that 1 is just 3 with a point prior – is seldom in practice helpful.)

]]>If you still care about statistical significance rather than estimating the size of the effect (or better, figuring out a model that can reproduce the functional relationship between the two parameters, here a dose response), then I don’t think you get his point.

He can correct me if he disagrees, but the multiple comparisons issue is just more problems on top of an already dead paradigm, you shouldn’t be doing those tests anyway… The main problem with what you mention is that it leads to a literature filled with hugely overestimated effect sizes. For example, here:

“I call it the statistical significance filter because when you select only the statistically significant results, your “type M” (magnitude) errors become worse.

And classical multiple comparisons procedures—which select at an even higher threshold—make the type M problem worse still (even if these corrections solve other problems). This is one of the troubles with using multiple comparisons to attempt to adjust for spurious correlations in neuroscience. Whatever happens to exceed the threshold is almost certainly an overestimate. “

http://andrewgelman.com/2011/09/10/the-statistical-significance-filter/

If you want the fleshed out, more mathematical argument, the best way is to run your own monte carlo simulations.

]]>Suppose a study has data matrix and J possible data summaries (which might be comparisons, regression coefficients, whatever), T_j(y), for j=1,…,J.

Consider three possible scenarios:

1. J=1. One can perform a test on T_1(y), comparing it to its distribution T(y_rep) under a null model and perform a hypothesis test.

2. J=20 and the researcher picks the best result (this could be via “p-hacking” in which all 20 tests are computed and the best one is chosen, or less formally through a “garden of forking paths” in which the data are set up opportunistically and tested in a way that makes sense, conditional on the values actually observed. In either case, the test being used is T(y) = max_j T_j(y), and if you want to perform a hypothesis test you need to figure out the distribution of T(y_rep) = max_j T_j(y_rep) under the null hypothesis. The way this works is that the T_j that’s picked will depend on the data.

3. J=20 and the researcher looks at all comparisons together. I’d suggest doing this using a hierarchical model.

The following paper is relevant to option 2 above:

http://www.stat.columbia.edu/~gelman/research/published/multiverse_published.pdf

And this paper is relevant to option 3 (my preferred approach):

http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf

If you tell someone that the risk of debilitating chronic disease, say, is indistinguishable from

zero – and then say ‘because it is less than 0.05 (1 in 20)’ – they will be justifiably concerned

that you have NO idea how risk management works.

So, how many bad studies do you need in a field before it becomes a crisis of credibility?

How many from an individual researcher? How bad do they need to be?