Macartan Humphreys sent the following question to David Madigan and me:
I am working on a piece on the registration of research designs (to prevent snooping). As part of it we want to give some estimates for the “scope for snooping” and how this can be affected by different registration requirements.
So we want to answer questions of the form:
“Say in truth there is no relation between x and y, you were willing to mess about with models until you found a significant relation between them, what are the chances that you would succeed if:
1. You were free to choose the indicators for x and y
2. You were free to choose h control variable from some group of k possible
3. You were free to divide up the sample in k ways to examine heterogeneous treatment effects
4. You were free to select from some set of k reasonable models”
People have thought a lot about the first problem of choosing your indicators; we have done a set of simulations to answer the other questions, and find for example that freedom to add control variables gives you a lot of latitude for small datasets but this decreases quickly as datasets become larger; freedom to focus on subpopulations gives huge latitude; freedom to select models (eg linear, logit, probit) don’t do much.
The question is: are there analytic results on these things already? Or is there already a literature assessing these different approaches to snooping?
I’ve been involved in a large-scale drug safety signal detection project for the last two or three years (http://omop.fnih.org). We have shown empirically that for any given safety issue, by judicious choice of observational database (we looked at 10 big ones), method (we looked at about a dozen), and method setup, you can get *any* answer you want – big positive and highly significant RR or big negative and highly significant RR and everything in between. Generally I don’t think there is any way to say definitively that any one of these
analysis is a priori obviously stupid (although “experts” will happily concoct an attack on any approach that does not produce the result they like!). The medical journals are full of conflicting analyses and I’ve come to the belief that, at least in the medical arena, the idea human experts *know* the *right* analysis for a particular estimand is false.
I’m all for registration of observational studies with pre-specified protocols. The bit I’m not so sure about is whether such a process will necessarily produce better answers…
To which Macartan replied:
That’s very interesting, if a bit depressing; the following result is easy: for any variables x,y and coefficient b there is a z such that a
regression of y on x yields b, with as low a p as you like; simply define z = y – bx. Of course whether you can find that z is another question…
I saw this today on registration: Mathieu et al: “Comparison of Registered and Published Primary Outcomes in Randomized Controlled Trials” which suggests that only 31% of articles that registered actually registered properly and stuck to their plans; among those that changed their plans there were a lot of positive results… However at least with registration you can go back and see what effects are due to the changing plans.
Meanwhile, I wrote the following reply to the original question:
The short answer is that I think a determined researcher can find all sorts of things. My solution to this snooping problem is not to forbid analyses but rather the opposite, to set up the data so people can do all possible analyses. If everything is noise that should show up in the distribution of findings.
To which Macartan replied:
Sounds very democratic but it requires that different people are happy to find all sorts of things; which if they have correlated stakes in the answers they might not;
But there is something else here which I find harder to put my finger on: say I propose model A ex ante and find wonderful effects, and then you come along and run models B, C, D, E,…. and find no effects; should I infer that my result was spurious? No, not unless I thought that B, C, D… are just as plausible tests of whatever my claim is. But of course if I did find them just as plausible then I would have been happy to include them in my initial statement of the test to be run. In other words the extra analyses that you would admit would only matter to me if they are the ones that I wouldn;t have forbidden in the first place. What precommitting then does is just move forward the conversaton about what the family of plausible models is, to a point where it is not influenced by results.
Then I wrote:
It’s not a matter of the original finding being spurious, it’s about putting it in a larger context. Consider the 8-schools example in chapter 5 of Bayesian Data Analysis. Inference for each individual school is informed by the data from the others. See also this presentation (which includes that example):
I don’t see how that addresses the issue. It is still the case that for whatever model you settle on (including a multi level Bayesian model that uses data from all schools) someone can muck about with features of the model to get results they like.
Yes, that’s always true in any case. But a multilevel model will handle many of the issues of concern.
At this point it seemed worth posting the discussion for all of you.