What recommendations to give when a medical study is not definitive (which of course will happen all the time, especially considering that new treatments should be compared to best available alternatives, which implies that most improvements will be incremental at best)

Simon Gates writes:

I thought you might be interested in a recently published clinical trial, for potential blog material. It picks up some themes that have cropped in recent months. Also, it is important for the way statistical methods influence what can be life or death decisions.

The OPPTIMUM trial (http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(16)00350-0/abstract) evaluated use of vaginal progesterone for prevention of preterm delivery.  The trial specified three primary outcomes: 1. fetal death or delivery at less than 34 weeks’ gestation; 2. Neonatal death or serious morbidity; 3. Cognitive score at two years.  Because there were three primary outcome they applied a Bonferroni/Holm correction to “control type 1 error rate”.

Results were:

outcome Progesterone Placebo Risk ratio / difference (95% CI) P unadjusted P adjusted
Fetal death or delivery < 34 weeks 96/600 108/597 0.86 (0.64, 1.17) 0.34 0.67
Neonatal death or serious morbidity 39/589 60/587 0.62 (0.41, 0.94) 0.02 0.072
Cognitive score at 2 years 97.3 (n=430) 97.7 (n=439) -0.48 (-2,77, 1.81) 0.68 0.68

The conclusion was (from the abstract) “Vaginal progesterone was not associated with reduced risk of preterm birth or composite neonatal adverse outcomes, and had no long-term benefit or harm on outcomes in children at 2 years of age.

A few comments (there is lots more that could be said):

  1. There is a general desire to collapse results of a clinical trial into a dichotomous outcome: the treatment works or it doesn’t. This seems to stem from a desire for the trial to provide a decision as to whether the treatment should be used or not.  However, those decisions often come later and are largely based on other information as well (often cost-effectiveness).
  2. The approach taken here (adjustment of p-values for multiple testing) implies that a conclusion of treatment effectiveness would be made if any of the three “primary outcomes” had p<0.05, after adjustment.
  3. “Statistical significance” is taken to mean treatment effectiveness (as it usually is by clinicians, researchers and even statisticians).
  4. There were a lot of patients missing for the assessment of cognition at 2 years, so there has to be a question mark over that result.  It is certainly possible that bias has been introduced.
  5. The Bonferroni adjustment moves the neonatal death/morbidity outcome from “significance” to “non-significance”, but the data still support a reduction in this outcome more than no effect or an increase (as the posterior distribution would no doubt show).
  6. It seems to me that the statistical methods used here have really not helped to understand what the effects of this treatment are.

I’d be interested in your take (and those of commenters) on the analysis and conclusions.

Interest declaration: I [Gates] was tangentially involved in this trial (as a member of the Data Monitoring Committee) and I know several of the authors of the paper.

My reply: Yes, I agree that the analysis does not seem appropriate for the goals of the study. Even setting aside the multiple comparisons issues, “not statistically significant” is not the same as “no effect.” Also the multiple comparisons correction bothers me for the usual reason that it doesn’t make sense to me that including more information should weaken one’s conclusion.

I wonder what would happen if they were to use a point system where death counts as 2 points and the other negative outcomes count as 1 point, or some other sort of graded scale? At this point maybe there would be a concern of fishing through the data, but for the next study of this sort maybe they can think ahead of time about a reasonable combined outcome measure.

The other issue is decision making: what to do when a study is not definitive (which of course will happen all the time, especially considering that new treatments should be compared to best available alternatives, which implies that most improvements will be incremental at best). In some ways the existing paper is good in that way, in that it presents the results right out there for people to look at. I’d like to see a formal decision analysis for whether, or when, to recommend this treatment for new patients.

Also, one more thing: the measurement of cognitive score has another selection bias problem which is that it is conditional on the baby being born. So it might be that they’d want to fit some sort of selection modeling to handle this. I could imagine a scenario in which a treatment reduced deaths and ended up also reducing cognitive scores if it saved the lives of babies who later had problems. Or maybe not, I’m not sure; it just seems like one more thing to think about.

86 thoughts on “What recommendations to give when a medical study is not definitive (which of course will happen all the time, especially considering that new treatments should be compared to best available alternatives, which implies that most improvements will be incremental at best)

  1. That loss to follow up data looks a bit odd to me, but maybe (a) deaths did not differ much, but the early delivery and serious mortality did. And then maybe as a result of the worse outcomes early on, the placebo group had better follow up?

  2. > Even setting aside the multiple comparisons issues, “not statistically significant” is not the same as “no effect.”

    They don’t claim “no effect” (at least if you read beyond the one-line summary). In the discussion part of the article they claim that “OPPTIMUM strongly suggests that the efficacy of progesterone in improving outcomes is either non- existent or weak.” I guess weak has to be understood relative to a clinically meaningful effect, according to the trial protocol the power to detect a relevant effect was over 80% for each of the primary outcomes.

    > Also, one more thing: the measurement of cognitive score has another selection bias problem which is that it is conditional on the baby being born. So it might be that they’d want to fit some sort of selection modeling to handle this.

    I’m sure you’ll be pleased to know that they gave some thought to these issues before launching a long-term study enrolling several hundreds of patients:
    “The primary childhood outcome is the Bayley III score, a continuous measure. This outcome will, by definition, not be available on babies who have died. Thus deaths need to be incorporated into the analysis, since the number of deaths may be sufficiently large as not to be negligible, and/or there may be a difference in the number of deaths between the two randomised groups. We will therefore use a two-stage statistical model that jointly models the treatment effect in both deaths and survivors [24], with deaths modelled using a binomial test and survivors modelled using a generalised linear model. The two parts are then combined to form the appropriate test statistic. Secondary analyses that adjust the estimated treatment effect for covariates felt to be of importance will be used as appropriate. Note that we will not be adjusting for gestational age in our analysis of childhood outcomes. The hypothesised mechanism of action of progesterone is to increase gestational age by reducing the proportion of women giving birth prematurely. To adjust for a post randomisation covariate (gestational age) which is a direct measure of the treatment effect, in a model that is estimating the consequence of that treatment effect (in terms of developmental outcome) is not statistically sound.”

    Regarding point 5 in Gates’ message, if we are willing to ignore the significance/non-significance distinction the point estimate is also in the direction of benefit in the primary obstetric outcome, not only in the neonatal outcome. And the point estimate shows a deterioration in the cognitive score (even if it’s statistically not significant and clinically negligible). I’m not saying that the “trends” in the different measures cannot be interesting (and they are also discussed in the article).

  3. The report of the trial notes that cognitive scores were imputed for dead children, but does not provide information on the imputation method. I cannot think of any appealing method to impute cognitive scores for dead children. Even if they had applied the two stage model, what is the interpretation of this estimate?

    My impression from the protocol is that the study was powered to detect effects of the magnitude they anticipated observing, not the minimum clinically relevant effect size. Given that the outcomes were very sick or dead babies, I assume even quite tiny effects would be clinically relevant. http://bmcpregnancychildbirth.biomedcentral.com/articles/10.1186/1471-2393-12-79

    This kind of ambiguous result is extremely common, and the love affair with the p<.05 criterion gives misleading conclusions. Perhaps there is no easy solution because the rigid trial protocols and interpretation rules seem intended to protect against nefarious pressures to overinterpret trial results when there is a potential financial benefit to doing so. But surely the plethora of ambiguous findings on important treatment questions indicate that we need to make trials faster, easier, and more common.

  4. So, the point score thing I think is the way to go, we’ve had all kinds of discussions recently about Bayesian decision theory where I’ve expressed strong opinion. Part of the reason is that I’m actually working on a risk model for a certain activity where there are some “precursor” type symptoms, and there are more serious issues, and the whole goal is to let people participate in the activity while keeping the lifetime “cost” low. So, anyway for each of say 7 outcomes that are progressively more serious you can put a “cost” associated…

    outcome : “Cost”

    1 : 1
    2 : 4
    3 : 8
    4 : 500
    5 : 5000
    6 : 100000
    7 : 10000000

    Where events 1,2,3 are “sub-clinical events that are precursors to clinical events” and 4 is the first “clinically relevant event”, 5 would require treatment but normally resolve, and 6, 7 could be very serious or life threatening.

    It seems to me you want to do a similar thing for these outcomes.

    fetal death : 1000 (spontaneous abortion is not totally uncommon so it could be caused by various things)

    neonatal death : 3000 (making it to term and then dying seems worse than a spontaneous earlier abortion to me, if nothing else it probably places the mother at higher risk)

    cognitive score at 2 : (1-score / maxscore) * 100 (obviously we want high cognitive scores… there’s no reason to necessarily use a linear function though, get some experts to think about it).

    Then the question becomes, does using this drug lower the expected “badness” of outcomes? Why we have the FDA doing anything BUT this is … frustrating but not actually baffling. Still, Wald’s paper was 1947!!

    • This is great when motivations are aligned & legitimate. In adversarial settings this just leads to a different form of fishing, doesn’t it?

      We choose scores that best color the model outcome to support FDA approving our drug & then fish for experts and narratives (“making it to term and then dying seems worse than a spontaneous earlier abortion”) that justify those prior choices. The great part is that there’s just so many appealing qualitative explanations available to justify almost any choice.

      I’m not saying it’s a bad approach. Just saying that it isn’t as easy & foolproof as you make it sound. It’s vulnerable to other kinds of abuse.

      • Well, I didn’t mean for it to seem like it was simple, but in theory, you’d require all these outcome badness scores to be pre-specified in the approval plan, and the FDA, other experts, and the drug manufacturer would have to hash this all out ahead of time. So it wouldn’t be fishing for scores after getting the data, it’d be evaluating whether the data supports improvements based on the pre-specified scores. With drugs I think that this kind of pre-trial plan is already being required, except it’s being required in terms of “95% chance of non-inferiority vs existing drug X using NHST test foo” or whatever.

        • Yes, but before you submit your “official” plan to the FDA it isn’t like you have *no* prior data or intuition.

          So long as you have the leeway to choose your outcome costs you might as well make the approval plan to minimize the apparent impact of whatever events you expect will damage your chance the most.

          Basically choosing your cost matrix (and perhaps other “priors”) is a giant degree of freedom. The filing company is more motivated to trawl for a cost matrix that furthers its case than to choose one most aligned to reality, whatever that means.

          Give a large enough degree of freedom and there exists some spot in parameter space that makes your drug look good.

        • The FDA itself would have input, the FDA could go out and get input from independent third parties, and the existing competitor companies, and/or actually estimate outcome costs from data related to hospitalization records, etc.

          remember, companies are ALREADY do all the trawling and etc that you talk about, except they’re working with a system that specifies NHST based outcomes that are often really irrelevant to actual outcomes in the world. For example my wife has talked a little with people at Merck who has a drug for osteoporosis that is in trials, and it’s conceivable that with the way things are today, it may not get approved even if it does better than the existing drugs simply because the new thresholds for approval are different than they used to be back when the previous drugs were approved. How is that good for society? Most likely once the older drugs were approved lobbyists for those companies are lobbying the FDA to raise the bar for new competitors right?

          As long as the cost functions and soforth increase and decrease in the correct directions (ie. more heart attacks = more bad, and more side effects = more bad ), I think it would be a step up from what we’ve got pretty much no matter what.

        • Yes, the trick likes in *not* letting the filing company chose the cost function nor any priors it wants to use.

          Not sure how feasible that is but if a third party panel of experts appointed by FDA is asked to come up with a cost function for various adverse events without being shown anything about the actual data, I think that is a good solution.

          The point I find iffy is tweaking priors after seeing the model results.

        • Daniel:

          You refer to “NHST test foo,” which would be some particular test that someone might use.

          There’s also “NHST test fu,” which is the ability to come up with the test that will give you p < .05 with the minimal amount of effort and the least appearance of p-hacking.

          NHST test fu is very important. Some people have built entire careers off it!

        • I agree with this, and I think it’s a serious problem for drug approval, much easier to think up a special test that seems logically relevant but you know is going to give you the magic p < 0.05 sauce than to meet some actual practically and clinically relevant level of effectiveness. This seems to be especially true in cancer drugs, we’ve got tons of cancer drugs that if you balanced how bad they make you feel (REALLY BAD) vs how much extra life you get out of them (four weeks or something) they’d probably be a net loss for society, but we’re paying BILLIONS for them through insurance and giving them to people who have some kind of basic expectation that “it has really significant effects” means that maybe they’re going to feel fine for a year or two at least. but it really just means that p < .05 that they’ll live some amount longer than if they didn’t take it.

        • In NZ, we have a centralised purchasing agency, PHARMAC. The largely US drug companies have lobbied for years to crush it and the disease it spreads. Its existence threatened NZ’s participation in the Trans-Pacific Trade Deal. This malevolent organisation does something similar to what you suggest Daniel. It looks at cost, benefits, but mostly at the evidence for effectiveness for patients. This malevolent organisation makes publicly funded medicine, with few real gaps to world class treatment available for all. I suspect many of the cancer drugs you allude to do not get funded in favour of innovative treatments that prevent or have evidence of QOL outcomes. Just a thought. https://www.pharmac.govt.nz/

        • Another point here is that I totally agree with your bit about “you’d require all these outcome badness scores to be pre-specified in the approval plan”.

          To me the analog of this for academic studies is pre-registration. And that’s why I’m puzzled by the antipathy I sense from Andrew etc. to pre-registration.

        • Andrew:

          Ok, apologies about mis-interpreting your attitudes towards pre-registration.

          I assumed you didn’t think pre-registration was very useful because:

          (a) I was under the impression that till very recently you never pre-registered your studies (I may be wrong!) If you thought it added value to your studies I guessed you would have pre-registered.

          & (b) You wrote stuff like “Preregistration is fine but it doesn’t solve your problem if studies are sloppy, variation is high, and effects are small” which didn’t seem too encouraging.

          I mean, your own studies aren’t sloppy etc. so why not make them better by pre-registering them?

        • Rahul:

          I haven’t done more preregistration because it takes work, also because it can reduce readability of an article or book if you go through all the steps you do in an analysis. When I’ve tried to be more careful about writing up exactly what I’ve done at each step, journals often ask me to cut.

          To put it another way, in my research, preregistration wouldn’t look like, “We have hypothesis X, so we do test Y.” It would like like, “We had idea X, so we did analysis Y to see what would come up; then we were disappointed not to see Z, so we tried Analysis A, . . .” This could still be a good thing, though, I agree.

          Perhaps with software such as knitR and Markdown and Python notebooks, in which one is empowered to annotate work as it happens, this will be the norm, for data analysts to post their entire “lab notebooks” online to accompany their research reports. Then every fork in the path would be documented.

        • @Andrew:

          So at the end of all those analyses don’t you conclude something concrete?

          Whatever it is that you conclude, why cannot / should not that conclusion (“hypothesis”) be put to test by a subsequent study, which can then be pre-registered explicitly?

        • Andrew:

          So in my naive view I distinguish between exploratory & confirmatory studies. Your description of studies sounded like exploratory studies. There it is totally fine to skip pre-registration.

          OTOH, for the confirmatory variety, I think pre-registration adds value. T

          I don’t use “confirmatory” in the sense of “cast in stone” but just that it means one started out with verifying a very particular model or hypothesis one had rather than just exploring the data and see what makes sense.

          I think exploration is great & even very essential but good science needs a mix of both kinds of studies.

          What I find iffy is that in going away from the NHST framework some advice seems to almost eschew any kind of non-exploratory study.

        • I think pre-registration of NHST stuff doesn’t really solve the problem that you’re using NHST… whereas if you’re going to go full Bayesian Decision Theory on a problem, I bet Andrew would get behind pre-registration of a cost function to make actual substantive decisions in the world.

          Though, consider in the case where you do a bunch of testing of your drug, and you use the pre-specified cost function, but you discover that there’s an issue where your drug does some really beneficial thing that no-one expected or included in the cost function… or has some really bad side effect that also wasn’t considered. I think ultimately you will need to revise cost functions to include dimensions that were not thought of in the original analysis. But again, here, a non-financially-interested third party should probably be used in helping to construct this adjustment.

        • To clarify: When you say pre-registration in the Bayesian context, it includes pre-specifying the priors you are going to use, correct?

        • I haven’t thought too carefully about it. Its not actually a statistical inference issue as much as a game theory issue. The thing is that when the model is simplistic you could pre specify everything. But when things are complicated that’s pretty hard. If you discover for example that some subgroup maybe has predisposition to certain side effects… You should change your model and then that throws your preregistration out the window.

          In the end I think preregistration is one way to provide transparency and convince people of the quality of an inference but its not the only way, and blind following of a registered plan isn’t guaranteed to produce better results.

        • I’ve been fitting Bayesian models only since 2013 now, but in my experience, for standard models and for standard situations using hierarchical linear modeling in factorial designs, even for un-preregistered studies there is no need to change the priors. I do carry out a sensitivity analysis when I have sparse data (e.g. in meta-analyses where we don’t have a lot of data to work with).

        • @Daniel

          But how do we tell apart a company that changes the model 1000x internally till it finds one that shows off its drug in a good light for FDA approval?

          We only get to see the one model or one prior that they *choose* to show the FDA. Not the 999 models that were blah for their application.

          That’s why I don’t like the flexibility for ad hoc model changes & prior changes subsequent to having seen the data.

        • @Shravan

          Sure, you have no need to change priors. But imagine you are a not-as-ethical-as-Shravan Pharma firm that has filed an FDA application for a new drug that could make it a billion dollars.

          Now, if the model-results based on the prior you had initially chosen don’t look good for your FDA approval, can we depend on the company to not keep trying new priors till they find a combination that makes their model look great?

          That’s why I feel we need pre-registration for FDA-approvals.

        • Rahul: so long as the cost function is expressing something that relevant people outside the company can get on board with, and so long as the model is expressing some logic that relevant people outside the company can get on board with… who cares how hard they looked? Bayesian inference doesn’t depend on how many things you looked at, it depends on the quality of the plausibility values you wind up assigning to the one problem of interest.

          That’s the nice thing about Cox’s axioms, they say basically that probability calculations generalize logic to real-valued instead of binary plausibility. So the question is what’s that logic look like?

          For example: “If a person eats ‘Captain Fortified’ cereal they will become the next president” vs “If a person eats ‘Captain Fortified’ they will get the full RDA of 7 vitamins”

          You’re not going to let anyone get away with inference using the first model, because it’s based on false premises, whereas it doesn’t matter how hard you looked using high performance liquid chromatography for the micro-nutrients mentioned in the second model, as long as it’s true… it’s true.

          Checking that the model is robust to prior specification, or has used sufficiently broad priors to cover all the possibilities that a third party survey of experts would reveal is probably one of the things you need to do to decide whether you accept the inference, but knowing “how many other things you looked at” ISN’T. That’s a feature of Frequentist statistics, it matters how many things you looked at for “the frequency with which false hypotheses are accepted” etc… but that’s just nonsense. It doesn’t matter how many things you looked at, if ‘Captain Fortified’ DOES have the 7 vitamins… then it’s a true statement.

          So, to be clear, it’s not that Bayesian inference magically fixes the “gaming the system” problem, but it DOES fix the “how hard did you look to find your effect” problem, because you only need to evaluate the goodness of the one final claim being actually made.

        • Sure, if you can manage to create One Bayesian Model to Rule Them All (OBMTRTA), then it will simultaneously give you estimates for any estimand (or “parameter”) of interest, and there’s no issue of fishing. But that’s clearly not what people actually do. For instance, say the model is essentially hierarchical, normal, homoskedastic: a distribution (with mean and variance) for subgroup means. As you’re exploring the data, you make choices as to which covariates to use for creating your subgroups. The “one model to rule them all” would basically be some kind of spike-and-slab prior over how to divvy up the overall variance between the different covariates/dimensions of grouping: you think most of the covariates get zero variance, and a few get to have a share. But fitting that meta-model is hard, so what you actually do is some trial-and-error search over models where each covariate is predeclared as being used or not used.

          Basically, it seems to me that we’d like a way to keep some amount of flexibility in model-building, but have a clearly-defined price to pay for it. Something like BIC: you want another parameter? Fine, no problem; here’s what it costs you. Want to do a subgroup analysis based on birth order? Fine, this is what it costs. Just as BIC “asymptotically equivalent to the Bayes factor”, this hypothetical statistic would approximate the OBMTRTA somehow.

          In order for that to be possible, you don’t have to fit the OBMTRTA, but you do have to specify, or at least sketch, it. “The OPTRTA states that any given distribution with support over all reals is going to be a Normal with probability 0.5, a student t with probability 0.2, a skew-normal with probability 0.15, etc.” Then whenever you realize that data outliers mess up a Normal assumption, you just pay the given Bayes factor tax to buy a t distribution and move on. Basically, instead of a garden of forking paths, you now have a garden of forking toll-roads.

        • @Daniel

          Even if you are right, do you see any downside to pre-declaring (in a pre-registration) the priors you will use before having seen the data?

          e.g. If I know you had decided that you were going to use “cognitive score at 2 = (1-score / maxscore) * 100” even before you saw any data somehow it makes me trust your model more than if I suspect you settled upon this particular cost function after playing around with many others after you looked at how they affected your model’s results.

          Perhaps I am irrational about this?

        • Rahul, see above where I say “In the end I think preregistration is one way to provide transparency and convince people of the quality of an inference but its not the only way”

          Also, I certainly think that in a medical regulatory setting people should be required to pre-specify as much as they can as a form of transparency, but I think that altering your model and evaluation of the model after the fact is an inevitable part of discovery, so I think it’s a good thing to have the pre-specified plan to show how much alteration was done and in what directions and why. What I don’t agree with is something like “your pre-specified analysis plan for cyclobenzaprine shows that you were going to test its effectiveness as an anti-depression medication, and by those standards it’s a terrible depression medication because it has a heavy sedative effect and makes people completely non functional… therefore we will never approve your drug”

          Well, look, it may well be that cyclobenzaprine is a terrible depression medication, but we discovered that it works well as a muscle relaxant to treat injuries, especially back injuries, and for this purpose it can be taken at night when you’re going to sleep anyway, and it is non-addictive, so now if we compare it to previous muscle relaxants for injuries such as Soma (CARISOPRODOL) it wins in every respect, but because we preregistered to try Cyclobenzaprene as a depression medication, we can never ever be trusted to re-analyze the data based on usage for back injuries, so everyone has to keep using this addictive inferior drug instead…

          So, basically I agree with the goal of trying to improve transparency by demanding that people make plain pre-specified statements about their expectations for a drug going into trials, (or any experiment, such as the lithium in drinking water, or changing the policies on certain traffic control measures, or whatever) but I think in the end we should ask “How good is this Bayesian inference regarding the final situation we have settled on” and use the pre-specified record of what people thought and the information about what we finally wound up with, as parts of the evaluation process, not as some kind of strict yes-no decision.

          Also, as to Quinn’s idea regarding BIC etc, no I don’t think that’s the way to go either. In an adversary system where someone’s motives for choosing the model are suspect, it makes more sense for regulators to use input from third parties to request the inclusion of several other models into a meta Bayesian analysis. Then you have something like:

          p(Data | Model1, params) p(params | Model1) p(Model1) + p(Data | Model2,params2) p(params2 | Model2) p(Model2) + …

          where there are say N models each being considered, one from the financially interested party, and several from various third parties. Either one of them will dominate, in which case after seeing the data the others will automatically be ignored by the Bayesian machinery, or all of them remain plausible, in which case we can do our decision theory based on the big mixture model.

        • @Daniel

          The company can always refile for cyclobenzaprine as a muscle relaxant submit a new pre-registration & conduct an independent trial to confirm that outcome.

          You don’t have to throw away the drug. Conduct a trial with that end use in mind. If there was a reason to think it would be a good muscle relaxant in the first place the company could have always added that outcome as a second goal to the evaluation metrics of the original trial.

          OTOH, if your anti-depressive is terrible but serendipitously shows what looks like a totally unexpected (say) anti-acne effect, I think it is fair to ask the company to demonstrate that the effect persists & isn’t an artifact via a separate, pre-registered trial than explicitly mentions the anti-acne goal in its pre-registration.

        • Rahul, suppose there was some reason that information was collected about injuries, so that effectively a “new trial” has already been run. That is, all the information you would collect in a new trial has already been collected in the old trial, but no one bothered to file in the trial plan that they thought the drug would be helpful for injuries. For example, suppose it was a trial of anti-depressant effect in soldiers who had received various skeletal injuries. The skeletal injury data is separate, collected in medical records for other purposes. You give the drug, it works terribly as an anti-depressant, but we notice the soldiers saying that they are experiencing less skeletal muscle pain, and soforth, we go back and get the data from their physical therapists, who are blind to their assignment of the drug, and lo and behold the data is there, when analyzed with a Bayesian model it looks good.

          The question is, can we trust the inference from a Bayesian model on data that was collected under the same type of circumstances as a new trial would have, or simply because we didn’t write down that we’d like to test this, do we have to throw this information out?

          Note, if you believe in “the frequency with which you are misled by a statistical hypothesis test” as the main issue in statistics, then very likely you need to re-do the trial. But, if you believe that “The plausibility that Data would be the outcome of a process if the parameters were Foo, and the plausibility that the parameters are Foo before you see any data” then p(Data|Foo)p(Foo)/p(Data) winds up being the same number whether you write down “we’ll also see if this drug works for skeletal injuries” or not. The question is, is the likelihood p(Data|Foo) a reasonable explanation for how Data comes about or not?

          if the functional form of p(Data | Foo) contains a lot of special purpose bits that you could only know to put into the equations after seeing the data, then you might well argue that this is an inappropriate model to use for the inference, it is too custom-fit to the accidental issues in the data, and if you want to use such a specific likelihood, you should re-run a new trial, but if the likelihood is composed basically of formulas similar to the ones you’d expect people to use before having seen data, then you are in a different situation. it seems stupid to require a new trial under those circumstances.

          This highlights how in Bayesian statistics, there is a CHOICE of likelihood, to represent how much information you have about how outcomes will occur, whereas in Frequentist statistics, in theory the likelihood is a PROPERTY OF THE UNIVERSE (the “true” long term frequency distribution of the data).

        • > The question is, can we trust the inference from a Bayesian model on data that was collected under the same type of circumstances as a new trial would have, or simply because we didn’t write down that we’d like to test this, do we have to throw this information out?

          Daniel: maybe *you* can trust the inference from a Bayesian model. Buy maybe a regulator shouldn’t trust *you* when you present your inference, because you may be presenting a subset of the information in a misleading way according to your interests. It is in the interest of the regulator to ask for a new trial, to be sure you’re not playing any games. I think this is not so much about bayesian vs. frequentist inference, and more about your inference vs. an inference that all the parties can agree on.

        • Carlos said:
          “I think this is not so much about bayesian vs. frequentist inference, and more about your inference vs. an inference that all the parties can agree on.”

          I see it as heavily about bayesian vs frequentist inference — although it is also heavily about whether or not all parties can agree on the inference. And whether or not one can agree on the inference depends in part on whether or not one accepts only frequentist, only Bayesian, or both as viable methods of inference. Both can be used to “play games” — in particular, the garden of forking paths applies to both. An important part of deciding whether or not an inference is (or should be) acceptable is whether it is consistent with the type of inference (frequentist or Bayesian) that is used. I see Daniel’s comments as trying to point out how what use of data is acceptable depends on whether one uses frequentist or Bayesian inference.

        • Carlos: that’s exactly my point though. The question comes down to “does the model make sense?” or am I “presenting a subset of the information in a misleading way according to [my] interests”

          So, since all the data collected in a clinical trial is in theory available to the regulator we can be certain that I’m not presenting a “subset”, and since the Bayesian model should be presented as runnable code, there’s no “misleading”.

          The question then comes down to Bayesian vs Frequentist because when a regulator sits down to evaluate whether the model code when run on the full complete data make sense, the question of whether the model makes sense has two interpretations.

          1) Bayesian case: The probabilistic model makes sense to someone who hasn’t seen the data yet based on a reasonable explanation of some scientific concepts, and the priors over parameters are not overly specific compared to what someone who hadn’t seen the data would think. The likelihood is thought of as a representation of knowledge, not as a frequency of occurrence. A Bayesian interpretation of this model means that the inference is totally legit. It just DOESN’T matter how many other hypotheses you might have looked at. Just like it doesn’t matter if I have a whole book full of “if we give Benadryl it causes muscle relaxation… if we give ibuprofen it causes muscle relaxation, if we give salt pills it causes muscle relaxation, if we give ground up horse hoof it causes muscle relaxation…” type rules with lines through all of them, until I got to the final one “if we give cyclobenzaprine it causes muscle relaxation”. Either the last one is true or it isn’t independent of how many other ones I looked at.

          vs

          2) The Frequentists test you are performing is based on having looked at a bunch of other hypotheses secretly which may have all been rejected (all my “if I give…. then….” multiple comparisons) or based on a wide variety of plausible sounding hypotheses being available to you which you could quickly choose from after seeing the data(“you could have very quickly come up with muscle-relaxation, or arthritis pain, or autoimmune, or a variety of other possible outcomes to check… this is the garden of forking paths theory) so that the **frequency with which a p < 0.05 result would occur when doing SOME selected test*** is much higher than 0.05 and hence we can’t trust the test to be a good filter because you might have just fished for this result, and/or thought it up post-hoc after seeing the data or whatever.

          Note, I fully agree with the need to carefully scrutinize the legitimacy of the model, and whether the model is fit to specific sub-groups etc which you probably only would have thought to do AFTER seeing the data, even in the Bayesian case… But if the model doesn’t have these sort of “overfitting” of the likelihood, and doesn’t have “hyper-precision” of the priors… it’s a legitimate inference independent of how many other models you might have fit to other outcomes etc.

        • Martha, although I kind of agree that the concept of the “forking paths” still has some merit in Bayesian inference, the main way in which it has merit is to point out that since your choice of Likelihood is in fact *a choice* (ie. there is no “one true a-priori likelihood which is a property of the universe”) you need to evaluate whether that likelihood makes sense in the context of the fact that it may have been chosen after-the-data. When there is no strong prior knowledge about the right form for the likelihood, choosing one likelihood after seeing the data may be a poor approximation to the “bigger” model of specifying a mixture between several options, or whatever. In essence, you may be putting a point-mass prior on your one choice of likelihood that is not justified. The more plausible sounding options there are in choice of likelihood, the less good is the approximation of picking just one and ignoring all the others, ie putting a strong prior p(Model_0) = 1 and p(Model_i)=0 for all other i.

        • @Martha:

          >>> the garden of forking paths applies to both [Bayesian & frequentist inference].<<<

          Thanks. I don't see this fact acknowledged much.

        • > there’s no “misleading”.

          Will you tell the regulator about all the combinations of outcomes and subgroups that you tried ? Either actual attempts to “get something” or the implied bias in doing the bayesian analysis only on what looks promising after looking at the results.

          > it’s a legitimate inference independent of how many other models you might have fit to other outcomes etc.

          Your inference is a posterior distribution that might assign more probability to some effect but will still assign some probability to no effect. Saying that you have “proved” the effect because the “no effect” region has low probability may look as a legitimate inference for you, but maybe it is not as legitimate for the regulator because his previous knowledge (K if you will) includes the fact that he only get’s to see “cherry-picked” analysis and the posterior might be different for him. I’m sure you might find a way to systematise this shrinkage.

        • @Anoneuoid

          Those steps are fine. But can you adapt them to include how we should go about descisionmaking? The sort that FDA must do.

          At the end of it all is finally a YES / NO decision, right?

          I guess what we are agonizing about here is the optimal, robust set of rules to convert the continuum of observations, priors & model-choices into a binary decision.

        • @Rahul:
          “At the end of it all is finally a YES / NO decision, right? ”

          I’d say yes and no: Yes, there is a yes/no decision about whether or not to approve the drug. But FDA approval, in my understanding, is for a specific use. Complicating matters, once approval has been given, physicians have the right to prescribe the drug for other purposes (“off label”). So, logically and ethically, there are lots of factors to consider. Given these circumstance, I don’t think any pre-specified yes/no decision criterion is ethically justified; benefits and possible harms (including those that might result from “off-label” use need to be weighed by individually considering them, which can only be done *after* their existence is known and their probabilities are estimated.

        • @Daniel

          In my naive, simplistic viewpoint pre-registration is all about having the rules in place before you run the race.

          Otherwise, there’s just too much temptation to shift the goalposts after you see the data.

        • @Daniel

          You write:

          “can we trust the inference from a Bayesian model on data that was collected under the same type of circumstances as a new trial would have, or simply because we didn’t write down that we’d like to test this”

          Doesn’t this boil down to the question of whether a certain dataset (i.e. a particular sample drawn from the large population) can or cannot have artifacts which are not actually features of the larger population?

          If you pre-declare the specific pattern you are looking for then the chance of the same pattern appearing in the specific sample you draw as an artifact is much lower than if I allow you to trawl the sample & report any particular pattern you might post hoc observe as a “feature” of the population.

          If you try hard enough you can make *any* data sing. You are almost guaranteed to discover *some* artifact!

        • Right, but to discover an “artifact” will require that you adjust your likelihood to fit the pattern: “for males between 30 and 60, with BMI over…and a history of rheumatoid arthritis, who never had chickenpox, and have more than one child… the drug is 9x more effective!” (n=2) :-)

          In the absence of that kind of “post data” information being put into the likelihood, you can still trust the inference. ie. “For males, the drug was effective for outcome X” is likely to mean it really was, because splitting out by sex is the kind of thing you’d have probably done even *before* seeing the data. The fact that you checked it for effectiveness on 37 other outcomes before finding outcome X is irrelevant to a Bayesian analysis.

          The question is not “can you make the data sing?” but more like “did you FORCE the data to sing?” :-)

          I’m happy to agree that if you have a highly specific sub-grouping, and a bunch of unusual covariates for which there is no a-priori reason to expect differences, you should re-do the testing to see if the results persist, but I still think you can combine the old and new data rather than being forced to start from scratch.

          Also, remember, the more you torture the data to define sub-groups etc, the less informative the data will be about the sub-groups as the sample size gets smaller. So in a Bayesian analysis, you’d expect that it’d be harder to find a practically valuable effect that is convincing if you do that kind of data torturing, provided you are forced to justify your priors to an external group.

        • @Daniel

          >>>The question is not “can you make the data sing?” but more like “did you FORCE the data to sing?”<<<

          Right, but knowing the background of FDA approvals, and what's at stake isn't it a fair assumption that the Pharma companies will force it? They will try whatever it takes to make the data sing.

        • What I can’t seem to express clearly enough is that in the Bayesian case the evidence for forcing the data to sing is there in the likelihood. The number of alternative things you looked at in secret is irrelevant.

          In the Bayesian case the regulator SHOULD BE scrutinizing the form of the model and looking for evidence of torturing the data, and shshoul be developing potential alternative models to compare to but should not be worrying about whether many other outcomes and models were tested in secret.

        • Daniel: “The number of alternative things you looked at in secret is irrelevant.”

          This is a false statement that many Bayesians trip over (assuming that you finally present the model that has the greatest evidence of the claim you want to make). The key to understanding that if you take a set of random variables that all have individual expected value equal to some theta, then make a new random value that is the *maximum* random variable in this set no longer has expected value theta (assuming a positive variance), but rather an expected value strictly greater than theta. It doesn’t take too much mental exercise to apply this concept to Bayesian multiple hypothesis testing and realize that you have a problem if you cherry pick your model. If you had cherry picked your hypotheses for reasons *independent* of high posterior probability of alternatives, then you should not be so worried about multiple comparisons.

          I’ve got half a mind to write up a post with a clear explanation and demonstration at some point. If you’re interested in reading (and tearing apart) the post, I’ll let you know when it’s up.

        • Cliff AB: would be happy to read your post. But I think you’re mistaking my position.

          I’m not saying it’s OK to do the following:

          “I want to claim that my drug is good at curing baldness, so I’ll try one after another a variety of different special purpose likelihoods defining how my drug affects baldness, and then I’ll pick the one that makes my inference on baldness look as good for my drug as possible”. This is clearly non-bayesian and I think approximates what you’re talking about. If there are several truly plausible models for how things work, they need priors over the models, and Bayesian model selection. Only if the priors over the models are not unreasonably skewed for pre-data priors, and the final result is basically a delta-function around the chosen model will the cherry-picking case equal the actual Bayesian calculation. (In reality, this is basically never). THIS IS THE CASE where the regulator should re-analyze with a different state of knowledge, or force the Bayesian to re-do the experiment to show that the predictions of the convoluted model hold stable. The evidence that this happened is that the likelihood will be a complicated slicing and dicing of the data showing up differences between groups that *prior to seeing the data* you never would have thought to slice and dice. The model choice is now conditional on having seen the data, it is not valid Bayesian inference. In essence you’re doing p(Data | Data_Dredging) where you should be doing p(Data | what_you_knew_before_seeing_the_data). A regulator who sees a fairly complex likelihood should be asking questions like “what is the justification for all these modeling choices…” and then maybe demand “I also want you to run the model where you make some much more simplified modeling choices” and soforth.

          but I *am* claiming that the following is true:

          “I want to find *something* that my drug is good at, and since I measured 400 different outcomes in my sample of 2000 people, I will look for which of those 400 outcomes it has a positive effect on, using a likelihood/model that is the kind of thing that you the regulator would think up on your own without having seen the specifics of my data. ”

          The fact that the second case turns up “your drug is pretty good at curing baldness” has nothing to do with whether or not I looked at the other 399 different outcomes (toenail fungus, depression, chickenpox, improved fashion sense, agoraphobia…etc).

          Whether something is “the kind of model the regulators would think up on their own without having seen specifics of the dataset” is a subjective question, so it needs to be evaluated in various ways and by various people. One way would be to literally go out and find a 3rd party who knows nothing much about the trial, but does know Bayesian modeling and some general pharma principles, and discuss how they might write up a model.

          Bayesian models are conditional on a “state of information” K. In an adversarial system, if your adversary uses a highly specific set of assumptions to go into their model which you yourself are not convinced are good ideas, you’re not obligated to believe their inference.

          On the other hand, in the Frequentist conception, there is in theory basically *no choice* in the matter of Likelihood. In theory you should be using “the true frequency distribution of outcomes in repeated sampling”. To the extent that you have a choice it’s just to choose some approximation to the “true” distribution that is a good enough approximation.

          This highlights very important differences between Bayesian inference and Frequentist inference. You are free to reject the Bayesian model as not representing a valid state of knowledge. But if you agree that it does represent a valid state of knowledge, then you are not free to claim that the validity of the inference depends on how many other unrelated questions you asked of the same data.

        • @Daniel

          >>>In essence you’re doing p(Data | Data_Dredging) where you should be doing p(Data | what_you_knew_before_seeing_the_data). <<<

          Yes, and that's why I'm saying the best way to ensure the sanctity of "what_you_knew_before_seeing_the_data" is to force you to declare it in pre-registration *before you actually saw any data*.

          You seem to agree that in the case-1 you describe the regulator must “force the Bayesian to re-do the experiment”.

          My point is that, in practice, the regulator will have to behave, as if, all cases are potentially similar to case-1 because I don’t think there’s any feasible way for the regulator, post-hoc, to tell apart your OK & not-OK scenarios.

        • Also, Cliff, you’re right that the other hypotheses need to be independent. For example, knowing that your drug doesn’t help reduce eosinophil activity might tell you something about how well you’d expect it to work for eczema, or allergic rhinitis, etc. So if you tested a bunch of stuff that is related to your final outcome and you find it wasn’t very effective, and then you test one final outcome and it is effective, then… you have two issues, one is that you’ve discovered new science, for example perhaps a new mechanism by which you can improve allergic rhinitis that’s unrelated to certain types of immune cells…. and the second is that you found some accident of the data which won’t persist.

          So, it’s not that you can go around testing just ANYTHING and then ignore all the other stuff to get your final inference, but you can go around testing stuff that your current state of knowledge suggests are independent questions (toenail fungus, and agoraphobia maybe)

          So, this may be where Rahul and others see potential for abuse, assuming that if you’re going to do a trial for something, say anti-depression activity, then if you don’t find effectiveness there, you might start testing for anti-psychotic, or anti-anxiety, or whatever, and you keep finding negative results, and then you test for say difficulty sleeping, and all those other negative results might inform you about how well it should work on sleeping, but they’re hidden and private to the Pharma company, and that’s a legit concern.

        • Rahul says: “My point is that, in practice, the regulator will have to behave, as if, all cases are potentially similar to case-1 because I don’t think there’s any feasible way for the regulator, post-hoc, to tell apart your OK & not-OK scenarios.”

          I disagree with the strong version of this, that is treating them as if they are “definitely similar”. whereas, “potentially similar” ? Yes fine. You should check for signs of not-ok stuff going on. But acting as if every case is case-1 is going to be expensive and inefficient and impede the progress of medical advancement. It’s actually pretty common for a drug to be thought up for one thing, and then turn out to be more useful for something else. So this isn’t just a theoretical case. Cyclobenzaprine is a real world example, I’m guessing Pharma people will be able to come up with a pretty big list of others.

          I think that to anyone capable of evaluating a statistical model, the evidence for significant data dredging is often pretty obvious. Does the effectiveness metric rely on a likelihood that uses all kinds of seemingly ad-hoc covariate structure? Like the “for males between 30 and 60, with BMI over…and a history of rheumatoid arthritis, who never had chickenpox, and have more than one child… ” example? This is pretty obvious data dredging, because, why would you even think to look at chickenpox status, or how many children, or rheumatoid arthritis???

          Unless those things come to mind right away when you’re thinking about the condition you’re treating, like maybe it’s a strange immune system related disease with a genetic component that sometimes causes infertility and obesity but sometimes doesn’t and is somehow known to be related to exposure to Varicella virus…

          but if we’re talking about a cure for toenail fungus and it has nothing to do with the immune system or fertility or chickenpox, then it is suspicious to have all these special-purpose components to the likelihood.

          Basically I think that a Bayesian model is a little like a mathematical proof, it’s an argument for making some assumptions, and then finding out what some data together with those assumptions tells you about what you should think about the world. The assumptions SHOULD be questioned, but for the most part, they should just be the assumptions that are right there in the Stan / BUGS / JAGS / whatever code. If you do question the assumptions, then by all means, simplify them, alter them, re-analyze the same data, and if you get contradictory results under different models, put them all into one big model and do Bayesian updating across multiple models. If you still get contradictory results, then definitely ask for more data, ask for trials that would differentiate between the models, etc.

          But, false discovery and multiple testing and forking paths… those are really about *frequencies* with which hypotheses will be rejected etc. That’s just NOT what the Bayesian mathematics is about, and it doesn’t apply in that way.

          We’ve identified some good things to think about here that DO apply to Bayesian inference. It’s definitely not the case that you can say “Bayes = always good inferences” or something like that, but the questions you need to ask are questions about the Stan code being submitted for approval.

          The question can be answered (partially) by pre-registration, but can also be answered by introspection and interviewing outside 3rd parties, and various other means as well. But it’s important to remember what the real question is: “How well does this Stan code represent a plausible model for how the world really works?” not something about how frequently you’ll make various types of errors if you let people do this kind of stuff repeatedly across hundreds of different regulatory applications, which are the legit concerns under Frequentist inference.

        • Daniel, could point out in which of the following cases (if any) would you think I may be misleading you and you shouldn’t take my Bayesian inference at face value (assuming the priors and likelihoods are ok as far as you are concerned)?

          A) I do 400 studies, on groups of 2000 people, looking every time at the same outcome. I send you the result I like better, without telling you about the others.
          B) I do 400 studies, on groups of 2000 people, looking each time at a different outcome. I send you the result I like better, without telling you about the others.
          C) I do 400 studies, with common data taken from a single group of 2000 people, looking each time at a different outcome. I send you the result I like better, without telling you about the others.

        • Daniel, Rahul et al,

          I think you have been mislead by the idea there is much to be learned by discovering the presence of an “effect”. This is simply not very useful information no matter what the circumstances of the data collection and analysis. When taking into account real life issues that researchers have to deal with (many of which have been mentioned here) it is outright meaningless.

          At a minimum you need to 1) get some kind of timecourse, dose-response, or similar. 2) come up with a model to explain the properties of that curve. 3) Test the model on new data. 4) Fix your model to address the deviations and repeat on further new data. 5) Goto 3.

        • Carlos:

          In A you are hiding **data** from me, in fact you’re hiding 99.75% of the data from me, so I strongly disagree with that as acceptable, any more than it would be acceptable to do 1 study on 800,000 people and then hand-select 2000 of them to send to me for review.

          In B you are hiding data from me as well, but it’s not data where your outcome measurement was relevant to the outcome of interest (supposedly). If you send me all this data and tell me “here’s the data on 399 other trials for other purposes” then this is acceptable and we should make our inference on the one you are interested in. However, I may insist that we use the 399 other trials as background information to help us set priors, or for purposes of *safety* rather than efficacy.

          in C I see all the data, and you are free to try to point out what you think is the most important of the outcomes and what you think is the right model, the 399 other models you ran don’t bother me, I’ll look at your submitted model and try to attack it to see if you’re doing something unkind to the data.

          So:
          A = NO definitely
          B = OK if you send me the data so I can use it in my background info and safety evals.
          C = Fine, but still needs scrutiny as does any submission

          At least that’s my current thinking.

        • I think the general statement of my position is something like this:

          It’s OK to hide your private *thoughts* from the regulator, everything else should be public, including data, and the code you used for the public model you’re submitting.

          Other Bayesian models that you ran are included in your private thoughts (Bayesian models are more or less mathematical formalization of certain kinds of thoughts).

          In the end, the regulator should be coming to their own conclusions based on a knowledgebase, the point of the submission is to affect that knowledgebase (by pointing out which knowledgebase you adopted in your Bayesian model submission) but the regulator is free to criticize that knowledgebase (ie. look for it being overly-specific compared to what they think it should be.)

          In the end, it behooves a Pharma company to adopt relatively neutral priors and relatively easily justifiable likelihoods because then there will be little to criticize. Of course, if that shows things are not effective… then you should probably expect bad behavior on the part of a highly financially interested party.

        • Daniel: thanks for you reply. If you want to play another round:

          It seems you have a problem with my hiding of data. Imagine a slightly modified set of scenarios where, instead of me doing the 400 analysis, we are 400 independent analysts and we send our “best” model without sharing any additional info among ourselves. So I wouldn’t be hiding anything except for the fact that 399 additional analysis were carried on but were not submitted, a bit of information that I understand you find irrelevant. How would that change your answers?

          I also have a question about your answer “C = Fine, but still needs scrutiny as does any submission.” Just to be sure: the scrutiny needed will be the same for my submission, when you know (or suspect) that I had checked the 400 possible submissions, and for someone else who preregistered a study, intending to analyse a single outcome, and then submitted the results. Correct? Assuming the priors, likelihood and data in both submissions are identical, of course.

        • “Imagine a slightly modified set of scenarios where, instead of me doing the 400 analysis, we are 400 independent analysts and we send our “best” model without sharing any additional info among ourselves.”

          Ok, first let’s be clear because it’s not explicit here. There are still only the one set of 2000 patients, and I get to see all the data right? Whenever I’m not seeing all the data I cry FOUL so, assuming I’m seeing all the data, the fact that 399 other people all analyzed the same outcome and you sent me the “best” model is OK. But it’s still basically your “case C” and I still think that the model needs scrutiny. You need to convince me not only that your model “shows an effect” but also that I should believe in the model. In other words, that it’s based on a knowledge-base that makes sense to someone who hasn’t seen all the data and combed through it. I need to comb through your ONE submitted Stan code and see the justification for all the model choices you made. And that’s true whether you do 1 model or 400.

          Anonuoid: I’m assuming a Bayesian Decision Theory concept here. So “Shows an effect” means that the net “goodness” or “utility” of whatever the time-course and properties of the dose-response curve etc are is what’s being evaluated. I agree with you scientifically, but from a regulatory perspective, we need not so much an answer for how the science works, but a measurement of expected goodness of outcomes. Let’s for the moment assume that the “goodness” function is somehow previously agreed upon through some kind of negotiations and voting and discussions among many parties, because it adds a level of complexity that isn’t needed for this discussion.

        • Maybe I should have written explicitly the new scenarios. I’ll do it now:

          A2) 400 different analysts do studies, on groups of 2000 people, all of them looking at the same outcome. My study happens to have the “more interesting result” and I send it to you. I’m not hiding any data from you, because I don’t know anything about the data in the other studies (but I’m not telling you that they existed and were “less interesting”).
          B2) 400 different analysts do studies, on groups of 2000 people, each one looking at a different outcome. My study happens to have the “more interesting result” and I send it to you. I’m not hiding any data from you, because I don’t know anything about the data in the other studies (but I’m not telling you that they existed and were “less interesting”).
          C2) 400 different analysts do studies, with common data taken from a single group of 2000 people, each one looking at a different outcome. My study happens to have the “more interesting result” and I send it to you. I’m not hiding any data from you, because the same data was used in all of them (but I’m not telling you that they existed and were “less interesting”).

          As for the second question, I understand that your conclusion after looking at my data, my choices and my code would be exactly the that you would reach if the same choices and code had been preregistered and I had send you an identical analysis once the data was available.

        • Carlos, as to your question “the scrutiny needed will be the same for my submission, when you know (or suspect) that I had checked the 400 possible submissions, and for someone else who preregistered a study, intending to analyse a single outcome, and then submitted the results”

          Yes, the preregistration allows this scrutiny to be done ahead of time, and most likely makes that scrutiny go faster. Furthermore, it then allows the back-end analysis to go more quickly because the scrutiny already happened, and this may have some benefits for everyone. But in the end, the regulator needs to see the Stan code (or whatever) and be convinced that all the modeling choices are ones which can be justified by a state of knowledge that doesn’t already condition on knowing the data. We’re trying to get

          p(Data | Params, KnowledgeThatMakesSensePreData)

          not

          p(Data | Params, PseudoKnowledgeThatMakesMoneyForBigPharma(Data))

          I disagree with the idea that we can never evaluate this after the fact, and therefore that all decisions must come from pre-registered plans. But, I DO agree that we need to evaluate it.

        • Carlos, I think your 3 explicit scenarios prevaricate, because it’s not possible to know which study had “the more interesting result” without sharing information about the results. Whoever submits the application needs to have seen all the results, and hence they ARE hiding data from me.

        • Here are some other thoughts I’ve had:

          Worst case: Do a study on 800,000 people, choose the 2000 individuals that present your favorite result, put them in a small dataset, with your favorite model and submit it as if the other stuff doesn’t exist. This is BAD JUJU and should probably be punishable by jail time. Why is this bad? Because the magnitude of the bias you can induce is something like quantile(result,.9975) – quantile(result,0.5) which could be really big.

          Less bad case, still bad: Do 400 studies each on 2000 randomly assigned people all with the same outcome in mind. Choose the study that shows your favorite result. Submit it. Why is this less bad? Because now the magnitude of the bias that you can induce is proportional to 1/sqrt(2000) so you can make your result look maybe 2% better by selection of which batch of 2000 to send. Also, if there are rare but severe side-effects etc, you may be hiding those and those could have significant effects on the net utility. Again, HIDING DATA = BAD. But the fact that you’re working with a sample of size 2000 that is all one block not hand-picked individuals reduces your control over the outcome.

          even Less bad case, still a bit bad: Do 400 studies each on 2000 randomly assigned people, all looking at different and probably unrelated outcomes, choose the one that seems to be the most favorable, and do a submission claiming to treat that outcome. Why is this bad? Again, hiding data. The data on the other outcomes may be considered “unrelated” to the main outcome, but the data also probably is relevant for things like side-effects and safety. Hiding data = BAD but at least you’re not hiding relevant outcome data.

          Not bad, just has to be assumed to be going on anyway: Do one study on 2000 people, slice and dice the data 400 ways, submit your favorite model, hope I don’t question you too hard about it. It’s my job as the regulator to question you hard about all your modeling choices. If I find that you’re dicking me around, I will come down hard on it, but if I find that your model choices made sense, whether you looked at all those other 399 models doesn’t really matter, just like it doesn’t really matter whether you privately go to confession and tell your priest that “really I don’t think this drug is effective”. The question is, IS it effective, not “what do you privately think?”

          Easiest case: Preregister your model, do your study on 2000 people, do the analysis, everything is smooth. This is good because it lets me as the regulator see your model ahead of time and determine whether it makes sense in just the way I would in the above scenario. Most likely if you don’t yet have that much data, you will not be able to swing the results that much, so your likelihood/code will not be too weird, and so things may go quicker. But note, I have to assume you’ve got hidden data in lab animals etc that you’re using to inform your model anyway, so I do still have to scrutinize your model. The main advantage here is that the overall time and cost involved in the scrutinization is reduced.

          So, I totally agree that preregistration is a good idea in this regulatory scenario, because I think it reduces economic costs and makes things more transparent. But, in a scenario where a secondary outcome is measured and turns out to be the most interesting one, etc analyzing an outcome post-data using a model that makes few assumptions still gives you valid inference regardless of how many other ways you sliced and diced the data privately (or what you told your priest).

        • Also, Carlos, consider the following scenario:

          400 different groups all do studies of 2000 people on the same drug, looking for good outcomes in a variety of things. One of them gets a result they like and submits the drug without consulting with anyone else.

          Now, they really AREN’T hiding anything, they’re totally unaware of the existence of those other 399. Of course they submit their results based on having gotten something good. I think we HAVE to accept this result as valid, because there’s just no way our inference can depend on what might have happened somewhere else that none of us know about. I mean with billions of galaxies in the world, there’s probably 400 other planets where who knows what kind of drugs are being tested on who knows what kind of life-forms. But, since Frequentist inference does apparently depend on how many different tests were run, in fact, the existence of those other galaxies actually invalidates all of Frequentist inference ;-)

        • Daniel, obviously I don’t need to share the data to decide which analysis to submit, only some kind of score that measures how interesting the result is. I wrote at 12:39pm “I wouldn’t be hiding anything except for the fact that 399 additional analysis were carried on but were not submitted, a bit of information that I understand you find irrelevant.”

          Yesterday you wrote “The fact that you checked it for effectiveness on 37 other outcomes before finding outcome X is irrelevant to a Bayesian analysis.” Now you tell me: “Whoever submits the application needs to have seen all the results, and hence they ARE hiding data from me.” Does this observation apply to all the scenarios?

          Remember that you *were* claiming that the following is true:

          “I want to find *something* that my drug is good at, and since I measured 400 different outcomes in my sample of 2000 people, I will look for which of those 400 outcomes it has a positive effect on, using a likelihood/model that is the kind of thing that you the regulator would think up on your own without having seen the specifics of my data. “

          ARE you hiding data from the regulator when you don’t tell him that you have seen the results for the analysis of all the 400 outcomes trying to find *something* that your drug is good at?

        • Daniel, regarding the scenario you proposed at 4:01 pm this is the point where I was trying to arrive. I completely understand that a Bayesian analysis is based only on the model (likelihood) and the priors and data. However, I don’t think you can naively trust the inference made with data which is supplied from a third party according to their will.

          Even if the likelihood and the priors are fine (imagine you are the regulator and you have given very detailed guidance specifying the acceptable priors and likelihoods) you can’t trust results (i.e. data) which are self-reported. Even if people may not actually be hiding data when they submit their results, their decision to submit or not their results is outside you control and introduces a bias.

          Letting people submit their results only when they feel like it is equivalent to letting them run the same experiment several times until the get the result they want. Somehow you think the latter is wrong, because it is hiding data, but you seem to be fine with the former.

        • Carlos I think we’re in complete agreement regarding allowing people to select the *data* they submit, and this has been my consistent view throughout this discussion. Whether they select data by looking at all of it, or just looking at a summary from a model, attempting to compartmentalize things into different analysis groups, etc, if in the end they collect N data points and submit n < N then they are doing something wrong.

          On the other hand, if they submit all N data points, and one model, and they have 400 private models run on the same dataset, I don’t think this by-itself invalidates their model in ANY way. The only thing that invalidates their model is if you look at the actual computer code (Stan etc) and see that their model makes invalid assumptions. One way that this could happen is if they search for some assumptions which will give them a favorable result. Another way this can happen is if they’re just not very good at modeling. Regardless of whatever way it happens, you have to check the content of the model as expressed right there in the Stan code.

          This situation however, does not seem to be true for Frequentist inference (and by this I mean explicitly, testing at p < alpha, not likelihood based inference). The theory there is that the Frequency with which hypotheses will generate p &lt alpha is somehow small enough that it will filter out most of the ‘wrong” hypotheses. However, all you need is a large enough number of plausible hypotheses and then Np > 1… so, knowing p < 0.05 for the hypothesis submitted doesn’t by itself tell you enough, you need to know also “N” which is the number of private hypotheses being tested, and by Andrew’s account, you need to know actually N* which is “the number of plausible hypotheses you *could have easily* generated post-data” (garden of forking paths).

          I personally just reject inferences based on p values except in very specific circumstances.

          Private thoughts don’t matter, private DATA matters.

        • “ARE you hiding data from the regulator when you don’t tell him that you have seen the results for the analysis of all the 400 outcomes trying to find *something* that your drug is good at?”

          Are you hiding data from the regulator when you fail to tell him what you told your priest in confession? Let’s just limit the definition of “Data” to “measurements taken on experimental units”.

          If you collect a matrix of 2000 rows (one for each person) and 400 columns (one for each outcome) and you submit this matrix to the regulator, you are not hiding data by this definition, even if you only submit a model analyzing the kth column.

          If you collect a matrix of 2000 rows and 1 column, and then submit this data, and one model, even though you have run 400 other models, this is not hiding *data*, the only question is whether the regulator should believe your likelihood, and that question can be answered by looking at the likelihood and discussing the issue with third parties who can offer information about whether the choices made are based on science known prior to the trial.

          If on the other hand, you collect a matrix of 800,000 rows and 1 column, and you give blocks of 2000 rows to different analysts, and each one comes back with a “score” and you submit the block of 2000 rows that got the best score from the compartmentalized analysts… pretty obviously there are 798,000 rows missing. You’re hiding that data. Furthermore, you’ve altered the data collection process and hence the likelihood needs to take this into account, and so hiding the fact of that data and the filtering process hides the fact that the likelihood needs to be altered. The fact that you yourself haven’t actually printed out and read through the full data matrix doesn’t mean you aren’t hiding data, and if you didn’t mention you were planning to do this compartmentalized process in your trial plan, then as far as I’m concerned you’re in for jail time.

          If SOMEONE ELSE who neither of us know about collected the other 798,000 rows, and you’ve never seen them, you’ve never seen scores derived from them, you have absolutely NO information about them, and you submit your 2000 rows with 1 column, then you are not hiding data, and your data selection process is not altered, and so no likelihood changes are needed, and so the inference is valid.

          if p(Data | Params) is a likelihood whose content you agree with, and Data is a dataset that is unfiltered (contains all the measurements that either of us know about according to the method described in the trial) and p(Params) is a prior whose content you agree with, then p(Data|Params)p(Params) is proportional to a posterior that you have to agree with. Agreeing with the likelihood and agreeing with the prior means you have to agree with the posterior.

          In a Bayesian analysis, the likelihood represents a model for the **data collection process** and is not taken as a god-given fact about the world, any deception about the data collection process is therefore just that… deception. If you LIKE the wasted money in the “compartmentalization” approach, we could go with that, but then we’re going to have to model it into the likelihood, and it’s going to dilute your results HEAVILY, and so you’re going to be wasting VAST sums of money.

        • > If SOMEONE ELSE who neither of us know about collected the other 798,000 rows, and you’ve never seen them, you’ve never seen scores derived from them, you have absolutely NO information about them, and you submit your 2000 rows with 1 column, then you are not hiding data, and your data selection process is not altered, and so no likelihood changes are needed, and so the inference is valid.

          Selecting whether I submit the data or not is a way of selecting what data to submit. The point is that I won’t be submitting my results unless I get the result I want. And each of the other 399 researchers doing the same experiment won’t submit their results unless they get the result they want. And according to you I won’t be hiding data by not submitting my “failed” model and whoever submits their results won’t be hiding data either.

          To wrap up, if I’m not misunderstanding you think that if I do an experiment several times and send you my results only when I find an effect the inference is not valid (because I’m hiding data). But if I ask several researchers to do the same experiment once, with instructions to send you their results if they find an effect, the inference is valid (imagine I never get to see any of those results, so you cannot accuse me of hiding anything). The outcome of the experiments will be the same, the package sent to you (model, priors, data, assumptions, everything) will be the same. And they really AREN’T hiding anything, they’re totally unaware of the existence of those other 399. And you HAVE to accept the result in the second case as valid.

        • Carlos: so long as you correctly describe the data collection process in your trial plan as: “I will ask 400 groups of researchers to collect 2000 data points, and tell them to only send you data if they get a result at least as good as x” and the final likelihood we analyze with has this data collection process included in it… YEP I’m ok with you wasting all that money.

        • The point of my glib reply being that you can’t lie about the data collection process, and I sure as heck *can* accuse you of hiding data when you give instructions to your researchers to “not send the data unless you find an effect” just because you personally didn’t stand in front of the shredder doesn’t mean you don’t have criminal liability for telling someone else to do the shredding :-)

        • Daniel,

          Why not just model the decision as a function of these parameters, and plot out the “cost-benefit” at various parameter values. You can show it in a graph, holding constant all the other parameters and varying each one independently, and then show all the permutations from sets of potential parameter values that spread across the plausible ranges of each.

          I guess I just mean – why bother saying “These are the specific parameter values we should use for our decision analysis” and instead just say “here are the potential cost-benefit calculations” and let people argue about what the world is really like (or what they think is important or valuable)”. Decision analysis yes, but we don’t need a single yes/no from this analysis, do we?

        • You’re right, you don’t need a *single* decision, but if you’re going to do this, you should probably average across the estimated values not under your control. So for example suppose you’re making a decision about the “badness” that will occur if you weight W and take dosage D and are male, and 33 years old… Well you still don’t know exactly how much of some effectiveness variable “G” the drug will have (for example), or the side effect values S1,S2, but you have some data, and a posterior distribution over that effectiveness and side effects etc… so you can plot the expectation:

          E[badness(G,S1,S2,Weight,Dosage,Age,Sex),p(G,S1,S2|Data,Model)]

          for all kinds of weights, dosages, Ages, and for each sex, but averaged over what you know about effectiveness G, and side effects S1, S2. Then people could look at say two drugs, Foo and Bar and look at the plot, and look up their Weight, Age, Sex, and find the most effective dose and compare the badness outcomes, and say “gee for me, Foo looks better”. Of course, to do this, you’d have to be looking at drugs that have similar purposes, and therefore use the same, agreed upon “badness” function

          So I’d be ALL OVER that kind of thing. It just hurts me a lot to think that “drugs.com” could be like that but isn’t. :-)

    • I agree. I think part of the reluctance to do this is a nervousness about assigning weights to different medical events (how do you value a baby’s life versus a mother having complications? – I agree it’s not easy), plus a large of dose of tradition in the way trials have evolved to have a “primary outcome” and deal in significance and non-significance, which are treated as if they are the same as treatment effectiveness and non-effectiveness. The sort of result seen is OPPTIMUM is common – “non-significant” but close – and is usually interpreted as “non-effectiveness”. The concern is that potentially useful treatments are inappropriately binned. Here they have the added complication of three “primary” outcomes, and they may have Bonferronied away an important result here.

      • Simon and perhaps Daniel:

        You might want to read the redefinition of morbidity Induction of labor as compared with serial antenatal monitoring in post-term pregnancy: a randomized controlled trial in http://www.nejm.org/doi/pdf/10.1056/NEJM199206113262402

        During the trial but before the unblinding of the treatment assignments a consensus panel was convened to decide upon the relative weighting. (I was involved in that panel (literally months of hard work) and the paper but withdrew as an author when I pointed out the statistical methods section another author wrote (in an earlier version) was incorrect regarding a secondary outcome but they over-ruled me be convincing senior authors that the statistical reviewer would be very unlikely to notice the error and submitted the paper. That paper had my name on it as an author but fortunately it was rejected. To me one of my success stories.)

        • “they over-ruled me by convincing senior authors that the statistical reviewer would be very unlikely to notice the error and submitted the paper. ”

          Aargh!

          “That paper had my name on it as an author but fortunately it was rejected. To me one of my success stories.”

          Yeah!

          Conjecture: Most people have trouble thinking like someone who thinks and writes in nested parentheses.

        • Off topic, but when I peek at someone typing, I upgrade my priors about him when I see him type closing parenthesis & then backtrack the cursor to start filling in the parenthesized text.

        • I tend to think in nested parentheses, but when I find myself writing English sentences with them, I usually try to rewrite without them. (Sometimes I find dashes and semicolons helpful.)

          But I don’t like programming at all — partly because I got turned off by a premature exposure at a young age (as I recall, it was something like needing to write a nine-digit number to indicate that you wanted to add two numbers; then everything needed to be “typed” on a punch card. That was back when a computer took up a whole room; “memory” was on magnetic tape in a large vertical arrangement. Also partly because when things got easier technologically, my poor quality fine motor coordination and increasingly poor eyesight made it frustrating.

        • “Conjecture: Most people have trouble thinking like someone who thinks and writes in nested parentheses.”

          Reminds me of Tom Wolfe setting up Noam Chomsky: “Every language depends upon recursion – every language. Recursion was the one capability that distinguished human thought from all other forms of cognition… recursion accounted for man’s dominance among all the animals on the globe…”

          As per usual, setup is followed by takedown: “He never recanted a word. He merely subsumed the same concepts beneath a new and broader body of thought*. Gone, too astonishingly, was recursion. Recursion!”

          paywalled link: http://harpers.org/archive/2016/08/the-origins-of-speech/

          *This sentence likely to reappear on the blog next time a main effect disappears in a replication, but an interaction is found, thereby substantiating everything the author previously said about everything.

  5. Context is important. Clinical advice if taking this medication is that you should tell your Doctor if you are pregnant. If someone is taking this medication and planning to become pregnant, there are already so many things one can find to be concerned about without this adding to them. From a clinical viewpoint, patient information increasingly is being steered toward appropriate risk statements in the consent to treatment phase. The impression of the statements in the article is that a clinician would be able to relatively clearly speak to the concern a patient might have as to what the relative risks are, from this study, in her deciding to use or not the medication. Relative Risk, OR or similar with CI ranges are more relevant in the real world. I wonder what importance p-values offer in patient information provision and in decision making in clinical settings. Perhaps part of the issue here is what end purpose research has and how interpretable are the results.

    • “Clinical advice if taking this medication is that you should tell your Doctor if you are pregnant. If someone is taking this medication and planning to become pregnant, there are already so many things one can find to be concerned about without this adding to them. ”

      Huh? The trial “evaluated use of vaginal progesterone for prevention of preterm delivery”. In other words, the trial was studying whether or not the medication might be beneficial or harmful when prescribed for the purpose of prevention of a possible less-than-optimal outcome of pregnancy. It did not compare pregnant and non-pregnant women.

      (However, it does seem strange to me that “fetal death or delivery at less than 34 weeks’ gestation” were lumped into one primary outcome.”

  6. It is my opinion that if there was a problem with the results of this trial, in was not on the side of the regulating agency or it’s policies.

    When proposing a clinical trial, the company declares the outcome of interest (after heavily defending it as a meaningful metric). If the company decided to declare 3 outcomes of interest, of course they should have to do some sort of multiple comparison method. Otherwise, I will start setting up a Vitamin C trial where I use it to treat depression, nausea, the flu, mumps, fear of airplanes, etc.

    The fact that they declared 3 outcomes and used a Bonferroni/Holm correction is somewhat indicative that they had no idea which of the outcomes might be affected. It is almost surprising how little evidence was collected for outcomes 1 and 3, and somewhat suggestive that they were quite incorrect about the results in early stages of the trial; did they real think the effect sizes of 1 and 3 were going to be large enough to be detected with this sample size? You’re welcome to make the argument “is there really something magical about p = 0.072 vs p = 0.05?” in regards to outcome 2, but I don’t think there’s any reason to decide that multiple comparisons should not be an issue in this case.

    It is possible that outcome 2 was what they were most certain of and were hoping to get outcomes 1 and 3 and so decided to take a risk. But that risk was totally unnecessary. There are already procedures (recognized by the FDA) that allow you to test secondary hypotheses without any loss of power to the primary hypotheses: Gatekeeping procedures. This allows the company to say “I’m sure we’ll get A so that’s our primary outcome of interest. It would be nice to get B for extra marketing purposes, so we will test B conditional on A being significant (if A is not significant, we automatically fail to reject B)”. Both hypotheses are tested at alpha = 0.05 rate and FWER = 0.05 is preserved. So there’s no loss to the company to test secondary outcomes, conditional on the drug doing what they were proposing it would do in the first place.

    There’s not a clear line to me as to why we should accept this result, even though it failed the preregistered protocol, but we shouldn’t accept power posing (I’m assuming that result came from data-dredging). I believe this would be a much smaller case of data-dredging (3 hypotheses vs say 100?), but are we going to start quantifying how much data-dredging is okay?

    • I agree that regulatory agency / pharma company dynamics are not at play here, given that the trial doesn’t seem to be related to any regulatory process and it’s being funded by the UK government and not by a pharma company.

      Regarding the correction for multiple testing, it actually seems it was not part of the original protocol. At least I see no mention of it, and their statement is ambiguous: “According to the prespecified statistical analysis plan, p values were initially reported without adjustment for multiple comparisons, then adjusted using a Bonferroni-Holm procedure.”

  7. This comment may be naive, but it seems to me that there is a hierarchical structure involved here that ought to be taken into account in the analysis (Or maybe I am naive in thinking it could be taken to account in the analysis?):

    The stated outcomes are “1. fetal death or delivery at less than 34 weeks’ gestation; 2. Neonatal death or serious morbidity; 3. Cognitive score at two years.” So it seems that there are levels as follows:

    Top level (dichotomous): Fetal death or live delivery.

    Nested within live delivery: Delivery at more or less than 34 weeks’ gestation. (Or maybe a more continuous variable: Length of gestation — which I realize is somewhat fuzzy to measure; but then, so is the “more or less than 34 weeks”)

    Next level: neonatal death, serious morbidity, or neither (nested within each category at second level)

    Next level: Available or not for cognitive testing at two years (nested in the last two of the three categories at the preceding level)

    Bottom level: Cognitive score at two years (nested within the “available” category of the preceding level.)

    • Yes! So the outcomes form a sequence (like a flow chart) and what we are interested in is really the probabilities of tradition between the states in the presence or absence of the intervention. That sounds a very sensible approach to me, but is not something I’ve ever seen done in clinical trials (not to say it hasn’t been).

      Also just wanted to say thanks to everyone who has posted comment here for a really interesting discussion.

    • This is part of what I was thinking about when I commented on the loss to follow up numbers. I have a feeling that babies with no major health issues were more likely to be lost. And yes the death lumping together is very strange at least on the surface.

  8. @Daniel:

    Here’s a thought experiment:

    Contest A) Suppose we invited people to submit Bayesian models right now for predicting US-elections. Then we appoint a third party panel to weed out all those models whose priors are unreasonable. Then we wait & post-elections we find one model did really really well at predicting all the results including regional ones and their minutae.

    Contest B) We wait until after the election results are declared and again invite people to submit post-hoc (Bayesian) models to retrospectively predict the election results. Again the expert-panel throws away models with unreasonable priors. Finally we select the best model.

    Now the question is: Would you trust the winning model of both scenarios equally strongly? If you prefer one scenario, then why?

    Isn’t the pre-registration scenario analogous to Contest-A?

Leave a Reply

Your email address will not be published. Required fields are marked *