Orphan drugs and forking paths: I’d prefer a multilevel model but to be honest I’ve never fit such a model for this sort of problem

Posted on December 3, 2017 9:51 AM by Andrew

Amos Elberg writes:

I’m writing to let you know about a drug trial you may find interesting from a statistical perspective.

As you may know, the relatively recent “orphan drug” laws allow (basically) companies that can prove an off-patent drug treats an otherwise untreatable illness, to obtain intellectual property protection for otherwise generic or dead drugs. This has led to a new business of trying large numbers of combinations of otherwise-unused drugs against a large number of untreatable illnesses, with a large number of success criteria.

Charcot-Marie-Tooth (CMT) is a moderately rare genetic degenerative peripheral nerve disease with no known treatment. CMT causes the Schwann cells, which surround the peripheral nerves, to weaken and eventually die, leading to demyelination of the nerves, a loss of nerve conduction velocity, and an eventual loss of nerve efficacy.

PXT3003 is a drug currently in Phase 2 clinical testing to treat CMT. PXT3003 consists of a mixture of low doses of baclofen (an off-patent muscle relaxant), naltrexone (an off-patent medication used to treat alcoholism and opiate dependency), and sorbitol (a sugar substitute.)

Pre-phase 2 results from PXT3003 are shown here.

I call your attention to Figure 2 [above], and note that in Phase 2, efficacy will be measured exclusively by the ONLS score.

My reply: 33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??

In all seriousness, I’d recommend they fit a multilevel model. That said, I’ve never fit such a model for this sort of experiment. I’d like to do it (at least) once, for a live example, as I think this would help me better understand the statistical issues and then I’d be able to make more helpful recommendations.

13 thoughts on “Orphan drugs and forking paths: I’d prefer a multilevel model but to be honest I’ve never fit such a model for this sort of problem”

Tom Passin on December 3, 2017 10:59 AM at 10:59 am said:

It’s striking that the tests/traits that have the smallest CIs show no apparent changes at any dose. So either there are really no effective changes, or the medications only improve certain tests/traits (if indeed any are actually improved) and not others. Someone who understands the condition could think about whether the second would be plausible.

I suppose that “ONLS” means “Overall Neuropathy Limitations Scale”. If you consider the improvement in ONLS vs dose (which seems to be the obvious choice for multilevel modeling), there is a small increase with dose, but the CIs are too large to tell for sure. If you just took the difference between high dose and no dose, and propagated the CIs, it looks (absent having the actual numbers to compute with) as if the difference would not have a significant p-value. This could be one of those cases where the mean, being the best unbiased estimate, could indicate something even if it’s not “statistically significant”. But there’s probably too much noise to really tell.

To me, these results call out for another test at an even higher dose. But I suppose that won’t be in the cards.

Reply ↓
- Anoneuoid on December 3, 2017 12:32 PM at 12:32 pm said:
  
  To me, these results call out for another test at an even higher dose.
  
  It’s interesting because my thought process is totally different than what is found in your 2nd paragraph but I would come to the same conclusion: They need to collect enough dose-response data to cover the entire curve. In my case it is so that we can come up with models of what is going on consistent with this curve (There are some other issues w the data etc but I’m ignoring that for now).
  
  Can you give more detail on how you link the ideas in your second paragraph to your conclusion? Specifically, I see these as your premises (there are probably unstated ones that need to be filled in):
  p1) there is a small increase with dose, but the CIs are too large to tell for sure
  p2) the difference between high dose and no dose…would not have a significant p-value
  p3) the mean could indicate something even if it’s not “statistically significant”
  
  And here is your conclusion:
  C) these results call out for another test at an even higher dose
  
  Basically, I don’t see how the conclusion follows from what you wrote. For comparison, here is my reasoning in the same format:
  
  p1) The dose response curve contains useful info about what is going on when people are given this drug
  p2) Having data on the entire curve (including upper plateau) will constrain the number of explanations for what is happening when people take this drug
  p3) It would be useful to have explanations/models that predict the dose response curve
  p4) It isn’t clear whether the highest dose tested was at the upper plateau or not
  
  Therefore:
  C) these results call out for another test at an even higher dose
  
  Is it possible for anyone else to follow my reasoning here?
  
  Reply ↓
  - Martha (Smith) on December 3, 2017 4:23 PM at 4:23 pm said:
    
    Your reasoning makes sense to me (but I haven’t read what’s in the link provided).
    
    Reply ↓
    - Anoneuoid on December 3, 2017 6:01 PM at 6:01 pm said:
      
      Thanks, it isn’t really about this in particular so I don’t think reading the paper is necessary to follow the logic. I really only glanced at it.
    - a reader on December 3, 2017 9:21 PM at 9:21 pm said:
      
      Anoneuoid:
      
      I think your reasoning is very odd in that it treats all measures of dose effect as exact. A very peculiar view from someone who believes all researchers are just chasing noise.
      
      While I understand your point (errors are of second order interest), I think Tom’s reasoning is totally fine in that he is saying that random error is so high it’s hard to pick out signal from noise. By upping the dosage, we presumably strengthen the signal (unless we’ve hit a plateau).
    - Anoneuoid on December 3, 2017 10:32 PM at 10:32 pm said:
      
      I think your reasoning is very odd in that it treats all measures of dose effect as exact.
      
      How so? If it reads like that there is a communication issue, because nothing could be farther from the truth regarding the thought process I attempted to describe.
      
      A very peculiar view from someone who believes all researchers are just chasing noise.
      
      Not sure why there is a strawman introduced here, but lets ignore for now…
      
      While I understand your point (errors are of second order interest)
      
      I’m not sure what you mean. Perhaps by “errors” you mean false positive/negative? In that case, they are totally irrelevant to the reason I would want more data. If the ONLS score means anything, I assume there is a non-zero effect of taking this pill.
      
      I think Tom’s reasoning is totally fine in that he is saying that random error is so high it’s hard to pick out signal from noise. By upping the dosage, we presumably strengthen the signal (unless we’ve hit a plateau).
      
      Thanks, so something like:
      
      p1) there is a small increase with dose, but the CIs are too large to tell for sure
      p2) the difference between high dose and no dose…would not have a significant p-value
      p3) the mean could indicate something even if it’s not “statistically significant”
      p4) upping the dosage may increase the change in ONLS score
      p5) it would be useful to know whether this treatment has any effect at all on ONLS score
      
      Conclusion:
      C) these results call out for another test at an even higher dose
      
      Does that fit? I could follow this line of reasoning, but it was non-obvious to me (no sarcasm). Of course, I reject “premise 5”, since I would assume some non-zero effect exists by default.
Carlos Ungil on December 3, 2017 12:07 PM at 12:07 pm said:

For those experiencing deja vu: this was posted already (with different comments from Andrew, and without mentioning the original poster) on October 16.

Reply ↓
- Andrew on December 3, 2017 12:13 PM at 12:13 pm said:
  
  Carlos:
  
  That’s funny. So many posts, it’s hard for me to keep track!
  
  Reply ↓
- Jeff Walker on December 3, 2017 7:14 PM at 7:14 pm said:
  
  I have deja vu events that are pre-migraine. I read this post and my immediate thought was, oh crap – I’m going to getting a migraine. But this time the deja vu is real. So maybe no migraine.
  
  Reply ↓
Ning LI on December 4, 2017 11:37 AM at 11:37 am said:

33 comparisons! I think the significance level should be adjusted, simply said, it should be decreased from 0.05 to 0.05/33 to avoid the p-hacking issue

Reply ↓
- Georgette Asherman on December 6, 2017 2:22 PM at 2:22 pm said:
  
  Are these 3 studies with a treatment/placebo arm or 1 study with 3 treatment levels and placebo? Typically Dunnett’s intervals and tests are used for multiple comparisons to one control group. It would be wrong to reset alpha to .05/33 (too conservative) but there are various ranking procedures used in pharma statistics to adjust p-values from multiple endpoints. Another approach is p-value combinations that obtain an overall test statistic. These methods are used in the genomics world and I find are useful when clinicians are obsessed with significance for a handful of tests.
  
  If there are 3 separate studies with different control populations the variability of the endpoints can be assessed and brought into the comparison. As far a multi-level modeling, this is all at one level of grouping and I am not sure how much low information priors could contribute. But a phase 2 study with more dosing levels can definitely use this data for setting some type of prior assumptions.
  
  Reply ↓
  - Jeff Walker on December 6, 2017 2:38 PM at 2:38 pm said:
    
    Agree – this seems like a standard multiple endpoint problem and 0.05/33 is too conservative. If one wants classical hypothesis testing, I think the place to start is
    https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Procedures+for+comparing+samples+with+multiple+endpoints&btnG=
    
    And this is a more modern treatment, developed for genomic data with many thousands of endpoints but equally useful for only a few.
    https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=ROAST%3A+rotation+gene+set+tests+for+complex+microarray+experiments+&btnG=
    
    Reply ↓
  - Ning Li on December 13, 2017 4:10 AM at 4:10 am said:
    
    Thanks for correcting what I misunderstood. Absolutely agree with what you mentioned about the “multiple endpoint” issue in this topic.
    
    Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Orphan drugs and forking paths: I’d prefer a multilevel model but to be honest I’ve never fit such a model for this sort of problem

13 thoughts on “Orphan drugs and forking paths: I’d prefer a multilevel model but to be honest I’ve never fit such a model for this sort of problem”

Leave a Reply to Anoneuoid Cancel reply