Freelance orphans: “33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??”

From someone who would prefer to remain anonymous:

As you may know, the relatively recent “orphan drug” laws allow (basically) companies that can prove an off-patent drug treats an otherwise untreatable illness, to obtain intellectual property protection for otherwise generic or dead drugs. This has led to a new business of trying large numbers of combinations of otherwise-unused drugs against a large number of untreatable illnesses, with a large number of success criteria.

Charcot-Marie-Tooth is a moderately rare genetic degenerative peripheral nerve disease with no known treatment. CMT causes the Schwann cells, which surround the peripheral nerves, to weaken and eventually die, leading to demyelination of the nerves, a loss of nerve conduction velocity, and an eventual loss of nerve efficacy.

PXT3003 is a drug currently in Phase 2 clinical testing to treat CMT. PXT3003 consists of a mixture of low doses of baclofen (an off-patent muscle relaxant), naltrexone (an off-patent medication used to treat alcoholism and opiate dependency), and sorbitol (a sugar substitute.)

Pre-phase 2 results from PXT3003 are shown here.

I call your attention to Figure 2, and note that in Phase 2, efficacy will be measured exclusively by the ONLS score.

My reply: 33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??

I sent this exchange to a colleague, who wrote:

In a past life I did mutual fund research. One of the fun things that fund managers do is “incubate” dozens of funds with their own money. Some do very well, others do miserably. They liquidate the poorly performing funds and “open” the high-performing funds to public investment (of course, reporting the fantastic historical earnings to the fund databases). Then sit back and watch the inflows (and management fees) pour in.

8 thoughts on “Freelance orphans: “33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??”

  1. It’s worth noting that the p-values are one-tailed. That choice can be defended (two of their references in fact address the issue of one-sided tests) but is not usual. Using two-sided tests only one of the four statistically significant results remains p<0.05 (the others become p<0.10).

  2. I think (?) you are being sarcastic with your title, but maybe you should be explicit about what is wrong to avoid any misunderstanding.

    As for the incubation of the funds, is that wrong? Some funds will do better by chance, but it is also possible the best performing funds are doing better because they are more skillfully managed. It just helps to keep in mind the multiple “chances” to do well when calculating the odds that a fund’s performance is due to chance alone. If a coach “incubates” multiple players for a team and then cuts the worst ones, does anybody doubt the better-performing players are actually better? Or should I go back to my little league coach and tell him kicking me off for batting .100 when there were .300 players was a misuse of statistics?

      • Thank you, Andrew, for noting the difference between unpooled drug efficacy comparisons and partially-pooled fund performance comparisons.

        Would a kind reader mind please providing calculation details, to help readers like me get a better quantitative understanding of Andrew’s points?

        — How do we set up to determine 1.65 expected significant in 33 comparisons? Is this a binomial distribution calculation (success / failure) ? Is there an implicit 0.05 significance level here?

        — How do we set up and calculate the analogous significance for say 33 partially-pooled mutual funds? Are more details required, e.g. size of pools and/or population of available stocks to choose from?

        Thanks very much in advance for any assistance.

        • Brad:

          Sure, with 33 independent comparisons and an expected 1.65 significant, you could use dbinom() in R to compute the probability of 0, 1, 2, 3, etc. significant results.

          But I’d not recommend this as a data analysis. I’d do partial pooling on the effect sizes. My point was just that, when there are many comparisons, it’s no surprise to see some low p-values, just by chance.

          Conversely, if you have good theoretical reasons to believe these effects, or if the effects are consequential, it can make sense to act on them right away, without using statistical significance as a threshold.

          From the above post, my correspondent had written: “This has led to a new business of trying large numbers of combinations of otherwise-unused drugs against a large number of untreatable illnesses, with a large number of success criteria.” And this suggested to me that there was no good theoretical reason to expect these effects, in which case from a Bayesian point of view we’d want to do a lot of partial pooling toward 0, which would give us little confidence in any of these claimed effects, even if the separate p-values happened to be less than 0.05 or whatever. The large number of potential comparisons is helpful in understanding how those low p-values came up in the first place.

          Regarding quantitative understanding, there’s this article which treats the problem from a non-Bayesian perspective. Oddly enough, I haven’t recently written any articles giving the simple Bayesian solution to such problems; I should do that.

        • > How do we set up to determine 1.65 expected significant in 33 comparisons? Is this a binomial distribution calculation (success / failure) ? Is there an implicit 0.05 significance level here?

          33 x 0.05 = 1.65

  3. Some time in the pre-Internet era I read about the following strategy for getting people to subscribe to a newsletter predicting football(?) results:

    You start with (say) 512 names of people you know are interested in football results. You send each of them a prediction for a particular match — half predicting a win for side A, and half for the other side.

    After the game has been played, you throw away the addresses of the people you sent the wrong prediction to, and send another prediction to the 256 survivors, again split 50/50 between the teams.

    After doing it again you have 64 people who have received three straight correct predictions, and so will wish to subscribe to your newsletter.

    Interestingly this would probably be difficult to pull off today — people would be very likely to compare notes on some Internet forum.

Leave a Reply to Carlos Ungil Cancel reply

Your email address will not be published. Required fields are marked *