Skip to content

When do statistical rules affect drug approval?

Someone writes in:

I have MS and take a disease-modifying drug called Copaxone. Sandoz developed a generic version​ of Copaxone​ and filed for FDA approval. Teva, the manufacturer of Copaxone, filed a petition opposing that approval (surprise!). FDA rejected Teva’s petitions and approved the generic.

My insurance company encouraged me to switch to the generic. Specifically, they increased the copay​ for the non-generic​ from $50 to $950 per month. That got my attention. My neurologist recommended against switching to the generic.

Consequently, I decided to try to review the FDA decision to see if I could get any insight into the basis for ​my neurologist’s recommendation​dation.​

What appeared on first glance to be a telling criticism of the Teva submission was a reference​ by the FDA​ to “non-standard statistical criteria” together with the FDA’s statement that reanalysis with standard practices found different results than those found by Teva. So, I looked up back at the Teva filing to identify the non-standard statistical criteria they used. If I found the right part of the Teva filing, they used R packages named ComBat and LIMMA​—both empirical Bayes tools.

​Now, it is possible that I have made a mistake and have not properly identified the statistical criteria that the FDA found wanting. I was unable to find any specific statement w.r.t. the “non-standard” statistics.

But, if empirical Bayes works better than older methods, then falling back to older methods would result in weaker inferences—and the rejection of the data from Teva.

It seems to me that this case raises interesting questions about the adoption and use of empirical Bayes. How should the FDA have treated the “non-standard statistical criteria”? More generally, is there a problem with getting regulatory agencies to accept Bayesian models? Maybe there is some issue here that would be appropriate for a masters student in public policy.

My correspondent included some relevant documentation:

The FDA docket files are available at!docketBrowser;rpp=25;po=0;dct=SR;D=FDA-2015-P-1050​

The test below is from ​ April 15, 2015 content/uploads/2016/07/Citizen_Petition_Denial_Letter_From_CDER_to_Teva_Pharmaceuticals.pdf”>FDA Denial Letter to Teva at pp. 41-42​

​Specifically, we concluded that the mouse splenocyte studies were poorly designed, contained a high level of residual batch bias, and used non-standard statistical criteria for assessing the presence of differentially expressed genes. When FDA reanalyzed the microarray data from one Teva study using industry standard practices and criteria, Copaxone and the comparator (Natco) product were found to have very similar effects on the efficacy-related pathways proposed for glatiramer acetate’s mechanism of action.

​The image below is from the ​Teva Petition, July 2, 2014 at p. 60


And he adds:

My interest in this topic arose only because of my MS treatment—I have had no contact with Teva, Sandoz, or the FDA. And I approve of the insurance company’s action—that is, I think that creating incentives to encourage consumers to switch to generic medicines is usually a good idea.

I have no knowledge of any of this stuff, but the interaction of statistics and policy seems generally relevant so I thought I would share this with all of you.


  1. Z says:

    It’s hard to know what to make of this situation. Teva argues that the drug is very complex and nobody knows how it works, so it’s possible that something about the specific way that Teva manufactures the drug might contribute to its efficacy. To illustrate their point, Teva tries to demonstrate that generics on the market in other countries have different effects on the gene expression levels of mouse cells. The FDA apparently doesn’t like the way Teva performed their experiment or did their analysis. The FDA says that they reach opposite conclusions when using “industry standard practices and criteria” for one study.

    I wonder what the standard practices are. If they used Bonferroni or something to correct for multiple comparisons, then the FDA would be too conservative and potentially miss real differences between the generic and brand drugs.

    And the statistical issue is just one prong of the argument. Teva’s general point that the drug is so complex that we can’t know whether a generic is similar enough to work without testing it in an efficacy trial (which would be prohibitively expensive for most generic companies to conduct) might still hold water. This could be true even if the generic did have a similar effect on mouse cell gene expression levels. I don’t know how to assess this argument, which Teva has a strong financial incentive to make, but which sounds reasonable enough that I would be personally somewhat nervous about taking the generic.

    • Z says:

      Maybe in the future brands making the argument Teva is making should be forced to fund the efficacy trials for the generics in exchange for a slightly extended patent exclusivity period.

    • Rahul says:

      Without knowing much about the specifics of the statistical argument, Teva’s motivations are very suspect. Very likely that Teva simply wants to protect their own exclusivity and monopoly power and FDA refused to play along.

      In any case, I’d be curious to know the neurologist’s rationale in asking your correspondent not to switch to the generic. At least the neurologist does not have the same conflict of interest that Teva does.

      • Jonathan (another one) says:

        Sure, Rahul, but the neurologist has his or her own risk aversion — the extra cost isn’t borne by the neurologist. There’s a treatment that works; the alternative is at best identical and there’s a colorable argument that it might be worse. Who ought to make the cost-benefit tradeoff?

        • There’s easily an argument that the generic might be better! but people don’t focus on that.

          For example though, there’s some evidence that “Roundup” brand glyphosate based weedkillers are carcinogenic, but glyphosate itself is pretty obviously not carcinogenic, so what’s going on? It turns out the surfactants in the Roundup brand stuff are probably the reason, and a generic brand that uses different surfactants might not be carcinogenic at all.

          • mark k says:

            “Might be better” is literally true but it’s nothing like even odds, so not focusing on the long shot that it’s better it is the logical choice.

            To a (very) crude approximation, you can think of late stage drug development as sampling thousands of “similar” molecules and formulations in a tiny portion of of ‘chemical space’. Even in the worst case–where researchers know so little the sampling is essentially random–making a random change to some attribute of the drug during production is unlikely to stumble upon the best thing yet, you’d expect it to be smack in the middle of the pack.

            If the researchers do know what’s going on then the actual product is more like a local maximum, so a random change almost certainly makes it worse.

            I agree with Jonathan on the motivation for the neurologist–there is no incentive to change, especially in a complicated disease like MS, unless you’re the one paying.

          • Rahul says:

            My priors say that the most likely outcome is that the generic & the original are absolutely identical in effect (for practical purposes).

            Any variation of effect is likely to be at a level that isn’t significant (clinically, not practically).

            • Rahul says:

              ….I meant to write “clinically not statistically”

            • Georgette Asherman says:

              The standard formula is to compare potency and declare equivalence when the 90% confidence interval for the ratio falls between 80% and 125%. This is the basis of most generic approvals. So for most small molecule drugs, generics don’t make much of a difference. There are cases where the binding and fillers do matter to hypersensitive people or people on multiple medications. I will say that cheap generic antibiotics are wonderful after recent dental procedures. For biologic drugs, those that come from some fermentation process, the process is part of the production so there are not true generics but biosimilars. Think of microbrews, our favorite biologic for 3000 years.

              This drug include microarray RNA expression data to show its efficacy. A Teva wanted to complain that the generic did not have the same results. Microarrays show how much RNA is produced for each gene in the presence of a test specimen. Where the RNA is a proxy for protein production. Microarrays were the hot technology about 10 years ago but now are limited to certain niches in pharmacology development. TI was really into them at the time but was not that interested in other parts of bioinformatics. Bioconductor is a system of R-packages used in this field. The consensus method coming out of Berkeley and Stamford was to do t-tests on thousands of genes and then apply false discovery rate adjustments. So the FDA is not against empirical Bayes in general but would like to use the standard method in a claim that the generic version does not show the same expression patterns.

              The FDA has written various guidelines for use of Bayesian statistics. The success has been in medical devices and adverse event tracking. However, most clinical statistics is based on frequentist approaches. Also most of a clinical report is descriptive, where operational considerations such as future manufacturing and implementation play a big role.

        • Rahul says:

          It’s an interesting question. I always thought that a good doctor ought to help the patient make cost-benefit analyses, quality-of-life trade off decisions etc.

          Sure, the final decision is the patient’s. But the doctor ought to actively help the patient make it was my impression.

          e.g. If a $1000-per-month regimen has a very minor chance of being better than a $40-per-month option then it is the doctor’s duty to *communicate* the tradeoff involved.

      • Garnett says:

        +1 I was wondering the same thing.

      • Bill Jefferys says:

        Fact: Pharma companies court physicians with various perks, in order to affect prescription habits.

        Don’t know if this is the case here, but it happens.

        Just sayin’.

  2. Dalton says:

    I think I emailed you about this Dr. Gelman, but this also affects the environmental sciences in a major way. There at least, we have an example of (absolutely terrible) statistical methods mandated by the regulatory agency in the weighty tome of the Code of Federal Regulation (CFR). For the full details check out 40.CFR.257.93 which mandates how monitoring is to be done at coal combustion residual plants. I believe there is similar language for other facilities (like landfills) which may affect groundwater.

    I’ve copied the relevant section below, but for me the highlights are: 1) you must use some NHST method with a multiple comparison control for p-values, BUT not too much control because you comparison wise-error rate needs to be no less than 0.01; 2) you can only use normal theory or non-parametric, none of this fancy GLM crap; 3) your statistical plan must be certified by a professional engineer, because you know engineers are really, really good at statistics.

    Maybe wait to read the rest of this til Monday. You don’t want to ruin your weekend being flabbergasted.
    “The owner or operator of the CCR unit must select one of the statistical methods specified in paragraphs (f)(1) through (5) of this section to be used in evaluating groundwater monitoring data for each specified constituent. The statistical test chosen shall be conducted separately for each constituent in each monitoring well.

    (1) A parametric analysis of variance followed by multiple comparison procedures to identify statistically significant evidence of contamination. The method must include estimation and testing of the contrasts between each compliance well’s mean and the background mean levels for each constituent.

    (2) An analysis of variance based on ranks followed by multiple comparison procedures to identify statistically significant evidence of contamination. The method must include estimation and testing of the contrasts between each compliance well’s median and the background median levels for each constituent.

    (3) A tolerance or prediction interval procedure, in which an interval for each constituent is established from the distribution of the background data and the level of each constituent in each compliance well is compared to the upper tolerance or prediction limit.

    (4) A control chart approach that gives control limits for each constituent.

    (5) Another statistical test method that meets the performance standards of paragraph (g) of this section.

    (6) The owner or operator of the CCR unit must obtain a certification from a qualified professional engineer stating that the selected statistical method is appropriate for evaluating the groundwater monitoring data for the CCR management area. The certification must include a narrative description of the statistical method selected to evaluate the groundwater monitoring data.

    (g) Any statistical method chosen under paragraph (f) of this section shall comply with the following performance standards, as appropriate, based on the statistical test method used:

    (1) The statistical method used to evaluate groundwater monitoring data shall be appropriate for the distribution of constituents. Normal distributions of data values shall use parametric methods. Non-normal distributions shall use non-parametric methods. If the distribution of the constituents is shown by the owner or operator of the CCR unit to be inappropriate for a normal theory test, then the data must be transformed or a distribution-free (non-parametric) theory test must be used. If the distributions for the constituents differ, more than one statistical method may be needed.

    (2) If an individual well comparison procedure is used to compare an individual compliance well constituent concentration with background constituent concentrations or a groundwater protection standard, the test shall be done at a Type I error level no less than 0.01 for each testing period. If a multiple comparison procedure is used, the Type I experiment wise error rate for each testing period shall be no less than 0.05; however, the Type I error of no less than 0.01 for individual well comparisons must be maintained. This performance standard does not apply to tolerance intervals, prediction intervals, or control charts.

    (3) If a control chart approach is used to evaluate groundwater monitoring data, the specific type of control chart and its associated parameter values shall be such that this approach is at least as effective as any other approach in this section for evaluating groundwater data. The parameter values shall be determined after considering the number of samples in the background data base, the data distribution, and the range of the concentration values for each constituent of concern.

    (4) If a tolerance interval or a predictional interval is used to evaluate groundwater monitoring data, the levels of confidence and, for tolerance intervals, the percentage of the population that the interval must contain, shall be such that this approach is at least as effective as any other approach in this section for evaluating groundwater data. These parameters shall be determined after considering the number of samples in the background data base, the data distribution, and the range of the concentration values for each constituent of concern.

    (5) The statistical method must account for data below the limit of detection with one or more statistical procedures that shall at least as effective as any other approach in this section for evaluating groundwater data. Any practical quantitation limit that is used in the statistical method shall be the lowest concentration level that can be reliably achieved within specified limits of precision and accuracy during routine laboratory operating conditions that are available to the facility.

    (6) If necessary, the statistical method must include procedures to control or correct for seasonal and spatial variability as well as temporal correlation in the data.”

    • Jonathan (another one) says:

      Fascinating. Standardized methods economize on gov’t time assessing results, probably greatly so, with two risks: first, there is the risk of making mistakes (on either side, not approving some facilitate which does not appreciably contaminate groundwater or approving a facility which does); second, there is the risk of adjusting the facts to the test, either fraudulently or via forking paths. In particular, condition (6) seems rife for abuse. That doesn’t mean it’s wrong; allowing each operator to generate its own report using its own methods could well overwhelm both the resources and the competence of (I presume) EPA staff.

      It is clear that that these rules have accreted over time and litigation rather than being thought anew and contrasted with some loss function. But frankly, I’ve seen a lot worse.

      • Roy says:

        I think everyone who works on the interface of statistics and policy should be required to spend a few years in government. Why might standardized methods arise? The assumption in blogs like this, is the statistician(s) for the company would only bring to bear better models. But my experience is that there are far too many hacks that are willing to take the money and run, and in fact do a worse analysis. So now you submit that analysis, first some one has to review it, to see if it is adequate. If it is not, what is the government’s recourse, since methodology isn’t prescribed? Take you to court, which is long and costly and becomes a case of dueling experts. So the end result is that methods get written explicitly into regulations.

        And why must it be certified by a Professional Engineer? Professional Engineers are licensed, and if they falsely sign off that a particular analysis was done in a certain way, and in fact it wasn’t, they can lose their license. It is more a certification that was is described in the document was actually done, and done correctly, than a certification that the techniques are the best.

        I was in policy when I first started in government, and it is really instructive how difficult it is to write an effective regulation or law, particularly since if large amounts of money are involved, there are teams of lawyers or other experts looking for loopholes. There are no such things as gray areas, because they provide outs.

        • Dalton says:

          I frequently work on projects where EPA is the regulatory agency. There is a serious problem with statistical methods mandated either in regulation or by direct request from the regulator. My experience is that policy becomes prescription and as a result statistical and scientific rigor suffers.

          A prime example is the use of normality tests. This is hard coded as the default starting step in EPA’s “official” software ProUCL. I myself have been told that I can’t use t-test unless I first use a normality test (e.g Anderson-Darling).

          This is how EPA (in general, but clearly not every person in the agency) interprets the results of a normality test. If the test fails to reject the null hypothesis, then they assume that the data are normality. (Nevermind that Fisher blatantly pointed out in defining the term “null hypothesis” that is can only be rejected, not confirmed. ) If the test rejects the null hypothesis, then they assume that data are not normal and that you cannot use methods that are associated with normality, even if it needs only be approximate or close enough. So that puts you in this paradoxical situation where if you have very small sample sizes and limited power to detect non-normality, you will be directed by ProUCL (or by the agency itself if you should presume to use such “unsupported” software as R) to use normal theory test, which is probably where you should be very cautious about assuming too much. But if you have lots of data the Anderson-Darling test will be far more likely to reject and as result you will be disallowed from using methods that are robust to slight deviations from normality with large sample sizes.

          Also what about the difference between practical significance and statistical significance that was a topic on this blog recently? This is regulation built on statistical significance. What would concern you more if you had a well near one of these facilities: a finding that lead in groundwater is elevated by 100 times background but with a confidence interval of 0.5 to 500 times (meaning it is not statistically significant), or a finding that lead in groundwater is elevated by 1.1 times with a confidence interval of 1.05 to 1.15? I can tell you which one scare me more, and it isn’t the statistically significant one. But the trigger for additional monitoring and remediation is statistical significance, not practical significance. This is disincentive for the regulated to pursue powerful monitoring programs. If I’m being judged on statistical significance, I’ll be inclined to sample as little as possible to minimize the chances of rejecting the null hypothesis of zero pollution.

          • Martha (Smith) says:

            +1 Thanks

          • YES YES YES, these are rock hard examples of exactly the kind of nonsense that using an accept-reject criterion with Frequentist statistical analysis leads to as compared to Bayesian decision theory. Frequentist accept-reject decisions about this sort of stuff are basically always a bad thing to codify into law.

            This isn’t a religious war, it’s a THEOREM OF MATHEMATICS called the Complete Class Theorem. Every decision rule is either a Bayesian decision rule or is dominated by a Bayesian decision rule. In practice with an adversary who actually is actively looking for loopholes, the Bayesian rule can dominate by a LARGE AMOUNT.

            Decisions NEED to use information about how bad various outcomes are likely to be, and they need to be made on a logical basis considering the weight of evidence for all the possible outcomes. But, almost the ONLY way people use accept-reject criteria is to decide whether or not they can use the null hypothesis as if it were true. So if you get a result like 0.5 to 500 times the background lead concentration, you’ll assume that since you can’t reject the null hypothesis of 1x the background level, you should use it as your assumption. This crap happens *all the time*. It’s a regulatory disaster.

            • Bill Jefferys says:

              +1 YES YES YES YES!

              I’ve long argued that decision theory is the right way to approach these problems. NHST is a crap way to make decisions.

            • Corey says:

              Technical note: Wald’s theorem doesn’t show that every admissible estimator *is* a Bayes estimator — it shows that every admissible estimator *has the same risk function* as some Bayes estimator. In technical jargon this makes the class of Bayes estimators an “essentially complete class” rather than an unqualified complete class. Does this matter? I dunno, but mathematical statisticians appear to care about the distinction.

              • Practically speaking, I think this result means that if you START with a Bayesian decision rule with a proper and mildly informative prior, then you wind up with an admissable rule, and if you start with a non-bayesian decision rule you can transform it to a Bayesian decision rule that does as well or better, instead of strictly better.

                Practically speaking, NHST usually does VASTLY worse because it doesn’t incorporate information about how bad different outcomes could be, and it can’t share information across sites. The only way you can incorporate that aversion to really bad outcomes is to require your NHST type rule to give you a very precise answer in the realm of what’s not bad, and doing so generally will require a lot of data and expensive testing, which means there will be heavy push-back from those being regulated.

                For example, you could require that lead levels be confined by a frequentist confidence interval to be 95% confidence of less that 1.1 times a specified background level at each site separately with corrections for multiple testing across sites… You’ll wind up having to take maybe 20 water samples per month per site and send them for high-precision lab testing at $200 each, for 5 locations, 20 samples, $200/ea the testing is going to cost $20k/month (numbers all made-up for illustration purposes)

                Or you could make assumptions about the different sites having a common distribution with a hyperparameter, use 2 samples at each site with a $20 low-precision test and one sample across the 10 samples randomly assigned to a $200 high precision test…. for a total cost of $400 and use a bayesian decision rule and wind up with better outcomes at 2 order of magnitude lower cost. To make this work you specify a loss function that rises rapidly above 1.1x your baseline, such as exp((x-1)/.05) and then specify that the expected loss be below some level consistent with say a gaussian distribution of mean 1x and sd 1.05x for example…

                allow(p(x)) = {integrate(p(x)loss(x),dx)/integrate(normal(x,1,.05)loss(x),dx) < 1}

                Mostly this is for completeness and/or the rest of the audience, I’m pretty sure Corey knows how this works.

              • Corey says:

                No need for the prior to be mild — the delta prior at any point in the parameter space gives rise to a constant estimator that completely ignores the data, and this is admissible for the usual loss functions. (Its risk function is non-dominated just for the parameter value at which the probability mass is concentrated. After all, if you knew the parameter value exactly, you would indeed want to ignore the data.)

              • Sure, if you know the parameter exactly you don’t bother to “estimate” it :-) Pretty much every standard prior is admissable the bit about the mildly informative prior is just what you typically want to use in a real-world regulatory scenario, where you want to use realistic levels of information, and rule out some things (like that 50% of the molecules in your water sample are actually lead) but still include all the values that any counter-party would want to argue for in the high probability region (so for example you might not rule out 1 mol/liter lead)

                Also, we should probably make a distinction between an estimation problem, where the risk function is in terms of the error in the estimate, and a generalized decision problem where the question is “perform a task or not” based on whether the expected loss is sufficiently high.

                I think the “perform a task or not based on expected loss” is actually unusual enough in real-world scenarios that it’s worthwhile making a big deal about how much better it is than the typical threshold / if-then rules are.

                For example, if you place a reasonable loss function on automobile emissions, and make people pay the expected loss, then people will automatically put money into fixing their cars to keep the emissions reasonably low, and will not pay to reduce emissions when the cost of the repair is higher than the cost of paying for the emissions. If the emissions “tax” is accumulated, it can then be used to pay for things like fixing other sources of pollutions, etc resulting overall in a better outcome than if everyone drives around at 75% of the allowable threshold for example.

              • I guess the right way to phrase it is not “perform a task if the expected loss is high enough” but rather if you have N possible tasks, including “do nothing” then perform whichever task has the lowest expected loss. in this way if fixing the car costs more than paying the fine… you pay the fine, and if it costs less then you fix the car for example.

    • Clark says:

      Other than being restrictive to some particular methodology, I see little to criticize and much to commend in these criteria. I like that the parametric anova is backed by a rank-based anova, each with adjustment for multiple comparisons. Appropriate transformations are mentioned to cope with non-normality, along with non-parametric approaches. From a frequentist perspective, I might like to see more discussion of effect sizes. Bayesian methods are not mentioned (perhaps they will be in a later revision), but there’s nothing wrong with frequentist methodology when used appropriately. An argument could be made that a restrictive but legitimate methodology reduces the likelihood of p-hacking, and puts everyone on a level playing field.

      • “but there’s nothing wrong with frequentist methodology when used appropriately”

        This is either tautological (ie. appropriately = “when it doesn’t do something wrong”) or simply wrong.

        The fact is that both the rules of logic (Cox’s theorem) and the rules for making good decisions (admissable decision rules = rules that are never dominated by some other rule) imply that the right way to make decisions is via Bayesian methods. So, instead this codifies into law that things MUST be done ILLOGICALLY.

        Although I agree that it could be fine to codify into law a procedure, it should NOT be a procedure that is provably wrong.

        • This set of rules could easily be replaced by specifying a utility function F based on the number of people in the region and the toxicological profile of the pollutants, and a likelihood function L based on some simple measurement error model, together with a sampling protocol for when and where samples should be taken. Then a procedure would be codified for estimating the populations losses, and “the plant must pay a fine equal to the losses estimated for the local population from the specified model to an agency account which will be used to provide equipment to the local water district or individual well owners to mitigate the pollution” The plant would naturally cease operations if it couldn’t make money, and low levels of pollution would still mean that through time the account would accumulate money sufficient to eliminate the pollution.

          By the same token, it’s stupid that in CA I’m allowed to drive my car around as much as I want provided it produces less than some threshold of CO2, NOx, and HC pollution, but I can’t re-register the vehicle if it tests even 1 unit higher than that threshold on any test. Instead, paying a registration/tax based on the actual estimated pollution produced is the right way to go.

          The results of this kind of frequentist threshold based regulation is typically that the maximum pollution will occur which still allows you to pass the threshold, whereas the optimal level of pollution may be much different and the cost associated to the pollution at the below threshold level is born entirely by the population.

          bad juju all around.

        • Clark says:

          We needn’t descend into a frequentist vs. Bayesianist religious war. I think we can generally agree that both approaches are effective when used correctly, and the opposite is also true. Frequentist methods are more commonly taught and used than Bayesian, so they’re a bigger target at the moment.

          A deeper problem is that far too often people not well trained in statistical methods are performing and interpreting their own analyses, poorly; we really need to do a better job of conveying to our students when they should realize they’re out of their depth and take their data to a professional.

      • Bob says:

        Without opining on the specific tests, it seems to me that these tests are exactly the type of measurement situations for which much of classical (frequentist) statistics was designed. IIRC, Student was trying to find a good way to determine the minimum number of tested samples needed to provide reasonable assurance that a batch of stout had the right amount of alcohol.

        As for the statement, “3) your statistical plan must be certified by a professional engineer, because you know engineers are really, really good at statistics.”

        I think that in most states, professional engineers are supposed to limit their advice and analysis to topics on which they are qualified. I suspect that the chances are quite high that most civil engineering firms that take contracts to design landfills will have someone on staff who is better at statistics than are many researchers who publish in psychology journals.

        An EPA website says “the average size of a CCR landfill is slightly over 120 acres with an average depth of 40 feet.” There are clearly going to be many dollars spent on civil engineering services in designing and maintaining such a facility.


        • Dalton says:

          I don’t know. I think the multiple comparisons mandate is particularly problematic when you consider that many of theses site will have multiple groundwater monitoring wells that are sampled concurrently. This policy requires you to test each one of those wells one-by-one against some “background” (either an offsite well or some threshold number), and then to control for multiple comparisons.

          But the structure of that data just screams multi-level model to me. Each well is an observational unit with sampling events through time and potentially more than one replication per sampling event (a replication here being a sample of water from a well). Why wouldn’t you want to use partial pooling in a multi-level model here? I’m Dr. Gelman can link to the paper, but this is a really good advantage of multi-level models to avoid multiple comparisons.

          We could also discuss the appropriateness of mandating that “the statistical test chosen shall be conducted separately for each constituent in each monitoring well”. This is essentially multivariate data. Each sample of water is tested for multiple chemicals. I don’t know much about coal ash, but it seems reasonable that there is more than one chemical of concern involved. It could be that elevated lead from coal ash pollution is also associated with elevated cadmium (I don’t know, I’m just picking two random elements as an an example). That seems like something worth considering if you’re goal is to detect potential pollution and to do so quickly before it becomes problem. Seems to me you’d have more power to do that by using all of the data at once.

          • Dalton says:

            Edit. I’m actually not Dr. Gelman. I meant to say “I’m sure that Dr. Gelman can link the paper”

          • Exactly, by codifying some correction for multiple comparisons and minimal use of background information across multiple locations etc the law ensures that it’s hard to disprove the null hypothesis that everything is at background levels which is “all good” for the coal burning plant.

            This is probably an instance of regulatory capture at work rather than “best practices” or “a good idea”.

  3. Rahul says:

    Ironically, in this story, I expected Teva to be the outside challenger with the cheap generic and Sandoz to be the incumbent that originally licensed the drug. Isn’t Teva the king of cheap generics.

    Interesting to see the roles reversed in this case!

    • mark k says:

      Yes, king of generics is right, and they routinely sue companies with patents and approvals to try and nullify them on IMO quite spurious grounds, but one in a hundred due to a confused judge or jury justifies it. (FWIW I’m not necessarily impartial, working in the industry for ‘brand name’ companies, and of course companies abuse patents to over-protect as well.)

      It’s not like they are just a bunch of rent seeking lawyers though–they do research arm and AFAIK are superb at process chemistry and manufacturing.

      • Rahul says:

        That’s why it tickles me to see Teva fighting to resist Sandoz’s attempt to introduce a cheap generic.

        Total reversal of roles.

        • Carlos Ungil says:

          Actually Sandoz is a pure generics company ($9.2bn in sales in 2015). After the merger with Ciba-Geigy to form Novartis, the Sandoz name was kept for the generics business. Teva, however, has both generics ($9.5bn in 2015) and proprietary drugs ($8.3bn, including $4bn of Copaxone). The latter have unsurprisingly better margins and account for over half of the profits. The generics part will be more important when the acquision of Actavis Generics is completed any day, but still Teva is one of the largest “innovative” pharma companies around.

  4. Rahul says:

    As an aside, Copaxone seems a rather iffy drug. Here’s what Wikipedia summarizes:

    “A 2004 Cochrane review concluded that glatiramer acetate “did not show any beneficial effect on the main outcome measures in MS, i.e. disease progression, and it does not substantially affect the risk of clinical relapses.

    In its pivotal trial of 251 patients, after two years it failed to show any advantage in halting disability progression.

    A double-blind 3-year study found no effect of glatiramer acetate on Primary-Progressive Multiple Sclerosis.”

    Admittedly, I am quoting selectively, and there seems evidence of an “effect” by some other rather convoluted metrics but the overall utility seems rather marginal to my naive analysis.

    Is this a case of drowning men clutching at straws & big-pharma eagerly making available such straws?

    • Jon says:

      How does this jive with the people at George Mason who complain that the FDA’s rigid requirements concerning proof of efficacy are causing people to be denied life saving treatment?

      • jrkrideau says:

        I am not sure I have understood your question but here is something that may be relevant at the social level — no relevance as far as I can see to statistical issues as such.

        • Jon says:

          There is a body of work and school that argues that toughening of FDA criteria for drug approval in the 1960s cost lives and harm health because the harm on slowing drug development and approval of useful drugs outweighs the benefits of preventing harmful or ineffective drugs from being sold.

          I recall that Sam Peltzman was one who did studies on this. More recently I have seen this view championed by Alex Taborek. I was interested in if there have been critical examinations of the studies from this group of economists and what the range of results on this issue are.

          My observations is that some of the economists think drug development is like the Star Trek episodes in which the crew of the Enterprise find the miracle cure within the hour show and almost everyone is saved.

          • jrkrideau says:

            Well to be snarky, I have often thought that economists engage in magic thinking.

            Any change that most of those economists are American?

            approval of useful drugs outweighs the benefits of preventing harmful or ineffective drugs from being sold.

            I’m Canadian and I remember the thalidomide crisis in the late 1950’s and the early 1960’s. Those victims who survived the effects of the drug still are suffering 50+ years later. I suspect the FDA’s toughened criteria was in response to that.

            I used to know one survivor who was fine except for the fact that his arms were about 25–30 cm long and his fingers were about the size of a 2-year old. Watching him using a soldering iron was amazing. For an almost identical deformity have a look at Niko Von Glascow’s photo at

            The USA was extremely lucky that it dodges that bullet, based, apparently, on the very brave stand of one FDA staffer, Dr. Frances Kelsey.

            So while I understand the economists’ argument, at least partly, I am not all that convinced.

          • Martha (Smith) says:

            Speaking from a consumer point of view, I think that giving more care to possible harm is more important that pushing drugs to approval. Examples:

            1. “Outcome switching” seems quite common in seeking drug agency approval — e.g., the re-analysis of Paxil (paroxetine) ( pointed out that the increased risk of suicidal ideation that had been noted after approval was suppressed in the original report.

            2. Anecdotally: Recently a friend had bizarre and scary side effects of a prescription sleeping pill. These were not listed as “possible side effects” on the package insert. But when I looked the drug up on the web, I found a “harm reduction” website that listed this drug as a common “recreational drug,” and said that the side effects my friend had experienced were quite common.

  5. Thomas says:

    Here’s a pretty scathing summary of the state of the profession of science.

  6. Keith O'Rourke says:

    Turns out it may all be the fault of lawyers ;-)

Leave a Reply