Problems with randomized controlled trials (or any bounded statistical analysis) and thinking more seriously about story time

In 2010, I wrote:

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.” At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data.

Randomized controlled trials (RCTs) have well-known problems with realism or validity (a problem that researchers try to fix using field experiments, but it’s not always possible to have a realistic field experiment either), and cost/ethics/feasibility (which pushes researchers toward smaller experiments in more artificial settings, which in turn can lead to statistical problems).

Beyond these, there is the indirect problem that RCTs are often overrated—researchers prize the internal validity of the RCT so much that they forget about problems of external validity and problems with statistical inference. We see that all the time: randomization doesn’t protect you from the garden of forking paths, but researchers, reviewers, publicists and journalists often act as if it does. I still remember a talk by a prominent economist several years ago who was using a crude estimation strategy—but, it was an RCT, so the economist expressed zero interest in using pre-test measures or any other approaches to variance reduction. There was a lack of understanding that there’s more to inference than unbiasedness.

From a different direction, James Heckman has criticized RCTs on the grounds that they can, and often are, performed in a black-box manner without connection to substantive theory. And, indeed, it was black-box causal inference that was taught to me as a statistics student many years ago, and I think the fields of statistics, economics, political science, psychology, and medicine, are still clouded by this idea that causal research is fundamentally unrelated to substantive theory.

In defense, proponents of randomized experiments have argued persuasively that all the problems with randomized experiments—validity, cost, etc.—arise just as much in observational studies. As Alan Gerber and Don Green put it, deciding to unilaterally disable your identification strategy does not magically connect your research to theory or remove selection bias. From this perspective, even when we are not able to perform RCTs,

Christopher Hennessy writes in with another criticism of RCTs:

In recent work, I set up parable economies illustrating that in dynamic settings measured treatment responses depart drastically and systematically from theory-implied causal effects (comparative statics) both in terms of magnitudes and signs. However, biases can be remedied and results extrapolated if randomisation advocates were to take the next step and actually estimate the underlying shock processes. That is, old-school time-series estimation is still needed if one is to make economic sense of measured treatment responses. In another line of work, I show that the econometric problems become more pernicious if the results of randomisation will inform future policy setting, as is the goal of many in Cambridge, for example. Even if an economic agent is measure zero, if he views that a randomisation is policy relevant, his behavior under observation will change since he understands the future distribution of the policy variable will also change. Essentially, if one is doing policy-relevant work, there is endogeneity bias after the fact. Or in other words, policy-relevance undermines credibility.

Rather than deal with these problems formally, there has been a tendency amongst a proper subset of empiricists to stifle their impact by lumping them into an amorphous set of “issues.” I think the field will make faster progress if we were to handle these issues the with same degree of formal rigor with which the profession deals with, say, standard errors. We should not let the good be the enemy of the best. A good place to start is to write down simple dynamic economic models that actually speak to the data generating processes being exploited. Absent such a mapping, reported econometric estimates are akin to a corporation reporting the absolute value of profits without reporting the currency or the sign. What does one learn from such a report? And how can such be useful in doing cost-benefit analyses on government policies? We have a long way to go. Premature claims of credibility only serve to delay confronting the issues formally and making progress.

Here are the abstracts of two of Hennessy’s papers:

Double-blind RCTs are viewed as the gold standard in eliminating placebo effects and identifying non-placebo physiological effects. Expectancy theory posits that subjects have better present health in response to better expected future health. We show that if subjects Bayesian update about efficacy based upon physiological responses during a single-stage RCT, expected placebo effects are generally unequal across treatment and control groups. Thus, the difference between mean health across treatment and control groups is a biased estimator of the mean non-placebo physiological effect. RCTs featuring low treatment probabilities are robust: Bias approaches zero as the treated group measure approaches zero.

Evidence from randomization is contaminated by ex post endogeneity if it is used to set policy endogenously in the future. Measured effects depend on objective functions into which experimental evidence is fed and prior beliefs over the distribution of parameters to be estimated. Endowed heterogeneous effects generates endogenous belief heterogeneity making it difficult/impossible to recover causal effects. Observer effects arise even if agents are measure zero, having no incentive to change behavior to influence outcomes.

As with the earlier criticisms, the implication is not that observational studies are OK, but rather that real-world complexity (in this case, dynamics of individual beliefs and decision making) should be included in a policy analysis, even if a RCT is part of the story. Don’t expect the (real) virtues of a randomized trial to extend to the interpretation of the results.

To put it another way, Hennessy is arguing that we should be able to think more rigorously, not just about a localized causal inference, but also about what is traditionally part of story time.

24 thoughts on “Problems with randomized controlled trials (or any bounded statistical analysis) and thinking more seriously about story time

  1. Yes, yes, yes, yes. I have such a problem with explaining to certain potential clients what I do. If I were to call myself a “statistician” they could put me in a box they understand “oh, he tells us what is and isn’t statistically significant” (which is *really* more or less what outsiders think of statisticians).

    But, I’m definitely *not* a statistician. My background is applied mathematics and engineering. I use statistics in the service of understanding underlying mechanistic models. I try to use “mathematical modeling and data analysis” as a name for what I do, but people don’t have a box for that.

    The point is, though, that in almost everything I do, I try to connect *substantive information about a real world process* to *measured data about that process*.

    The typical RCT is actually an attempt to explicitly AVOID that. I read the sentence “However, biases can be remedied and results extrapolated if randomisation advocates were to take the next step and actually estimate the underlying shock processes” and I said to myself *yes* this is one of the major things that is wrong with medicine. It’s as if we look at Tiger Woods hitting a golf ball, and say that the “effect” of swinging the club is to move the ball from this location at the tee, to this location near the hole….

    But the effect of swinging the club is to exert tens of thousands of pounds of force on the ball for milliseconds, thereby imparting a velocity and a spin to the ball. You’ll never *EVER* learn about the properties of special textures on the club face or different materials in the ball core or jacket if you pretend that the system is an input-output function for positions rather than a dynamic process involving forces.

      • You are a Data Scientist!
        What does that mean? Whatever the hell you want it to mean.

        Well yes, of course. “”When I use a word,” Humpty Dumpty said, in a rather scornful tone, “it means just what I choose it to mean – neither more nor less.”

        It handles the situation where there is no exact job title exists and “Data Scientist” is probably as good as it gets for what Daniel does. We have not evolved a very specific term for his type and range of work.

        David Wootton in his book The invention of science points out that until the middle of the 19th C we did not have an agreed upon term for ‘scientist’.

        “Data Scientist” may win out here or something else may replace it as a description for the general type of work it implies.

        I have joke business cards printed up calling myself a “Behavoural Economist” which is probably not all that completely wrong, and an “Applied Nutritionist” which means that if you invite me for a meal, I’ll give you feedback on it.

        Neither term is a “protected title” where I live so I can use either or both as I please. The “Applied Nutritionist” might be a bit dicey in some other provinces.

    • > The typical RCT is actually an attempt to explicitly AVOID that.
      Classical statistical methods attempt to explicitly AVOID that (or at least keep them in the backroom just among the adults)

      > call myself a “statistician” they could put me in a box they understand
      That has been the hardest challenge for me in my career because, I believe, because if you do a good job your not doing what folks expect statisticians to do (and perhaps what most statisticians do do).

      Part of the solution seems to lie in discussing and repeatedly checking the expectations of people you have never worked with before and trying to make clear why its in their best (long term) interests that you are not doing that sort of stuff. If they really need to publish something, anything, this year – try to just move on (unless you really need the work).

      • “That has been the hardest challenge for me in my career because, I believe, because if you do a good job your not doing what folks expect statisticians to do (and perhaps what most statisticians do do). ”

        Exactly. You’re a “bad statistician” in exactly the way that a person who identifies for you a way to upgrade your car with several safety features and a rebuilt engine and make it last another 10 years at 1/5 the cost of a new car is a terrible used car salesman.

        • For example, people want me to tell them how many units they need to test in order to get a statistically significant sample (a meaningless phrase, but what they mean is “in order to get a rubber stamp of approval that we did a good job and the final result is more or less unimpeachable”).

          I tell them basically that the only meaningful way to make a decision about the number of units to sample is to use decision theory and I explain that’s basically the same idea as a cost vs benefit analysis, and we’ll have to look at the cost of doing a test, and the cost of not detecting a flaw and try to minimize the total cost. They come back and say “we decided we could afford to test 6 units, is that ok?”

          My conclusion is basically that the statistics community has implanted in people’s head the idea that there are magic numbers that only statisticians can tell you, and once you get a statistician to tell you the magic numbers you’re all safe, and when you return your measurements to the statistician he (or she) will perform a special SAS related ritual over them and then give you permission to call anyone who calls your results into question an ignoramus.

          Sigh.

        • RCTing is just a device that provides two equal (in distribution) realities, one of which you can intervene on, hopefully without affecting the other in the hopes of getting a not too non-additive effect (on some scale of measurement) that one can discern beyond background variation.

          A lot more is required to get _good_ science that just that – and its just a solitary study.

          CS Pierce quote – “I do not call the solitary studies of a single man a science … [they need to be] aiding and stimulating one another by their understanding of a particular group of studies … that I call their life a science … each should understand pretty minutely what it is that each one of the other’s work consists in …”

        • This is close to home for me as I spend my life doing clinical trials. I recognise a lot of what Daniel and Keith are saying. RCTs (in medicine at least) have developed their own subculture and their own set of rituals, and there’s a standard accepted way of doing things. In some ways this is good, but it also discourages thinking creatively about problems, and some of the traditional methodology doesn’t make a lot of sense, especially if applied by rote to every problem. There are strong incentives to maintain the status quo: RCTs, especially the big ones, are seen as prestigious and a great source of big research grants and 4* papers for the REF (sorry, UK-specific), so nobody wants to take a risk by going against the accepted traditions.

          So – we need more “bad statisticians” in trials!

  2. I feel like this is a lot of beating around a simple bush. Before knowing you have something worth acting upon, you need:

    1) a theory that makes some sort of precise predictions
    2) carefully collected data to compare to the predictions

    Maybe whatever researchers do is some heuristic that yields appropriate behaviors under some conditions (it doesn’t require much data/modelling to convince yourself that jumping in front of cars are bad for health), but I really question whether higher degrees and nationally-funded research programs should be built upon such activities.

  3. This is why the platinum standard is a randomized controlled trial with a _crossover_ design.

    Difference’s in subject (or experimenter) Bayesian updates should be reflected in different effects depending on whether they were in a treatment group pre or post crossover.

    Note that this also addresses your previous objection about between subject versus within subject.

    I’m not saying crossover is the end all be all. Just that, in practice, RCT folks do make some attempt to overcome these challenges.

    [I apologize if this duplicates someone else’s comment. I did a keyword search, but such are only modestly effective.]

  4. >From a different direction, James Heckman has criticized RCTs on the grounds that they can, and often are, performed in a black-box manner without connection to substantive theory. And, indeed, it was black-box causal inference that was taught to me as a statistics student many years ago, and I think the fields of statistics, economics, political science, psychology, and medicine, are still clouded by this idea that causal research is fundamentally unrelated to substantive theory.

    Heckman has a number of papers connecting black box causal inference methods widely used by economists (instrumental variables, matching, etc) to economic theory (namely, structural models)

  5. Hennesy’s point about placebo controlled trials is interesting. I would restate it as: a pill with active ingredient can be considered two simultaneous treatments–a biological treatment (b) and a psychological treatment (p). If the treatments are additive and do not interact, then a randomized trial with placebo control gives you the biological effect by difference in differences. If they do interact, we need more. We can estimate E[Y(b=1,p=1)] from the active arm of a trial, E[Y(b=0,p=0)] from a no treatment arm, and E[Y(b=0,p=1)] from a placebo arm. But we are interested in the pure biological effect E[Y(b=1,p=0)], which is a treatment we cannot actually assign (unless we inject subjects in their sleep).

    • >”a randomized trial with placebo control gives you the biological effect by difference in differences.”

      This is that insidious error. There is an whole iceberg of issues about how well what you measured maps to what you actually care about, etc. It is usually the case that all sorts of mundane things can explain “a difference”. The stats are a tiny, tiny part of figuring out what is going on.

      • Randomized experiments give you unbiased estimates of the causal effect of the treatment that was randomized on the outcome that was measured in whatever population the sample can be considered to be a draw from. There are of course many issues surrounding whether the treatment in the study is really the treatment of interest, the measured outcome is really the outcome of interest, and whether the implicit population is really the population of interest. Those issues are not what my comment was about.

        • >”Randomized experiments give you unbiased estimates of the causal effect of the treatment that was randomized on the outcome that was measured in whatever population the sample can be considered to be a draw from.”

          Not in principle. As an easy example, say you give “injection B” to some Ebola infected macaques but not others, and euthanize them when they are judged to be “very sick”. The injection is the treatment and “% dead after 10 weeks” is the outcome.

          Now add that the people judging whether it is time to euthanize are not blinded (perhaps no attempt was made, or perhaps the macaques that get the injection all have some “side effect” making it impossible, eg their skin turns blue, which is part of the treatment).

          If the % dead after 10 weeks is lower in the vaccinated macaques than controls, can you deduce this is because of the biological effect of the vaccine? I would say no. Instead it could be the people deciding when to euthanize treated the two groups differently.

          That example is a toy one, but if you link to a real life paper that you think demonstrates your claim, I am pretty sure I will see it suffers from the same type of problem (not necessarily related blinding).

        • Don Rubin (and I am sure others) pointed out that in actual clinical trials you unbiased estimates of the causal effect of the treatment only if the causal effect is zero. In actual trials there is non-compliance, drop out, psychological effects of being in a RCT, etc., etc. that will bias the estimate of any truly non-zero treatment effect.

    • Dear Z–If you look to our model, our point is that even if the mental effect is assumed to be additive, as we do, it will not difference out because it is not independent of the pure physical effect. That is, a rational Bayesian agent will derive different beliefs regarding efficacy from a different physical effect. In other words, the mental effect is not independent of the physical effect, quite the contrary, even if the agent is truly blinded to whether he is in the treatment or control group, and even if there is zero attrition. If I draw my physical effect from a different urn, even if it has the same mean physical payoff, the distribution of my beliefs regarding efficacy, and hence hope, will differ.

  6. RCTs of antidepressants are tainted by the fact that the placebos are typically inert, while the drugs often (or usually) have noticeable side-effects. Patients in the experimental arm of the study tend to say to themselves, “When I take this pill, I get an upset stomach, my head hurts, and I get sleepy. Wow, this must be a very powerful drug, and I bet I’m going to feel better soon”. When you combine this source of bias with other sources (such as the file-drawer effect), you end up with the quite plausible theory that antidepressants are nothing more than glorified placebos.

  7. Randomized controlled trial has following problems when it is used to study chronic diseases:

    Indiscriminate application of a treatment. When a chronic disease is caused by imbalance, a cure for the disease is to correct the imbalance. The correction cannot be too little or too much. If one’s cancer is caused in part by too much omega 6 fatty acids and too little omega 3 fatty acids, a correct cure must be directed to lowering the omega 6 fatty acids and increasing the omega 3 fatty acids. The corrective measure works only on those cancer patients who have this imbalance. However, if this measure is used to patients who have a perfect omega 6 to 3 fatty acids ratio, it will make their conditions worse. When a treatment is evaluated in a randomized controlled trial, the treatment is indiscriminately applied to all trial subjects. It is excepted that some subjects will show positive effects and others show negative effects. If a trial’s outcomes are 20% positive effects, 20% negative effects, with 60% no effects, the final treatment’s mean will be 0% due to statistical averaging. Such a result is wrong because this treatment can cure 20% of patients if it is applied only to well matched patients.

    Treatment’s effect is too weak. In treating a subtle imbalance in biochemical processes or tissue structure, the current clinical trial cannot deal with a massive number of interfering factors. If a health property such as survival time is used as a measure, the survival time depends on a large number of factors such as age, personal health, sex, genetics, disease condition, exercise, diet, activity level, emotional condition, chronic stress, etc. If a trial tries to study a diet, the effects of the diet are interfered by all other factors that happen to exist randomly. Thus a clinical trail is unable to correctly determine any of those factors. Therefore, medicine has to claim “no evidence” that they can be used to cure diseases. I have proved those claims were false.

    Interfering factors ruin clinical trial outcomes. When clinical trials are used to study one single factor, a large number of other uncontrolled factors work like interfering factors. Whenever a clinical trial is used, a statistical analysis is conducted to determine if treatment effects exist. However, the trial data actually “bundle” the effects of all interfering factors as an apparent experimental error. In conducting statistical analysis, the measured treatment effect (e.g. the statistical average) is compared to the experimental error. Only if the treatment’s effects is sufficiently larger than the experimental error, does the statistical analysis affirm the treatment’s effect. If the treatment’s effect is closer to or even smaller than the experimental error, the trial just “regards” the treatment’s effect as being caused by “the experimental error”, thus failing to recognize the treatment’s effect.

    Wu, Jianqing and Zha, Ping, Randomized Clinical Trial Is Biased and Invalid In Studying Chronic Diseases, Compared with Multiple Factors Optimization Trial (November 4, 2019). Available at SSRN: https://ssrn.com/abstract=3480523 or http://dx.doi.org/10.2139/ssrn.3480523

Leave a Reply to Anoneuoid Cancel reply

Your email address will not be published. Required fields are marked *