Skip to content
 

No guarantee

From a public relations article by Karen Weintraub:

An anti-aging startup hopes to elude the U.S. Food and Drug Administration and death at the same time. The company, Elysium Health, says it will be turning chemicals that lengthen the lives of mice and worms in the laboratory into over-the-counter vitamin pills that people can take to combat aging. . . . The problem, [startup founder] Guarente says, is that it’s nearly impossible to prove, in any reasonable time frame, that drugs that extend the lifespan of animals can do the same in people; such an experiment could take decades. That’s why Guarente says he decided to take the unconventional route of packaging cutting-edge lab research as so-called nutraceuticals, which don’t require clinical trials or approval by the FDA.

So far so good. But then this:

This means there’s no guarantee that Elysium’s first product, a blue pill called Basis that is going on sale this week, will actually keep you young.

Ummm . . . do you think that if the product did have a clinical trial and was approved by the FDA, that there would be a “guarantee” it would actually keep you young?

P.S. As an MIT graduate, I’m disappointed to see this sort of press release published in Technology Review. Hype hype hype hype hype.

77 Comments

  1. Seb says:

    What Elysium is doing is nothing more than selling snake oil to the gullible, monetizing the existential fear of death.

  2. Keith O'Rourke says:

    If is “was approved by the FDA” and it was the usual submission involving a series of trials, with in the end two confirmatory trails being statistically significant – then it has survived some reasonably severe testing (using Mayo’s term) of a positive versus negative primary effect.

    There is a move underway to modernize drug regulations and for instance making it ongoing over time (e.g. life-cycle of evidence) which likely would be much better for a product like this.

    • Jonathan (another one) says:

      I think Andrew’s point is that “a positive versus negative primary effect” for all patients is really far from a guarantee for any particular patient. Really, really far.

      • Keith O'Rourke says:

        You are right, I should have added positive versus negative primary effect on average for patients like those recruited into to the trail, treated similarly as in the clinical trails conducted in leading hospitals, …

        • Over a period, in this case, of say 60+ years. If you had started such a trial in 1900 with 10mg starch pills you’d probably have seen a massive effect by 1960 even with random assignment *something* was bound to be different between the two groups and if that thing amplifies exponentially at a rate of say 1%/yr 60 years later 1.8 times as many people in one group will have X vs another ;-)

          You can only really do what is normally done in clinical trials because the timeframe of the trial is so short, a couple of years isn’t enough time for “all else equal” to be severely broken in terms of the background time-trends.

          • Keith O'Rourke says:

            Note everything has to be done at once up and front and not everything requires a randomized trial (yikes don’t quote me).

            Just the second entry to pop up in my google search (the first was just on Japan) http://law.stanford.edu/wp-content/uploads/sites/default/files/publication/677385/doc/slspublic/Mello_New%20horizons%20in%20pharmaceutical%20regulation.pdf

            Those interested will want to do a better literature search.

          • Martha (Smith) says:

            +1 to Daniel’s comment

          • Dean Eckles says:

            Huh? This doesn’t make sense unless you think the members of each group are interacting with each other. Fisherian randomization inference (and standard statistical inference) will work just as well 60 years later.

            • Randomization *approximately* balances every unknown. But it doesn’t *perfectly* balance *anything*. If that imbalance in the unknowns results in anything that has long term growth involved, the extra time will accentuate that imbalance.

              It’s the difference between exp(0.01 * .1) = 1.001 (an imbalance that grows at 1%/yr for a month

              and exp(0.01*60) = 1.8 (an imbalance that grows at 1%/yr for 60 years)

              Two groups can diverge simply because they’re given more time to diverge, not because of any actual treatment

            • Andrew says:

              Dean:

              I’m not a big fan of randomization inference in this or any other setting. (Readers of my review of the Imbens and Rubin book might recall this is one place where I disagreed with those experts.) For hypothesis testing, I just don’t care because nothing has zero effect anyway, and for interval estimation I don’t like randomization inference because in general it’s so dependent on all the auxiliary assumptions of the model. The latter is what Dan Lakeland was getting at in one of his memorable characterization of the null hypothesis as “a specific random number generator.”

              Randomization inference is a cool idea, and I too was impressed when I first heard about it (from Rubin, as it happens). But the more I think about it, the less sense I think it makes in most real-world applications.

              • Dean Eckles says:

                Yeah, I figured you wouldn’t. But I pointed to randomization inference just to highlight that quite minimal assumptions are required. If you prefer using the CLT here, that would also work (though the suggested model may seem to not have moments…). As I said, independence of units is really what is required.

                Note that in the case of a randomized experiment, there is “a specific random number generator” we have in mind — the one that we used to assign units to treatment and control. And it is reasonable to wonder, did my treatment do anything at all? That allows exact tests of the sharp null of no effects.

                Whatever your views of “exact” tests etc., I don’t think the suggested argument really works at all. Daniel, maybe you can say more about the individual level model you have in mind that would generate such a problem.

              • Andrew says:

                Dean:

                But I have no interest in a test of the sharp null hypothesis of zero effects. I just don’t care! I’m pretty sure this treatment will have effects, they will just be highly variable from person to person and situation to situation.

              • Dean Eckles says:

                Though it might be a good start to reject such a model…

                Also, asymptotically, randomization inference (ie permutation tests) with appropriate test statistics also only reject under a difference in means — that is, ATE=0.

                Anyway, my point was simply to say that Daniel Lakeland’s proposed problem is not real in the absence of violations of key independence assumptions — not to say you need to use randomization inference.

              • george says:

                +1 to Dean’s comments. For every random assignment that leads to a 1.8-fold difference between groups over 60 years, there must be many more that lead to much smaller differences.

                Also, the goal of randomization isn’t to perfectly balance all covariates – it doesn’t – it is to prevent anyone from biasing the allocation of participants to different treatments.

                For a quick summary of these points and more see e.g. this talk by Stephen Senn.

              • george: “For every random assignment that leads to a 1.8-fold difference between groups over 60 years, there must be many more that lead to much smaller differences”

                Yes, but that means that we’ll conclude the 1.8x difference is *unusual* under random re-assignment and hence “something must be going on”. And, in fact, something *is* going on, it’s just not causally linked to the thing we’re testing. See below for examples: red food coloring (treatment effect is real, but not from the starch!), legal changes (over 60 years there are 60 opportunities for congress to accidentally pass a law that affects one arm of the study more than the other), social changes (over 60 years there are 60 opportunities to have public opinion shift on homosexuality, immigrants from Ireland, people with dark skin, males vs females in the workplace, to create or destroy types of jobs, etc) thereby accidentally affecting one arm more than other.

                With more opportunities for either a single event or slow time evolution to systematically affect one arm vs another we’re opening ourselves up to “discovering” that our pill causes something when in fact it’s that something else entirely causes that thing.

                There are ways to design tests that make it less likely for you to discover something, but as t goes to infinity, there WILL be a difference between the two arms even though it’s not caused by the pill/policy/treatment, which means that if you want to avoid false discovery of a causal effect of your pill using an NHST type arrangement, as t goes to infinity your test’s probability of rejecting the null needs to go to zero, and therefore the probability that you’re wasting everyone’s effort/time/money goes to 1.

              • george says:

                Yes, but that means that we’ll conclude the 1.8x difference is *unusual* under random re-assignment and hence “something must be going on”.

                Um… that means the unusual difference will only happen unusually, and you’re not raising real concerns.

                Please read the Senn talk, he has some useful papers too about what randomization provides, and what it doesn’t.

            • In 1930 you take 10 year olds and assign a random subset to take 10mg starch pills. In 2000 at age 80 years old you look at longevity having tracked all of them through time. You find that fewer people taking starch died, and that the average age at death for those who did die was older for starch takers.

              It’s not that you can’t detect differences (in fact, depending on your statistical tests it will be very easy to detect differences), or design a statistical test (randomization testing etc) that will be very robust and therefore fail to detect statistical significance, it’s that if you detect statistical significance you can’t assume causality of the treatment when even tiny differences in unknowns can amplify in time. Was the effect due to the starch pill? Or was it due to the fact that by random chance we assigned 10 more people to take starch who were ineligible for the draft? Or 10 fewer people who were employed in an industry that collapsed as the economy changed through time? (randomization test can help) Or that subtle differences in the way we handled people in the two arms through 60 years (the relationships they formed with the monitoring staff) were different (randomization test doesn’t help, treatment had an effect, but it wasn’t the pill), or that access to advances in health care wound up being different due to the arm they were assigned to (again treatment had an effect but not due to the pill), or that through time people in one arm or another aged differently due to inherent differences in the groups that weren’t balanced by the random number generator (we got “unlucky”)… or whatever

              The collapse of the travel agency industry or the interaction of the starch pill treatment with the Gramm–Rudman Act are not causes we consider in a 2 month trial… but we HAVE to consider them in a 60 year trial.

              The point about randomization inference is that it might be more robust and so therefore not show statistical significance for even moderately large differences…. But if it does show statistical significance, my point is that we still can’t conclude a causal effect *of the pill* there’s just too much else that could have been the real cause through time.

              I’m not saying we can’t run long term trials, just that we’ll need to design them in a very different way from “what is usually done” in a 2 month or 1 year trial. We’ll explicitly need to take the passage of a lot of time into account, and we’ll need to take our growth in knowledge into account. In 1930 maybe no-one thought to “control for smoking” but later in time we will need to. From a Bayesian perspective, the conclusions of the study will depend on the background knowledge we accumulate during the 60 years, so that we can have very different conclusions at different time-points.

              • For example, suppose you innocently choose to color the pill red in 1930… But 30 years later you find out that the red food coloring was cancer causing! Doh. Back to the 60 year drawing board!

              • Carlos Ungil says:

                > Or that subtle differences in the way we handled people in the two arms through 60 years (the relationships they formed with the monitoring staff) were different (randomization test doesn’t help, treatment had an effect, but it wasn’t the pill),

                This is why double blinding is used, to avoid handling people in the two arms differently. But it’s true that in some cases is hard to ignore who is in the treatment group and who is in the control group.

                > or that access to advances in health care wound up being different due to the arm they were assigned to (again treatment had an effect but not due to the pill)

                Or maybe a serial killer is targeting all the people who participated in the trial and didn’t take the starch pill.

                > For example, suppose you innocently choose to color the pill red in 1930… But 30 years later you find out that the red food coloring was cancer causing! Doh. Back to the 60 year drawing board!

                This is an example where the effect is real and due to the pill, not a spurious relationship due to a problem with RCTs.

                It seems now you want to move the goal posts to causal inference, so I guess we will agree at some point and I’m ready to concede defeat now.

              • Carlos: I don’t think of it as moving the goal post. That was initially my intention though perhaps not well communicated. The *point* of RCTs is causal inference, and I’m just saying over long periods of time the causes get muddled. The red food coloring really is a cause from the pill, but it’s not a cause from *starch*. The assumption that “color doesn’t matter” is probably true for 2 months, but not true for 60 years when the color is a mild carcinogen and its affect accumulates.

                Serial killers targeting one arm vs another was meant to be a joke, but it’s not entirely without merit, I mean suppose rather than serial killers it’s the congress? Specifically congress passes a law that happens to disproportionately affect one arm? like maybe deregulating the telephone company and there happens to be a few more telephone employees in one arm? There are then a few more suicides. We conclude what? 10mg starch pills *cause* suicide? Whoops! Over 60 years there are MANY opportunities for random events to accidentally affect one arm in a consistent way more than another, over 2 months or a year, not so much.

              • Dean Eckles says:

                All of the versions of the story you’ve offered that would actually pose a problem do involve a causal effect of the assignment to treatment or control. For example, the red food coloring. And conceptually these are just as much a problem at any time scale.

                That is, the problem Daniel Lakeland posits about long-term trials does not exist. I’m stating this so directly because I don’t want readers to be misinformed about the foundations of inference in randomized experiments.

              • Dean Eckles says:

                “Specifically congress passes a law that happens to disproportionately affect one arm? like maybe deregulating the telephone company and there happens to be a few more telephone employees in one arm? There are then a few more suicides. We conclude what? 10mg starch pills *cause* suicide? Whoops! Over 60 years there are MANY opportunities for random events to accidentally affect one arm in a consistent way more than another, over 2 months or a year, not so much.”

                This example does not work at all. All the statistical inference will still have the advertised properties unless: The observations are non-independent such that there is dependence within groups (e.g. we put all the treatment people together and one suicide causes others).
                The other options you have described all involve causal effects of assignment to treatment.

              • Dean,

                Problems occur even if the statistical properties of inference hold. If the confidence interval widths go to infinity due to the myriad variations available in time then you will correctly reject all but the very largest effects… and then you are guaranteed to find out nothing about your treatment unless it has an enormous effect… and then you’re guaranteed to have wasted everyone’s time and money, but you’re also “guaranteed” to have proper coverage. It’s just that proper coverage could easily become practically useless.

                The treatment effect of red food coloring as an example of course in principle applies to short term trials as well. If you put 10mg of perfectly good allergy medicine in a pill containing cyanide without realizing cyanide is a poison you’ll accidentally have a problem. The difference is that over 60 years things that have negligible effect on 1yr time scales have potentially serious effects on the 60 year time scale, so the number of things that can make your trial go wrong is much larger.

                For example, you send off two space probes, each in opposite directions, and you have a group of people pray for probe A to go off course. Years later you find that indeed it is off course… is that the prayer? or is it micro-meteorites and launch stresses deforming the shape of the probe so that its radiation pressure isn’t isotropic causing 100 nano-newtons of asymmetric radiation pressure that accumulates over 100 years? Of course, N is too small, let’s launch 200 probes and pray for 100 of them to go off course… We discover that … sure enough given 100 years, everything goes off course… therefore we have no chance to discover whether prayer works… and by the way, it only cost 200 x 10^8 dollars!

                Every single thing that happens EVER in the whole universe is basically due to the state of the universe being S(0) at time t=0 and evolving according to the laws of physics, which are pretty much a nonlinear chaotic dynamical system. The Lyapunov exponent of every two subsystems we care to think about is greater than zero, which means that any two sub-systems with infinitesimally close initial conditions diverge exponentially in time. You can’t make that stop by using randomization tests, you can only have your randomization tests widen their accept/reject regions exponentially… So 60 years later, for example, if your confidence intervals of survival are +- 15 years for a starch pill, you can pretty much guarantee that no matter what miracle cure you tried, even if it works, you’ll fail to detect it, which means you didn’t make an inferential mistake, but you sure as hell made a budgeting mistake.

                You seem to think that I’m saying “the math of randomization doesn’t work for long term tests” which is *not* what I’m saying, I’m saying that the attribution of the effect to the cause we have in mind (consumption of starch for example) doesn’t necessarily work for long term tests because either things will diverge and our randomization test will detect it, and it may well be due other small-perturbation sized causes rather than to our treatment (ie. it’s not the *starch* but the very mild toxicity of the coloring) the effect is normally thought to be zero on a 2 month scale, but isn’t over the long run, or… there will be enough noise accumulation in the diverging paths of the individuals that we’re guaranteed our confidence interval gets too big to be helpful and we wasted all our money…. OR we do something very particular to carefully model all the perturbations we can think of and we can therefore rule them out, that’s a way you could deal with the divergence of the spacecraft for example, posit a deformation mechanism and see if calculated over 100 yrs it would produce the observed deflection.

                I’m not saying we can NEVER get information out of 60 year RCTs but I am saying that you had best do a scientifically-well-though-out modeling process respecting all the perturbation-sized effects you can think of, because a 2 month or 1 yr trial can do the asymptotic perturbation-effect=0 version but at some point as the trial timescale increases that will no longer work to give you the sensitivity you need, and this is a necessary consequence of basic laws of physics.

              • You schedule group A to always come to the clinic on a Tuesday assuming that day of the week has zero effect, they all arrive at the world trade center Tues Sept 11 2001 and are wiped out by a terrorist attack… is that caused by taking the starch pill? In a trial of two drugs, one group takes a pill that has a more bitter flavor, so more of them secretly use mouthwash, it turns out that ingredient FOO in the mouthwash is very slightly carcinogenic… one group contains more people with long foreign names, every time they call the clinic they spend 5 more minutes on the phone spelling out their name, it turns out that cell phones really do have a very slight increase in cancer incidence after 60 years x 1 time per month x 2 extra minutes spelling their name and being put on hold and etc….

                there are SO MANY potential very low probability per unit time events that have negligible effect on 2 month time scales whose probability of affecting the outcome goes to 1 asymptotically as time goes to infinity. Randomization inference can still work fine and yet we can be guaranteed to be wasting 60 years of people’s lives and huge quantities of money because in the end we can’t conclude anything.

              • Dean Eckles says:

                I think you’re continuing to combine two very different cases, which perhaps is what has made the argument hard to follow:

                1. There is a causal effect of random assignment, but there are construct validity problems — ie your assignment has some nuisance mechanisms by which it affects the outcome. For example, “You schedule group A to always come to the clinic on a Tuesday assuming that day of the week has zero effect, they all arrive at the world trade center Tues Sept 11 2001 and are wiped out by a terrorist attack”.

                2. There isn’t a causal effect of random assignment, such as “one group contains more people with long foreign names, every time they call the clinic they spend 5 more minutes on the phone spelling out their name, it turns out that cell phones really do have a very slight increase in cancer incidence after 60 years x 1 time per month x 2 extra minutes spelling their name and being put on hold and etc”.

                It might be good to clarify the intuition you have about why these imbalances are supposed to larger over time compared with any actual effects of interest.

              • Dean, thanks for helping me understand how I’m confusing you. I agree at least for the moment, with your decomposition. Here are my thoughts:

                First off, we’re talking here about the context of something like a supplement that has a small effect over a lifetime, why bother running a 60 year study if it’s something that has a strong fast effect like Benadryl or Penacillin? The relevant parameter is actually the ratio (life time of the person) / (time it takes to accumulate an effect of size we care about X), or the inverse of that ratio.

                So, in the context of where this ratio is ~ 100 we can run a couple month RCT and ignore all the small things that might confound us whose ratio is say 1 or 2 or 0.5 and treat them as randomly approximately balanced and small. Even if there are a lot of them, they’re each going to contribute something small because there isn’t time for them to get big.

                But, when we’re running a trial on something like a supplement supposed to extend your lifetime QALYs by 3-5 years over 80 years we’re now talking about comparing our effects in the context of how many other effects that have similar sized effects on similar time-scales?

                Consider the function N(t*) which measures how many different things there are that might randomly induce fluctuations in outcome of the size of interest at the timescale ratio t*. My intuition is that this goes to infinity as t* goes to 0 (there are a REALLY LOT of things that do almost nothing over a lifetime) but as the ratio t* gets into the range of 1 (ie. it takes a whole lifetime to get an effect) there are still a LOT of things that might randomly do this, and they vary by time and place.

                When N is big enough random assignment is going to leave some of them imbalanced in such a way that we will have effects of the size we’re concerned with due to these random background things *almost every time we run the trial* and therefore we have a severe “lack of power”. When that occurs, it’s your condition (2) and in essence we shouldn’t run the trial for the same reason we shouldn’t run 2 month trials with 1 person, you just can’t get enough information from a single person’s outcome.

                Can you get around this by running a bigger trial? Maybe, but the intuition that my function N(t*) goes towards infinity at t* goes to 0 suggests that long trials would require potentially MANY participants and combined with the cost of long time the cost of large n compounds together to make them prohibitively expensive. For example, to see if you could on average elongate people’s lives by on the order of 1 year by having them do jumping jacks for 1 minute every morning you might need to run the trial on every single living person on the earth randomly assigning them to yes or no jumping jacks…. The same is basically true for any treatment supposed to increase your life by 1 QUALY.

                In some sense, we’re talking about the “curse of dimensionality” as N(t*) is measuring the “dimensionality of the random confounding factors”. In high dimensions, every case is an outlier as Bob Carpenter recently wrote about.

                When it comes to your part (1) where there is a causal effect of treatment but it’s not related to the causes we’re interested in…. Suppose for example that giving people vitamin Q (a made up substance) was something that was hugely effective back in 1900 when diets were deficient in vitamin Q, but by 1970 when people had better diets giving vit Q would have no effect. It is therefore not the effect of Vit Q that we measure when we do a trial from 1910 to 1970 but the effect (Vit Q interacting with Diet between 1910 and 1970). Just as “diet between 1910 and 1970” is different from “diet now” how many other myriad things are different between then and now? How many of them interact with our process? Prior to the widespread desk-ification of the workforce jumping jacks might have been a stupid idea, but now that we all sit in front of computers discussing important topics like long-term clinical trials… we really should get up and do the jumping jacks… The longer the trial runs, the more chance there is for an interaction, whether it’s terrorist attacks or earthquakes, lead paint in construction, leaded gasoline, changes in work leading to changes in exercise, background availability of medicines and medical care, whatever. If you can prove that a thing has an effect that takes only 2 months to occur… you can run the trial a couple different times in different places and show consistency. If you run a 60 year trial, how do you show that the results weren’t very specific to interactions with the zeitgeist of that particular time-period? Run 5 different 60 year trials? Practically speaking that’s a disaster, especially because if the thing is small enough effect that it takes 60 years to see it, it intuitively is probably not the right place to spend money.

                So either way, the longer the trial takes the more random effects can become imbalanced, thereby leading to either “false positives” or “enormous confidence intervals” either of which is kind of useless. In addition, there will be plenty of opportunities for consistent but time-varying interacting confounding factors you will have to deal with. This means you either have an inference system which makes it nearly certain that you can’t reject a null hypothesis (such as randomization tests whose confidence intervals grow rapidly), or you make assumptions about “all else zero” as is done in short term trials, and you’re bound to find that something happens, even if it has nothing to do with your treatment, or it only works during the time period of interest due to some background confounder (such as terrorist attacks or the existence of leaded gasoline, or the kinds of work people do).

                As far as I can see, the only way to get around this is to have a specific mechanistic hypothesis and to show that the mechanistic hypothesis predicts some particular kind of time-evolution of outcome, and to also show that that time evolution of outcome really does occur. In that kind of case you can then rule out other explanations by virtue of the fact that they simply are unlikely to follow exactly the same time-evolution (whereas looking at just a couple of endpoints… not so much). In other words, Bayesian inference with causal models of the full time-series can extract information from these kinds of things. So, theoretically we can get something out of long RCTs but practically speaking it’s a potential nightmare of wasted resources.

              • Anoneuoid says:

                >”As far as I can see, the only way to get around this is to have a specific mechanistic hypothesis and to show that the mechanistic hypothesis predicts some particular kind of time-evolution of outcome, and to also show that that time evolution of outcome really does occur. In that kind of case you can then rule out other explanations by virtue of the fact that they simply are unlikely to follow exactly the same time-evolution (whereas looking at just a couple of endpoints… not so much).”

                Yes, exactly. I would also add to the things that go wrong. You can have your treatment be responsible for the desired effect but work via some mechanism totally different from what is proposed. For example:

                chemo -> nausea -> caloric restriction -> longer survival
                vs.
                chemo -> kills cancer cells -> longer survival

                In the first scenario the chemo is a totally spurious gigantic waste of money, not so much the second. The standard methods used to analyze data are blind to issues like this. I think most statisticians just really don’t understand the nature of the troubles (what Fisher referred to as “complex entanglements of error”*) facing the scientist. Randomization is great, blinding is great, but these techs only address relatively minor issues. That is the beginning, not the end.

                *”The Nature of Probability”. Centennial Review. 2: 261–274. 1958. pg 274.

              • Alex Gamma says:

                Daniel,
                your argument is very persuasive. But I’m thinking this: You seem to focus only on the possibility of systematic divergence over time. Consider, however, the population of chance group differences created by random assignment to treatment, and then consider the distribution of mortality effects of these group differences at follow-up. Your argument seems to assume that this distribution would be asymmetrical around zero, and indeed grow more asymmetrical with time. My intuition is that a greater chance to diverge (i.e. create a mortality difference) over time for a single group difference is paralleled by a greater chance within the population of group differences for their mortality effects to cancel out.

              • Alex,

                These issues ultimately are empirical questions requiring evaluation with observations, and not only that but they will certainly vary from time to time, treatment to treatment, and population to population. So if you do a 40 year long term study of psychiatric drug X on a population initially of 20 year old men who show signs of PTSD after participating in combat in 1990 vs 20 year old men who show signs of PTSD after participating in combat in 2010 vs anti-MS drug for women age 30+ who live in the pacific northwest starting in 1990 vs anti-MS drug for women age 30+ who live in the pacific northwest starting in 2010….

                You’re talking about vastly different processes, different needs, different drugs, different time periods, different exposures, different genders, etc… If there are systematic differences in the way combat veterans are treated from 1990 to 2010 vs 2010-2050 or systematic differences in the social environment surrounding women in the workplace, or disability treatments or background medical knowledge about MS in 1990-2010 vs 2010-2050 or whatever, you’ll see different issues in each trial.

                In essence, when the time-period is short like 2 months, you can treat all the background stuff going on as constant, whereas when the time period is 40 years you need to realize that the time period is not constant, and so time-varying external forces could be steering your results (or not!). But only by repeating the experiment several times can you see how important the time-varying stuff is. To get a decent sample size of how much the time varying stuff matters, we should probably run the trial for moderate n, like say 5-10 non-overlapping time periods. For a 50 year study x 5 time-periods it’ll be 250 years before we know the result… discount the value of this information at a rate of 2%/yr and exp(-0.02*250) = 0.0067 so how are you going to justify all that expenditure? Can you even imagine really running a trial of say feeding thousands of people little pills containing the bark of the foobar plant repeatedly in batches lasting 50 years ever since the year 1760 AD?

                How has the longevity of humans changed even without the foobar bark when looking now compared to 1760? How much variability would there have been in the control group alone? How much consistent time trend? How much would the civil rights movement have affected our results? How about the Civil War, the Spanish American War, WWI, WWII, Korea, Vietnam, Desert Storm, Post 9/11 Endless Conflict, The McCarthy Era? How about the discovery of the microorganism theory of disease, Pasteurization, Water Sanitation Systems, Penicillin, Refrigeration, The Automobile, The Airplane, The Telephone, The 911 Emergency System, The Internet?

                If you ran the bark pill experiment 5 times you’d have seen massive differences in the outcomes in each of the 50 year periods. How would you detect a 3-5 year out of 80 years life extension given a noise level of +- maybe 15 years of life expectancy in that time period? How would you even account for the kind of drop-out you’d have inevitably seen by males ages 18-25 during each of the wars? Etc etc etc. It’s maybe a statisticians wet-dream to have so much data and have so many interesting modeling issues to consider, but it’s a budgeting disaster when the end result is “your best bet is to be born around 1990 and never start smoking, you’ll probably live a lot longer than someone born before 1900”

              • Carlos Ungil says:

                > How would you detect a 3-5 year out of 80 years life extension given a noise level of +- maybe 15 years of life expectancy in that time period?

                You look at the life span of the people in the treatment group and at the life span of the people in the control group. Maybe there is a difference that cannot be explained by the differences between the groups, and then you may then be inclined to think the treatment has an effect. The noise should be the same in both groups, unless we are, as you said, “unlucky” (and by definition being “unlucky” is a rare event and not something that happens with high probability). If the signal is stronger than the noise, we can detect it. And of course if the signal gets lots in the noise, we cannot detect it.

                > How would you even account for the kind of drop-out you’d have inevitably seen by males ages 18-25 during each of the wars?

                Why would one of the groups be more likely to drop out due to the war? Unless we are “unlucky”, see above.

                On the issue of moving goal posts, your initial messages seemed to be about randomization (“even with random assignment *something* was bound to be different between the two groups and if that thing amplifies”, “Randomization *approximately* balances every unknown. But it doesn’t *perfectly* balance *anything*. If that imbalance in the unknowns results in anything that has long term growth involved, the extra time will accentuate that imbalance.”) but later you give all kinds of examples of things going wrong. I would say most of these concerns are not particular to RCTs and affect any kind of clinical trial, scientific experiment, social study, and in general any human endeavour intended to understand or manipulate the real world. (And I don’t think you can solve the problem creating a Bayesian model that takes into account all the potential wars that might or might not be fought in the future and all the potential laws that might or might not deregulate the telecom industry, etc.)

              • My initial post may have “seemed” to be about randomization, but it absolutely wasn’t supposed to be just about that one difficulty. It was about the inherent difficulty of measuring small effects of long-term anything given a chaotic dynamical system being actively perturbed together with the prohibitive cumulative cost, and the discounted value of anything that only pays off far in the future, and the need to validate the results by repeated trials, and soforth. In other words, it’s a huge problem all around. As I read recently in the blog of one of the commenters here (I unfortunately forget who) the motion of a gram of mass 1 light year away from us perturbs the behavior of an ideal gas sufficiently to make it impossible to calculate accurate trajectories for more than a fraction of a second or something like that. Chaotic dynamical systems are not going to be easy to study using RCTs without any mechanistic understanding.

                And you’re ABSOLUTELY right that you can’t get anywhere with a model of all the wars and social upheavals and inventions and whatever. That wasn’t what I was suggesting either. (I seem to be striking out in the communications issues here).

                What you CAN get somewhere with though is a proper physical scientific model of the biological process that you believe is going on. Suppose taking Vitamin Q affects some pathway in a certain way, your model then predicts upregulation of hormone X and downregulation of hormone Y and over the long run this leads to reduced depositing of certain arterial plaques, and then lowers inflammation in the arteries, which alters immune system responses, which reduces incidences of asthma and heart disease, leading to fewer strokes and heart attacks and etc.

                You have this model that makes very detailed predictions about what should happen, and how it interacts with diet and existing state of arterial plaque and inflammation, and exposure to air pollution and whatnot. You run a trial with vitamin Q and measure as many of these things as you can through time, you show that the result is consistent with your physical model, and then you follow people through their life and show that there’s reduced rates of survivable heart attacks, strokes, asthma, whatever… Your detailed model of the physics / chemistry / biology trumps most of these problems. Sure, you still need to track people through long periods to show that there aren’t unforseen side effects, and soforth, but you get way farther with this kind of thing than with an RCT for 60 years looking at endpoint = age at death and fraction that survive 60 years.

                Of course, you have to have a detailed mechanistic model then, and it has to be correct, and that’s no easy task in Biology!

              • Also, Carlos being “Unlucky” is a rare event in low dimensions, but in higher dimensions (ie. there are 100,000 different things that could affect your trial by as much as you expect Vitamin Q to have an effect) then being “unlucky” is absolutely the norm, you are never anything other than unlucky.

              • Carlos Ungil says:

                > As I read recently in the blog of one of the commenters here (I unfortunately forget who) the motion of a gram of mass 1 light year away from us perturbs the behavior of an ideal gas sufficiently to make it impossible to calculate accurate trajectories for more than a fraction of a second or something like that.

                If this is the basis for your concerns, you should consider a long-term clinical trial anything longer than one minute.

                > What you CAN get somewhere with though is a proper physical scientific model of the biological process that you believe is going on.

                You CAN get somewhere as well without the proper physical model of the biological process. Clinical trials found penicillin useful even though it was not understood what it was exactly or how it worked. Maybe Fleming shouldn’t have bothered publishing his discovering given that he had no idea why it was killing bacteria, maybe he was just lucky (and the bacteria just unlucky). After all, there are so many reasons that could be killing them (cosmic rays, differences in temperature, rests of soap, vibration from nearby trains, micro-organic wars) that we should not be surprised that only the bacteria in dishes treated with penicillin died.

                > Of course, you have to have a detailed mechanistic model then, and it has to be correct, and that’s no easy task in Biology!

                In fact, that’s not a easy task in physics either. You don’t calculate the orbits of the asteroids or the weather forecast from the trajectories of each particle in the solar system and beyond. And don’t forget that classical mechanics is in the best case an approximation to the ultimate theory that we don’t even have.

              • Penicillin has a replicable effect on a time scale of 1 day. We can test it 15 times in a row in just 2 weeks. Try testing a lifetime of vitamin Q consumption 15 times in a row and get back to me on the consistency of the effect…

                The issue is a dimensionless ratio known as the Deborah number of the experiment. If your experiment is on the trajectories of gas particles then to keep the Deborah number large you need to do your experiment for microseconds. If your experiment is about plate tectonics it might suffice to keep the experiment down below 1000 years. if your experiment is about human lifetimes you should keep the duration below a year or so. If you don’t have a large Deborah number you can’t use an asymptotic approximation truncating away time varying effects and you’ll need a more complex model. If penicillin took 3 years to kill a plate of bacteria you’d need to start ruling out the slow effect of cosmic rays, reflected UV light, contamination of the media supplied by lab suppliers, fluctuations in room temperature… You don’t in a 12 hr experiment because 100% of control plates are totally fine after 12 hrs

              • Carlos Ungil says:

                If my experiment is about curing common colds it should not be longer than a couple of hours and if my experiment is about pregnancies it should not be longer than one week. Am I doing it right?

                If I was administering something to a few dozens of birds and 100% of them were dead in 3 years (most of them with kidney disease) while only 10% of those in the control group were dead (none of them with kidney disease), probably I won’t spend my time thinking about cosmic rays and fluctuations in room temperature and instead I will decide to continue the research in the obvious direction.

              • Carlos, it seems like you’re mocking me, but I’m going to continue…

                The Deborah number, broadly, is the ratio of the timescale over which dynamics in the outcomes naturally occur, to the ratio of the timescale of your experiment. So. 300 years ago maybe getting a “common cold” actually had a high chance of killing you, and by 40 years ago the common cold was a 7 day nuisance and today it’s still about a 7 day nuisance. This suggests the timescale for naturally occurring dynamics is maybe 100 years. So, if your common cold experiment lasts 1 to 3 years or less you can ignore the background dynamics.

                If you’re studying pregnancies… about 50 years ago pregnancy was something 19-30 year old women tended to do, whereas by today we have fertility treatments and IVF and soforth so that more women in their 40s and even up to around 50+ are choosing to have children. The background medical system in that timescale has also changed from more natural vaginal deliveries to more schedule cesarean sections, not to mention a broad set of background changes in the drugs available and the medical knowledge. So, this suggests that if you were going to study pregnancy issues you should keep the trial duration shorter than say 5 or 6 years (Deborah number ~ 10).

                Alternatively in these cases, you can study situations with smaller Deborah numbers by explicitly taking the dynamics into account, but you’d best be doing time-series analysis and measuring lots of background changes… stuff that isn’t normally done in typical clinical trials.

                Another dimensionless ratio of interest (pun!) is the ratio of the time your experiment takes to the time it takes for the economic value of the data to decline by a factor exp(-1) (let’s call this the Friedman number). So if the prevailing interest rate is something like 2% then the timescale of interest is 50 years. If your experiment lasts much less than 50 years then the ratio is < 1 and you can pretend that the value of your experiment is not something changing in time. If on the other hand your experiment lasts 50 or 100 or 300 years (for multiple serial trials for example) then you're going to have to explicitly take into account the fact that its economic value declines exponentially.

                If you take all this stuff into account, you can get somewhere. But the "usual" stuff done in clinical trials is to ignore all this because asymptotically for short duration trials, the Deborah number is very large and the Friedman number is pretty small…

              • Also note, the Deborah number is relative to the background dynamics, which is itself time-varying. So, if you look at the 1600’s lifetimes were maybe 60 years, and child mortality was high, whereas in the 2000’s lifetimes are maybe 85 years and child mortality is low. If we look at 2200 perhaps lifetimes are still around 85 years and child mortality is still low. So, the Deborah number of a 50 year trial in 1650 is maybe low, but the Deborah number of a 50 year trial starting in 2010 could easily be much higher…

                One of the reasons to do experiments on animals grown in laboratory colonies is that all the background dynamics are pretty static. If you’re feeding wild birds at location A your concoction for 3 years and at location B regular birdseed, and there’s a big die off at location A, you’re going to need to look for things like toxic oil spills a few miles away from A and neighbors who have rat problems and suddenly put out rat poison. In a lab, not so much. The Deborah number of a lab experiment is inherently larger than one in a dynamic environment even for the same duration experiment because the background dynamics are held constant on purpose.

              • A final dimensionless ratio which is usually the only one taken into account in clinical trials is the ratio of the size of the effect on the outcome to some measure of the statistical range of the noise in the measurement, typically the standard deviation. So, in your case with 100% of the birds dying in 3 years vs 10% of controls, this ratio is large. In other words, you’re studying a large effect. When you’re studying a large effect, typically you can do it on a shorter time-scale. But let’s say that with your bird example they’re all fine for 3 years and then they drop dead en-mass in the last month. Wow, that seems like a big effect. My prior is we’d really better check on the rat poison issues etc, but let’s say it’s in a lab, and we’re confident that kind of stuff wasn’t going on. I still am going to ask you to repeat the 3 year experiment to show that this wasn’t a fluke, and when you do, cool, you found a large effect!

                Now suppose you give Vitamin Q to people daily starting age 10 and follow them for 70 years. And suppose that those you gave vitamin Q to really did live 15 years longer on average! Wow. That’s pretty big. Much bigger than our prior estimate of effect sizes for nutraceuticals (ie. that they have somewhere between -3 and +3 years life extension with 0 being the mode!). Congrats… Except, I still probably want you to do the experiment again to show that it wasn’t a fluke related to the particulars of that time period, especially since we probably didn’t keep all the people in controlled lab conditions. So, ok, we do the experiment again. So the experiment takes 160 years, and sure enough the next time it’s 12 years of life extension. so we have some evidence of consistency. So, now we’ve spent *at least* say $100,000 a year for 160 years (in constant dollars) which is about $16 Billion on this trial, though it’d be easy to imagine at least 10x that much, but we did in fact show that we can extend people’s lives by on the order of a decade. That’s pretty good! However, our prior for nutraceuticals was maybe 1-3 years life extension at best, with plenty of the ones we might be interested in trying potentially actually having some adverse effects. Are you willing to commit 160 years of recordkeeping and nursing staff and doctor’s careers etc and $16-160 Billion dollars so that your great grandchildren have a moderate (prior) probability of maybe getting a few extra life years and a small probability that maybe it’s a decade, and a moderate probability that you’re actually slowly poisoning them?

                Wouldn’t it be maybe better to commit $16-160 Billion dollars to reducing air pollution, or developing cars that autonomously take over driving when evasive maneuvers are required, with development lasting a decade and several trials of their effectiveness each lasting a year or two?

              • Ack, $16 to $160 Million not Billion, typo. Though real clinical trials cost a lot more than $100k / yr, that’s pretty much just the salary of say 1 medical professional.

              • Carlos Ungil says:

                > Carlos, it seems like you’re mocking me, but I’m going to continue…

                I was only joking in part. The numbers you’re using are completely arbitrary and you could make similar arguments with completely different conclusions. There may be a reason why Deborah numbers are only used in a very particular context.

                > If penicillin took 3 years to kill a plate of bacteria you’d need to start ruling out the slow effect of cosmic rays, reflected UV light, contamination of the media supplied by lab suppliers, fluctuations in room temperature… You don’t in a 12 hr experiment because 100% of control plates are totally fine after 12 hrs

                Why wouldn’t you need to rule out as well the fast effect of cosmic rays (and all the other issues that you mention) when the experiment is executed in one day? If temperature and light exposure are important, these are not factors changing only slowly over the years. They have daily cycles unless the lab conditions are well controlled. What’s the Deborah number then? You ignore those factors in the short-duration trial because you assume they affected both groups equally and you can in principle make the same assumptions in long-duration trials. Your arguments about small differences being amplified exponentially are unconvincing. There is also variation between individuals in each group, and if those difference don’t create a exponential divergence within groups why should it happen between groups?

                And the problem of changes in the environment affects short-duration trials exactly in the same way. If a five-year trial is useless when completed, because the environment has changed since the trial started, all the one-year trials executed more than four years ago are equally useless by now.

                > When you’re studying a large effect, typically you can do it on a shorter time-scale.

                There will be a point where the trial gets long enough for the effect to be detected and it might be beyond what you call short. Many of these long-term trials are event based. You need enough deaths to be able to detect the difference in the mortality rate between the treatment and control groups. You may plan to run the trial until you get so many deaths in total. And you can also do interim analysis at some predefined points to stop the trial early if possible (the measured effect size may overestimate the actual effect size more than if the trial had run to completion, but that’s a separate issue). For example https://en.wikipedia.org/wiki/Multicenter_Automatic_Defibrillator_Implantation_Trial

                Note as well that adding more patients doesn’t always allow you to run shorter trials, because you might be actually interested in the long term outcomes (like the benefit or harm caused by cancer screening or medical treatments in general). Implanting defibrillators in one million patients and looking at their evolution for one month doesn’t really give you the same information than following one thousand patients for three years (even if you had a detailed physical model of how the device keeps patients alive and kept records of what they had for lunch).

              • Anonymous says:

                Very unimpressed with the low level of discussion in this thread. The usefulness of randomization in RCTs is fundamental/valid under very basic assumptions, and being presented with aimless 500 word responses every time someone points this out is incredibly frustrating.

                Daniel, the field of clinical trials is based on solid foundations. Please read any introductory textbook on the topic to learn more.

              • Anonymous: Clearly I’m going to need to stop replying here because no one is listening…. I never questioned whether Randomization is useful, only whether *long term clinical trials of things having small effects that take a lifetime to accumulate* is useful.

                The fundamental issues are not whether randomization works. In language perhaps statisticians can understand they are:

                1) Dramatically reduced power to conclude anything about a useful externally valid causal effect due to inevitable accumulation of noise through time, and interaction of effects with time varying backgrounds.
                2) Linearly increased cost due to duration.
                3) Increased complexity of analysis due to complex changing background.
                4) Exponentially reduced economic usefulness of the results due to discounting

                Randomization doesn’t solve any of those.

              • Rahul says:

                @Daniel

                Would it be fair to conclude we should just not waste time trying to study this sort of thing (“things having small effects”).

                Agnostic to method. Whether it be RCTs or otherwise.

                Or are you saying we ought to still study them but just that RCTs are the wrong tool?

              • Carlos:

                “Why wouldn’t you need to rule out as well the fast effect of cosmic rays (and all the other issues that you mention) when the experiment is executed in one day? If temperature and light exposure are important, these are not factors changing only slowly over the years. They have daily cycles unless the lab conditions are well controlled. What’s the Deborah number then? You ignore those factors in the short-duration trial because you assume they affected both groups equally and you can in principle make the same assumptions in long-duration trials. “

                We watch a control plate of bacteria for 24 hours and it grows exponentially just fine on a nice fresh agar surface. Great, evidently our lab has no background effects of a size similar to the exponential growth of bacteria in their initial growth phases. What would be such things? Presence of an antibiotic in the media, a bright UV sanitization light, or temperatures above 160F come to mind. The differential equation that models growth effectively when N is small is dN/dt = N, where t is a dimensionless ratio of time in seconds to some scale time, and since we’re free to choose a timescale for t, we can always make this the differential equation. For typical bacteria the timescale for t might be something like 40 minutes.

                Now, if we watch it for a couple of days, the bacteria outgrows its media. We’ve ignored this effect in our differential equation, so on longer timescales we’ll need to do:

                dN/dt = N *(Nc(Media) – N)

                And then, we can rescale this by dividing by Nc(Media) to define a new NN:

                dNN/dt = NN*(1-NN)

                Once we’ve hit near the carrying capacity of the media, growth will be changing in a dramatically slower manner. Now all the stuff we’ve ignored because it was actually

                dNN/dt = NN*(1 – NN) + eps1(t) + eps2(t) + eps3(t)

                where eps1(t) is an o(1) function of time (size much less than 1), becomes relevant. Since NN is nearly 1, so NN*(1-NN) is itself o(1) (growth and carrying capacity have canceled each other out almost completely), we can change the timescale for t and make all the effects O(1) again (size of order 1).

                dNN/dt* = NN*(1-NN) + eps1′(t) + eps2′(t) + eps3′(t)

                Now, because t* = 1000t and we’ve also rescaled the epsilons all of these have equally important O(1) effects.

                So, on different time scales, different explanations are required taking into account different effects entirely. This is known as “asymptotic analysis” of dynamical systems. Asymptotically for small time scales all that matters is the massive exponential growth… but once you hit a day or so, you can’t ignore the carrying capacity, and once you hit 4 or 5 days… a whole bunch of other perturbation effects are important, and once you hit a couple of years, for a bacterium that’s a huge amount of time, you’ll have to start wondering if evolution of the bacterial genome to adapt to the environment of a laboratory dish is important. If your experiment requires 3 years to complete, and in that time both the control and the experimental conditions adapt their genome to the nice stable and weird laboratory environment, then the causal effect you discover when you try to apply it to say water sanitation might have totally evaporated. “External validity” of the experiment is dependent on your experiment not essentially interacting with eps_k(t) a small external function of time that represents some particular stuff that happened to be going on in your lab during your particular trial period. For example, giving vitamin Q counteracts the perturbation effect of leaded gasoline in the air during the period 1920-1970 but has no effect on anyone after 1980 or so due to the absence of leaded gasoline in the air. Yay, we found Vitamin Q works in a trial from 1920-1970!

                well… no it *worked* but it doesn’t *continue* to work. The value of our information for future generations is very very little because they’re not spewing lead into the air, and the value of our information for the generation that paid for the experiment… also very very little because it’s too late to go back in time and give everyone Vit Q starting in 1920… Yay we discovered that “if we had given Vit Q in the past, many people would have been less damaged by the lead we used to spew into the air but don’t spew anymore”. Congratulations, that’ll be $160 Million Dollars please.

                As the size of your effect gets small and the timescale for seeing your effect gets large the *number of perturbation sized effects your experiment could interact with goes to infinity*, and this means you’re in a high dimensional space, where every random sample is “weird” (and your whole experiment is ONE random sample). In order to counteract this, you should show that your effect is independent of any transient perturbation effects in the background environment, which means *repeating your experiment several times* which means instead of taking 60 years it takes 400 years… and this is where the second dimensionless ratio comes in, exponential discounting automatically makes your experiment economically useless even if you find a totally valid time-stable effect.

                NONE of this has to do with whether or not you should use randomization. YES you should use randomization… but you should use it to study effects large enough that you can observe them in short timescales, repeat them to be sure the effect is stable in time, and still not have gone so far out in time that exponential discounting makes your result economically useless.

              • @Rahul. That’s a good question. I think we *should* study things having “small” effects, but RCTs may not be the right way. For example, leaded gasoline had a “small” effect. That is, for example, breathing the air didn’t kill you in a couple of weeks or even a couple of years. There’s some people who think that breathing leaded air may have made people more prone to violence and soforth. But whatever the real effect of breathing lead, it was bad over a timescale of 100 years… I’m glad someone studied this and that we banned leaded gasoline!

                So, no, we can’t just say “don’t study small effects that accumulate in time” but how should we study them? It helps a LOT to come up with a mechanistic model, even if it’s imperfect, and then show that the mechanistic model makes predictions that come true in some variety of other conditions.

                We don’t build full scale dams and then show that when we induce full-scale earthquakes we get full scale disasters. We build 1 foot long scale model dams, and then scale up the forces in a centrifuge to show that under validly similar conditions to the real thing, we’ll get failure. In the same way, we study mice whose lifetime is a year or so, but we do it under scaled conditions. Scaling this sort of thing isn’t easy. But if you show that under a variety of scaled up doses N(D) milligrams of lead does accumulate into the mouse’s brain, and that in that timescale it affects the mice’s ability to learn tasks… and you then extrapolate N(D) back to a dose similar to what a child living next to the 110 freeway might get, and soforth… you can discover that there’s something consistent going on even if it’s small and even if your extrapolation back to the much longer-term effects on a much larger human child isn’t terribly accurate. Bayesian models are really needed here since extrapolation error really *isn’t* the same as sampling error.

                This is in fact exactly what toxicologists do. And it’s not perfect, and OF COURSE they use randomization. Randomization is great, it’s just not a panacea that makes other problems go away.

              • Rahul says:

                @Daniel

                I disagree. I think in the sort of cases you describe (small effect, multiple confounders, complex interactions, high cost of sampling etc.) I agree that RCTs won’t work well but so won’t other “mechanistic” methods.

                If someone did show me a “scaled” effect in lab mice by feeding them huge doses or something I’m going to be so skeptical that I have any clue as to how to extrapolate this to humans etc.

                i.e. I can stand by “it’s just impossible to study these things” rather than RCTs are a bad tool to do it.

                A corresponding set of critiques applies to Mechanistic Studies as those you pointed out to RCTs.

              • @Rahul.

                Suppose we have a decent mechanistic model. It’s based on some good science such as the biochemistry of how Vit Q interacts with some particular enzymes that help change the chemistry of lead using some kind of quantum chemistry model and some lab experiments in test tubes. Then we perform a scaled up experiment on mice, which is repeatable, it takes about a year. Then we scale the experiment up to say pigs for a year and it’s consistent as well. Now we have extrapolation error to humans, so we get a Bayesian posterior over the effect in humans after only 2 years. It’s not perfect, it’s got significant uncertainty in it.

                What do we do? We use Bayesian decision theory to balance the uncertainty and the potential good vs bad effects as well as the economic costs of taking Vit Q and our priors over what long-term adverse effects there might be. Then we make a decision. We get the information in 2 years instead of 300 and that’s HUGELY more valuable due to exponential discounting.

                So, sure it doesn’t give definitive answers. “there’s no guarantee” as the title of this post says, but I think it’s the best we can do. We can’t just not study small effects, or we’d all be still breathing lead and have old-school automobile emissions and be dying 5-11 years younger than we are now as well as suffering more chronic illnesses instead of making the many many different changes in our environmental pollution outputs that we’ve made over the last 50 years.

              • Martha (Smith) says:

                @Daniel:
                “Suppose we have a decent mechanistic model.”

                This begs the questions of what constitutes “decent” and how often we have a decent mechanistic model to work with.

              • Rahul says:

                Martha: +1

                If I’m going to worry about 1% exponential differences I might as well worry (tons more) about extrapolating from QC Models.

              • @Martha and @Rahul: of course, I’m not going to disagree with the need for assessing the goodness of mechanistic models. The more specific the predictions of the models and the shorter the timescale for their predictions, the more opportunities to detect problems.

                The whole field of toxicology is built on the idea that we need to extrapolate from scaled experiments. According to Wikipedia “Mathieu Orfila is considered the modern father of toxicology, having given the subject its first formal treatment in 1813 in his Traité des poisons, also called Toxicologie générale.[4]”

                If we did RCTs in humans, we’d still be waiting for the definitive results of the third replication of the first ever toxicology study circa 1810…

              • I’ll see if I can in a reasonable amount of time, give some simulation examples of things I’m concerned about here and post it on my blog. Maybe simulations will help explain.

          • Carlos Ungil says:

            > Over a period, in this case, of say 60+ years. If you had started such a trial in 1900 with 10mg starch pills you’d _probably_ have seen a _massive effect_ by 1960 even with random assignment *something* was bound to be different between the two groups and if that thing amplifies exponentially at a rate of say 1%/yr 60 years later 1.8 times as many people in one group will have X vs another ;-)

            What does _probably_ mean? What is a _massive effect_? Does the variation within each group of this X of yours also diverge?

            > You can only really do what is normally done in clinical trials because the timeframe of the trial is so short, a couple of years isn’t enough time for “all else equal” to be severely broken in terms of the background time-trends.

            Some clinical trials run for quite longer than two years, and sometimes followup of the patients who participated in the trial continues more than 20 years later.

        • Jonathan (another one) says:

          But even ignoring the applicability to a wider population, think of the difference between (a) a drug that lengthens the life of 2 percent of those who take it, wit no adverse effects for anyone else; (b) a drug which lengthens the life of 90 percent of the people who take it while shortening the lives of 5 percent; etc. etc. Merely having a positive primary effect, even if extrapolable, is nowhere close to any sort of guarantee, or even an indication that the drug is a good idea. The key to neutraceuticals (at least in theory) is that through benign ingredients, at worst, you lose money, sort of like case (a).

  3. zbicyclist says:

    Not my area, but I’ve heard that there are many chemical that will, say, cure cancer in mice that fail in humans.

    Anti-aging seems much more complicated than killing cancer cells, so what are the odds it will (a) work in humans, and (b) not have long term side effects? p (a and b) seems very low to me.

    • Cliff AB says:

      >> Not my area, but I’ve heard that there are many chemical that will, say, cure cancer in mice that fail in humans.

      True, but I would have a little more faith in mice being a good model for aging than for cancer. The reason for that is cancer is specifically a mutation, not a fundamental process. Certain mice lines are bred to have incredibly high rates of cancer (or whatever disease you are studying). If the cause of your disease of interest in this line of mice does not match up with the common cause in the standard human population, then the mice models do very poorly for obvious reasons.

      Not to say that the aging mechanism is definitively exactly the same within mice and humans. But I would say the odds are better than for mice who are selectively bred for a specific mutation that makes them different from even healthy mice, not to mention healthy humans.

      >> Anti-aging seems much more complicated than killing cancer cells, so what are the odds it will (a) work in humans, and (b) not have long term side effects? p (a and b) seems very low to me.

      I agree that P(a and b) seems very low. I could believe that P(a) is not that low, and P(b) is also not that low, but my prior is that P(a and b) << P(a) * P(b).

      • Anoneuoid says:

        >”cancer is specifically a mutation, not a fundamental process.”

        I would be careful with that assumption, it could cause a lot of suffering and wasted time/resources if it is wrong.

      • mark k says:

        I would not bet a lot of money on the aging mechanisms in mice that live a few years being dominated by the same processes that impact humans. A lot of the pathways overlap, I’m sure, but the importance and impact on the whole organism is likely to be radically different. As in, irrelevant or in the opposite direction in some cases.

        The argument that cancer is “a mutation” would actually make it easier to model. In fact, we can graft human tumors on mice, which obviously have the same origin as human tumors, and it doesn’t help that much. There’s a lot going on.

    • Cliff AB says:

      Ooop, after reading the following sentence, my prior on P(a) just went down significantly:

      “The compound is believed cause some effects similar to a diet that is severely short on calories—a proven way to make a mouse live longer.”

      Unless there’s been some major changes in the research in the last 5 years ago, the restricted calorie-improved longevity theory is one of the biggest abuses of statistics to date in my opinion. Early tests showed that monkeys (I think that was the animal?) who sat in a cage and ate as much as they wanted at any time did not live as long as monkeys who had their food sources restricted. I’ve read statements from many MD’s extrapolating this out to the point where they believe there is a simple linear equation: the less calories you eat, the less you age. Period. That is even in spite of later studies that found no improvement between a restricted calorie diet and a further restricted diet. I’ve even read stories of MD’s who take this so much to heart that they eat not much more than a bowl of cereal a day. It’s actually pretty scary.

      • There is, however, a pretty well established mechanism whereby periodic reduction in calories induces a natural process in which the immune system kills off circulating cells, which are then renewed from stem cells once calories return. This mechanism is thought to reduce incidence of diseases related to unwanted/unneeded inflammation. It’s been show to be helpful in conjunction with chemotherapy (where suppressing the immune system during chemo and then having it grow back rapidly once the drugs are gone keeps patients from getting sick) and diseases like irritable bowel. It may also play a role in things like heart disease and Alzheimer’s. So, linear extrapolation is a poor idea, but nuanced understanding of the causality results in a useful real scientific model.

        see for example https://news.usc.edu/82959/diet-that-mimics-fasting-appears-to-slow-aging/

        disclaimer, my wife knows people who were involved in some of these studies.

        • Cliff AB says:

          Interesting, so there may be more to it than I had read about. I do remember that there were a lot of debates about the initial findings, but perhaps they’ve put together a more nuanced model since then.

          Next time my back hurts, maybe I’ll follow up on the latest research.

        • Martha (Smith) says:

          I wonder if nuanced understanding would involve considering if there is an interaction with the tendency (if I am not mistaken) for swings in caloric intake to be connected to development of Type 2 Diabetes.

          • Well, the link above does suggest a possibility of brief periods of low calorie intake to reduce the risk of Type 2 Diabetes:

            From the article: “In a pilot human trial, three cycles of a similar diet given to 19 subjects once a month for five days decreased risk factors and biomarkers for aging, diabetes, cardiovascular disease and cancer with no major adverse side effects, according to Longo.”

            Note, this is very different from say weeks of dieting followed by binging and weight gain (Yo-yo dieting). This is relatively severe dietary restriction for a few days every month to several months. So for example 50% calorie reduction for 3 days in a row one time every month, or every 3 months etc.

            It’s the severity and the briefness that seem to be important. you can’t reduce calories that severely for very long. You need to remain relatively healthy so that everything grows back from healthy stem cells on the 4th day or so.

  4. Dzhaughn says:

    Isn’t a chemical that kills you the ultimate anti-aging agent?

  5. I hope to live a long life, but it’s PR articles like this (among other things) that remind me that mortality isn’t so bad after all.

    I see a sloppy utopianism here: “Hey, we’ve got no guarantees that you’ll live forever if you take this pill, but it’s worth a try, eh, since it’s out there? Then, if the FDA does come around and prove it (approving being equal to proving), you’ll already be in the Immortal Vanguard, which is doubleplusgood, given the intensity of ceaseless strife.”

  6. edward haynes says:

    Snake oil it is.

    The company is putting chemicals, unproven in any human trials, into slick containers shaped by marketing geniuses, and suckering the gullible out of $600/year.

    Elysium “finished” a clinical trial in August, using elderly human subjects from Canada. The goals were to examine laboratory values and functional improvement in elder populations. The trial was a mess, run part-time by Dr. Gene Wang (whose day job is with GSK), and was poorly executed. Nearly 2 months after the trial was completed, the presence of ZERO announcements and claims in the media can only mean that the trial showed NOTHING.

    Keep those $50 monthly payments coming, suckers….

Leave a Reply