Skip to content

The health policy innovation center: how best to move from pilot studies to large-scale practice?

A colleague pointed me to this news article regarding evaluation of new health plans:

The Affordable Care Act would fund a new research outfit evocatively named the Innovation Center to discover how to most effectively deliver health care, with $10 billion to spend over a decade.

But now that the center has gotten started, many researchers and economists are disturbed that it is not using randomized clinical trials, the rigorous method that is widely considered the gold standard in medical and social science research. Such trials have long been required to prove the efficacy of medicines, and similarly designed studies have guided efforts to reform welfare-to-work, education and criminal justice programs.

But they have rarely been used to guide health care policy — and experts say the center is now squandering a crucial opportunity to develop the evidence needed to retool the nation’s troubled health care system in a period of rapid and fundamental change. . . .

But not all economists think that randomization is the gold standard. Here’s James Heckman, for example, criticizing “the myth that causality can only be determined by randomization, and that glorifies randomization as the ‘gold standard’ of causal inference.” I try to put some of this in perspective here; see p. 956 of that article.

Meanwhile Prabhjot Singh offers some thoughts on the health policy innovation center:

The Innovations center that is in the cross hairs of this article is the one new federal level healthcare initiative that I truly think is transformative. Over just the past 3 years, they leveraged 1 billion dollars to change the organizational behavior of a 2.7 trillion dollar industry in the specific area of service delivery and payment systems. I have sat with the executive / strategy groups of 3 of this city’s largest hospital systems (covering 80% of the city’s patients) and many across the country as they scramble to figure out how their organization should shift strategy to capture some of this money. It’s actually sorta fascinating, because 1 billion divided by all the possible hospital systems in the US, who each submit multiple grants, is a pretty small amount of money. But the innovation center process is smart – hospital systems have to demonstrate that they would do what they suggest anyway. Everybody is angling to catch a piece of this large sounding amount of cash and figure out what the innovation center really wants to see – and they game their applications accordingly and whisper intel across their institutions and the required set of regional partners they have to assemble. By the time initiatives are submitted, there is a remarkable amount of synchronization and consensus across internally fragmented organizations about the sort of payment and delivery innovations they each think has the best shot. I was surprised by how many leaders of interdependent healthcare sub-systems were meeting for the first time through this process. In my view, the sheer internal prep and intra-institutional cooperation required to submit a proposal for an innovation center challenge—repeated across the country—is worth $1b. The shared awareness of common challenges alone will promote the diffusion of successful demonstrations across institutions with synchronized priorities. This is notoriously difficult even with an iron-clad RCT in hand.

When the goal is changing system behavior, the design of the process should be a central concern. It creates a clear navigation path so the entire ecosystem can strive forward, not simply a few lucky demonstration projects that post improvements compared to controls. The demonstrations are like fruit on a tree. In a well designed process, the tree remains after the fruit is long gone. It’s easy to ignore the tree, much its own interdependencies, when you’re entirely focused upon comparing control and treatment apples. They matter too, but an apple needs to satisfice a set of contextual requirements. Moreover, we already have a dedicated RCT/CEA funding stream for well designed healthcare delivery experiments called PCORI. The problem with forcing all innovation center demonstration project to contain an RCT is that it massively constrains the space of participants & potential solutions, and would fundamentally comprise the systems change process I described above. But I’d hardly expect people who seem to believe that informal domain knowledge is a fundamental source of bias to be eliminated to appreciate that.

Here [Singh writes] is a much more thoughtful exploration of these issues:

Over the last twenty or so years, it has become standard to require policy makers to base their recommendations on evidence. That is now uncontroversial to the point of triviality—of course, policy should be based on the facts. But are the methods that policy makers rely on to gather and analyze evidence the right ones? In Evidence-Based Policy, Nancy Cartwright, an eminent scholar, and Jeremy Hardie, who has had a long and successful career in both business and the economy, explain that the dominant methods which are in use now—broadly speaking, methods that imitate standard practices in medicine like randomized control trials—do not work. They fail, Cartwright and Hardie contend, because they do not enhance our ability to predict if policies will be effective.

The prevailing methods fall short not just because social science, which operates within the domain of real-world politics and deals with people, differs so much from the natural science milieu of the lab. Rather, there are principled reasons why the advice for crafting and implementing policy now on offer will lead to bad results. Current guides in use tend to rank scientific methods according to the degree of trustworthiness of the evidence they produce. That is valuable in certain respects, but such approaches offer little advice about how to think about putting such evidence to use. Evidence-Based Policy focuses on showing policymakers how to effectively use evidence, explaining what types of information are most necessary for making reliable policy, and offers lessons on how to organize that information.


  1. BenK says:

    Is there a way to express the shortcomings of RCTs technically, more formally? For example, RCTs are sharply focused on clear alternatives that apply broadly across a uniform or otherwise indistinguishable patient population. Too many treatment arms or too much relevant and discernible variation in the individual cases and an RCT becomes unfeasible at best. I suppose that the implicit high dimensional nature of practical interactions with people is what Singh is describing; but then can one recommend strategies that quantify those features and include them in the study design?

    As an aside, there are reasons why certain people view informal domain knowledge with suspicion – and not all of it is because they actually believe the knowledge to be incorrect. Also, some believe the informal channels of knowledge transmission violate their moral standards with regard to various -isms (gender, race, age, etc). As a result, there is a secondary drive to discredit the informal knowledge which is perceived as propagating distasteful social structures. The narratives about the rise of science are not only about proving epicycles do not exist, but also about discrediting power structures. This ‘loads’ the science and causes a variety of maladies, including attempts to ‘capture’ the practice of science by those already holding the informal domain knowledge.

    • K? O'Rourke says:

      A very simple example of some of what is coming out here.

      A surgeon I was working with was funded for an RCT to test an intervention to decrease surgery no shows/cancellations.

      As part of setting up the RCT he did a survey asking those with currently scheduled surgeries if they would be willing to participate if the trial was already running. He was assessing likely recruitment for the trial. None of the patients phoned missed their surgery. He made the incredibly risky inference that phoning patients just before the scheduled surgery would effectively end the no shows/cancellation problem.

      He even returned the grant funding to the funding agency with an explanation!!!

      It this binary RCT good other things bad superficial education and thinking that is the problem.

    • Moreno Klaus says:

      BenK: From a mathematical point of view nothing wrong. But the devil is on the details. One example about how hard this is in practice: Imagine you are comparing an intervention against standard care. But this intervention already exists in your health system at the time you are starting the trial. The consequence is that it becomes difficult (even unethical) to enforce the control group not to use the intervention. This biases results against intervention. There is the famous healthy patient bias (patients selected are healthier than general population), there is also the question about whether the RCT is an accurate representation of reality: Some interventions have a degree of a learning curve (for example: a new surgery). If the RCT is done in a major university hospital, but most patients are treated in a small local hospital, there is a question about whether the potential benefit we find, can be extended to the general population, and I could go on the whole day…

  2. Anon says:

    Randomization is the gold standard for determining causality. Economists are so much better at this sort of thing than Physicists.

    You should have seen the kind of subpar bronze-standard non-randomized methods physicists used to figure out that tides were caused by the moon. It’s a wonder physicist know the cause of anything.

  3. Anon says:

    Is the statement “Randomization is the gold standard” a theoretical or empirical assertion?

    • K? O'Rourke says:

      Actually simple minded muddled group think!

      In context, CS Peirce argued randomization mathematically justified (quantitative) inference, so theoretical here.

      In agriculture, repeated field yield trials (control versus control) were undertaken to get an empirical assessment.

      (In math, there are no errors but careless mistakes can happen – lots of mistakes possible in agricultural work.)

      But Fisher and Yates clearly recognised that time place variation precluded this from being an empirical assessment for plots other than those actually used in past years.

      Most would agree randomization gives a much higher standard than non-RCTs internally (for those in the study and people very much like them) at least for average outcomes but is almost completely blind as to any mechanism and therefor of almost unknown generalizability.

      Some people try to get one from RCTs and the other (generalisation) with the help of Non-RCTs.

      One of the things I was struck by early in my career in clinical research was that people usually rushed in RCTs way to soon, missing valuable insights available from focus groups, surveys, case studies, analysis of NonRCTs and even other earlier RCTs.

      • Anon says:

        just to clarify, the statement isn’t “RCTs are better than nothing”, but “RCTs are the gold standard”.

        The results of psychologists RCTs are usually laughable, while biologist’s discovery the Krebs Cycle is rock solid causal science (which as far as I know) had nothing to do with RCTs or any of that crap.

        If we’re going to be empirical, then lets extend that habit to the “gold standard”. How much of much of the our best causal discoveries were found with RCTs?

        • question says:


          Krebs had a model capable of precise prediction to compare the results to. This is missing from most RCTs, so they use a strawman. Even though the results in the below paper consist of n=1 experiments, they are still more convincing than the usual RCT result. The difference is that in the below experiment the region of possible results consistent with the theory is very small, while usually 50% of possible results are consistent with the theory that motivated a RCT.

          “Quantitative significance of the citric acid cycle. Though the citric acid
          cycle may not be the only pathway through which carbohydrate is oxidised in animal tissues
          the quantitative data of the oxidation and resynthesis of citric acid indicate that it is the preferential
          pathway. The quantitative significance of the cycle depends on the rate of the slowest
          partial step, that is for our experimental conditions the synthesis of citric acid from oxaloacetic
          acid. According to the scheme one molecule of citric acid is synthesised in the course
          of the oxidation of one molecule of “triose”, and since the oxidation of triose requires 3 molecules
          OS, the rate of citric acid synthesis should be one third of the rate of 0, consumption if carbohydrate
          is oxidised through the citric acid cycle. We find for our conditions:
          Rate of respiration (QO2 = -20
          Rate of citric acid synthesis (Qcitrate) = + 5,8
          The observed rate of the Citric acid synthesis is thus a little under the expected figure
          (-6,6), but it is very probable that the conditions suitable for the demonstration of the synthesis
          (absence of oxygen) are not the optimal conditions for the intermediate formation of
          citric acid, and that the rate of citric acid synthesis is higher under more physiological conditions.”

  4. Clyde Schechter says:

    Although this is peripheral to the main point of the post, I feel compelled to point out that “… we already have a dedicated RCT/CEA funding stream for well designed healthcare delivery experiments called PCORI” is untrue. The legislation creating PCORI and the bills authorizing its funding explicitly forbid it to fund any cost-effectiveness analyses. RCT’s – yes, but CEA’s — absolutely not.

  5. Rahul says:

    External validity seems the biggest problem with RCTs in my opinion.

    It’s like concluding that walking on the left bumps you into fewer people at Heathrow, and then using that as a strategy for a lifetime of globetrotting airport trips.

  6. Elin says:

    RCTs do one thing, which is provide an arguable justification that observed differences within the study are not due to selection bias related to unmeasured variables. Since so many social interventions have turned out to have hidden unmeasured selection bias this is not a small thing.
    So internal validity is quite strong net of any social impacts related to the fact that double blind in social experiments is much more difficult than in a drug trial. The intervention deliverers know they are intervention deliverers. (In the world bank eye glasses study they did not give out 0 power glasses to children in the comparison sites for example.) So randomization by itself does not solve all internal validity problems.

    In a drug trial, typically hundreds of doctors across the country working in the context of their normal clinical practice are delivering the intervention in a blind manner. In a social RTC typically one organization in a well defined single or small number of geographic areas receives intensive intervention that changes complex things about their daily practices. The designers of the program often work hard to train the line workers and supervisory staff and to build enthusiasm or “buy in” for the program. Scale up from social RCTs almost never looks like the original program in practice. They can be highly suggestive, yes. For example there are a large number of criminal justice RCTs and they often give other agencies and jurisdictions good reasons to consider replicating, but the track record of trying identical interventions in other places is mixed at best (the mandatory arrest for domestic violence experimentation is the best known example of this).

    At the same time, untold thousands of people are running randomized AB advertising experiments on websites constantly, because Google encourages them to and gives them the tools to make the studies really blind. Is that smart and effective? Absolutely. If Andrew were running ads he’d be silly not to try with red and green text color and proceed based on the results. In fact, however, if it were so clear that if red was better than green on Andrew’s site that means it will be better on everyone’s site? I don’t think people would need the infrastructure for experimentation that Google has built for site level experimentation if people who know about making money from advertising had concluded that context and audience didn’t matter.

    • Steve Sailer says:

      Or maybe red text works better in web advertising until it becomes a boring cliche that readers learn to not look at, at which point green text works better for some period of time until it too becomes a boring cliche. In marketing research, it’s expected that The Truth Wears Off.

  7. Steve Sailer says:

    “many researchers and economists are disturbed that it is not using randomized clinical trials”

    I was in a clinical trial for the very successful breakthrough anti-cancer drug rituximab 17 years ago when I had NH lymphatic cancer. They told me all participants got the rituximab trial drug — nobody got a placebo. But I was still worried they’d slip me a placebo, so when I first got this monoclonal antibody and started severe shivering, while all the doctors and nurses were very worried, I was ecstatic. This was clearly not a placebo. It was strong medicine!

Leave a Reply