Skip to content
 

If observational studies are outlawed, then only outlaws will do observational studies

My article “Experimental reasoning in social science” begins as follows:

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.”

At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data.

In the present article, I’ll address the following questions:

1. Why do I agree with the consensus characterization of randomized experimentation as a gold standard?

2. Given point 1 above, why does almost all my research use observational data?

In confronting these issues, we must consider some general issues in the strategy of social science research. We also take from the psychology methods literature a more nuanced perspective that considers several different aspects of research design and goes beyond the simple division into randomized experiments, observational studies, and formal theory.

It’s a chapter in the book, “Field Experiments and Their Critics,” edited by Dawn Teele and based on a symposium at Yale a few years ago featuring Don Green, Alan Gerber, Abhijit Banerjee, Esther Duflo, and several other political scientists and economists.

P.S. The original version of this post included an image from wikipedia. Photo by Larry Philpot, www.soundstagephotography.com

8 Comments

  1. Jeremy Miles says:

    It’s nice to see that a book like this is priced sensibly, and not in the close to three figures range that you often see.

  2. jrc says:

    I just skimmed your article, but I would think that the first biggest reasons we would not want to discount non-experimental empirical methods is that, in the social sciences at least, experiments are often none of: feasible, relevant, or ethical. By this I mean:

    Feasible: Suppose we want to know the effect of the minimum wage on employment. It is not feasible to randomly assign people a minimum wage, so the best we can hope for is the “quasi-experimental” type of analysis, which I guess you call a “natural experiment” but it isn’t really “natural” and in my opinion isn’t really an experiment – it is a policy change that we re-conceive as an experiment. This is just using the *metaphor* of experimentation in a different context.

    Relevant: Suppose you want to know how expectations about child mortality rates affect fertility decisions to more deeply understand the demographic transition. If we are trying to understand something that happened in the past, there may be very little value in experimenting, even if an experiment could get at some little piece of the puzzle.

    Ethical: Suppose we want to know how bureaucratic ordeals affect welfare take-up, we wouldn’t just make it harder for poor people to get their welfare… or suppose we wanted to know the determinants of solicited bribe amounts, we wouldn’t just go around committing crimes in front of police to see what they ask for….or suppose we wanted to know if giving tax collectors new power increased their ability to extort bribes, we wouldn’t just give some of them extra bribe-extorting powers and then see if they used them… or suppose we wanted to know the long-term effects of untreated syphilis, we wouldn’t just give it to a bunch of people who didn’t know what was going on and then see what happened to them.

    • Erin Jonaitis says:

      Ouch. I see what you did there.

    • Fernando says:

      jrc:

      Many of the issues you mentioned can be solved with a little imagination. For example, we don’t randomize people to smoking. Rather we randomized smokers to smoking cessation.

      So take you ethical examples:

      – “Suppose we want to know how bureaucratic ordeals affect welfare take-up, we wouldn’t just make it harder for poor people to get their welfare…”

      Here we could randomize the roll out of bureaucratic streamlining in a highly bureaucratic context.

      – “the determinants of solicited bribe amounts, we wouldn’t just go around committing crimes in front of police to see what they ask for”

      Actually experiments have been done on this related to race, education, etc.. none of which involved committing a crime. Bureaucrats often ask for bribes for doing their work.

      – “or suppose we wanted to know if giving tax collectors new power increased their ability to extort bribes, we wouldn’t just give some of them extra bribe-extorting powers and then see if they used them”

      Again, you could give them less power, and see if that diminishes the ability to extol bribes. (Again assuming effects are symmetric, which they need not be)

      – ” or suppose we wanted to know the long-term effects of untreated syphilis, we wouldn’t just give it to a bunch of people who didn’t know what was going on and then see what happened to them.”

      Given a population of people with syphilis we might do an encouragement design inviting people to seek early treatment, or something like that. After all, you cannot typically force people to seek treatment.

      Admittedly these all involve additional assumptions, or slightly different estimates (e.g. LATE vs ATE, etc.) but don’t make virtue the enemy of the good.

  3. question says:

    “suppose we wanted to know the long-term effects of untreated syphilis, we wouldn’t just give it to a bunch of people who didn’t know what was going on and then see what happened to them.”

    I think you are combining two stories there. Naturally infected but withheld treatment: https://en.wikipedia.org/wiki/Tuskegee_syphilis_experiment. Purposefully infected but attempted treatment: https://en.wikipedia.org/wiki/Guatemala_syphilis_experiment

    This guy’s story is far scarier: https://en.wikipedia.org/wiki/Albert_Stevens. They misdiagnosed him with stomach cancer, told him he was doomed to die, then while he was in the hospital they injected him with plutonium without telling him.

  4. I’m really worried about the implied epistemological supremacy of the notion of ‘gold standard’. Isn’t it fairer to say that much of the statistical aparatus has been based on assumptions of randomness that do not obtain unless careful selection measures have been taken. And aren’t we still left with the problem of representativeness of any initial selection upon which randomization is performed?

    Wouldn’t it be just as accurate to say that anthropological thick description is the gold standard of knowledge in the social sciences and randomised control trials are simply a way of approximating that at a large scale on an extremely limited set of social phenomena susceptible to a certain kind of measurement (either directly or by proxy)?

    Or wouldn’t we be better off saying that there is no such thing as a ‘gold standard’ but rather competing narratives of causality about the relationship of human action and social reality?

    Or would we not be even better off saying that there are population-level, small-group-level and individual-level phenomena which we should be interpreting in their own right. For instance, many of the examples where the effect sizes are so small to make a difference for any one individual but may result in a complete reshaping of the population (like ‘nudge’).

    But having read the paper, I don’t see that you really tried to make the case for randomised experimentation. The dictum “To find out what happens when you change something, it is necessary to change it.” surely works only on the simplest of cases where there are no confounding issues of scale. Pedagogic reform is a good example. You can do an RTC of a particular pedagogic intervention and you will often find that it works. So you try to roll it out nationwide and any effect the intervention had disappears (which almost always happens) perhaps because you lose the level of quality control you had in the experiment. What does it mean for the quality of the knowledge? Did you learn that this particular intervention works? Perhaps. But did you also learn that it should be implemented nationwide? Perhaps not. But more worryingly are there perhaps many other types of interventions that did not produce an effect in an RTC but that should be implemented nationwide? Ultimately, even with RTC data, you have to tell a story at the level of an anecdote. And I’d suggest that it is these types of anecdotes that we should be paying attention to (see here for a longer treatment from a different perspective http://metaphorhacker.net/2011/03/epistemology-as-ethics-decisions-and-judgments-not-methods-and-solutions-for-evidence-based-practice).

Leave a Reply