Skip to content

Could propensity score analysis fix the Harvard Nurses study?

A well-publicized example of problems with observational studies is hormone replacement therapy and heart attack risks for postmenopausal women. In brief, the observational study gave misleading answers because the “treatment” and “control” groups differed systematically. Could the method of propensity scores have found (and solved) the problem?

Hormone replacement therapy and heart attacks

The evidence from the Women’s Health Initiative, a randomized clinical trial from the 1990s, is that hormone replacement therapy increases the risk of heart attack in older women. (Here’s a summary from the American College of Obstetricians and Gynecologists, which I found at the entry for Hormone Replacement Therapy at the National Library of Medicine).

Confusion from the observational study

The above findings surprised people because observational evidence from the Harvard Nurses Health Study found that women who used hormone replacement therapy had a lower risk of heart attacks. For example, in an article in the Harvard Health Letter from October, 1997:

The latest report from the Nurses’ Health Study speaks to some of those issues. In this ongoing investigation, begun in 1976, the researchers examined the impact of long-term HRT use in more than 60,000 nurses. They looked at length and continuity of hormone use and how this affected the women’s death rates. They also studied women who used estrogen alone or in an estrogen/progesterone combination and adjusted their data to account for smoking, weight, exercise, and other lifestyle habits.

Overall, the researchers found that the death rate among HRT users was 37% lower than that of women who had never taken hormones, primarily because the hormones appeared to protect women against heart disease. Indeed, the risk of dying of cardiovascular disease was 53% lower in the HRT group.

But since then, attitudes have changed. For example, from the NIH:

Do not use estrogen plus progestin therapy to prevent heart disease. The new findings show that it doesn’t work. In fact, the therapy increases the chance of a heart attack or stroke. And it increases the risk of breast cancer and blood clots.

The Nurses Health Study seems to have struck out on that one! The women who took HRT were apparently quite a bit different, on average, from those who didn’t–even after “controlling” for background variables.

But is it possible that, if the data from the Nurses study had been analyzed using propensity scores (see here for a description of the method), that more reasonable claims would have been made from the beginning?


The Nurses Study continues to operate and make headlines, so this is still a live issue.

I was writing about matching and propensity scores because these are the adjustment methods most familiar to me. The question could equally be asked about other methods, such as g-estimation (see the comments of Jamie Robins at this 2003 meeting).


  1. Eric Lim says:

    I was a student at a London University last year studying for my MSc in Biostatistics.

    A very senior statistitian from the Harvard Nurses Study gave a guest lecture and presented the conflicting information from the inital observational studies with the current randomised trial.

    I asked him: 'Would propensity score analysis be useful in this regard to analyse the results from the initial observational studies?'

    To which was a terse reply 'We don't do propensity score analysis, it's not part of our repertoire', and there didn't seem to be any hint of interest in exploring this either. Quite sad really.

    Eric Lim

    Cambridge, UK

  2. Andrew says:


    I heard a similar story from somebody else. Really disheartening, although I suppose it's naive to think that scientists will be much more ethical than other people.

  3. Joseph Delaney says:

    My understanding of the issue is that there was also a prevalent user problem (creating selection bias) at least partially due to time-varying risk. While this could have been found and modeled, I am unsure about how propensity scores give any advantage over a thoughtfully constructed regression model. Unless the study you are thinking of had a lot more power to estimate predictors of exposure than outcomes due to very few outcomes (but I don't believe that this was the case with the Nurse's Health Study).

    I'm not saying that better statistical models shouldn't be used but I worry about overstating the benefits of propensity score analysis. It's an extremely good technique, no question about it, and I've published on one of it's variations. But I want to be very sure that we don't miss issues of study design and bias in the process.

    Issues of self-selection seriously limit all observational epidemiology. The issue is serious enough that I often wonder if we should not use observational studies to estimate medication benefits (at all). It's just too misleading.

  4. Matthew Salganiik says:

    The New York Times Magazine had an interesting article that reviews the confusion surrounding hormone replacement therapy. Here's a link:

    I find that this article is very good for teaching about correlation and causation.