Hey, we all know the answer: “correlation does not imply causation”—but of course life is more complicated than that. As philosophers, economists, statisticians, and others have repeatedly noted, most of our information about the world is observational not experimental, yet we manage to draw causal conclusions all the time. Sure, some of these conclusions are wrong (more often than 5% of the time, I’m sure) but that’s an accepted part of life.
Challenges in this regard arise in the design of a study, in the statistical analysis, in how you write it up for a peer-reviewed journal, and finally in how you present it to the world.
School sports and life outcomes
An interesting case of all this came up recently in a post on Freakonomics that pointed to a post on Deadspin that pointed to a research article. The claim was that “sports participation [in high school] causes women to be less likely to be religious . . . more likely to have children . . . more likely to be single mothers.” And the advertised effects were huge: “a ten percentage-point increase in state-level female sports participation generates a five to six percentage-point rise in the rate of female secularism, a five percentage-point increase in the proportion of women who are mothers, and a six percentage-point rise in the proportion of mothers who, at the time that they are interviewed, are single mothers.” These effects are huge to start with—elasticities of 50% for things that have nothing (apparently) to do with the treatment—and are even larger when you consider that the outcomes are binary and, for example, sports participation can’t make you secular if you were already going to be secular anyway, it can’t cause you to have a child if you were already going to have a child anyway, etc.
But, as the authors of the paper (Phoebe Clarke and Ian Ayres) explain in their blog posts and the scholarly article, they’re not measuring the effects of sports participation directly. Here’s what they’re doing:
This paper . . . adopts an instrumental-variables method . . . in which variation in rates of boys’ athletic participation across states before the passage of Title IX is used to instrument for changes in girls’ athletic participation following its passage . . .”
Here’s the summary from the published article in the Journal of Socio-Economics:
And here’s Ayres in the Freakonomics blog:
I apply the same methodology [as used earlier by economist Betsey Stevenson] to social outcomes, and find that sports participation causes women to be less religious, more likely to have children, and, if they do have children, more likely to be single mothers.
More specifically, their analysis is “comparing women in states with greater levels of 1971 male [high school] sports participation . . . to women in states with lower levels of 1971 male sports participation.” The outcome are state-level average responses to General Social Survey questions for “respondents who completed tenth grade and who either attended high school before Title IX was passed in 1972 or after it came into full effect in 1978.” So they’re doing their best to a target their analysis on the group of women who’d be affected by the treatment. Then they run individual-level regressions on binary variables (just as a minor point, I’d prefer to keep the original ordered responses; not a big deal but it can’t hurt), but the action is all coming from the state-level predictor (the measure of male athletic participation in 1971, by state).
The trouble is that instrumental variables regression is not magic. In this case, the problems are:
(a) the treatment is at the group, not the individual, level, and
(b) it’s not a clean “natural experiment.”
Think of it this way. Suppose some states were randomly selected to get the Title IX treatment and some were not. This would be the ideal scenario—but, even there, you’re measuring the effect of an aggregate policy, not the effect on individual participation. But it’s much worse than that. Actually, the treatment was applied to all the states, so all that could be studied was an interaction. It’s not even like these examples where a new policy is phased in, in different years in different states. Finally, of course the interaction being studied is not random; there are systematic differences between states with higher and lower boys’ high school sports participation in 1971. (The highest rates are reported in North Dakota, Nebraska, Minnesota, Iowa, Kansas, Montana, Arkansas, South Dakota, Vermont, Idaho, Oregon, and Wyoming.)
So, where does this stand on the correlation-causation scale? Clarke and Ayres are measuring correlations and giving them a causal interpretation. That’s not always such a bad thing to do, indeed it corresponds to an implicit model in which the observed variation can be taken as random (an ignorable treatment assignment, as Rubin would say). In this case they have a bit more—but not a lot more, in my opinion, because of problems (a) and (b) above. In short, I disagree with Ayres’s claim that this instrumental variable “is about as good as they come.” They come better. That doesn’t mean I don’t think Clarke and Ayres should publish their results. I just don’t think they should jump the gun on the causal interpretations.
Kaiser Fung has written about “story time“: after researchers do the hard work of causal identification and statistical analysis, they start with the unsupported speculation, with general idea that some of the rigor of the design and analysis should leak into the speculation. I think story time is just fine (and I think Kaiser would agree with me on this). What’s important is to draw the line at the right place, to make it clear to your readers where the data analysis ends and the speculation begins. In this case, I think the analysis ends somewhere after the state-level correlational analysis and the discussion of possible identification. The causal reasoning is speculation.
What to do?
OK, fine. What are we getting from all this, besides general “Mom and apple pie” advice not to oversell our research results (advice that would be good for me to follow sometimes with my own work, I’m sure)?
I do think we can get somewhere, taking as a starting point the implausibility of the reported point estimates. As noted above, if a ten percentage-point increase in state-level female sports participation is associated with a five percentage-point increase in the proportion of women who are mothers, there’s no way that most of this can be coming from a direct effect. The implication would be that there’s this huge group of girls who (i) will have children if they do sports, and (ii) will not have children if they do not do sports. Clearly these estimated elasticities have to be driven by big differences between states that possibly have nothing to do with high school sports. The authors do some placebo controls—applying their analysis to some other outcomes—and get a mix of statistically significant and non-significant results, and that’s fine, but maybe the next step would be to do some more systematic comparisons, looking at lots of different state-level predictors (not just boys’ 1971 high school sports participation) and lots of different state-level outcomes. Report a big grid of correlations, then see what’s there.
I’d also suggest, for each outcome, to make a scatterplot of the state-level aggregate vs. boys’ sports participation in 1971. If you want to make the causal leap, go for it—but make clear that it’s a leap. In the meantime, the scatterplot (with the 50 states labeled by their convenient two-letter abbreviations) could give a lot of insight.
Finally, if you’re interested in the substantive questions about the effects of sports participation, I think it’s essential to make a connection to whatever is already known in this field. Sure, survey data have their limitations: as Clarke and Ayres note, kids select into sports participation. But there are ways of getting around this, various versions of natural experiments which, like the Title IX thing, are not perfect but provide some leverage. Also one can try to model the selection process. Lots of ways of doing this.
Accepting uncertainty and embracing variation
Also a minor point. The article includes the following footnote:
It is true that many successful women with professional careers, such as Sheryl Sandberg and Brandi Chastain, are married. This fact, however, is not necessarily opposed to our hypothesis. Women who participate in sports may “reject marriage” by getting divorces when they find themselves in unhappy marriages. Indeed, Sheryl Sandberg married and divorced before marrying her current husband.
This sort of case-by-case discussion can be interesting for formulating hypotheses but it looks odd to me when phrased as above. My problem is that, even if all the modeling assumptions are correct, the model’s predictions are only probabilistic. It’s not necessary to explain away every contrary example. This is not a big deal but I bring it up because one of our themes on this blog in recent months has been the love of certainty, and the desire to use statistical tools to transmute variation and uncertainty into sure things. Sometimes this works (the law of large numbers and all that) but when we get back to individual cases we should recognize the limitations of our models and our predictions.
Limitations of the claims, and limitations of the criticisms
As is generally the case with these correlation-causation things, I don’t want to say that the research hypotheses are false. It may well be true that, at the individual level, “sports participation causes women to be less religious, more likely to have children, and, if they do have children, more likely to be single mothers,” even if the actual effects are an order of magnitude lower than claimed. But state-level correlations don’t tell us much about this. Recall that if we were studying state-level correlations of income and voting, we’d come to the false conclusion that poor people are more likely to vote Republican. In the present example, the Title IX story helps, but only a little.