Skip to content
 

Forking paths plus lack of theory = No reason to believe any of this.

[image of a cat with a fork]

Kevin Lewis points us to this paper which begins:

We use a regression discontinuity design to estimate the causal effect of election to political office on natural lifespan. In contrast to previous findings of shortened lifespan among US presidents and other heads of state, we find that US governors and other political office holders live over one year longer than losers of close elections. The positive effects of election appear in the mid-1800s, and grow notably larger when we restrict the sample to later years. We also analyze heterogeneity in exposure to stress, the proposed mechanism in the previous literature. We find no evidence of a role for stress in explaining differences in life expectancy. Those who win by large margins have shorter life expectancy than either close winners or losers, a fact which may explain previous findings.

All things are possible but . . . Jesus, what a bunch of forking paths. Forking paths plus lack of theory = No reason to believe any of this.

Just to clarify: Yes, there’s some theory in the paper, kinda, but it’s the sort of theory that Jeremy Freese describes as “more vampirical than empirical—unable to be killed by mere evidence” because any of the theoretical explanations could go in either direction (in this case, being elected could be argued to increase or decrease lifespan, indeed one could easily make arguments for the effect being positive in some scenarios and negative in others): the theory makes no meaningful empirical predictions.

I’m not saying that theory is needed to do social science research: There’s a lot of value in purely descriptive work. But if you want to take this work as purely descriptive, you have to deal with the problems of selection bias and forking paths inherent in reporting demographic patterns that are the statistical equivalent of noise-mining statements such as, “The Dodgers won 9 out of their last 13 night games played on artificial turf.”

On the plus side, the above-linked article includes graphs indicating how weak and internally contradictory the evidence is. So if you go through the entire paper and look at the graphs, you should get a sense that there’s not much going on here.

But if you just read the abstract and you don’t know about all the problems with such studies, you could really get fooled.

The thing that people just don’t get is that is just how easy it is to get “p less than .01” using uncontrolled comparisons. Uri Simonsohn explains in his post, “P-hacked Hypotheses Are Deceivingly Robust,” along with a story of odd numbers and the horoscope.

It’s our fault

Statistics educators, including myself, have to take much of the blame for this sad state of affairs.

We go around sending the message that it’s possible to get solid causal inference from experimental or observational data, as long as you have a large enough sample size and a good identification strategy.

People such as the authors of the above article then take us at our word, gather large datasets, find identification strategies, and declare victory. The thing we didn’t say in our textbooks was that this approach doesn’t work so well in the absence of clean data and strong theory. In the example discussed above, the data are noisy—lots and lots of things affect lifespan in much more important ways than whether you win or lose an election—and, as already noted, the theory is weak and doesn’t even predict a direction of the effect.

The issue is not that “p less than .01” is useless—there are times when “p less than .01” represents strong evidence—but rather that this p-value says very little on its own.

I suspect it would be hard to convince the authors of the above paper that this is all a problem, as they’ve already invested a lot of work in this project. But I hope that future researchers will realize that, without clean data and strong theory, this sort of approach to scientific discovery doesn’t quite work as advertised.

And, again, I’m not saying the claim in that published paper is false. What I’m saying is I have no idea: it’s a claim coming out of nowhere for which no good evidence has been supplied. The claim’s opposite could just as well be true. Or, to put it more carefully, we can expect any effect to be highly variable, situation-dependent, and positive in some settings and negative in others.

P.S. I suspect this sort of criticism is demoralizing for many researchers—not just those involved in the particular article discussed above—because forking paths and weak theory are ubiquitous in social research. All I can say is: yeah, there’s a lot of work out there that won’t stand up. That’s what the replication crisis is all about! Indeed, the above paper could usefully be thought of as an example of a failed replication: if you read carefully, you’ll see that the result that was found was weak, noisy, and in the opposite direction of what was expected. In short, a failed replication. But instead of presenting it as a failed replication, the authors presented it as a discovery. After all, that’s what academic researchers are trained to do: turn those lemons into lemonade!

Anyway, yeah, sorry for the bad news, but that’s the way it goes. The point of the replication crisis is that you have to start expecting that huge swaths of the social science literature won’t replicate. Just cos a claim is published, don’t think that implies there’s any serious evidence behind it. “Hey, it could be true” + “p less than 0.05” + published in a journal (or, in this case, posted on the web) is not enough.

22 Comments

  1. Jonathan says:

    Can I add: who the heck cares and why should we? A year longer … oh, but if you win by a lot then you don’t live a year longer. Reminds me of the hot dog ‘studies’ that found a link between consumption of hot dogs and childhood leukemia – something people have a reason to care about – except it appeared at say 8-12 hot dogs and disappeared at say 15 hot dogs so there would be some hidden mechanism which both triggered leukemia and somehow made that trigger ineffective if you kept eating, a sort of magic preventative being to eat more hot dogs than any rational person would eat. Discontinuous regression indeed.

  2. curio says:

    “it’s the sort of theory that Jeremy Freese describes as “more vampirical than empirical—unable to be killed by mere evidence” because any of the theoretical explanations could go in either direction”

    Could you link to a post where you discuss your take on “severity” and “falsifiability”?

  3. Martha (Smith) says:

    “this p-value says very little on its own”

    Yes! Better still:

    “no p-value says much on its own”

  4. Z says:

    Why aren’t the data ‘clean’ in this study? Don’t we know vote totals and life-spans pretty precisely?

    • Andrew says:

      Z:

      The sentence above is: “The thing we didn’t say in our textbooks was that this approach doesn’t work so well in the absence of clean data and strong theory.” In this case the data are clean but the theory is weak—really the theory is weaker than weak, as it makes no predictions and it is completely open-ended.

      • anon says:

        Or the partial data is clean, but the complete data is unclean, right?
        Meaning we have precise vote totals and life-spans but not clean data on everything else that affects their life-span.

  5. abdulh says:

    “I’m not saying that theory is needed to do social science research: There’s a lot of value in purely descriptive work.”

    not sure what this means. What would be a concrete example of purely descriptive work?

  6. Paul Alper says:

    Andrew wrote:

    “We go around sending the message that it’s possible to get solid causal inference from experimental or observational data, as long as you have a large enough sample size and a good identification strategy. People such as the authors of the above article then take us at our word, gather large datasets, find identification strategies, and declare victory. “

    What does that quotation say about data mining in general?

    • someone says:

      What does “good identification strategy” mean? Depending on what it includes isn’t the message sent correct? And data mining in general correct with “good identification strategies”?

  7. A.G.McDowell says:

    I guess what counts as a credible theory is in the eye of the beholder, but there is at least a precedent for examining public success vs lifespan – http://www.sciencemag.org/news/2001/05/oscar-winners-live-long-lives studied the life expectancy of Oscar winners. Anecdotally, see e.g. the death of https://en.wikipedia.org/wiki/Charles_Kennedy after his political career (like all others, as Enoch Powell points out) ended in failure.

    • Andrew says:

      Ag:

      The paper being discussed did refer to academic literature on electoral victory and lifespan, and it’s not impossible for there to be some sort of connection. As I wrote in the above post, I’m not saying the claim in that published paper is false. What I’m saying is I have no idea: The claim’s opposite could just as well be true (indeed, it’s the opposite claim that’s appeared in the literature). Or, to put it more carefully, we can expect any effect to be highly variable, situation-dependent, and positive in some settings and negative in others. The theory in the paper under discussion is weak because it makes no predictions and could easily go in any direction.

      • Mark Borgschulte says:

        I think it’s reasonable to ask whether this is a finding about the average effect of election in US history, or if these effects would be thought to hold more generally, e.g. in other countries and other times. The RD design is greedy with data, and people take a long time to die, so we can’t zoom in on the recent decades in the US, where people may be most interested. The closest thing we can say is that the positive effect of election is stronger in the second half of the sample.

  8. Imaging guy says:

    Although I don’t think it is applicable here, those comparing lifespans of different professions or other exposures (i.e. potential causal factors) have to consider “immortal time bias”. You can’t say generals live longer soldiers since people become generals in middle age and their deaths can only arise after this point in time while soldiers can die at any age above 20. Similarly, those who became U.S. presidents at earlier ages tended to die younger than those who became one at older ages. These three articles discuss a published study where “immortal time bias” was not taken into account.
    1) Avoiding blunders involving ‘immortal time’
    2) Sun exposure and longevity: a blunder involving immortal time
    3) Skin cancer as a marker of sun exposure: a case of serious immortality bias

  9. Peter Thow says:

    I am sympathetic to the need for theory, but to be fair, what are the path and were are the forks here. In the abstract I see only “US governors and other political office holders”. Maybe there’s more in the paper, but if you want to do a better job (“it’s our fault” isn’t it?) at explaining what a forking path is, you better add a sentence or two about what you are thinking. I see one fork, I don’t see what’s wrong with the evidence they showed us. Not super interesting, but second-rank polisci journal worthy.

    • Andrew says:

      Peter:

      Here are just a few of the forking paths, just from the single paragraph quoted above:
      – choice of which office-holders to study
      – choice of restricting the sample to later years (rather than, for example, to early years, middle years, years of economic recession, years of economic expansion, years with war, years with peace, etc.)
      – decision to look at exposure to stress, and how to define stress
      – decision to look at margin of victory (which is a bit contrary to the decision to use a regression decision analysis) and not, for example, partisan control of the state legislature and of the national government (both of which are very related to what a governor can do)
      – functional form used in the regression
      etc etc etc.

      Also the direction of the effect is itself a forking path, given that they say the earlier literature found an opposite effect. All this is consistent with the sorts of patterns you can find in random numbers, as has been demonstrated in the classic papers by Simmons et al. and Nosek et al., and as was demonstrated so eloquently and inadvertently by Bem in his celebrated paper.

      • jack pq says:

        I don’t think it’s so bad. The regression discontinuity design (not just plain regression) is clever and allows for identification of the effect we’re interested in, while robustness checks should take care of the forking paths you describe. I assume the authors will add more rob checks during the process of revisions (it’s still only a working paper isn’t it?).

  10. Mark Borgschulte says:

    Thanks for the feedback, Andrew.

    I believe we have a clear hypothesis: stress kills powerful people, with presidents as the leading example. If this is true and generalizes to any other groups, we hypothesize that governors, and other politicians, will experience it. We set out to test for an effect of election, and then look for a specific role for stress. The finding of a longer life in the pooled sample (with similar coefficients in the subsamples) is the sort of thing that happens when you go to the data.

    We were specifically motivated by concerns about forking paths when we wrote up the paper. We tried to stick closely to the original hypothesis, and in particular, follow through on the test for a role of stress despite the unexpected main effect. We also tried to report results for the most complete possible sample, even though it was quite costly to search for election losers in the 19th Century, and the positive effect of election is stronger in the 20th Century. I agree that the stress-SES gradient literature is messy, but we would have been remiss not to cite it. We did not frame the results as a test or validation of this theory.

    On the details of the analysis, there really aren’t that many choices with the RD design. We follow the latest recommendations of the CCT team on bandwidth and polynomial order, and run the usual density and covariate balance tests (in the Appendix of the current draft). We can expand on robustness checks in the revision. Also, the journal to which we’ve submitted has a data posting policy.

    • Andrew says:

      Mark:

      Thanks for the reply. I think it makes the most sense to consider your article as an attempted replication of the hypothesis that winning elections reduces lifespan, and in that sense we can focus on the following sentence of your abstract, “We find no evidence of a role for stress in explaining differences in life expectancy,” which indeed sounds like you’re saying there’s no evidence of any effect. One could also interpret the statement, “US governors and other political office holders live over one year longer than losers of close elections,” as essentially no effect, given that one year is so small and the circumstances of any study of this will be such that effects of one year of lifetime will be essentially undetectable.

      The rest of the abstract, though, looks a little iffy to me. There’s nothing wrong with performing these analyses, but it’s no surprise that if you look at enough comparisons you’ll find some that happen to be statistically significant in your data.

      Also I’m skeptical of these sentences from your conclusion: “One possibility is that losing exerts a negative effect on survival, if narrow losers experience profound disappointment or anxiety from the experience. It is also possible that health losses to winning candidates are offset by increases in prestige or income associated with service. . . . our findings suggest that prestige and other benefits of promotion to high offices more than offset any health costs, even in the most stressful times.” That all seems a bit strong to me: there are so many causes of death and so many effects of being elected to political office that I’d expect any effects on lifespan to be positive for some people, negative for others, and negligible for lots of others, so the whole idea of talking about aggregate effects doesn’t make a lot of sense to me here. I mean, sure, mathematically you can define an aggregate effect but I don’t think this is going to be a helpful way of learning about the effect of stress on health.

      I do see your point that there’s a literature out there, and in that sense your paper could represent a relevant null finding. It’s not that there’s anything wrong with the statistics, it’s just that I don’t think these data are going to allow much to be said beyond that null result.

Leave a Reply