Around these parts we see a continuing flow of unusual claims supported by some statistical evidence. The claims are varyingly plausible a priori. Some examples (I won’t bother to supply the links; regular readers will remember these examples and newcomers can find them by searching):
– Obesity is contagious
– People’s names affect where they live, what jobs they take, etc.
– Beautiful people are more likely to have girl babies
– More attractive instructors have higher teaching evaluations
– In a basketball game, it’s better to be behind by a point at halftime than to be ahead by a point
– Praying for someone without their knowledge improves their recovery from heart attacks
– A variety of claims about ESP
How should we think about these claims? The usual approach is to evaluate the statistical evidence–in particular, to look for reasons that the claimed results are not really statistically significant. If nobody can shoot down a claim, it survives.
The other part of the story is the prior. The less plausible the claim, the more carefully I’m inclined to check the analysis.
But what does it mean, exactly, to check an analysis? The key step is to interpret the findings quantitatively: not just as significant/non-significant but as an effect size, and then looking at the implications of the estimated effect.
I’ll explore in the context of two examples, one from political science and one from psychology. An easy example is one in which the estimated effect is completely plausible (for example, the incumbency advantage in U.S. elections), or in which it is completely implausible (for example, a new and unreplicated claim of ESP).
Neither of the examples I consider here is easy: both of the claims are odd but plausible, and both are supported by data, theory, and reasonably sophisticated analysis.
The effect of rain on July 4th
My co-blogger John Sides linked to an article by Andreas Madestam and David Yanagizawa-Drott that reports that going to July 4th celebrations in childhood had the effect of making people more Republican. Madestam and Yanagizawa-Drott write:
Using daily precipitation data to proxy for exogenous variation in participation on Fourth of July as a child, we examine the role of the celebrations for people born in 1920-1990. We find that days without rain on Fourth of July in childhood have lifelong effects. In particular, they shift adult views and behavior in favor of the Republicans and increase later-life political participation. Our estimates are significant: one Fourth of July without rain before age 18 raises the likelihood of identifying as a Republican by 2 percent and voting for the Republican candidate by 4 percent. . . .
Here was John’s reaction:
In sum, if you were born before 1970, and experienced sunny July 4th days between the ages of 7-14, and lived in a predominantly Republican county, you may be more Republican as a consequence.
When I [John] first read the abstract, I did not believe the findings at all. I doubted whether July 4th celebrations were all that influential. And the effects seem to occur too early in the life cycle: would an 8-year-old would be affected politically? Doesn’t the average 8-year-old care more about fireworks than patriotism?
But the paper does a lot of spadework and, ultimately, I was left thinking “Huh, maybe this is true.” I’m still not certain, but it was worth a blog post.
My reaction is similar to John’s but a bit more on the skeptical side.
Let’s start with effect size. One July 4th without rain increases the probability of Republican vote by 4%. From their Figure 3, the number of rain-free July 4ths is between 6 and 12 for most respondents. So if we go from the low to the high end, we get an effect of 6*4%, or 24%.
[Note: See comment below from Winston Lim. If the effect is 24% (not 24 percentage points!) on the Republican vote and 0% on the Democratic vote, then the effect on the vote share D/(D+R) is 1.24/1.24 – 1/2 or approximately 6%. So the estimate is much less extreme than I’d thought. The confusion arose because I am used to seeing results reported in terms of the percent of the two-party vote share, but these researchers used a different form of summary.]
Does a childhood full of sunny July 4ths really make you 24 percentage points more likely to vote Republican? (The authors find no such effect when considering the weather in a few other days in July.) I could imagine an effect–but 24 percent of the vote? The number seems too high–especially considering the expected attenuation (noted in section 3.1 of the paper) because not everyone goes to a July 4th celebration and that they don’t actually know the counties where the survey respondents lived as children. It’s hard enough to believe an effect size of 24%, but it’s really hard to believe of 24% as an underestimate.
So what could’ve gone wrong? The most convincing part of the analysis was that they found no effect of rain on July 2, 3, 5, or 6. But this made me wonder about the other days of the year. I’d like to see them automate their analysis and loop it thru all 365 days, then make a graph showing how the coefficient for July 4th fits in. (I’m not saying they should include all 365 in a single regression–that would be a mess. Rather, I’m suggesting the simpler option of 365 analyses, each for a single date.)
Otherwise there are various features in the analysis that could cause problems. The authors predict individual survey respondents given the July 4th weather when they were children, in the counties where they currently reside. Right away we can imagine all sorts of biases based on how moves and who stays put.
Setting aside these measurement issues, the big identification issue is that counties with more rain might be systematically different than counties with less rain. To the extent the weather can be considered a random treatment, the randomization is occurring across years within counties. The authors attempt to deal with this by including “county fixed effects”–that is, allowing the intercept to vary by county. That’s ok but their data span a 70 year period, and counties have changed a lot politically in 70 years. They also include linear time trends for states, which helps some more, but I’m still a little concerned about systematic differences not captured in these trends.
No study is perfect, and I’m not saying these are devastating criticisms. I’m just trying to work through my thoughts here.
The effects of names on life choices
For another example, consider the study by Brett Pelham, Matthew Mirenberg, and John Jones of the dentists named Dennis (and the related stories of people with names beginning with F getting low grades, baseball players with K names getting more strikeouts, etc.). I found these claims varyingly plausible: the business with the grades and the strikeouts sounded like a joke, but the claims about career choices etc seemed possible.
My first step in trying to understand these claims was to estimate an effect size: my crude estimate was that, if the research findings were correct, that about 1% of people choose their career based on their first names.
This seemed possible to me, but Uri Simonsohn (the author of the recent rebuttal of the name-choice article by Pelham et al.) argued that the implied effects were too large to be believed (just as I was arguing above regarding the July 4th study), which makes more plausible his claims that the results arise from methodological artifacts.
That calculation is straight Bayes: the distribution of systematic errors has much longer tails than the distribution of random errors, so the larger the estimated effect, the more likely it is to be a mistake. This little theoretical result is a bit annoying, because it is the larger effects that are the most interesting!
Simonsohn moved the discussion forward by calibrating the effect-size questions to other measurable quantities:
We need a benchmark to make a more informed judgment if the effect is small or large. For example, the Dennis/dentist effect should be much smaller than parent-dentist/child-dentist. I think this is almost certainly true but it is an easy hurdle. The J marries J effect should not be much larger than the effect of, say, conditioning on going to the same high-school, having sat next to each other in class for a whole semester.
I have no idea if that hurdle is passed. These are arbitrary thresholds for sure, but better I’d argue than both my “100% increase is too big”, and your “pr(marry smith) up from 1% to 2% is ok.”
No easy answers. But I think that understanding effect sizes on a real scale is a start.