How do we evaluate a new and wacky claim?

Posted on July 11, 2011 9:05 AM by Andrew

Around these parts we see a continuing flow of unusual claims supported by some statistical evidence. The claims are varyingly plausible a priori. Some examples (I won’t bother to supply the links; regular readers will remember these examples and newcomers can find them by searching):

– Obesity is contagious
– People’s names affect where they live, what jobs they take, etc.
– Beautiful people are more likely to have girl babies
– More attractive instructors have higher teaching evaluations
– In a basketball game, it’s better to be behind by a point at halftime than to be ahead by a point
– Praying for someone without their knowledge improves their recovery from heart attacks
– A variety of claims about ESP

How should we think about these claims? The usual approach is to evaluate the statistical evidence–in particular, to look for reasons that the claimed results are not really statistically significant. If nobody can shoot down a claim, it survives.

The other part of the story is the prior. The less plausible the claim, the more carefully I’m inclined to check the analysis.

But what does it mean, exactly, to check an analysis? The key step is to interpret the findings quantitatively: not just as significant/non-significant but as an effect size, and then looking at the implications of the estimated effect.

I’ll explore in the context of two examples, one from political science and one from psychology. An easy example is one in which the estimated effect is completely plausible (for example, the incumbency advantage in U.S. elections), or in which it is completely implausible (for example, a new and unreplicated claim of ESP).

Neither of the examples I consider here is easy: both of the claims are odd but plausible, and both are supported by data, theory, and reasonably sophisticated analysis.

The effect of rain on July 4th

My co-blogger John Sides linked to an article by Andreas Madestam and David Yanagizawa-Drott that reports that going to July 4th celebrations in childhood had the effect of making people more Republican. Madestam and Yanagizawa-Drott write:

Using daily precipitation data to proxy for exogenous variation in participation on Fourth of July as a child, we examine the role of the celebrations for people born in 1920-1990. We find that days without rain on Fourth of July in childhood have lifelong effects. In particular, they shift adult views and behavior in favor of the Republicans and increase later-life political participation. Our estimates are significant: one Fourth of July without rain before age 18 raises the likelihood of identifying as a Republican by 2 percent and voting for the Republican candidate by 4 percent. . . .

Here was John’s reaction:

In sum, if you were born before 1970, and experienced sunny July 4th days between the ages of 7-14, and lived in a predominantly Republican county, you may be more Republican as a consequence.

When I [John] first read the abstract, I did not believe the findings at all. I doubted whether July 4th celebrations were all that influential. And the effects seem to occur too early in the life cycle: would an 8-year-old would be affected politically? Doesn’t the average 8-year-old care more about fireworks than patriotism?

But the paper does a lot of spadework and, ultimately, I was left thinking “Huh, maybe this is true.” I’m still not certain, but it was worth a blog post.

My reaction is similar to John’s but a bit more on the skeptical side.

Let’s start with effect size. One July 4th without rain increases the probability of Republican vote by 4%. From their Figure 3, the number of rain-free July 4ths is between 6 and 12 for most respondents. So if we go from the low to the high end, we get an effect of 6*4%, or 24%.

[Note: See comment below from Winston Lim. If the effect is 24% (not 24 percentage points!) on the Republican vote and 0% on the Democratic vote, then the effect on the vote share D/(D+R) is 1.24/1.24 – 1/2 or approximately 6%. So the estimate is much less extreme than I’d thought. The confusion arose because I am used to seeing results reported in terms of the percent of the two-party vote share, but these researchers used a different form of summary.]

Does a childhood full of sunny July 4ths really make you 24 percentage points more likely to vote Republican? (The authors find no such effect when considering the weather in a few other days in July.) I could imagine an effect–but 24 percent of the vote? The number seems too high–especially considering the expected attenuation (noted in section 3.1 of the paper) because not everyone goes to a July 4th celebration and that they don’t actually know the counties where the survey respondents lived as children. It’s hard enough to believe an effect size of 24%, but it’s really hard to believe of 24% as an underestimate.

So what could’ve gone wrong? The most convincing part of the analysis was that they found no effect of rain on July 2, 3, 5, or 6. But this made me wonder about the other days of the year. I’d like to see them automate their analysis and loop it thru all 365 days, then make a graph showing how the coefficient for July 4th fits in. (I’m not saying they should include all 365 in a single regression–that would be a mess. Rather, I’m suggesting the simpler option of 365 analyses, each for a single date.)

Otherwise there are various features in the analysis that could cause problems. The authors predict individual survey respondents given the July 4th weather when they were children, in the counties where they currently reside. Right away we can imagine all sorts of biases based on how moves and who stays put.

Setting aside these measurement issues, the big identification issue is that counties with more rain might be systematically different than counties with less rain. To the extent the weather can be considered a random treatment, the randomization is occurring across years within counties. The authors attempt to deal with this by including “county fixed effects”–that is, allowing the intercept to vary by county. That’s ok but their data span a 70 year period, and counties have changed a lot politically in 70 years. They also include linear time trends for states, which helps some more, but I’m still a little concerned about systematic differences not captured in these trends.

No study is perfect, and I’m not saying these are devastating criticisms. I’m just trying to work through my thoughts here.

The effects of names on life choices

For another example, consider the study by Brett Pelham, Matthew Mirenberg, and John Jones of the dentists named Dennis (and the related stories of people with names beginning with F getting low grades, baseball players with K names getting more strikeouts, etc.). I found these claims varyingly plausible: the business with the grades and the strikeouts sounded like a joke, but the claims about career choices etc seemed possible.

My first step in trying to understand these claims was to estimate an effect size: my crude estimate was that, if the research findings were correct, that about 1% of people choose their career based on their first names.

This seemed possible to me, but Uri Simonsohn (the author of the recent rebuttal of the name-choice article by Pelham et al.) argued that the implied effects were too large to be believed (just as I was arguing above regarding the July 4th study), which makes more plausible his claims that the results arise from methodological artifacts.

That calculation is straight Bayes: the distribution of systematic errors has much longer tails than the distribution of random errors, so the larger the estimated effect, the more likely it is to be a mistake. This little theoretical result is a bit annoying, because it is the larger effects that are the most interesting!

Simonsohn moved the discussion forward by calibrating the effect-size questions to other measurable quantities:

We need a benchmark to make a more informed judgment if the effect is small or large. For example, the Dennis/dentist effect should be much smaller than parent-dentist/child-dentist. I think this is almost certainly true but it is an easy hurdle. The J marries J effect should not be much larger than the effect of, say, conditioning on going to the same high-school, having sat next to each other in class for a whole semester.

I have no idea if that hurdle is passed. These are arbitrary thresholds for sure, but better I’d argue than both my “100% increase is too big”, and your “pr(marry smith) up from 1% to 2% is ok.”

Summary

No easy answers. But I think that understanding effect sizes on a real scale is a start.

7 thoughts on “How do we evaluate a new and wacky claim?”

MW on July 11, 2011 7:52 AM at 7:52 am said:

Wait a minute…participation is based solely on weather? Because if it's not raining, EVERYONE participates in celebrations, and if it's raining, NOBODY participates in celebrations…? Right…
Anonymous on July 11, 2011 2:01 PM at 2:01 pm said:

Agreed MW. Moreover, I can't help but suspect that this study is really just picking up on a correlation between geographic region (which may have a similar climate) and political ideology.

As an afterthought, this whole study is like intellectual rubber stamp collecting (to steal Nassim Taleb's phrase). Celebrating 4th of July makes you more Republican, so what?
John Mashey on July 11, 2011 7:17 PM at 7:17 pm said:

If people run out of wacky claims, I recommend the Journal of Scientific Exploration. My favorite is the statistics of dog astrology.
GC on July 12, 2011 7:22 AM at 7:22 am said:

I'm confused by Anonymous' post. Wouldn't the inclusion of county fixed effects mean that the estimated effect size is generated by comparing cohortXcounty1 to cohortYcounty1? In that case, we would have to look for bias that would be present between cohorts, not necessarily between counties, right? But maybe I'm missing something here.
Anonymous on July 12, 2011 9:04 AM at 9:04 am said:

GC – you're correct. I have a bad habit of just glancing through the results tables with articles like this. Well played Madestam and Yanagizawa-Drott.
Winston Lin on July 14, 2011 7:25 PM at 7:25 pm said:

Andrew, the July 4 findings might not be quite so wacky:

1) I really like your idea of looking at effect sizes on a real scale, but here it should be 6%, not 24%. The finding that one rain-free July 4 before age 18 "raises the likelihood of … voting for the Republican [presidential] candidate by 4 percent" is based on an estimated coefficient of 1 percentage point (Table 4), relative to a baseline of 25% (Table 1). (Nonvoters are included in the denominators.)

2) The estimated effects on identifying and voting Republican are largest for people born in 1920-1939, and disappear for those born in 1970-1990 (Table 8). According to the paper, "Celebrations in the first half of the 20th century were political events. Local politicians were involved in planning for the occasion, as well as providing financial support to the Fourth of July festivities. They also participated actively in the parades and presented orations during the formal ceremonies. Many used the holiday to campaign or to gain visibility between campaigns by giving political speeches. In the cities, civic groups and political parties organized separate events to further their particular cause."

3) The treatment was at the community level. If you were a kid in the U.S. in the 1920s, the weather on July 4 may have affected the dose of political propaganda that you and all your peers got.
mb on July 15, 2011 5:09 AM at 5:09 am said:

Small issue.

From their Figure 3, the number of rain-free July 4ths is between 6 and 12 for most respondents. So if we go from the low to the high end, we get an effect of 6*4%, or 24%.

This would be 28% if it was 6 to 12 inclusive.

Comments are closed.