I was reading Jenny Davidson’s blog and came upon this note on an autobiography of the eccentric (but aren’t we all?) biologist Robert Trivers. This motivated me, not to read Trivers’s book, but to do some googling which led me to this paper from Plos-One, “Revisiting a sample of U.S. billionaires: How sample selection and timing of maternal condition influence findings on the Trivers-Willard effect.”
This paper is really bad. It has a bunch of fatal statistical errors.
The paper is not on a particularly important topic, it seems to have received little or no scientific influence or media coverage, and it was published in a non-prestiguous journal.
So this post is not about casting doubt on some Ted talk or whatever.
Rather, consider this as a case study in statistical errors. For this purpose, perhaps it’s a good thing that the paper in question is obscure. Statistical errors occur all over the place—indeed it is reasonable to suppose they are more common in obscure work.
It just happens that this particular paper is on a topic with which I’m already familiar, so it’s particularly easy for me to spot the errors. When you read an bad paper on a familiar topic, the errors just pop right out, it’s as if you were wearing 3-D glasses.
Here’s the abstract:
Based on evolutionary theory, Trivers & Willard (TW) predicted the existence of mechanisms that lead parents with high levels of resources to bias offspring sex composition to favor sons and parents with low levels of resources to favor daughters. This hypothesis has been tested in samples of wealthy individuals but with mixed results. Here, I argue that both sample selection due to a high number of missing cases and a lacking specification of the timing of wealth accumulation contribute to this equivocal pattern. This study improves on both issues: First, analyses are based on a data set of U.S. billionaires with near-complete information on the sex of offspring. Second, subgroups of billionaires are distinguished according to the timing when they acquired their wealth. Informed by recent insights on the timing of a potential TW effect in animal studies, I state two hypotheses. First, billionaires have a higher share of male offspring than the general population. Second, this effect is larger for heirs and heiresses who are wealthy at the time of conception of all of their children than for self-made billionaires who acquired their wealth during their adult lives, that is, after some or all of their children have already been conceived. Results do not support the first hypothesis for all subgroups of billionaires. But for males, results are weakly consistent with the second hypothesis: Heirs but not self-made billionaires have a higher share of male offspring than the U.S. population. Heiresses, on the other hand, have a much lower share of male offspring than the U.S. average. This hints to a possible interplay of at least two mechanisms affecting sex composition. Implications for future research that would allow disentangling the distinct mechanisms are discussed.
Set aside the theoretical problems with this work, as I’ll just be talking about the statistics.
Dead on arrival
The biggest error in the paper, the error that makes the whole thing worthless, is that the noise is so much larger than the signal.
Let’s do the math. N=1165 children in the study. The comparison with the least uncertainty would be simply to take the raw proportion of girls in this sample and compare to the known proportion in the general population. The standard error is simply .5/sqrt(1165) = .015, that’s 1.5 percentage points.
Now, effect sizes. The difference in proportion girl births, comparing billionaires to the general population, has to be much much smaller than this. Compare billionaires to other white people, it will be even smaller. It’s really hard to imagine any “billionaire difference” to be anywhere near the difference in proportion girl births, comparing white and black Americans, which is around .005.
Suppose the true effect size is .002. (I actually think it’s less.) Even if it’s as large as .002, if the standard error is 0.015, that’s basically impossible to detect. We’re in kangaroo territory.
Let’s do the design analysis:
> retrodesign(.002, .015)
That’s right, if the true effect size is .002, this study has a power of 5.2% (that is, a 5.2% chance of getting a statistically significant p-value), a type S error rate of 35% (that is, a 35% chance that an estimate, if statistically significant, would be in the wrong direction), and an exaggeration factor of 17 (that is, an estimate, if statistically significant, would be on average 17 times larger than the true effect).
Or what if you wanted to make the bold, bold claim that billionaires differ from the general population in their sex ratio by the same rate at which whites differ from blacks. Run the program, and you still get a power of only 6%, a type S error rate of 17%, and an exaggeration factor of 7.
In short, such a study is hopeless no matter what. It’s dead on arrival. It’s a wild throw of the dice to even attain statistical significance, but it’s worse than that, as any statistically significant estimate would be essentially noise.
Deader on arrival
But what about the other analyses in the paper, for example the comparisons between subgroups of billionaires? For these comparisons, the statistics are even worse!
Let’s consider a best-case scenario, comparing two groups that are (essentially) equal-sized: 582 babies in one group, 583 in the other. The difference in proportion girls in these groups will have standard error sqrt(.5^2/582 + .5^2/583) = .029, that’s twice the standard error from above. (That’s the general pattern, that comparisons or interactions have twice the standard error of averages or main effects: you get a factor of sqrt(2) from the halving of the within-group sample size and another factor of sqrt(2) from the differencing.)
So this aspect of the study is even more useless. Again, let’s consider a hypothetical effect size of .002:
> retrodesign(.002, .029)
Power of 5%, type S error rate of 42%, exaggeration factor of 33. You can’t get much noisier than that.
Researcher degrees of freedom
The paper has other errors, of course. It almost has to, given that statistical significance was found under such inauspicious conditions.
The most obvious problem is multiple comparisons: the researcher has many degrees of freedom in deciding what to look at, hence he can keep looking and looking until he finds something statistically significant. In the paper at hand, we see:
– Billionaires compared to the general population,
– Heirs compared to self-made billionaires,
– Comparison just of male billionaires,
– Comparison of heiresses to the general population,
– Comparison of heiresses to self-made billionaires,
– Comparison of heiresses to heirs.
The author does a multiple comparisons correction and finds no significance, which is kinda funny because then he reports the differences as if they reflect real patterns in the population.
In any case, the multiple comparisons correction understates the problem because (a) there are lots of other comparisons floating around in the data that the researcher could’ve noticed and surely would’ve reported had they been notable, and (b) there are a bunch more researcher degrees of freedom in the data-exclusion and data-classification rules (for example, the division of heiresses into those who inherited from parents and those who inherited from spouses).
Again, given the variation and sample size in the context of possible effect sizes, the study had no chance of succeeding in any case, so I don’t don’t don’t recommend anyone try a preregistered replication. The point of the above discussion of forking paths and degrees of freedom is just to explain how the researcher could’ve found statistical significance out of what is essentially pure noise.
Interpretation of results
Finally, the paper at hand also demonstrates several standard mistakes associated with p-values:
– The use of one-sided tests in a context where departures in either direction would be notable,
– The reporting of a p-value near .05 as “almost statistical significance,”
– A “robustness check” that is almost identical to the original analysis (in this case, a logistic regression instead of a comparison of proportions),
– Selected non-significant differences interpreted based on their signs as being “consistent with the stated hypothesis,”
– An observed proportion being reported as “considerably lower than that of the general population,” without noting that this difference is entirely explainable by chance,
– A non-significant difference being taken as evidence of the null hypothesis (“Given that this difference is not statistically significant, it speaks against the first hypothesis that billionaires have a higher percentage of male offspring than the general population.”),
– And, of course, comparisons between significance and non-significance.
Followed by tons of storytelling. It’s tea-leaf-reading without the tea.
I have no desire to pick on this particular researcher—that’s why I have not mentioned his name in this post. The name is no secret (you can find it by just clicking on the link above that has the research article), but I want to focus on the very very common statistical errors rather than on which faceless scientist happened to be making them that day.
These are real errors, and they’re avoidable errors. But you’ll make them too, over and over again, if you do statistics using the grab-some-data-and-look-for-statistical-significance approach.
The point of this post is not to pile on and criticize an obscure paper in an obscure journal by an author we’ve never heard of. The point is to help you and your colleagues avoid these same errors in your own work, errors you might well make in higher-stakes situations where you’re under pressure to find results and where you might not see the forest for the trees.
The paper discussed above is almost a laboratory setting of statistical misunderstanding, where a researcher was able to use standard statistical tools to wrap himself in a web of confusion. Again, it’s nothing personal—statistics is hard, and I’m sorry to say that we in the statistical profession often sell our methods as a way of distilling certainty from noise.
The author of this paper inadvertently made a whole bunch of errors all in one place. As discussed above, it is no coincidence that these errors occurred together. When you start with hopelessly noisy data and you add to this the practical necessity to obtain statistical significance, all hell will break loose. It’s kinda sad to have to admit that the dataset you spent so many months painfully constructing, does not have enough information to answer any of your research questions—but that’s how it goes sometimes. Just too bad nobody told this guy about these issues before he started his study.
So remember these statistical errors here, in this clean setting, and watch out for them in your world.