Greg Won writes:
I manage a team tasked with, among other things, analyzing data on Air Traffic operations to identify factors that may be associated with elevated risk. I think its fair to characterize our work as “data mining” (e.g., using rule induction, Bayesian, and statistical methods).
One of my colleagues sent me a link to your recent article “Too Good to Be True” (Slate, July 24). Obviously, as my friend has pointed out, your article raises questions about the validity of what I’m doing.
A few thoughts/questions:
(1) I agree with your overall point, but I’m having trouble understanding the specific complaint with the “red/pink” study. In their case, if I’m understanding the author’s rebuttal, they were not asking “what color is associated with fertility” and then mining the data to find a color…any color…which seemed to have a statistical association. They started by asking “is red/pink associated with fertility”, no? In which case, I think the point their making seems fair?
(2) But, your argument definitely applies to the kind of work I’m doing. In my case, I’m asking an open ended question: “Are there any relationships?” Well, of course, you would say, the odds are that you must find relationships…even if they are not really there.
(3) So let’s take a couple of examples. There are 1,000′s of economists building models to explain some economic phenomenon. All of these models are based on the same underlying data: the U.S. Income and Product Accounts. There are then 10,000′s of models built—only a handful of are publication-worthy. So, by the same logic, with that many people studying the same sample, it would be statistically true that many of the published papers in even the best economics journals are false?
(4) Another example: one of the things that we have uncovered is that, in the case of Runway Incursions, errors committed by air traffic controllers are many times more likely to result in a collision than errors committed by a pilot. The p-value here is pretty low—although the confidence interval is large because, thankfully, we don’t have a lot of collisions. What is your reaction to this finding?
(5) A caveat: In my case, we use the statistically significant findings to point us in directions that deserve more study. Basically as a form of triage (because we don’t have the resources to address every conceivable hazard in the airspace system). Perhaps fortunately, most of the people I deal with (primarily pilots and air traffic controllers) don’t understand statistics. So, the safety case we build must be based on more than just a mechanical analysis of the data.
(1) Whether or not the authors of the study were “mining the data,” I think their analysis was contingent on the data. They had many data-analytic choices, including rules for which cases to include or exclude and which comparisons to make, as well as what colors to study. Their protocol and analysis were not pre-registered. The point is that, even though they did an analysis that was consistent with their general research hypothesis, there are many degrees of freedom in the specifics, and these specifics can well be chosen in light of the data.
This topic is really worth an article of its own . . . and, indeed, Eric Loken and I have written that article! So, instead of replying in detail in this post, I’ll point you toward The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.
(2) You write, “the odds are that you must find relationships . . . even if they are not really there.” I think the relationships are there but that they are typically small, and they exist in the context of high levels of variation. So the issue isn’t so much that you’re finding things that aren’t there, but rather that, if you’re not careful, you’ll think you’re finding large and consistent effects, when what’s really there are small effects of varying direction.
(3) You ask, “by the same logic, with that many people studying the same sample, it would be statistically true that many of the published papers in even the best economics journals are false?” My response: No, I don’t think that framing statistical statements as “true” or “false” is the most helpful way to look at things. I think it’s fine for lots of people to analyze the same dataset. And, for that matter, I think it’s fine for people to use various different statistical methods. But methods have assumptions attached to them. If you’re using a Bayesian approach, it’s only fair to criticize your methods if the probability distributions don’t seem to make sense. And if you’re using p-values, then you need to consider the reference distribution over which the long-run averaging is taking place.
(4) You write: “in the case of Runway Incursions, errors committed by air traffic controllers are many times more likely to result in a collision than errors committed by a pilot. The p-value here is pretty low—although the confidence interval is large because, thankfully, we don’t have a lot of collisions. What is your reaction to this finding?” My response is, first, I’d like to see all the comparisons that you might be making with these data. If you found one interesting pattern, there might well be others, and I wouldn’t want you to limit your conclusions to just whatever happened to be statistically significant. Second, your finding seems plausible to me but I’d guess that the long-run difference will probably be lower than what you found in your initial estimate, as there is typically a selection process by which larger differences are more likely to be noticed.
(5) Your triage makes some sense. Also let me emphasize that it’s not generally appropriate to wait on statistical significance before making decisions.