Skip to content
 

Publish your raw data and your speculations, then let other people do the analysis: track and field edition

There seems to be an expectation in science that the people who gather a dataset should also be the ones who analyze it. But often that doesn’t make sense: what it takes to gather relevant data has little to do with what it takes to perform a reasonable analysis. Indeed, the imperatives of analysis can even impede data-gathering, if people have confused ideas of what they can and can’t do with their data.

I’d like us to move to a world in which gathering and analysis of data are separated, in which researchers can get full credit for putting together a useful dataset, without the expectation that they perform a serious analyses. I think that could get around some research bottlenecks.

It’s my impression that this is already done in many areas of science—for example, there are public datasets on genes, and climate, and astronomy, and all sorts of areas in which many teams of researchers are studying common datasets. And in social science we have the NES, GSS, NLSY, etc. Even silly things like the Electoral Integrity Project—I don’t think these data are so great, but I appreciate the open spirit under which these data are shared.

In many smaller projects, though—including on some topics of general interest—data are collected by people who aren’t well prepared to do a serious analysis. Sometimes the problems come from conflicts of interest (as with disgraced primatologist Marc Hauser or food researcher Brian Wansink, both of whom seem to have succumbed to strong and continuing incentives to find positive results from their own data); other times it’s as simple as the challenge of using real and imperfect data to answer real and complex questions.

The above thoughts were motivated by a communication I received from Simon Franklin, a post-doc in economics at the London School of Economics, who pointed me to this paper by Stéphane Bermon and Pierre-Yves Garnier: “Serum androgen levels and their relation to performance in track and field: mass spectrometry results from 2127 observations in male and female elite athletes.”

From the abstract of the article in question:

Methods 2127 observations of competition best performances and mass spectrometry-measured serum androgen concentrations, obtained during the 2011 and 2013 International Association of Athletics Federations World Championships, were analysed in male and female elite track and field athletes. To test the influence of serum androgen levels on performance, male and female athletes were classified in tertiles according to their free testosterone (fT) concentration and the best competition results achieved in the highest and lowest fT tertiles were then compared.

Results The type of athletic event did not influence fT concentration among elite women, whereas male sprinters showed higher values for fT than male athletes in other events. Men involved in all throwing events showed significantly (p<0.05) lower testosterone and sex hormone binding globulin than men in other events. When compared with the lowest female fT tertile, women with the highest fT tertile performed significantly (p<0.05) better in 400 m, 400 m hurdles, 800 m, hammer throw, and pole vault with margins of 2.73%, 2.78%, 1.78%, 4.53%, and 2.94%, respectively. Such a pattern was not found in any of the male athletic events.

Franklin writes:

I’m sure you wouldn’t be surprised to see these kinds of mistakes in published work. What is more distressing is that this evidence is said to be a key submission in the IAAF’s upcoming case against CAS [the Court of Arbitration for Sport], since the CAS has argued that sex classification on the basis of T levels are only justified if high T confers a “significant competitive advantage”. (Of course, one might reasonably disagree with that standard but is the standard the IAAF, and this paper, try to meet. An IAAF official is, in fact, a co-author.) Media coverage here, here, and here.

The paper correlates testosterone levels in athletes with their performance at a recent World Championship and makes causal claims about the affects of testosterone on female performance.

There are more than a few problems with the paper, not least the fact that it makes causal claims from correlations in a highly selective sample, and the bizarre choice of comparing averages within the highest and lowest tertiles of fT levels using a student t-test (without any other statistical tests presented).

But most problematic is the multiple hypothesis testing. The authors test for a correlation between T-levels and performance across a total of over 40 events (men and women) and find a significant correlation in 5 events, at the 5% level. They then conclude:

Female athletes with high fT levels have a significant competitive advantage over those with low fT in 400 m, 400 m hurdles, 800 m, hammer throw, and pole vault.

These are 5 events for which they found significant correlations! And we are lead to believe that there is no such advantage for any of the other events.

Further, when I attempt to replicate the p-values (using the limited data available) I find only 3 out of the 5 with p<0.05, and least three women's events with p<0.15 with signs in the opposite direction (high-T athletes perform worse), strongly suggesting that a joint test on standardized performance measures would fail to reject. Note also that this study is being done precisely because there are currently at least a few high performing hyper-androgenic women in world athletics at the moment, and these women are (presumably) included in their sample. Now, of course, there are all sorts of endogeneity problems that could be leading to a downward bias in these estimates. And indeed I'm surprised to see such a weak correlation in so many events, given what I've read about the physiology. But the conclusion to this paper cannot possibly be justified on the basis of the evidence.

It’s hard for me to judge this, in part because I know next to nothing about doping in sports, and in part because the statistical analysis data processing in this paper is such a mess that I can’t really figure out what data they are working with, what exactly they are doing, or the connection between some of their analyses and their scientific goals. So, without making any comment on the substance of the matter—the analysis in that paper is so tangled and I don’t have a strong motivation to work it all out—let me just say that statistics is hard, and papers like this give me more of an appreciation for the sort of robotic-style data analyses that are sometimes recommended in biostatistics. Cookbook rules can be pretty silly, but it all gets worse when people just start mixing and matching various recipes (“Data distributions were assessed for normality using visual inspection, calculation of skewness and kurtosis, and the Kolmogorov-Smirnov test. . . . The effects of the type of athletic event were tested with a one-way analysis of variance . . . Tukey HSD (Spjotvoll/Stoline) post hoc test when appropriate . . . athletic performances and Hb concentrations of the highest and lowest fT tertiles were compared by using non-paired Student’s t-test. These different athletic events were considered as distinct independent analyses and adjustment for multiple comparisons was not required. . . . When appropriate, a χ2 test was used. Correlations were tested by the Pearson correlation test.”) I have no reason to think these authors are cheating in any way; this just appears to be the old, old story of a data analysis that is disconnected with the underlying questions being asked.

What to recommend here? I feel like any such analysis would really have to start over from scratch: the authors of the above paper would be a good resource, but really you’ll have to look more carefully at the questions of interest and how they can be addressed by the data. But that takes effort and expertise, which is a challenge given that expertise is not so easy to come by, and the IAAF is not, I assume, an organization that has a bunch of statisticians kicking around.

Given all this, I think the way to go is for people such as Bermon and Garnier to publish their data and their speculations, and leave the data analysis to others.

33 Comments

  1. Marcos says:

    I think the fact that researchers do their own analysis is a huge problem in medical sciences, and that is mostly because of conflict of interest. I find very impressive how much importance people give to things like blinding, randomization, involvement of the pharmaceutical industry and yet nobody cares about researchers doing their own analysis. It is like, physicians should not know which treatment the subject is so that s/he does not influence the outcome, but it is okay for the physician to analyze the data. Totally nonsense. I think funding agencies should demand and provide funds for independent statistical analysis. At the same time we should stop with this null result thing that is ‘umplublishable’, and publish the result, whetever it is.

    • Anoneuoid says:

      At the same time we should stop with this null result thing that is ‘umplublishable’, and publish the result, whetever it is.

      This seems to be the dominant opinion, so where is this highly damaging practice coming from? In fact I am not sure I have ever met someone who advocates the current practice.

    • Nicholas says:

      To me, this has been more of a problem for the social sciences. On the medical side their is substantially greater research infrastructure such that most larger schools of medicine have a biostatistics core that handles the statistical analysis for medical researchers. It seems to me that the problem of the data collector being the analyst is far more prevalent on the social/behavioral sciences side, which I would imagine is due to a lack of funding. Nevertheless, even having an outside statistician doesn’t always prevent the ceaseless hunt for p <0.05, as their are other institutional pressures that can exert themselves. That said, I agree that at the very least there needs to be adequate funding to create the same type of research infrastructure that is common to biomedical research.

    • Paul says:

      I’m not so sure. I think that a deep understanding of the subject leads to a more thoughtful data analysis than is possible when the analysis is farmed out to pure statisticians.

      • Carol says:

        Hi Paul,

        My observation is that it is a lot easier for a statistician to acquire the necessary substantive knowledge to do a thoughtful data analysis than it is for a substantive person (say, a psychologist or a physician) to acquire the necessary statistical knowledge.

        Carol

      • Martha (Smith) says:

        I think careful collaboration between subject matter specialist and statistician is what is needed.

  2. Ema says:

    Political scientist here. I have a bunch of cool data and a bunch of training in statistics but not as much as a statistician. I’m junior. I tend to think I’m at least moderately competent in analysis and I don’t have a working relationship with any statisticians.

    What would you like me to do? Analyse it myself? Collaborate with a statistician? Post my data online and pray for tenure?

    • Andrew says:

      Ema:

      I’m not sure what you should do right now, but in general I’d like it to be possible to publish, in a good journal, important data along with competent analysis, without the implicit requirement that the results be statistically significant or definitive or especially innovative. So you could publish your data, analyze them as well as you can with recognition that the analysis could be improved, a good journal would publish your paper because the topic is some combination of interesting and important, and then others could follow up with their own analysis, citing you. The idea is that your work would be influential and your paper would be cited by later researchers who reanalyze your data, and even later researchers who refer to others’ reanalyses and also your raw data.

  3. Thanatos Savehn says:

    Dammit Andrew I’m a lawyer not a statistician but, if I’m understanding any of this unfolding crisis, the model ought to imply the statistical analysis and not the other way around. Thus, if you were to preregister your experiment and specify the analyses to be done upon the data to be collected and what your hypothesis predicts they’ll reveal isn’t the problem largely solved?

    As per the comment re: biostatisticians safeguarding biomedical science I’ll say this having deposed three since 2011 when I had my first OMG none of this means what thought it means moment: all three got the definitions right but two of the three got the inferences wrong in the very ways Gigerenzer and Wagenmskers have described.

    • Andrew says:

      Thanatos:

      I don’t think there’s any experiment here or really a place for preregistration. These track-and-field data are observational and had already been collected long before this particular analysis was even considered. This is commonplace in policy: the question arises and then you search for relevant data.

  4. Guido Biele says:

    It appears to me that an implicit assumption of Andrews proposal (devision of labor between those who propose hypotheses and collect data and those who analyze data) is that one can typically collect useful data without knowing well what analysis one would perfrom. My impression from the research areas I have been involved in (psychology, neuroscience, mental health) is that this would not be a valid assumption. Another implicit assumption is maybe that a statistician can quickly pick up the domain knowlege required to perform a meaningful analysis. I guess this depends on the research field or problem.
    An alternative approach to “division of labor” would be to collect less data/write fewer papers and spend more time learning about statistics, or to involve a statistician from the beginning on. Which alternative is better depends in part on if it is more efficient for the domain expert to acquire the relevant statistical expertise, or for the statistician to acquire the relevent domain expertise.

    • Andrew says:

      Guido:

      In the example being discussed, the data on androgen levels and athletic performance already existed, so I think it would make sense for those researchers to gather the data, put the numbers in a convenient form, do some minimal analysis, and then publish it. Instead of what they did, which was to do a goofy analysis, push some questionable claims, and still not make the complete dataset available for others.

      • Guido Biele says:

        Andrew:
        I agree that it appears that they dropped the ball (I write “it appears” because I haven’t read the paper). I did not want to defend that paper or comment on that specific case.

        My comment originated more from the blog title and in particular this sentence:

        I’d like us to move to a world in which gathering and analysis of data are separated, in which researchers can get full credit for putting together a useful dataset, without the expectation that they perform a serious analyses.

        which I understood to advocate the division of labour I described above.

        The point I was trying to make is, that I am unsure about how likely it is that somebody who cannot or does not want to think about a serious analysis will be able to put together a useful dataset.

        • Andrew says:

          Guido:

          Yes, I agree with you on that. Lots of studies of the “power = .06” variety are dead on arrival because the data that have been collected just can’t answer the questions being asked. Familiar examples to the reader of this blog include the ovulation-and-clothing study, the ovulation-and-voting study, the beauty-and-sex-ratio study, the power-pose study, etc etc.

          Still, even in these horrible examples, I’d prefer if the authors had just published the data they had, along with their speculations, clearly labeled as such, and then left it to others to analyze the data cleanly (which in each of the above cases would result in a clean statement that essentially nothing could be learned).

        • Martha (Smith) says:

          I think the best approach to data collection is to start with a collaboration between subject matter specialist and statistician in order to develop a data collection plan that is likely to produce data that can be informatively analyzed (rather than, for example, data with confounds that could have been avoided by a better data collection plan).

          (Gee, wouldn’t it be great if some ambitious soul produced a book of case studies of data sets that are essentially unusable because of poor data collection — and how the data collection plans could have been modified to produce usable data. Or has this already been done?)

  5. I’m not sure researchers would be satisfied with publishing data with ‘speculations’. That sounds like it could be open field day. Let the researchers spec out their ideas, hythotheses, suggest some key variables and correlations, and then let some experimental design people develop or augment a dataset in conjuction with the researcher + others with knowledge of other successful empirical studies. Also a database expert would be useful, especially if the data will be rolled out to the public, for multipurpose analysis.

    • Andrew says:

      Ralph:

      It’s already open field day; take a look at the paper linked to in the above post, not to mention lots of papers in PPNAS and Psychological Science.

      Speculations are going to happen anyway. What I’d like is for people to get their data out there and not feel pressure to prove something they can’t really prove, or to attach an inappropriate degree of certainty to speculations.

      Your plan, in which lots of researchers with different areas of expertise get together to do a project, is fine. But there will be cases such as in the above-linked paper where a single researcher, or a small team without general expertise, has some data and perhaps some speculations to share. I’d prefer for them to do this right away—publish their data and their speculations, as speculations—rather than waiting for a grand synthesis that may never come.

      • Marcus says:

        I agree that this would be desirable but the cynic in me would note that researchers really like (and need for tenure purposes) to get cited and a paper is far more likely to get cited a lot when you’ve p-hacked your way through the garden of forking paths to a really impressive effect size estimate that you can then turn into a TED talk, NYT editorial, Huffington Post blog entry or paper in PNAS or Psych Science.

    • Keith O'Rourke says:

      Open field days may be absolutely necessary for science to actually happen in many areas (given what is reported was actually done well and honestly and fully reported).

      Its a fairly old idea (e.g. astronomical observations in the 1800,s).

      This paper discusses the motivations and advantages – https://www.researchgate.net/publication/8402267_The_Value_of_Risk-Factor_Black-Box_Epidemiology

      A couple excerpts
      “Research articles can be useful even if they only report how the study was done and what associations were observed without attempting to interpret or explain these observations in terms of methodologic, biologic, or social theories.”

      “A descriptive orientation could also encourage deferral of general explanations to more thorough and comprehensive reviews.”

      ” … policy implications should be reserved for separate articles that synthesize evidence in a balanced fashion, along with costs and benefits of proposed actions.”

      “Some epidemiologists could find such article specialization in conflict with their conception of public health research. Most researchers accept, however, that personal specialization has become necessary, because one “renaissance scientist” can no longer master all the fields …”

      Wide spread adoption of such ideas though may send most University hiring and promotion procedures into a complete tail spin.

  6. Andrew wrote: “What I’d like is for people to get their data out there and not feel pressure to prove something they can’t really prove, or to attach an inappropriate degree of certainty to speculations. and “I’d prefer for them to do this right away—publish their data and their speculations, as speculations—rather than waiting for a grand synthesis that may never come.”

    See http://kornel.zool.klte.hu/pub/ornis/articles/OrnisHungarica_vol25(1)_p120-146.pdf and http://kornel.zool.klte.hu/pub/ornis/articles/OrnisHungarica_vol25(1)_p147-176.pdf for some examples.

    Copy/pasted from the first url: “Ornithological studies often rely on long-term bird ringing data sets as sources of information. However, basic descriptive statistics of raw data are rarely provided. In order to fill this gap, here we present the third item of a series of exploratory analyses of migration timing and body size measurements of the most frequent Passerine species at a ringing station located in Central Hungary (1984–2016). (…) Our aim is to provide a comprehensive overview of the analysed variables. However, we do not aim to interpret the obtained results, merely to draw attention to interesting patterns that may be worth exploring in detail. Data used here are available upon request for further analyses.”

  7. Seth Green says:

    A few comments on the T & F side of this discussion:

    My impression is that the underlying issue here is not “doping in sports”, but rather whether women with unusually high, but naturally occurring levels of testosterone should be allowed to compete in women’s sports. Of course some of those women might be ingesting/injecting testosterone as well, who knows.

    When I read that high testosterone is negatively associated with performance for some events, my first thought is “measurement error.” My second is that the authors are sampling from a truncated distribution. In 2011 and 2013, the IAAF required women with more than the allowed amount of testosterone to take testosterone-suppressing hormones (we think — the whole saga is cloudy). CAS struck that down in 2015 I conjecture that the authors would have found stronger effects across a wider range of events if they ran the same tests today, when the long right tail of the distribution is competing and winning (their names are Caster Semenya, Francine Niyonsaba, and Margaret Wambui), at least in the 800. Why there is no analogue to the Caster Semenya situation in other events, I have no idea.

  8. Kyle MacDonald says:

    Much of the discussion of researchers’ incentives for noise mining seems to focus on career incentives, which are probably a big part of it. However, I think another part might be the simple desire to demonstrate your intellectual prowess, both to yourself and to your colleagues, not because it will get you a slick invitation to TED but because valuing the ability to understand stuff is what drew you into academia in the first place. In this view, attempting to make your speculations more solid than they can be given the data is about self-actualization, not just material comfort. This is consistent with the general idea that people who analyze data badly are not generally trying to be fraudulent: they’re often doing what they believe to be good science.

    In an ideal world, science, perhaps especially social science, would be about measuring the world accurately, not about finding out who’s clever. We do not live in an ideal world (maybe Leibniz did, and good for him, but I’m pretty sure that I don’t). I think that the innate bias among most intellectuals to prioritize the theorist over the experimenter, the analyst over the data gatherer, the hedgehog over the fox, the theory builder over the problem solver, and the Galahad over the Don Quixote (last one from Paul Hoffman) will make it difficult to advance much on this issue.

    • Martha (Smith) says:

      “I think that the innate bias among most intellectuals to prioritize the theorist over the experimenter, the analyst over the data gatherer, the hedgehog over the fox, the theory builder over the problem solver, and the Galahad over the Don Quixote (last one from Paul Hoffman) will make it difficult to advance much on this issue.”

      Sad, but probably at least part of the problem.

    • Andrew: “I’d like us to move to a world in which gathering and analysis of data are separated, in which researchers can get full credit for putting together a useful dataset”
      Kyle: “…I think another part might be the simple desire to demonstrate your intellectual prowess.”

      Though I appreciate the general point of this post, on the usefulness of separating data from analysis, I have to point out that if someone told me that my lab could collect data but was forbidden from analyzing it, I’d probably quit science. It’s not a question of “getting full credit,” or “demonstrating intellectual prowess,” but rather implementing the basic human curiosity that makes us do science in the first place. It’s true that there are fields in which the data collection itself is remarkable or necessary — acquiring vast collections of genomic data, or sky surveys in astronomy — but most science isn’t like this, and that’s fine.

      • Andrew says:

        Raghu:

        I don’t think anyone was saying anything about “forbidding” people from analyzing data they’d collected. I just want people to be able to publish, and be cited for, collections of interesting or important data, without the implicit requirement that the data be accompanied by an analysis that makes strong claims.

        To put it another way: in a world in which gathering and analysis of data are separated, I’d have no problem with you or your colleagues publishing two papers, one with the data and the other with the sophisticated analysis.

        • Good point, and I strongly agree that the “implicit requirement that the data be accompanied by an analysis that makes strong claims” is terrible. Coincidentally, my lab just published a paper on a pain-in-the-neck experiment from which there wasn’t much to conclude, and I’m glad to work in a field in which this is (somewhat) acceptable. I’ll try to write more about this some other time.

          About separate papers focused e.g. on data and then analysis: This reminds me that it seems very rare nowadays to have “series” of papers — for example this and this on a method and then its application, two of a series of three from 1954-55 that I was recently reading. I don’t know why this format has largely disappeared.

          But to push back a little bit: About people beingable to publish just data: they’re able to now, it’s just that no one cares. But of course you mean “able” in a broader sense of this being appreciated. I agree that this would be good. But there are, among bad reasons, decent reasons why this isn’t the case: there are an infinite number of things one can collect data on. Which are worthwhile? The analysis, or the hypothesis that drives the analysis, helps us figure out which of a billion papers we might want to care about, and totally separating data from analysis might take away this cue. One could, as an alternative, publish a paper that’s just motivation + data, and I agree that this would in principle be good, but these papers will have to compete in the crowded landscape against papers that have “closure” to the narrative as well. The authors might be unsatisfied (for the reasons in my original comment), and the readers unsatisfied as well. This may not be a bad thing for science, but it may be a tough thing for science done by humans!

      • Kyle MacDonald says:

        Curiosity about the data is of course the reason for doing science! Beyond a few people who are VERY interesting to talk to at dinner parties, few people go into microbiology because they’re really excited about microscopes. I alluded to this later in the sentence you quote from: “valuing the ability to understand stuff is what drew you into academia in the first place.” My emphasis was deliberately cynical, though: we pursue the truth by demonstrating that we are clever enough to find it, a bit like how construction workers build roads and bridges by working to feed their families.

        What’s that Tukey quotation? “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.” (https://en.wikiquote.org/wiki/John_Tukey) I think that, beyond curiosity, we want to believe that Tukey’s warning doesn’t apply to us, that we are clever enough to get answers from our data. (I know that I want to believe this about myself.) If, apart from economic incentives, researchers were only motivated by a desire for the truth, I think we would see fewer analyses that have no chance of succeeding than we actually see.

Leave a Reply