Skip to content

Response to Rafa: Why I don’t think ROC [receiver operating characteristic] works as a model for science

Someone pointed me to this post from a few years ago where Rafael Irizarry argues that scientific “pessimists” such as myself are, at least in some fields, “missing a critical point: that in practice, there is an inverse relationship between increasing rates of true discoveries and decreasing rates of false discoveries and that true discoveries from fields such as the biomedical sciences provide an enormous benefit to society.” So far so good—within the framework in which the goal of p-value-style science is to make “discoveries” and in which these discoveries can be characterized as “true” or “false.”

But I don’t see this framework as being such a useful description of science, or at least the sort of science for which statistical hypothesis tests, confidence intervals, etc., are used. Why do I say this? Because I see the following sorts of statistical analysis:

– Parameter estimation and inference, for example estimation of a treatment effect. The goal of a focused randomized clinical trial is not to make a discovery—any “discovery” to be had was made before the study began, in the construction of the new treatment. Rather, the goal is to estimate the treatment effect (or perhaps to demonstrate that the treatment effect is nonzero, which is a byproduct of estimation).

– Exploration. That describes much of social science. Here one could say that discoveries are possible, and even that the goal is discovery, but we’re not discovering statements that are true or false. For example, in our red-state blue-state analysis we discovered an interesting and previously unknown pattern in voting—but I don’t see the ROC framework being applicable here. It’s not like it would make sense to say that if our coefficient estimate or z-score or whatever is higher than some threshold that we declare the pattern to be real, otherwise not. Rather, we see a pattern and use statistical analysis (multilevel modeling, partial pooling, etc.), to give our best estimate of the underlying voting patterns and of our uncertainties. I don’t see the point of dichotomizing: we found an interesting pattern, we did auxiliary analyses to understand it, it can be further studied using new data on new elections, etc.

– OK, you might say that this is fine in social science, but if you’re the FDA you have to approve or not approve a new drug, and if you’re a drug company you have to decide to proceed with a candidate drug or give it up. Decisions need to be made. Sure, but here I’d prefer to use formal decision analysis with costs and benefits. If this points us toward taking more risks—for example, approving drugs whose net benefit remains very uncertain—so be it. This fits Rafael’s ROC story, but not based on any fixed p-value or posterior probability; see my paper with McShane et al.

Also, again, the discussion of “false positive rate” and “true positive rate” seems to miss the point. If you’re talking about drugs or medical treatments: well, lots of them have effects, but the effects are variable, positive for some people and negative for others.

– Finally, consider the “shotgun” sort of study in which a large number of drugs, or genes, or interactions, or whatever, are tested, and the goal is to discover which ones matter. Again, I’d prefer a decision-theoretic framework, moving away from the idea of statistical “discovery” toward mere “inference.”

What’s the practical implications for all this? It’s good for researchers to present their raw data, along with clean summary analyses. Report what your data show, and publish everything! But when it comes to decision making, including the decision of what lines of research to pursue further, I’d go Bayesian, incorporating prior information and making the sources and reasoning underlying that prior information clear, and laying out costs and benefits. Of course that’s all a lot of work, and I don’t usually do it myself. Look at my applied papers and you’ll see tons of point estimates and uncertainty intervals, and only a few formal decision analyses. Still, I think it makes sense to think of Bayesian decision analysis as the ideal form and to interpret inferential summaries in light of these goals. Or, even more short term than that, if people are using statistical significance to make publication decisions, we can do our best to correct for the resulting biases, as in section 2.1 of this paper.


  1. BenK says:

    I’m familiar with R.I.; I even took one of his online classes because I don’t do human genomics much and was interested to see how the methods in that space might relate to other kinds of genomics. I think the kind of science he does definitely shapes his idea of what good science is and how to judge it.

    His idea of a FDR for all of science definitely doesn’t give the sense that discoveries can be both true and false at the same time.
    The heliocentric model was true and false. Does it even make sense to ask ‘does the universe spin around the sun or the earth?’ Well… without a fixed frame of reference… but to require all that Popper between Aristotle and Einstein to happen at once or deny it all
    as ‘false discovery’ seems foolish.

    Using the FDA discussion – the question is badly framed as ‘the treatment effect’ though this is the way we typically discuss it. Actually, given the degree of variation among individuals, it may not make sense to discuss ‘an effect.’ Instead – ‘personalized medicine.’ We need to redefine the clinical picture and find the classes of data that help make treatment decisions. That needs to be part of the drug discovery process – and in doing so, if the FDA, physicians, the market, etc, will only support drugs or diagnostics that fit with the current ‘standard of care’ (i.e. they can’t be bothered to refine the clinical picture data requirements) then we are near the end of our rope.

    In antimicrobials, we learned that we needed to do culture as part of the clinical picture. And then susceptibility testing. Now, we are learning to do sequencing because culture wasn’t actually getting us the data we thought it was in many cases and was misleading clinical practice.

    So we end up reframing the questions in cancer around tumor diversity; in infectious disease, around agent diversity; and we need to include human diversity.

    As you can tell, my ‘favorite’ kind of science is about posing the right questions. If you get the question right, the answer drops out as obvious. Simply knowing an answer ‘could exist’ is meaningful and impactful.

    At the same time, most ‘science’ needs to be pretty practical and can’t go overturning entire worldviews.
    The replication crisis shows clearly how drawing wildly inappropriate conclusions from poorly collected and overinterpreted data can be disastrous for science and engineering. Part of the problem is that we are expecting paradigm-shift thinking from every grant proposal; from every assistant professor. Many excellent professors never spark a paradigm shift in their entire lives. They certainly can’t from each grant. We collectively need to be very supportive of modestly conducted science. Otherwise, we end up supporting charlatans who continually make false claims of discovery.

    • Martha (Smith) says:

      ” We collectively need to be very supportive of modestly conducted science. Otherwise, we end up supporting charlatans who continually make false claims of discovery.”


      • Anoneuoid says:

        I dont think so… All you need is to require independent replication of some results before you believe them. Even just one is so much better than none. This whole replication crisis is just due to laziness and cheapness, trying to get away without funding/doing whats required. It is really that simple.

        Testing theories is a whole other aspect. One that can be ignored for now, since the first step of getting “facts to be explained” has become so messed up by people who dont know what they are doing.

  2. yyw says:

    Cost benefit analysis and decision theory should be taught in every intro to statistics class. Even in areas with no explicit policy implications, they could still inform whether to follow up/replicate existing studies.

  3. Michel Accad says:

    Important point. Sensitivity and specifity analyses assume a mechanistic model of health and disease which is untenable, except perhaps when dealing with infectious causes (and even there it’s problematic).

  4. I am quite sympathetic to this claim that there aren’t true or false effects. It’s not clear to me where theory validation falls here. I’ve had “exploration” papers as described in the post, but other papers don’t seem to fit into either category. In those papers, I take a pattern found in some other paper and try to come up with a more generic theoretical explanation that predicts that the same pattern should show up in a new dataset (because the underlying generation mechanism is shared even though the surface features differ). How do you recommend identifying whether the theoretical prediction of “same pattern” was valid

  5. Rafa says:

    Thanks for the post Andrew. I more or less agree with all the points in your post. However, you don’t discuss what I wanted to be the main message of my original post: we want to move the ROCs up rather than move to the left on the current ROC. For example, requiring smaller p-values or higher posterior probabilities move us to the left on the current ROC. Getting rid of charlatans, more awareness of common data analysis pitfalls, more data visualization, less coding errors, more open data, etc.. move the entire ROC curve up. I agree, and I think even say it in the OP, that the ROC curve framework is an over simplification. I was using it as way to demonstrate that there are more important things to do than impose more stringent cutoffs. In hindsight it was a mistake using you as an example since most of the changes you advocate for on your blog would move the ROC up. Sorry about that.

    Best wishes and thanks for sharing your thoughts,

  6. Dimitris Rizopoulos says:

    But sometimes we have to do with “pure” scientific discovery. For example, a few years ago, particle physicists announced the discovery of Higgs boson based on replicated experiments they have done in the LHC. AFAIK they use a 5-sigma based p-values to take the decision whether they have found something or not (see e.g., Could someone explain how we could do a more proper Bayesian decision analysis here? What are the cost & benefits of announcing to the world that Higgs boson does or does not exist?

    • Andrew says:


      I don’t think it’s always, or even usually, necessary for the statistical analysis to proceed all the way to a decision analysis. It can be fine just to state one’s uncertainty. If you look at my own applied work, in most cases I did not perform any decision analysis with costs and and benefits; I just stated my assumptions and my inferences.

      • Dimitris Rizopoulos says:


        First of all I agree with you and other that dichotomies are arbitrary and for that reason not good. But on the other hand I do see the need for that. In the example I mentioned, AFAIK particle phycisists do very rigorous experiments that are replicated from different teams in the LHC. But there is still a need between themselves to have a mechanism to say that based on the results of these experiments, accounting for the uncertainty they have, a particular particle does or doesn’t exist. Hence, they need to have a kind of threshold to decide/claim the discovery of a new particle. Of course it can be that later based on the results of new experiments they need to revise their original “decision/claim” of finding a particle, but this is how science progresses. How do you suggest that this is approached?

      • Keith O'Rourke says:

        A large part of the problem seems to be an _acceptance_ that individual researchers or group of researchers can and should make such decisions.

        Such as when editors pressure them to be judge, jury and executioners in area where they almost surely have a biases to the own methods, ambitions and study results.

        • Dimitris Rizopoulos says:

          AFAIK in particle physics is not individual researchers or group of researchers that have decided to use the 5-sigma p-values for discovery, but the whole field of particle physics that has set this particular threshold down. In addition, they have set down that all details of the experiments are known and, most importantly, that are replicated by different teams working on the same thing and under the same conditions before accepting the discovery of a new particle, etc.

          Where I’m trying to get into is that perhaps having a dichotomous “decision” rule for discovery is not the biggest problem but rather been rigorous in describing how the data have been collected and analyzed (as Andrew and others already said), and importantly replicate results before accepting them.

  7. The basic concept of ROC is that person A is sending a symbol X along a noisy channel and person B is receiving a symbol Y on the other end, and trying to determine if Y is the correct symbol to have been received. In this case, there is a definite true value for symbol Y namely Y=X.

    In pretty much any other situation ROC is a wrong way to think about how things work, particularly it’s a wrong way to think about how science works. It’s not like nature sends a sequence of symbols to us and we just need to determine if we’re receiving them correctly. That isn’t even close.

Leave a Reply