I’m negative on the expression “false positives”

After seeing a document sent to me and others regarding the crisis of spurious, statistically-significant research findings in psychology research, I had the following reaction:

I am unhappy with the use in the document of the phrase “false positives.” I feel that this expression is unhelpful as it frames science in terms of “true” and “false” claims, which I don’t think is particularly accurate. In particular, in most of the recent disputed Psych Science type studies (the ESP study excepted, perhaps), there is little doubt that there is _some_ underlying effect. The issue, as I see it, as that the underlying effects are much smaller, and much more variable, than mainstream researchers imagine. So what happens is that Psych Science or Nature or whatever will publish a result that is purported to be some sort of universal truth, but it is actually a pattern specific to one data set, one population, and one experimental condition. In a sense, yes, these journals are publishing false claims. But it’s not clear to me that the framing in terms of “false positive” etc. is the way to go. It’s not like I think the true effects are zero, I just think there’s not much of a relationship between what is found in a particular study, and more general phenomena of interest.

I can understand if my preferred framing in terms of Type M and Type S errors is too new and unfamiliar to use here, but I do think that a key part of this effort should be to move away from the whole Type 1, Type 2, false positive, false negative, etc. framework.

37 thoughts on “I’m negative on the expression “false positives”

  1. I don’t have much to add on significance testing. However, I would say that in a way this is a minor aspect of a larger underlying disease.

    Very tentatively the way I see it mainstream science is increasingly less about discovering truth, or even useful knowledge, and more about scientists, their employers, publishers, funders, etc.. (Arguably this has always been true but it is the changing proportions that matter.)

    IMO this is nobody’s fault but an emergent property of the set of principal agent relations these actors define, the associated incentive system, and its degradation through external shocks (e.g. technology). Anecdotally, I think an increasing proportion of scientists are unhappy with the current system.

    Moreover, an incentive mechanism that emerged piecemeal after the Enlightenment is likely not ready for shocks like MOOCs, a potentially massive consolidation in higher education, a huge mismatch in the supply and demand for academics, changes to the cost of knowledge generation, dissemination, and storage, and so on. Yet there are also bright lights. Citizen science, the rise of independent labs, makers, biohackers, etc… survive outside the mainstream incentive scheme, and are often rewarded only if they produce useful knowledge. Even so, some see these developments as pseudo-science, and possibly dangerous to humanity.

    Much of the “crisis” in science is not new. Chapter 16 in one of my favorite text books offers a nice overview circa 1991, with references to the 1930s (http://books.google.com/books?id=v-walRnRxWQC&source=gbs_navlinks_s also sections 15.5.2, and 8.5) However, recent technological changes may have accentuated the crisis (maybe the problem is getting worse, or it has become easier to talk about it, or what not). Considering the technological changes mentioned above, matters may well get worse before they get better. No matter. I am told the Chinese character for “Crisis” also means “Opportunity”.

    • Having worked in Doug Altman’s group for a couple years, I will agree that he seemed to have an earlier sense of the negative aspects emergent from significance testing and other commonly accepted research practices and how large the damage could be.

      On the other hand, he seemed to be extremely good at gaming other aspects academic practices for his advantage.

      There always will be various incentives, and managerially you want people to follow them as you can then change the accepted research practices to encourage emergent behavior that is less _wrong headed_.

      I think the problem Andrew is pointing to in these posts is that there is not a widely acceptable alternative to significance at present.

        • Looks like an interesting book.

          I do think there are considerably more individuals aware of the problems and actively trying to change much of the behaviour, than 5 or 10 years ago, and it may be just starting to take off exponentially.

          For instance, grants for reproducible research topics were hard to come by in 2005, and reviews were often in the style of “this is unnecessary, you just have to tell research assistents to keep good notes”.

  2. I think “false positive” had its origin in medical testing. A screening test that incorrectly indicated that a person had disease X, was characterized as a “false positive” indication.

    For a 19th century example of this usage see:
    http://books.google.com/books?id=BApYAAAAYAAJ&pg=PA628&dq=%22false+positive%22&hl=en&sa=X&ei=lel7UpmTHI7JsQTGnIKADg&ved=0CJsBEOgBMBE#v=onepage&q=%22false%20positive%22&f=false

    The related term “receiver operating characteristic” is usually attributed to radar engineers during WW II.

    In both cases, the underlying problem is one of decision making—should a patient be subject to further diagnostic procedures? Should we act as if the radar return is from an aircraft?

    In this context, “false positive” is a reasonable term.

    But, in making inferences about natural principles—degrees of belief are a better tool for understanding than simple 0/1 rules.

    Observer

  3. The idea of “false positives” has no necessary connection to claims about the world beyond the laboratory. A FP is merely a data point that is classified in a way that disagrees with the way the researcher(s) believe(s) that data point should have been classified. Declaring the data point a FP or a TP says nothing about the world beyond the laboratory–the label ascribed to the data point (FP, TP, FN, or TN) merely describes a relationship between a researcher’s observations and beliefs.

    • “False positive” is a very useful concept in the context of decisionmaking, e.g. when you have to decide whether to approve a drug or not.

  4. I agree with Gelman. (Of course they are getting it from diagnostic screening, as someone mentioned above, and perhaps the contemporary screening in genomics and such has something to do with this–but science isn’t like that.) This is also why the computations behind the oh-so popular repeat articles about how “most claims are wrong and science doesn’t self-correct” are erroneous.

      • I was playing around with (goofy) ideas to make “just the low power” ma’am issue more salient (and it is Friday afternoon).

        So you have found a Genie in a Bottle and they have granted you no health problems before you are 100, total protection from any harm from external objects and adequate wealth to be able to do what you want. They were so impressed by your thoughtful wishes they granted you one free wish they came up with. They will immediately produce for you a perfect randomized trial comparison for any empirical question you want the answer for in your life – but you only get the estimated effect and SE, the power is set by them at exactly x% (averaged over all true effect sizes of all the requests in your life) and because they have done this for you – they will make it so that no one else will do a similar trial in your life time.

        For what value x would you say no thanks and what could you do to make that x smaller (e.g. adjust the estimate somehow)? What would your prior be for the true effect sizes of the requests in your life (that the Genie knows precisely)? How would you value/trade-off the expected type M and S errors?

        • Oops, I did not realize the problem as given makes it impossible to define a likelihood and if I provide even enough information to do ABC (and not need the likelihood [and I hate likelihoods] but get an approximation by posterior/prior) http://en.wikipedia.org/wiki/Approximate_Bayesian_computation it provides a backdoor path to calculate the true effect the Genie knew about to make the study design.

          But he is a Genie, so smart enough to make a good enough transformation so the mean and SE always provide an adequate quadratic approximation to a (pseudo)likelihood that though wrong is not too wrong (i.e. Charlie Geyer’s asymptopia a Le Cam http://www.stat.umn.edu/geyer/lecam/ ) So take that as _the_ likelihood – its just a goofy example to focus on challenges of being _stuck_ with low power.

  5. Pingback: Friday links: an accidental scientific activist, unkillable birds, the stats of “marginally significant” stats, and more (link fixed) | Dynamic Ecology

  6. I am frustrated by the current deluge of low quality “failed replications”

    Here is an example.

    The Economist cites a paper with “Nine failed replications”

    Yet, there are no 9 replications. only 1 up to 3 of them are replications, there others have by all opinions changes of protocol.

    Even those who are supposedly replicas, are very low powered, are heterogeneous etc. quality issues and missed protocol items.

    http://www.plosone.org/annotation/listThread.action?root=64751
    (this is the original authors response, but the replicators do not deny the above.

    I believe the incentive issues of replications drive to replications, low profile people and undegraduates etc. because top of the line researchers and funds are not available for this. Yielding not so convincing “failed replications” and making the discussion too murky to be useful.

    • Jazi, neither the replications or the original results are reliable. All work done in this fashion will be ignored by future generations and is, at best, a waste of time and money. The only solution is testing precise predictions based on theories. If this is not possible for whatever reason the publication should be purely descriptive. This is how science functioned before NHST and how will function again once we are out of these dark ages.

      • > the publication should be purely descriptive

        Agree, (and made arguments along those lines in Greenland S, O’ Rourke K: Meta-Analysis. In Modern Epidemiology, 3rd ed. Edited by Rothman KJ, Greenland S, Lash T. Lippincott Williams and Wilkins; 2008.)

        Really investigators should not be _allowed_ to draw conclusions or make recommendations (as most journals demand) given their inherently biased view of their study’s at most partial contribution to addressing any empirical question. In fact, their later role in doing a meta-analysis of the collection of studies is problematic – how do they un-biasedly assess their contribution relative to others?

        Many insurmountable opportunities for purposeful scientific communication.

      • the original study I do not know.

        But a replication got to be very high quality. High power. EXACT replica and ask the original authors for full protocol details.

        In this fashion, you can say I did replicate and found no effect.

        what we have now, are replications with way lower quality relative to hte original papaers.

        • Jazi:

          You write that a replication has to be an “exact replica and ask the original authors for full protocol details.”

          I disagree. A replication of a published study should be a replication of what is published. Unfortunately it appears to be standard to publish studies without giving details about what is done: no survey forms, no explicit rules for data exclusion, etc. In such settings, I think the replication of the published result should follow as closely as possible what is published. It should not require asking authors. The public record is the public record. Asking authors is fine, but if an important detail of a study is not in the published paper (or supplementary material) then that’s a problem with the original publication, not with replication.

        • As far as science goes, using “he did not wrote it in the paper” is a no go. a technical excuse.

          Moreover, many details are either common sense to higher level researchers (like use college students like original study , and do not change to wildly heterogeneous population)

          many details are in the referenced literature in the paper, which are technically referenced, but ignorant replicators tend not to be fully aware of the literature.

          (example. using 30 out of 30 priming words in old men priming, which even me knew is a no go. It is explicit in the priming literature).

          Another reason for consulting authors, is that without close knowledge you might even miss things that are explicit int eh protocol, and think they are unrelated technicalities.

        • Jazi:

          The problem is that what is believed is based on the public record. Suppose a published paper says, We did X, but actually what they did was Y. If X cannot be replicated, that is news, that is relevant to the scientific community, which otherwise might believe X because it was published in a peer reviewed journal. A replication or non-replication of Y could also be news, that’s fine too, but as far as the general scientific community is concerned, if X is what is published, X is what needs to be addressed. An improvement would be for researchers to reproduce all their protocols in the published record, but until that happens we have to deal with what was written.

        • When someone does DNA replication with non sterilized equipments because the paper did not referred to equipment sterilization, nobody will use your logic to make it a “worthy” replication.

          The public record includes the relevant literature (usually referenced) but even when not directly referenced, but everyone in the discipline knows where to look. A replicator who has not read the basic textbooks of a sub-discipline is simply not qualified. That seems to me common sense.

          I think the old people example is elucidating that. I am hardly a priming expert. I have not read the relevant textbooks etc. But I have read a popular book on the subject, and maybe ten papers.

          When *I* know a detail to be utterly obvious (that you prime with only a minority of the words as prime in puzzles – so that it will be subconscious), and a replicator was not only not aware of it beforehand! but was so ignorant of priming literature as to try to brush it aside afterwards!

          That is absurd.

        • Jazi and you have an interesting debate here. I think the situation is like with references, and the important point concerns author defenses. A reference should contain enough information to let the intended reader know how to find it. If it doesn’t, people can be justifiably offended by the effective lack of a reference. With replication, if there are any decisions in the procedure that aren’t obvious to the people who might replicate it, that’s the fault of the author, and he should be apologetic if the replicator says the author’s results are suspect, not defensive.

  7. K, can you post an excerpt so I can see your argument. I do not have access to that book.

    I currently believe that this practice of making up a statistical hypothesis to test if there is no theoretical prediction available is at the root of the problems with statistics in the sciences.

    Scientists are actually “disproving” the opposite of their research hypothesis. This is illogical (or at least convoluted) and means that no theory can ever be falsified. Instead, theories can only fall in and out of fashion like fads.

    • Some commentators have even suggested that no randomized-trial result should be published without the inclusion of a meta-analysis in place of the narrative literature review section (O’Rourke and Detsky, 1989). Similarly, because of the rapid growth of epidemiologic literature, the traditional narrative review may no longer be a reliable way to summarize research in certain areas.

      Although we have no doubt that a well-conducted meta-analysis provides valuable information to put a new study in context and guide its design and interpretation, the thought and effort needed to conduct a reliable meta-analysis may be far too excessive to demand of authors of those studies. These authors will also have an intrinsic conflict of interest given their involvement in the new study that needs to be critically evaluated in the context of all other studies. This practical limitation has led some commentators to suggest that reports of single studies should be relieved of the tasks of formulating conclusions, and thus of reviewing the literature in detail. Instead, they suggest that single-study reports should focus on describing their methods and data in as much detail as feasible, to facilitate later meta-analyses of the topic (Greenland et al., 2004).

  8. I work in a gravitational wave data analysis group and we are very much concerned with the problem of separating real gravitational wave events (of which there are extremely few) from transient non-Gaussian instrumental noise artefacts (of which there are very many) of relatively large signal-to-noise ratio.

    The concepts of false positive and false negative fit this problem very well; I’m not sure how thinking in terms of type M / type S errors would bring any substantial benefit.

    We know there are gravitational waves passing through us all the time, almost all of them ‘undetectably’ weak, but in order to say we have detected one we need to clearly distinguish between the ‘real signal’ and ‘artefact’ event classes, which means setting up a comparison using a ‘no-signal’ null hypothesis – although we already know that this null is impossible and unphysical in reality.

    To rework this as a Type M error problem would involve bringing in arbitrary thresholds of loudness – i.e. we claim a detection if we have sufficient evidence that there was a real signal of greater strength than X – with the ‘false alarm’ being replaced by a severe overestimate of the strength of the real signal at the claimed event.

    We don’t need such a threshold, though, as in practice the noise sets its own threshold and the only thing we want or need is a sufficiently high Bayes factor for pure-signal against pure-noise, which again in practice is mapped to a sufficiently low false alarm probability.

  9. Pingback: OK, sometimes the concept of "false positive" makes sense. - Statistical Modeling, Causal Inference, and Social Science

  10. Pingback: "A bug in fMRI software could invalidate 15 years of brain research" - Statistical Modeling, Causal Inference, and Social Science

  11. Pingback: Let's stop talking about published research findings being true or false - Statistical Modeling, Causal Inference, and Social Science

  12. Pingback: No, I don’t like talk of false positive false negative etc but it can still be useful to warn people about systematic biases in meta-analysis « Statistical Modeling, Causal Inference, and Social Science

Comments are closed.