“Statistical chemotherapy”: Jeremy Freese adds a new item to the lexicon!

In the context of reporting the latest on hurricanes/himmicanes, Freese comes up with a new one for the lexicon. Considering the latest manipulations performed by the hurricanes/himmicanes people, the sociologist writes:

Like statistical chemotherapy, even though it slightly poisons their key result, it still leaves it alive just below the conventional statistical cutoff (p = .035). But the diseased result in the wrong direction is now above the .05 threshold . . .

Bingo! An excellent contribution from the author of the quip, “more vampirical than empirical—unable to be killed by mere evidence” [we quoted here, for example] which applies to so much published research, on topics ranging from hurricanes to, ummmm, I dunno, ovulation and voting?

I find it distressing that people have so much difficulty admitting they could’ve made a mistake. I do attribute some of my own willingness to admit error to my training as a statistician. Maybe another thing that helped me along was the bad experience I had with my colleagues when I worked at the University of California, which gave me some appreciation for pluralism in a very general sense.

The ethics question

Another question is: at what point does stubborn denial and refusal to accept criticism (which is how I read the behavior of the hurricanes/himmicanes researchers) edge into flat-out unethical scientific behavior? I certainly don’t think it’s unethical for researchers to make mistakes or to publish mistaken work. But at some point, if they keep fighting, it makes me wonder whether at some level they realize what’s going on.

My take on it (without knowing these people personally) is that they started the exchange with Freese and others under the reasonable assumption that they’d done things correctly (after all, their paper got through peer review at the prestigious tabloid Proceedings of the National Academy of Sciences), but at some point when the criticisms started coming in faster and harsher, they switched to a war footing. They felt they were being attacked. And once you’re in a war (and you’re convinced that “the other side” will use any statistical tactic to get you), you feel that any statistical tactic is allowed in response. Nowhere do they give evidence that they ever stepped back to think: Hey, maybe we’re wrong!

That progression is all understandable to me. Still, as scientists we have some responsibility to the public. And to defend and defend like they’re doing, this seems unethical to me. At some point, as a researcher, you have to shift from ignorance as a defense to a recognition of your limitations. If you don’t do that, you are in some sense violating a public trust.

P.S. This discussion from Bob O’Hara shows a revealing quote from the hurricane/himmicane authors:

The ladder test uses Stata’s sktest to determine normality transforms, based on the work of D’Agostino, Belanger, and D’Agostino Jr (1990), with an adjustment made by Patrick Royston, (1991), one of the leading statisticians worldwide in smoothing and transforms. He is the original author of fractional polynomials, which is much better than GAM at smoothing on complex situations. The results are the same with the Shapiro-Wilk and Shapiro-Francia tests for normality.

This paragraph indicates an unfortunately common attitude, that statistics is a collection of tests which can be applied without serious consideration of the application in question. At some level, of course, statistics is the science of defaults, and I wouldn’t write books full of statistical methods if I didn’t think they could be applied in some generality—but the quote just above give a sense of how this attitude can go wrong.

43 thoughts on ““Statistical chemotherapy”: Jeremy Freese adds a new item to the lexicon!

  1. Pingback: Handy statistical lexicon « Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science

    • Rahul:

      As I’ve said before, I wouldn’t be so hard on the referees. Referees work for free and get no credit for their good calls so I don’t think it’s fair to blame them for their misses.

        • It’s the editors who make the decisions, so they’re the ones who should be held accountable. The name of the editor is given with the paper, so we know should carry the can.

        • Russ:

          In this case, I don’t think the problem is “due diligence” so much as that these tabloid journals (I’ll also include Psychological Science in that category) seem to favor headline-grabbing claims and statistically significant p-values. I’d guess that, in this case, a more reasonable paper (for example, presenting the claims as exploratory and recognizing the implausibility of the model in certain settings) would have essentially zero chance of acceptance. For the paper as it stands to have been accepted is just a matter of luck (it happened not to be sent to the right reviewers) but if the paper were more reasonable I don’t think it would’ve had a chance at all. I’m certainly not saying that everything or even most things published by PNAS is crap, but journals of this sort do seem to have a problem with social science, and it’s not clear to me that a longer review time would do much, given the commonly-held view that if a claim is statistically significant and dramatic, that it should be published in a prominent place.

        • Andrew,

          I agree with you. I was just responding to Rahul’s comment about due diligence. But these are not unrelated. If you have almost no time to review a paper, how will you judge it? The easiest thing is how exciting its claims are. The journal is itself sending such a signal.

        • If a Journal policy prevents you from doing due diligence, I think the right thing to do is to refuse being a reviewer.

          It’d be interesting to compile a list of days-to-review of various journals. Is this data available somewhere? On their websites?

          Also looking at how obviously crappy some of these recent papers are, 10 days is a generous amount of time. Unless that’s my hindsight bias at play.

        • To turn the question around: How many days ( weeks? months? ) do you think is a reasonable time frame to referee a paper?

        • Rahul:

          It can depend on the paper. In my field, math, 2-3 months is the norm, but people sometimes take a year (which they should not).

          These things (in any field) are not earth-shatteringly important that they need to get into print immediately. In math, we commonly post preprints anyway, so people can indeed see them immediately.

  2. Its funny the point you are covering about being wrong and still persisting. This is a topic that Paul Krugman has been reflecting on recently.

  3. Freese has been great on this.

    Did the authors ever come up with an explanation for why low-damage storms with male names kill more people than similar storms with female names? Most storms are low damage and I believe this effect cancels out the publicized effect in total. Seems rather important from a public policy point of view (eyeroll).

    • The effects actually don’t cancel out. The model is on the log scale, so they cancel out on that scale, but if you back-transform to the deaths scale, the effects at higher damamge are much larger. I poltted this too.

      I’m amused that this was actually the claim of thet title of the paper, and what got the most attention, but the authors never actually demonstrated it!

  4. Reading all these discussions, remodeling and blog articles that have been written about the hurricane-paper, I’m starting to think that science might even work again. People are actually interacting with each other and doing – well … – Open Science. Of course the original paper was a mess but I actually like how the academic blogosphere and methodologists (to use a more general term than statisticians) seems to become more and more involved in scientific discussion and criticism outside of the realm of journals and conferences. I even would like to applaud the authors that they are not hiding and react openly to the criticism.

    • Daniel:

      Agreed. But . . . even better would be if we were having this discussion around a serious piece of research. If all the pixels spilled on the sexy topics of himmicanes, ovulation and voting, fat arms and political attitudes, beauty and sex ratio, etc etc etc, were instead devoted to more serious topics such as early childhood intervention (a topic where well-publicized research also has some statistical concerns; see here and here), then I think science would be working even better!

      And, yes, I recognize that I’m contributing to this by blogging on bad research, and I’m sympathetic to Jeff Leek’s argument that we’d be better off ignoring the crap and focusing on the good stuff.

      The trouble is that the crap gets so much publicity. But perhaps Jeremy and I are making things worse by our violation of the don’t-feed-the-trolls principle.

      • Andrew:

        I do agree that there are more important areas of research but public discussions of bad quantitative work and especially attempts to show how to make it better might have a good influence on researchers who care but did not know better. Thus it is important to receive attention and we cannot really control which research will get attention. I don’t think it’s a good idea to ignore “the crap” because then everyone can just throw out crap without any incentive to care about good scientific practice. Just because a topic is less serious or important we should not be able to throw out methodological rigorousness. Pointing out flaws in weak research and explaining what went wrong allows to learn a lot about statistics. I honestly think I learned more about (applied) statistics from your criticism of “himmicanes, ovulation and voting, fat arms and political attitudes, beauty and sex ratio” than from any university course or own applied work.

      • I think there are two quite orthogonal issues here: (a) bad methodological research & (b) frivolous topic choice.

        It’d entirely be possible to do quantitatively sound studies of, say, fat arms & political attitudes.

        The emergence of (b) I attribute to a Frekonomics-like effect; a glorification of big methods applied to sexy but relatively unimportant topics.

        Sometimes I think it’s a blessing that these crappy studies are on frivolous topics: I’d rather have p-value fishing on a red-dress-fertility study than on some surgery efficacy analysis.

        • Why do you think that there are no “crappy studies” on some surgery efficacy analysis? Well I don’t know how things are in that specific area but we have seriously flawed research in medicine all the time. Lack of statistical literacy is a widespread problem.

        • Of course there are. I meant it in the sense: Every crappy study of a frivolous topic I encounter makes me happy thinking how much worse the alternative could have been.

        • Rahul,

          At least it keeps the kids off the street? I’ve been thinking for awhile now that the primary purpose of academia is as a jobs program, any additional benefits are incidental.

        • There are many problems. Yes. The methodological knowledge is often limited. Yes. Most researchers I know are very serious about science, though. Nobody would be crazy enough to become a scientist to get “a job”. Okay, maybe someone is but at least in Germany it’s not the most clever career path to choose.

        • Daniel,

          At least in biomedical field, the vast majority of researchers don’t know any better. They have been so confused by strawman NHST that they can’t think clearly when discussing evidence. That is my explanation for why the field is full of n=1, n=3, n=10 results under cherrypicked, “optimized” conditions that are never replicated. Another thing that happens is that out of the results that are replicated there are missing controls for decades of experiments, because significant p-value is misinterpreted to mean your explanation is the correct one. No one appears to read the papers carefully, the interpretations of the authors are just parroted.

          If 70-90% of claims are wrong, why not just flip a coin? That is why I think the main reason to keep funding such research is keeping people off the street. Some good references here:
          http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology

  5. I think that, perhaps, researchers are also less likely to take a step back and actually think about whether they did anything wrong because they get so much criticism. During my studies, there has always been a strong focus on being critical up to the point where some people go searching for stuff to criticize, no matter how small. Combine this with more criticism from researchers that are simply of opinion that their way is better than yours (while both ways are possible) and a defensive reaction to all criticism evolves, including criticism that points to really major flaws (and should thus not be handled defensively). I am not saying researchers should be less critical of each other’s work, – I actually believe this is really important – just that it may account for inherent defensiveness in (some) researchers.

    • This reminds me of the discussion about falsification in science and that it’s not possible for a researcher to really have a “fallibilistic” mind set. While I do hope and wish for everyone to be as critical with his own work as possible, as a social scientist I do believe that it ought to be the the (scientific) community which is doing the criticism and falsification of each others work. Therefore I do not care very much about the defensiveness of some researchers to be honest. While the factual institutionalization of peer review in the Social Sciences seems to have negative effects on the idea of active and concrete criticism, the recent developments in the academic blogosphere give me some hope. I think we need more of this criticism not less.

      • Daniel,
        “it’s not possible for a researcher to really have a “fallibilistic” mind set.”

        If someone wants to present an idea as scientific, they should be able to tell others what evidence would make them abandon the idea. If this is not possible then I doubt the researcher is actually doing science.

        • question:

          I feel somewhat misquoted here. While I do agree that nobody can be a perfect fallibilist, I’m not “the discussion” but my English was somewhat sloppy here. Sorry. I do agree with you in general, that scientists should try to tell others what evidence would make them abandon an idea but I think it’s usually fine that scientist defend their own ideas and critique the work of others. You don’t have to falsify your own work and it’s also fine to try to defend your work. It’s actually an important part of science (see how defending the Newtonian Celestial Mechanics brought huge developments and insights in the 19th century).

          My point was that the most important part of the fallibilistic enterprise works through social mechanisms – scientific discussion and criticism! – not through the super powers of individual scientists.

        • Daniel,

          I see your point that the falsification usually comes from elsewhere. However, the scientist with “most to lose” by a falsification should set a kind of upper bound on what is required to abandon the idea. Without this, stated before the further experiments are done, we end up with endless argument until a theory becomes unfashionable. Such statements are all too rare so I do not mean it as a description of how “science” works, but rather a prescription. Requiring such statements to be made for publication seems like a simple, cheap way to improve the quality of scientific discourse.

          I consider most of what falls under the label “science” these days to be something else, some kind of mix of exploratory research and cargo culting.

    • @Sara:

      Reminds me of a quip (Twain I think): Interviews are always difficult because the stupidest group of men can ask more questions than the wisest man can ever answer.

      Sometimes academic criticism is a bit like that.

  6. I wonder if we are perhaps missing the core problem.

    Is there a mechanism for them to correct the research without retracting the paper? After all, if only in the past 50 years
    or so, the system has been established such that a PNAS paper is a major claim on a CV. We can do a self-righteous dance
    around this topic, but people are genuinely afraid for their careers in many cases. Existential fear is an enemy of virtue.

    The story reads differently if the resistance is based around a deep-seated feminist ethic such that they read the critique
    as a product of ideological opponents. Then we are seeing a fundamental mistrust of the critics – any critic is automatically
    enemy. This could motivate a similar response but raises an entirely different set of difficulties.

    • I’m not sure what you are aiming at with your remark about “deep-seated feminist ethics”. I have deep-seated feminist ethics and think that the methodological criticism of this paper is useful, necessary and even somewhat important. Anyone using feminism as a reason to discredit methodological criticism is abusing it but this seems to be a strawman for me, anyway. I haven’t read from any feminist who implied that the criticism of the paper is just ideological opposition.

  7. In line with the recent Badges movement I propose a new badge: The “I recognize and learn from mistakes” badge that scientists could add to their profiles, articles, etc..

    The badge would indicate adhesion to a set of ethical principles along the lines discussed by Andrew. E.g. “I will step back an consider the interest of the public”, etc…

  8. Andrew, I’m curious as you are criticizing the idea of statistics as a “collection of tests which can be applied without serious consideration of the application in question” how that would apply to stuff like the “Box-Cox-Transformations”. Would you advise against such formalized reasoning for transformations?

    • Daniel:

      There perhaps are some applications where it makes sense to estimate a transformation from data in that way, but I don’t think what these researchers are doing makes much sense in the hurricanes example.

      • Andrew:

        I think we don’t need to discuss that the normalizing procedure seems to be very odd. I don’t even know why they would want to normalize the predictor but that may be me missing something. But the problem is not how they did it (from the data) but that they did it at all, right?

Leave a Reply to Bob O'H Cancel reply

Your email address will not be published. Required fields are marked *