Skip to content

Evaluating evidence from published research

Following up on my entry the other day on post-publication peer review, Dan Kahan writes:

You give me credit, I think, for merely participating in what I think is a systemic effect in the practice of empirical inquiry that conduces to quality control & hence the advance of knowledge by such means (likely the title conveys that!). I’d say:

(a) by far the greatest weakness in the “publication regime” in social sciences today is the systematic disregard for basic principles of valid causal inference, a deficiency either in comprehension or craft that is at the root of scholars’ resort to (and journals’ tolerance for) invalid samples, the employment of designs that don’t generate observations more consistent with a hypothesis than with myriad rival ones, and the resort to deficient statistical modes of analysis that treat detection of “statististically significant difference” rather than “practical corroboration of practical meaningful effect” as the goal of such analysis (especially for experiments). This problem is 1,000x as big as “fraud” or “nonreplication” (and is related at least to the latter, which is predictable consequence of the substitution of NHT rituals for genuine comprehension of causal inference);

(b) the real peer review is *always* the one that happens after articles are published–how could this be otherwise? 1.5 or 3 in rare instances 4 people read a paper beforehand & many times that many after; didn’t someone write a paper recently on need to apply to knowledge on validity & reliability to the procedures we use for producing/disseminating/teaching such knowledge?!; and

(c) the “conclusion” or “finding” of a paper is never any stronger than — never anything *other than* — the weight that one can assign to the evidence it adduces in favor of a hypothesis that can never be deemed to be “proven,” so a paper can never ultimately have any more influence in a scholarly conversation than the validity and cogency of the causal inference its design genuinely supports (those determine weight).

To me this makes the impact of *bad” papers self-limiting. They will likely get published, even in good journals– b/c of (a). But because any problem genuinely worthy of being solved will always compels the sustained attention of serious people– ones who *get* that publication of a paper is merely an announcement (one that might well be wrong) that someone has generated some relevant evidence worthy of consideration & not a show-stopper “conclusive proof” of anything — (b) & (c) will inevitably blunt the impact of the bad papers that get published. So I don’t worry that much about publication of bad papers.

Note: this analysis excludes the problems w/ the “WTF” genre of sudies, which aren’t about trying to solve real puzzles but instead regaling people who think psychology is just “Ripley’s Believe it or Not” w/ ANOVAs,,,. The harm of those papers isn’t that they’ll tempt us to accept *wrong* answers to important questions (the risk of the sort of bad papers I’m describing)–because they aren’t even addressing such problems; it is that (a) they will divert space in journals, and maybe creative effort by researchers who crave attention, away from papers that address real issues; and (b) diminish the credibility of social science in the minds of serious people.

I don’t know if I agree with Kahan’s claim that the impact of *bad” papers will be “self-limiting.” My reason for doubt is explained well by Jeremy Fox in his discussion of some findings that, when mistakes are published in the scientific literature, they tend to persist even after being corrected:

The data show that scientists rely on pre-publication peer review, to the exclusion of post-publication review. Once something has passed pre-publication peer review, the scientific community mostly either accepts it uncritically, ignores it entirely, or else miscites it as supporting whatever conclusion the citing author prefers. . . .

For better or worse, the only time most of us read like reviewers is when we’re acting as reviewers. Plus, pre-publication is the only time authors are obliged to pay attention to criticism. . . .

It’s not easy to criticize the work of others, because that often seems like criticizing the people who did the work, and nobody but a jerk enjoys criticizing other people. Pre-publication peer review is an institutionalized practice that gets around this very human desire to want to think well of one’s peers, and to have them think well of you. That’s why, as frustrated as I (and probably all of you) often get with pre-publication peer review, I’d like to see it reformed rather than replaced. . . .

Fox is talking about biology and ecology, but I suspect these problems are going on in other scientific fields as well, and Fox’s perspective seems similar to that of Nicolas Chopin, Kerrie Mengersen, and Christian Robert in our article, In praise of the referee.

But I brought this up right now not to discuss peer review but to emphasize that once a mistake is published, it’s hard to dislodge it.

Anyway, to continue with the main thread, here’s Kahan again:

I think the “WTF” findings are more likely to get “pounded in” than bad studies on things that actually matter. The things that matter are issues of consequence for knowledge or practice that usually admit of multiple competing explanations– the ones in the EOOOYKTA –“everything-is-obvious-once-you-know-the-answer” — set, which is where you will find *serious* social scientists laboring. There I think the life of an invalid study is likely to be short, even if it starts out w/ much fanfare. It is short, moreover, b/c it *lives* in the minds of serious people, who really want to know what’s going on. “WTF” is a kind of intellectual junk food, produced for people who generally don’t think critically. And, via the sort of science journalism you criticized in your Symposium article, gets pounded “deeply and perhaps irretrievably into the recursive pathways of knowledge transmission associated with the internet.”

Science journalism is another one of the professions — like the teaching & propagation of knowledge relating to statistics — that is dedicated to transmitting information on what science has discovered through use of its signature method of disciplined observation & inference but that doesn’t use that method to assess its own proficiency in transmitting such insight.

A useful supplemental remedy to the one you propose — calling up lots of experts to see what they think — is for journalists simply to *read* scientific studies in the way they are supposed to be *read*: not for reports of “facts” discovered or conclusive proven, but as reports of the production of valid *evidence* that a thoughtful person could assimilate to everything else he or she knows to update an assessment that is itself subject to revision upon production of yet further valid evidence — forever & ever. A journalist who just reads the “intro” & “conclusion” — or just the university press release– & says “Science proves x!” not only doesn’t *get* the study. He or she can’t possibly be *telling the story* that a person who is genuinely intrested in scientific discovery cares about. That *story* necessarily identifies the problem that motivated the researcher, describes the sort of observations a researcher collected to investigate it, explains the logic of the causal inference that connected those observations to a conclusion, and the various statistical or other steps a researcher to test and probe the strength of inference. A journalist ought to do that — just b/c he or she ought to; it’s the craft of the profession that person is in. But a journalist who does this routinely — who applies critical reasoning to a purported empirical proof — might well figure there’s a problem when a publisher press release anounces, that a researcher has “proven” that “people named Kim, Kelly, and Ken more likely to donate to Hurricane Katrina victims than to Hurricane Rita victims”!

I’m not so optimistic as Kahan here. For one thing, when I first encountered the “dentists named Dennis and lawyers named Laura” paper, I simply took it as true. Even now, after the paper has been subject to serious criticism, I still don’t know what to believe. I’m similarly on the fence regarding the Christakis/Fowler findings on the contagion of obesity. And, of course, Freakonomics (as well as, presumably, Kanazawa himself) got fooled by the beauty-and-sex-ratio study.

I think it’s fine for science reporters to read scientific papers, but I think it’s hard for any outsider to spot the flaws. If I can’t reliably do it and Steven Levitt can’t reliably do it, I think journalists will have trouble with this task too. As I wrote in my article, some of the problems of hyped science arise from the narrowness of subfields, but you can take advantage of this by moving to a neighboring subfield to get an enhanced perspective.

I’ll give Kahan the last word by linking to this recent post of his where he considers science communication in more detail.


  1. Bruce McCullough says:

    The impact of bad papers may be self-limiting if the bad papers can be identified. I believe that most bad papers are never identified; hence the self-correcting process of science cannot work. If I publish a bad paper that no one attempts to replicate, how will it ever be found out? It won’t. The idea that published results are true until false is not the way science should advance (think “cold fusion”). We should accept no result as true until it has been independently verified at least once.

  2. Bruce McCullough says:

    ooops: that should have been, “The idea that published results are true until PROVEN false is the not the way science should advance.”

    • zbicyclist says:

      Can the process of peer review sometimes make it more difficult to tell which papers are bad? The first peer review might help clean up some of the messier parts of the paper — wouldn’t any author try to get rid of / downplay the aspects that led to rejection by the first set of reviewers, in order to better get acceptance by reviewers at a second journal?

      • Anonymous says:

        One would hope that authors would at least correct egregious errors pointed out by peer reviewers when their paper is rejected, but it doesn’t always happen. Look at this paper published in the journal Traffic Injury Prevention:, which is described in the title and in the first sentence of the abstract as a case-control study. It is actually a cohort study (for those of you not familiar with epidemiological studies, to an epidemiologist, this is like the difference between black and white). (The authors named one cohort ‘cases’ and the other cohort ‘controls,’ thus, to them it became a case-control study.) I pointed out that error (along with many others) during peer review for another journal, which rejected the manuscript, but the authors did not correct even that most basic and obvious of errors.

    • zbicyclist says:

      Also relevant to McCullough’s contention is this quote from Ronald Coase, who’s got 3 articles on his passing in today’s WSJ (news, editorial, and op-ed by David R. Henderson). I’ll let Henderson put it in context:

      “Coase made many intellectuals uncomfortable by pointing out an obvious implication of their belief in government regulation: If regulation works so well in the market for goods, then it should work even better in the market for ideas.

      “Why? As Coase said in a 1997 Reason interview: ‘It’s easier for people to discover that they have a bad can of peaches than it is for them to discover that they have a bad idea.’

      “Many intellectuals thought Coase was arguing for government regulation of ideas. He wasn’t. His point was to get intellectuals to see that their case for regulating goods is weak.”

  3. jonathan says:

    It may be helpful to divide the reasons into intrinsic and extrinsic – or endogenous/exogenous or any inside/outside pairing that fits a jargon need.

    Intrinsic reasons include the reluctance of journals to correct and we recognize that can be motivated by a variety of reasons, such as not wanting to appear to need correction, not wanting to discourage submissions because authors fear corrections coming out later, not wanting to diminish the impact of papers published and thus journal and personal reputation, etc. These issues are, as you I think have noted, addressable using the methods of your fields: do studies that look at papers, check for errors & replicability where possible and develop results which then can be used to drive better practices, if only by exposing and thus punishing anti-correction motives.

    Extrinsic would include ideology. The R&R paper, for example, was grasped so firmly by people who wanted it to be true that I suspect it will be quoted for decades. The ultimate example may be Galen: his almost completely wrong views about disease became embedded in the religious structure which dominated Western society for the next 1300+ years. Extrinsic motives prevented even examination; if we presume truth, it need not be tested because testing implies doubt about what we presume is truth. Other than pointing out this kind of blindness is a good rationale for ideological diversity – because you never know what the minority beliefs preserve and develop – I have no idea how to address these extrinsic problems in a way that isn’t masturbatory. Yes we can construct ideological scaling, etc. but I don’t see how any form of judgement matters versus belief that negates judgement.

  4. K? O'Rourke says:

    I would argue prevention is much better than cure.

    At least in clinical research there usually is a bunch of studies (e.g. 7 +/- 2) almost always with more low quality that high quality. Low quality studies can be far from the truth and make the high quality studies look like outliars. But agreeing with Bruce its extremely hard to assess the quality of the studies and wee even expect the lower quality studies are written up to look like they were the high quality studies. Additionally, any modeling or adjustment for assessed quality will seldom if ever be very convincing –

    And lastly as an example the preferred method for adjusting Correspondence plots was totally wrong in the SAS manual for about 10 years before I contacted Greenacre and SAS quickly issued a correction. There were literally hundreds of published papers that simply quoted from the SAS manual (and even a couple textbooks) and as far as I know most of them have never been corrected.

  5. Nathan says:

    I found this bit interesting, as a contrast to my experience in the physical sciences:

    “(b) the real peer review is *always* the one that happens after articles are published–how could this be otherwise? 1.5 or 3 in rare instances 4 people read a paper beforehand & many times that many after;”

    When working in large collaborations, it is common for drafts to be distributed to several dozen co-authors before submission. Not all of these collaborators will carefully review the work they are signing their names to, but many do and provide substantive comments. Often the paper will then be posted to the arXiv after submission, and before publication, allowing for another round of mass peer review.

    This may just be an acknowledgement of laziness, but I am probably less likely to provide feedback to an author on a paper that has already been published, as opposed to one that has been submitted to the arXiv during peer review, because I know my comments can have no effect on the final version.

  6. Winston Lin says:

    Andrew, “In praise of the referee” is a great article. Hope you’ll make it the center of another blog post sometime. Thanks for sharing it.

  7. […] and very serious statistical flaws. I have an old post that draws the same conclusion (and in a follow-up post Andrew was kind enough to quote that old post of mine). Pre-publication review isn’t perfect. […]

  8. […] “Evaluating evidence from published research” … via @twssecn […]