Skip to content

Suggested resolution of the Bem paradox

There has been an increasing discussion about the proliferation of flawed research in psychology and medicine, with some landmark events being John Ioannides’s article, “Why most published research findings are false” (according to Google Scholar, cited 973 times since its appearance in 2005), the scandals of Marc Hauser and Diederik Stapel, two leading psychology professors who resigned after disclosures of scientific misconduct, and Daryl Bem’s dubious recent paper on ESP, published to much fanfare in Journal of Personality and Social Psychology, one of the top journals in the field.

Alongside all this are the plagiarism scandals, which are uninteresting from a scientific context but are relevant in that, in many cases, neither the institutions housing the plagiarists nor the editors and publishers of the plagiarized material seem to care. Perhaps these universities and publishers are more worried about bad publicity (and maybe lawsuits, given that many of the plagiarism cases involve law professors) than they are about scholarly misconduct.

Before going on, perhaps it’s worth briefly reviewing who is hurt by the publication of flawed research. It’s not a victimless crime. Here are some of the malign consequences:

– Wasted time and resources spent by researchers trying to replicate non-findings and chasing down dead ends.

– Fake science news bumping real science news off the front page.

– When the errors and scandals come to light, a decline in the prestige of higher-quality scientific work.

– Slower progress of science, delaying deeper understanding of psychology, medicine, and other topics that we deem important enough to deserve large public research efforts.

This is a hard problem!

There’s a general sense that the system is broken with no obvious remedies. I’m most interested in presumably sincere and honest scientific efforts that are misunderstood and misrepresented into more than they really are (the breakthrough-of-the-week mentality criticized by Ioannides and exemplfied by Bem). As noted above, the cases of outright fraud have little scientific interest but I brought them up to indicate that, even in extreme cases, the groups whose reputations seem at risk from the unethical behavior often seem more inclined to bury the evidence than to stop the madness.

If universities, publishers, and editors are inclined to look away when confronted with out-and-out fraud and plagiarism, we can hardly be surprised if they’re not aggressive against merely dubious research claims.

In the last section of this post, I briefly discuss several examples of dubious research that I’ve encountered, just to give a sense of the difficulties that can arise in evaluating such reports.

What to do (statistics)?

My generic solution to the statistics problems involved in estimating small effects is to replace multiple comparisons by multilevel modeling, that is, to estimate configurations rather than single effects or coefficients. This tactic won’t solve every problem but it’s my overarching conceptual framework. There’s lots room for research on how to do better in particular problem settings.

What to do (scientific publishing)?

I have clearer ideas of resolutions (at least in the short term) of the Bem paradox; in short, what to do with dubious but potentially interesting findings.

So far there seem to be two suggestions out there: Either publish such claims in top journals (as for example Bem’s in JPSP, or the contagion-of-obesity paper in NEJM), or the journals should reject them (perhaps from some combination of more careful review of methodology, higher standards than classical 5% significance, and Bayesian skepticism).

The problem with the publish-in-top-journals strategy is that it ensures publicity for some mistakes and it creates incentives for researchers to stretch their statistics to get a prestigious publication.

The problem with the reject-’em-all-and-let-the-Arxiv-sort-’em-out strategy is that it’s perhaps too rigorous. So many papers have potential methodological flaws. Recall that the Bem paper was published, which means in some sense that its reviewers thought the paper’s flaws were no worse than what usually gets published in JPSP. Long-term, sure, we’d like to improve methodological rigor, but in the meantime a key problem with Bem’s paper was not just its methodological flaws, it was also the implausibility of the claimed results.

So here’s my proposed solution. Instead of publishing speculative results in top journals such as JPSP, Science, Nature, etc., publish them in lower-ranked venues. For example, Bem could publish his experiments in some specialized journal of psychological measurement. If the work appears to be solid (as judged by the usual corps of referees), then publish it, get it out there. I’m not saying to send the paper to a trash journal; if it’s good stuff it can go in a good journal, the sort where peer review really means something. (I assume there’s also a journal of parapsychology but that’s probably just for true believers; it’s fair enough that Bem etc would like to publish somewhere that outsiders would respect.)

Under this system, JPSP could feel free to reject the Bem paper on the grounds that it’s too speculative to get the journal’s implicit endorsement. This is not suppression or censorship or anything like it, it’s just a recommendation that the paper be sent to a more specialized journal where there will be a chance for criticism and replication. At some point, if the findings are tested and replicated and seem to hold up, then it could be time for a publication in JPSP, Science, or Nature.

From the other side, this should be acceptable to the Bems and Fowlers who like to work on the edge. You still get your ideas out there in a respectable publication (and you still might even get a bit of publicity), and then you, the skeptics, and the rest of the scientific community can go at it in public.

There have also been proposals for more interactive publications of individual articles, with bloglike opportunities for discussion and replies. That’s fine too, but I think the only way to make real progress here is to accept that no individual article will tell the whole story, especially if the article is a report of new research. If the Bem finding is real, this can be demonstrated in a series of papers in some specialized journal.

Appendix: Individual cases can be tough!

I’ve encountered a lot of these borderline research findings over the past several years, and my own reaction is typically formed by some mix of my personal scientific knowledge, the statistical work involved, and my general impressions. Here are a few examples:

“Beautiful women have more daughters”: I was pretty sure this one was empty just based on my background knowledge (the claim was an difference of 8 percentage points, which is much more than I could possibly expect based on the literature). Careful review of the articles led me to find problems with the statistics.

Dennis the dentist, Laura the lawyer, and the proclivity of Dave Kingman and Vince Koleman to strike out a lot: I was ready to believe the Dennis/Laura effect on occupations and only slightly skeptical of the K effect on strikeouts, but then the work was later strongly criticized on methodological grounds. Still, my back-of-the-envelope calculation let me to believe that they hypothesized effects could be there.

Warming increases the risk of civil war in Africa: This one certainly could be true but something about it rang some bells in my head and I’m skeptical. The statistical evidence here is vague enough that I could well take the opposite tack, believing the claim and being skeptical about skepticism of it. To be honest, if I knew these researchers personally I might very well be more inclined to trust the result. (And that’s not so silly: if I knew them personally I could ask them a bunch of questions and get a sense of where their belief in this finding is coming from.)

“45% hitting, 25% fielding, and 25% pitching”: I was skeptical here because it was presented as a press release with no link to the paper but with enough details to make me suspect that the statistical analysis was pretty bad.

“Minority rules: scientists discover tipping point for the spread of ideas”: I don’t know if this should be called “junk science” or just a silly generalization from a mathematical model. Here I was suspicious because the claim was logically inconsistent and the study as a whole fit the pattern of physicists dabbling in social science. (As I wrote at the time, I’ll mock what’s mockable. If you don’t want to be mocked, don’t make mockable claims.)

“Discovered: the genetic secret of a happy life”: There’s potentially something here but the differences are much smaller than implied by the headlines, the news articles, or even the abstract of the published article.

Whatever medical breakthrough happens to have been reported in the New York Times this week: I believe all of these. Even though I know that these findings don’t always persist, when I see it in the newspaper and I know nothing about the topic, I’m inclined to just believe.

That’s one reason the issue of flawed research is important! I’m as well prepared as anyone to evaluate research claims, but as a consumer I can be pretty credulous when the research is not close to my expertise.

If there is any coherent message from the above examples, it is that my own rules for how to evaluate research claims are not clear, even to me.


  1. Erin Jonaitis says:

    I find it weirdly reassuring that someone I admire has no clear answer for individual cases. Reassuring, I guess, because it supports my own sense that this stuff is actually harder than the popular image of science lets on.

    Your suggestion about what to do with implausible findings is interesting. The biggest difficulty I see is getting buy-in from the top journals. Science, Nature, and the like *want* to publish attention-grabbing findings; they don’t necessarily benefit much from publishing stuff that is already known, even if it’s known largely by a small group of experts in one area. This seems to me a fundamental aspect of journal economics and I’m not sure how to change it — unless we can find some way to increase the perceived cost to the journal of publishing something that later turns out not to be true. Dock journals’ impact scores for each retraction published? (Then that creates a bigger incentive to bury the dirty laundry.) Start a popular snark site making fun of dumb journal articles? (The key is that it has to be popular to work — though I guess the Ig Nobels have a headstart. It’s also a risky strategy as far as the funding climate for science goes.)

    I suspect in fifty years we will not have the same academic publishing system we do today. Heck, it might look pretty different in as few as ten. The one you mention (a more interactive, open review process — I’m picturing a sort of stackoverflow for scientific papers, with upvotes and such) is a possibility, and it could get rid of some of the perverse incentives at work here — if academics could be incentivized to use it! I agree that one article will never tell the whole story, of course; the challenge is what to do with that in a field where it’s very challenging to publish pure replications, and hence pretty pointless to do them at all.

  2. Paul says:

    I definitely agree with your statistical solution to the problem. It brought to mind research by a fellow grad student back in the 80’s who was applying a model to each of the 50 states, which I suggested handling in this manner. Stuff like this seems ideal for this treatment.

    … I was initially a little confused by the title of your post because I misinterpreted ‘Bem’ as ‘Bug-eyed monster’.

  3. Fernando says:

    I really liked your concluding sentence!

    To me it highlights a problem with your normative argument as to where Bem should have sent his paper. If we don’t know how we evaluate science, how can we even begin to modify the scientific process?

    In my view, before recommending normative solutions, it is good practice to have a positive diagnosis. In this case a diagnosis involves a theory about the sort of games scientists play.

    Namely, a theory specifying Who are the players (e.g. professors, reviewers, publishers, funders, corporations, etc); What are their possible actions; Who has the initiative; What is the information structure; What are the outcomes; and how do different players value these outcomes.

    Having more information about the game generating such perverse outcomes can then help guide our suggestions for improvement.

    Some people disagree with this approach. Randomistas would just try one intervention after another until something works. That is, in effect, an evolutionary approach – progress through trial and error. We know it works, but it also tends to be a very slow process. Presumably there are more efficient ways to do this, and theory might help.

  4. Jeremy Fox says:

    Um, do you mean Vince Coleman? Whose name starts with a “C”? There is no “Vince Koleman” in the database. And while Vince Coleman did strike out a fair bit (especially for a singles hitter), his propensity to do so was nowhere near that of Dave Kingman, or of many other sluggers whose names don’t start with K. Coleman was known for stealing bases, not for striking out.

  5. K? O'Rourke says:

    > If we don’t know how we evaluate science, how can we even begin to modify the scientific process?
    John Tukey’s warning not to base this on mere technical knowledge seems worth repeating here.

    > accept that no individual article will tell the whole story
    I believe that is the key point especially given the selective reporting.

    It took me a long time to get that clear in my mind and some other’s minds. As a graduate student I helped some clinical researchers I was employed for to figure out some stats methods for meta-analyses and we did a few and published our efforts. Soon after these were printed, I was ostracised by all of my fellow graduate students and probably all the faculty for having been so statistically naïve. I discovered that when one of the students who was also at my martial arts club obligated by honour to warn me, told me about it. A couple where going to present a talk claiming (reworded in modern technical terms) that any parameter for selective publication was not identified given the published study for any reasonable probability model one might use for it (best work showing how it is done is current by X at Warwick). I was told be them I would receive an abstract an hour before they gave the talk. My supervisor also did talk to me about it vaguely, suggested I had missed something and lost confidence in me (I believe). I never completed that degree, but a number of years later gave a talk in the statistics department on meta-analysis. I argued that the bias from selective reporting did not arise from collecting the studies together but instead affected each study _individually_ (unless it could be shown that study was not exchangeable i.e. never subject to the selection possibility). It seemed to be what was needed to have been pointed out, and my old supervisor thanked for making the point seemed much more comfortable with me afterwards.

    • Fernando says:

      If I read you correctly, you are arguing that the selection bias does not operate only at the level of what gets published, but also at the level of what questions are asked.

      That makes sense. Forward looking individuals will likely to factor in any publication bias very early on in their research.

      If so, the publication bias could affect what is published, what is rejected, what is asked, and what methods and materials are chosen.

      Downstream, of course, they might affect policy, regulation, consumer behavior and so on.

      • K? O'Rourke says:

        Yes, any (non-random) process that affects what ends up in publications.

        Better fleshed out in the clinical RCT field, and they do have names for different types.

        My fellow study, An-Wen Chan tracked the outcomes reported in publications to the primary outcome given in the ethics applications. Originally the ethics boards refused to be involved in as they thought such tracking would be un-ethical. But he found a _responsible_ ethics board and documented an important effect. But statisticians should know that from the theory – it’s not a big stretch…

        Hope people will take up Andrew’s challenge and think this through for their own fields – and then look to what the clinical RCT field is doing.

        By the way meant to replace the X with John Copas from Warwick.

  6. Ian Fellows says:

    I worry about explicitly having journals reject papers on the basis of ‘implausibility.’ In some sense, implausibility is just code for subjective bias. Sure it might be bias based on knowledge of the field, and years of experience, but then again it might be bias against ideas that contradict the editor’s own research program.

    It also would give credence to the widely held belief in the public that scientists suppress (i.e. reject from journals) articles that don’t support their agenda. This conspiracy theory about the scientific process is much more dangerous than the occasional false positive, which because of its implausibility, gets investigated thoroughly and immediately. In the case of Bem, the week was not out before the ‘failed to reproduce’ studies started rolling in.

    • Andrew says:


      I don’t think all journals should be rejecting papers on the basis of implausibility. I just don’t think Bem-like papers need to be appearing in the top journals. There’s a lot of solid, interesting, and new work that can be appearing in these top journals; I think the speculative work is better placed in specialized journals where the subject-matter experts can read, criticize, and (try to) replicate. Again, I’m recommending specialized but good journals, I’m not talking about trash journals here. I’m assuming that if the Bem paper was deemed good enough to appear in JPSP, it also could’ve been published in a lesser-ranked specialized journal of psychology.

    • Fernando says:

      “the occasional false positive”

      Umh, I’d say that what is truly occasional is a “true” positive.

      Some large replication studies support this view.

  7. ChristianKl says:

    The list of possible malign consequences is pretty interesting in what it leaves out.
    It seems to only care about the effects that false research has on the scientific community.
    Science isn’t only done for the sake of science.

    In the real world people take actions based on false knowledge that they got through reading false science.
    People can die when they take the wrong drug because the research was faulty.

    “At some point, if the findings are tested and replicated and seem to hold up, then it could be time for a publication in JPSP, Science, or Nature.”
    The problem is that those journals don’t want to publish studies that replicate already existing knowledge. Scientists don’t cite studies that replicate existing research enough and therefore the journals don’t want them.

    Maybe the goal should be about changing the paradigm to cite the first paper in the literature that introduces an idea when one references that idea in one’s own paper?

    It’s much easier to blame the journals for publishing the wrong paper than to take personal responsiblity to cite more replication studies in one’s own papers.

    In total the main problem isn’t that bad journals promote papers that hurt the scientific community.
    The problem is that the scientific community values through citations work that hurts the general public who expects science to produce some sort of truth.

    • Andrew says:


      1. Yes, this was a list of “some” of the malign consequences. But please note the last item on my list: “Slower progress of science, delaying deeper understanding of psychology, medicine, and other topics that we deem important enough to deserve large public research efforts.” The “topics we deem important enough” bit is supposed to recognize that this is work that should ultimately benefit the general public.

      2. I think there’s an interaction between citation practices and journal review practices. I’m hoping that the sorts of discussions we’re having now will be helpful in moving things forward.

  8. Jeremy Fox says:

    Decades ago someone put out a book called The Scientist Speculates. It was explicitly a collection of “half-baked” ideas from leading scientists (and statisticians, IIRC–I think IJ Good wrote an essay for it). It included a whole chapter of speculations on ESP. I’m wondering about the possibilities for a journal along these lines as a way to address the issues this post identifies. The Journal of Implausible Results and Half-baked Ideas? I admit that that idea is itself half-baked and perhaps implausible…

    • Andrew says:


      I J. Good did not just contribute to that book; he edited it. It could be interesting to track the book down in the library and see how many of its speculations went anywhere interesting or useful.

      • Jeremy Fox says:

        Your memory clearly is better than mine, since I actually own a copy of the book! ;-) Picked it up used years ago. Just didn’t have it in front of me when I wrote my previous comment. Agree that it would be interesting to see which speculations panned out in any fashion. Clearly all the ESP ones didn’t…

        The piece I recall best from the book was actually really out of place. It was a humorous science-to-plain English dictionary. I recall in particular the translations of descriptions of model fit:

        Scientific phrases:
        “The agreement with the predicted curve is…
        …as good as could be expected, given the approximations made in the analysis.”

        English translations:
        “The agreement with the predicted curve is…

  9. Jon M says:

    I think a lot of the problem, particularly in observational data, is researchers dredging through their datasets with 3000 hypotheses. It’s vaguely plausible that any variable you have might be socially transmitted so let’s just test all of them and publish the significant ones.

    I’ve often wondered about an individual researcher Bonferroni correction. Perhaps built into all statistical software that keeps track of all analyses run and adjusting for the number of unpublished hypotheses tested. Perhaps more realistically an analysis log that shows the timing of steps taken in the analysis to come to the final results. It will be pretty ugly seeing the sausage being made but isn’t it better to know where the problem is coming in.

    Also could a system be put in place where the authors of the initial study, published in a lesser journal, are offered something akin to authorship when their results are replicated and published in Nature. Could slightly reduce the incentive to publish in too high a journal to begin with.

  10. A large part of the problem stems from having (a) high prestige journals in the first place, and (b) peer review. (a) gives explicit credibility to work, which perhaps given (b), isn’t such a great idea. Peer review is noisy even from experts who are trying to be “fair” (whatever that means in this context). It’s even worse than it might seem due to cliques of reviewers and the notion of “peer”.

    As an extreme case, when I was a professor in the Philosophy Dept. at Carnegie Mellon, some of my colleagues were incensed at the tenure reviews in history and lit crit, which were all very deconstructionist (I hadn’t gotten tenure yet, so I wasn’t involved — only the old folks get to make decisions about the young folks, another bias in the system). As much as the output of researchers in these departments seemed like nonsense, they had armies of “peers” to argue their case. These peers had tenure at “top” institutions and stacks of publications in “top” journals. It’s hard to break out of this circularity.

    Chomsky’s Manufacuring Consent has an interesting analysis of media bias, but is more than a bit ironic given Chomsky’s stranglehold on American theoretical linguistics and its popular journals.

    • K? O'Rourke says:

      Bob: I do think you have a _not that wrong_ model of how much of academia does work (in the sense of maintaining a consensual pattern of activity) but here I think we are more concerned about the traces of supposed reports of randomized comparisons.

      It’s OK (but very inefficient) to agree to study _nonsense_ but the randomization in studying that nonsense is supposed to be self correcting (and sometimes nonsense turns out to not be nonsense)

      The bigger problem comes when more than just a few take those traces as truly being unselected reports of randomized comparisons and give them the credibility (that only unselected reports of randomized comparisons should be given) as in the ESP case.

      Now if, as is often done in gaining approval for a pharmaceutical drug, the proponents proposed a research program, subjected it to careful third party review, carried it out well and made all information available by request for auditing and let a third party expert group carefully review and decide on the merits – I think all be much less wrong and I would have to take ESP (or some correlate) as a possibility.

      I believe that is happening in some clinical research areas where the clinical specialities have learned Don Rubin’s _motto_ – “smart people do not like being repeatedly wrong” and have decided to dam academia and go full speed ahead.

      I did have to watch my nephew with childhood cancer being give a treatment that an earlier meta-analysis had suggested the better studies showed harm and the really poorly done (biased) studies, some possible small benefit. The product was later taken off the market as it was assessed as no longer commercially profitable. (Later lab science found the mechanism that caused the stimulation of tumour growth, I believe).

  11. James says:

    I have a couple of concerns/questions with Andrew’s proposed solution:

    1) What constitutes “speculative” research? From the examples, it seems that research is speculative if I have a strong prior that the answer to the research question posed (e.g., “Does ESP exist?”) is “No”. But, one might think of research for which the answer this question is “I have no idea” as equally speculative. So to with findings wherein I don’t believe the postulated theoretical mechanism driving the empirical findings, but have no issue with the results. Should all these examples been seen as speculative and treated in the same way?

    2) It seems that speculative — as its being used here — is being defined as necessarily “bad”. But, at least some speculative research is necessary for science to progress. Paraphrasing Einstein slightly: the worst thing isn’t for a scientist to be wrong, its for her to be uninteresting. Presumably, the problem is less that some (much) of published research is wrong, but rather that responses demonstrating these problems are not published or otherwise disseminated. In some senses, the response to the Bem paper actually shows the process working — the relevant research community responded swiftly and vocally, noting problems with Bem’s findings.

    3) In light of (2), I wonder if the issue is less with the research community than with the manner with which results are communicated to the broader population. I doubt that many researchers regard published results as sacrosanct. But, those from outside the given research community may do so, or at least may not immediately recognize problematic methods/findings. Perhaps one of the reasons these problems seem particularly pressing with medical and psychological research lies in the fact that many of those who read journals in these fields are practitioners — and in that journalists are particularly likely to pick up on findings in these fields.

    4) I also suspect there’s a practical problem with Andrew’s solution: Those seeking to replicate speculative findings must necessarily read and cite speculative papers. But there is less of an imperative for others to cite careful replications/meta-analyses. Since journal rankings are based on impact factors, it seems likely that the second tier journals to which speculative findings would be relegated would become elevated to the top tier; while the top tier journals would fall in the rankings.

    • Andrew says:


      The process kinda worked in the Bem setting, but I think it would’ve worked better had the publication and replication been in a lower-visibility journal so that the sorting-out could’ve been done without involving articles in the New York Times etc.

  12. Stephen Weigand says:

    I don’t think the Ioannidis article ( should be mentioned without mentioning a strong critique by Goodman and Greenland (, and the response by Ioannidis (shortened PLOS link to

    I think applying the “What Would Sander Say?” filter is, as always, helpful.

  13. Eli Rabett says:

    Late here, but one of the beauties of the open review system (see EGU publications) is that you get to see the reviews. This means the reviewers are on notice to do their job, and the reader can gain an appreciation of the problems as well as the strengths of the paper before or after (tastes differ) reading the paper.

  14. Greg Francis says:

    I do not mind astonishing (unbelievable) findings being published in important journals, but the quality of the work needs to support the claims.

    What was striking about the Bem paper was not only that the finding was astonishing, but that the experimental work was well beyond the required standards of the field. He replicated the main finding 9 times out of 10 experiments. To most psychologists such replication is pretty convincing. As it turns out, this belief is mistaken and actually provides evidence against Bem’s claim. I have paper accepted in Psychonomic Bulletin & Review that addresses exactly this issue with regard to the Bem studies. A preprint is available at

  15. […] Andrew Gelman: Suggested resolution of the Bem paradox […]