With a bit of precognition, you’d have known I was going to post again on this topic, and with a lot of precognition, you’d have known I was going to post today

Chris Masse points me to this response by Daryl Bem and two statisticians (Jessica Utts and Wesley Johnson) to criticisms by Wagenmakers et.al. of Bem’s recent ESP study. I have nothing to add but would like to repeat a couple bits of my discussions of last month, of here:

Classical statistical methods that work reasonably well when studying moderate or large effects (see the work of Fisher, Snedecor, Cochran, etc.) fall apart in the presence of small effects.

I think it’s naive when people implicitly assume that the study’s claims are correct, or the study’s statistical methods are weak. Generally, the smaller the effects you’re studying, the better the statistics you need. ESP is a field of small effects and so ESP researchers use high-quality statistics.

To put it another way: whatever methodological errors happen to be in the paper in question, probably occur in lots of researcher papers in “legitimate” psychology research. The difference is that when you’re studying a large, robust phenomenon, little statistical errors won’t be so damaging as in a study of a fragile, possibly zero effect.

In some ways, there’s an analogy to the difficulties of using surveys to estimate small proportions, in which case misclassification errors can loom large.

And here:

[One thing that Bem et al. and Wagenmakers et al. both miss] is that Bayes is not just about estimating the weight of evidence in favor of a hypothesis. The other key part of Bayesian inference–the more important part, I’d argue–is “shrinkage” or “partial pooling,” in which estimates get pooled toward zero (or, more generally, toward their estimates based on external information).

Shrinkage is key, because if all you use is a statistical significance filter–or even a Bayes factor filter–when all is said and done, you’ll still be left with overestimates. Whatever filter you use–whatever rule you use to decide whether something is worth publishing–I still want to see some modeling and shrinkage (or, at least, some retrospective power analysis) to handle the overestimation problem. This is something Martin and I discussed in our discussion of the “voodoo correlations” paper of Vul et al.

Finally, my argument for why a top psychology journal should never have published Bem’s article:

I mean, how hard would it be for the experimenters to gather more data, do some sifting, find out which subjects are good at ESP, etc. There’s no rush, right? No need to publish preliminary, barely-statistically-significant findings. I don’t see what’s wrong with the journal asking for better evidence. It’s not like a study of the democratic or capitalistic peace, where you have a fixed amount of data and you have to learn what you can. In experimental psychology, once you have the experiment set up, it’s practically free to gather more data.

I made this argument in response to a generally very sensible paper by Tal Yarkoni on this topic.

P.S. Wagenmakers et al. respond (to Bem et al., that is, not to me). As Tal Yarkoni would say, I agree with Wagenmakers et al. on the substantive stuff. But I still think that both they and Bem et al. err in setting up their models so starkly: either there’s ESP or there’s not. Given the long history of ESP experiments (as noted by some of the commenters below), it seems more reasonable to me to suppose that these studies have some level of measurement error of magnitude larger than that of any ESP effects themselves.

As I’ve already discussed, I’m not thrilled with the discrete models used in these discussions and I am for some reason particularly annoyed by the labels “Strong,” “Substantial,” “Anecdotal” in figure 4 of Wagenmakers et al. Whether or not a study can be labeled “anecdotal” seems to me to be on an entirely different dimension than what they’re calculating here. Just for example, suppose you conduct a perfect randomized experiment on a large random sample of people. There’s nothing anecdotal at all about this (hypothetical) study. As I’ve described it, it’s the opposite of anecdotal. Nonetheless, it might very well be that the effect under study is tiny, in which case a statistical analysis (Bayesian or otherwise) is likely to report no effect. It could fall into the “anecdotal” category used by Wagenmakers et al. But that would be an inappropriate and misleading label.

That said, I think people have to use what statistical methods they’re comfortable with, so it’s sort of silly for me to fault Wagenmakers et al. for not using the sorts of analysis I would prefer. The key point that they and other critics have made is that the Bem et al. analyses aren’t quite as clean as a casual observer might think, and it’s possible to make that point coming from various statistical directions. As I note above, my take on this is that if you study very small effects, then no amount of statistical sophistication will save you. If it’s really true, as commenter Dean Radin writes below, that these studies “took something like 6 or 7 years to complete,” then I suppose it’s no surprise that something turned up.

23 thoughts on “With a bit of precognition, you’d have known I was going to post again on this topic, and with a lot of precognition, you’d have known I was going to post today

  1. Depends or "your" cost – practically free to gather more data.

    I do recall a project with a medical student where with some fancy sigmoidal regression we found stat sign mortality signal – but the senior clinical wanted to wait for more data.

    When we had more data – the signal was gone.

    The medical student was incensed – had we published the first false positive, we could have published a second paper retracting it – and with two published paper he would have qualified for a faculty position.

    But its only real if it replicated by another investigator – as JG Gardin used to put it – a scientist has no business replicating his own findings – as that sensible paper point out its actually unfair to expect people to be able to do that!

    Fortunately, ESP like other vampirical believes can never by killed by mere data.

    K?

  2. The response was very good. I'm glad they took on Wagenmaker's critique directly. I felt his statistical criticism was highly unwarranted.

    Regarding the gathering of more evidence, I suppose it depends on how much weight you put on combining experiments. Bem reports a 'combined' p-value on the order of 10^-11. If you believe that, then gathering more data is unnecessary.

    I don't believe the result, but think the problem is much more likely to be an (unconscious) methodological bias. Last I checked no one has been able to reproduce the results.

  3. "No need to publish preliminary, barely-statistically-significant findings."

    Hmm, Bem ran 9 experiments involving 1,000 subjects, and this took something like 6 or 7 years to complete. Combined odds of those 9 experiments amount to over a billion to 1.

    But sure, let's wait another decade or two and see if a couple of generations of future scientists can replicate this study. Wait, Bem's studies were already based on previously successful experiments? Well then, perhaps it's best to just ignore it and prevent this sort of thing from being published.

  4. Ian:

    I wasn't thrilled with the Wagenmakers et al. article either (as noted in my earlier blog on the topic). I preferred Yarkoni's take on the story.

    Dean:

    Note the last sentence of Ian's comment above.

    Bem's work may very well have taken 6-7 years. Nonetheless, it would take much less time and effort to take his favorite of his experiments and replicate it.

  5. Andrew, once again, I agree with you on the substantive stuff. But I do think that if you believe JPSP published Bem's article too hastily, you're basically arguing that psychology journals need much more rigorous standards in general, because this wasn't an isolated failure. (As Dean pointed out, Bem's paper already contained substantially more data than most JPSP articles do, and most of the methodological problems I discussed in my post are pretty widespread in the literature, so they're not unique to his paper either.)

    I'm not saying there's anything wrong with that view (actually, it's pretty much what I believe), but you can't have it both ways… Either the publication of Bem's paper is an indictment of the field as a whole (or at least a large chunk of it), or it's a sign that the process is working the way it's supposed to. I don't see how you can argue that a paper that's methodologically no worse (and probably better) than most others in the same journal shouldn't have been published just because you don't like the conclusion… I mean, that was basically the argument Wagenmakers et al made by appealing to a prior pulled out of thin air, and I agree with your reasons for thinking it wasn't a good one.

  6. One thing that is missing from sceptical discussion and criticism of the Bem paper is any mention of the previous research on which it is based (as noted by Dean Radin above) or any of the case reports of apparent ESP experiences.

    If you make a serious attempt to read these case reports (see L.E Rhine's work), you will look upon this kind of research as far from silly and begin to see that it is actually quite rational.

  7. Ian: Always contrast before combining.

    Combining p_values is _only_ sensible under the assumption of replication.

    Its not a measure of replication.

    Just looked up an old book by Kalbfleisch "Probability & Statistical Inference" an early user friendly intro to Likelihood and Bayes where they were surprizing clear about contrasting likelihoods before combining them as well as before combining likelihoods and priors.

    K?

  8. Tal:

    As noted above, I think many classical statistical methods (including the discrete-model Bayesian approach favored by Wagenmakers et al. and Bem et al., which in my opinion is just a slight variant of classical null hypothesis testing) can work well when studying large effects such as Stroop but can fail when studying small effects such as ESP. The error is in assuming that a statistical method that works in one setting will necessarily work in another.

    The standard of statistics in ESP work has long been fairly high (although not at the top level of statistical methodology; compare, for example, Bem's statistics to the sorts of models that have been fit for decades in psychometrics), but good statistics won't help you if you're studying effects that are null or nearly so. At that point all you're doing is finding patterns in noise, discovering measurement bias, etc (as noted in Ian's commend above).

    Dave:

    Uri Geller was ok but he wasn't so impressive when he went on Carson.

  9. one very critical topic seems to be missing from the critiques and discussions: the role of theory in study design and data analysis. i appreciate the statistical critques (and that this is a statistics blog!), but such critiques will never convince psychological scientists. Tal's approach is rhetorically weak because any motivated researcher can develop a list of methodological and statistical flaws across a series of studies. Wagenmakers et al. focus on one flaw that can account for the results of all the studies because this is the kind of critique that social psychologists find convincing. the reason JPSP demands multiple studies is not replication, it is parsimony, i.e., building a body of evidence that can only be explained by the researcher's preferred theoretical framework.

    in the mind of a psychology researcher, it is the complete lack of a well-articulated theory that is consistent with what we currently know about causation that makes the one-sided tests and data peaking unacceptable.

    Joachim Krueger is the only person i know of who has actually articulated this idea that the issue is substantive, not statistical:

    http://www.psychologytoday.com/blog/one-among-man

  10. What MJ said. In fact, we can use this as a proof of Bayesianism, by the following syllogism:
    (1) When people find a conclusion they don't believe they can always finds a potent critique of the statistical underpinning.
    (2) When people have a conclusion they do believe they never bother to question the statistical underpinning.

    Thus, posterior belief in a statistical result simply reflects the prior beliefs of the people reading the study –> Reader-based Bayesianism

  11. AS Laplace might put it, really worried here about the likelihood not the prior.

    The likelihood (or data model) is very likely not adequately reflecting how the multiple studies were actually done and reported (Ian comments again).

    A full prespecified and excellently executed set on new studies would fix that (with a bit of luck cause things can always go wrong).

    K?

  12. Bem and the journal was correct in publishing the results of these experiments. There are people like MJ how say

    "it is the complete lack of a well-articulated theory that is consistent with what we currently know about causation that makes the one-sided tests and data peaking unacceptable."

    Really? Maybe two-sided tests make the situation worst by not fitting Bem's paradigm and thus obscuring the results so no one else will investigates further.

    In other words, a classic stall or suppression technique.

    The issue here is to prompt further investigations to advance knowledge and for theory construction. It's not to protect some cherished beliefs which academic scientists and psychologists have fallen in love with.

  13. MJS:

    I don't think anyone is talking about suppression. But are these inconclusive results worth publishing in JPSP–one of the top psychology journals? Maybe not. Lots of experiments are done every year but most are not published in JPSP.

  14. K?: Of course, these folks are using standard p-values, which aren't likelihoods and don't obey the likelihood principle. But I agree that their data model isn't well specified, so it is hard to decide whether their conclusions are warranted.

  15. @Andrew Gelman.

    Suppression happens explicitly or implicitly, you can call it cognitive dissonance if you like.

    Anyway, I see nothing wrong with their conclusions and see nothing premature about publishing where they did. When you ask if they warranted publication in a top psychology journal, I say of course they did. It's ground breaking and paradigm shifting research that needs to reach the widest relevant audience. The editor should be commended along with the reviewers.

    Maybe the ivory tower will finally realize that the masses are right and way ahead of them when it comes to the paranormal.

  16. Andrew, I'm going to assume that your response was either a) a lighthearted quip or b) a serious response implying that Uri Geller's performances represent the 'data' from which this kind of research is inspired.

    If a) then I understand. It's your blog and you don't have to respond seriously to every comment. If b) then I'm a little concerned and would offer the following: Geller has nothing to do with either L.E. Rhine's work on spontaneous cases or the previous work on which Bem's time-reversed paradigm is based. I would add that the majority of parapsychologists take Geller with a very large pinch of salt. However, if you want psychic 'superstars' rather than a large body of spontaneous cases, I suggest you look into Joe McMoneagle and his work with Dr. Ed May.

  17. Bill: Yes it was the data model not being specified I was pointing to.

    I use data model and likelihood (the function) interchangeably.

    In the ways the joint model can be wrong (both prior and likelihood) I sometimes want to focus more heavily on one than the other (and as Andrew has pointed out its not obvious how to do this)

    In Jessica Utts online talk she seemed to mostly focus on prior problems despite some encouragement from the audience to also focus on likelihood (data model) problems.

    But then I could do a Maynard Kenyes and instead claim that my prior for the data models being importantly wrong was near 1.

    (And with small effects importantly wrong can arise from very slight almost undetectable mishaps.)

    K?

  18. I'd like more research not on parapsychology but on people who think there is substance to parapsychology, who seem well represented in this thread. I think that would do much more to push forward human understanding (and understanding humans).

    For example, what is their explanation of why phenomena they evidently believe are real, powerful, and ubiquitous are so difficult to establish in experimental and/or statistical terms? Or, if they consider that a loaded question: How would they rephrase that question? Or how is it loaded?

  19. E.J.:

    Dank je wel. I've added a P.S. to the above blog discussing your new paper.

    Nick:

    As I wrote in my earlier blog on the topic, it does seem evident that many people want ESP to be true. Unfortunately for them, psychological studies are small-scale enough that simple ideas of ESP can be easily refuted. In contrast, economics and political science are observational not experimental sciences, and so true believers of various wishful ideas in these fields never have to abandon or even modify their beliefs.

  20. By the way, it often seems to be overlooked that while Harold Jeffreys in his _Theory of probability_ worked hard on trying to make data analysis (as he wouldn't have described it) coherent and rigorous, he had a strong sense of all that makes that aim so difficult.

    More particularly, his monograph is rare as a statistical source that plays succinct but serious attention to "wishful thinking" as a difficulty to be confronted in statistics.

  21. Figure 1 could be explained by the fact that Experiment 9 (50 participants) was hypothesised to increase the effect size by introducing deeper encoding during the practice phase. And the experiment with 200 participants was expected to generate the smallest effect because it was using non-arousing stimuli.

  22. No prevention for Confirmation Bias without randomization.

    That is, with the usual freedoms in analyzing observation studies you can "adjust" until you get want you want – or close enough.

    Recall their was some creative guy who left Google or somewhere becuase the random evaluations of "good" ideas was cramping his style.

    For better academic lifestyle – avoid areas where RCTs are readily doable.

    Or as my old boss used to say – any idiot can randomize to two groups and observe which one does better (with adequate care and resources of course)

    Similar to Bruce Lee's lament – any idiot can point a gun and pull a trigger.

    K?

Comments are closed.