Skip to content
 

One more time on that ESP study: The problem of overestimates and the shrinkage solution

Benedict Carey writes a follow-up article on ESP studies and Bayesian statistics. (See here for my previous thoughts on the topic.) Everything Carey writes is fine, and he even uses an example I recommended:

The statistical approach that has dominated the social sciences for almost a century is called significance testing. The idea is straightforward. A finding from any well-designed study — say, a correlation between a personality trait and the risk of depression — is considered “significant” if its probability of occurring by chance is less than 5 percent.

This arbitrary cutoff makes sense when the effect being studied is a large one — for example, when measuring the so-called Stroop effect. This effect predicts that naming the color of a word is faster and more accurate when the word and color match (“red” in red letters) than when they do not (“red” in blue letters), and is very strong in almost everyone.

“But if the true effect of what you are measuring is small,” said Andrew Gelman, a professor of statistics and political science at Columbia University, “then by necessity anything you discover is going to be an overestimate” of that effect.

The above description of classical hypothesis testing isn’t bad. Strictly speaking, one would follow “is less than 5 percent” above with “if the null hypothesis of zero effect were actually true,” but they have serious space limitations, and I doubt many readers would get much out of that elaboration, so I’m happy with what Carey put there.

One subtlety that he didn’t quite catch was the way that researchers mix the Neyman-Pearson and Fisher approaches to inference. The 5% cutoff (associated with Neyman and Pearson) is indeed standard, and it is indeed subject to all the problems we know about, most simply that statistical significance occurs at least 5% of the time, so if you do a lot of experiments you’re gonna have a lot of chances to find statistical significance. But p-values are also used as a measure of evidence: that’s Fisher’s approach and it leads to its own problems (as discussed in the news article as well).

The other problem, which is not so well known, comes up in my quote: when you’re studying small effects and you use statistical significance as a filter and don’t do any partial pooling, whatever you have that’s left standing that survives the filtering process will overestimate the true effect. And classical corrections for “multiple comparisons” do not solve the problem: they merely create a more rigorous statistical significance filter, but anything that survives that filter will be even more of an overestimate.

If classical hypothesis testing is so horrible, how is it that it could be so popular? In particular, what was going on when a well-respected researcher like this ESP guy would use inappropriate statistical methods.

My answer to Carey was to give a sort of sociological story, which went as follows.

Psychologists have experience studying large effects, the sort of study in which data from 24 participants is enough to estimate a main effect and 50 will be enough to estimate interactions of interest. I gave the example of the Stroop effect (they have a nice one of those on display right now at the Natural History Museum) as an example of a large effect where classical statistics will do just fine.

My point was, if you’ve gone your whole career studying large effects with methods that work, then it’s natural to think you have great methods. You might not realize that your methods, which appear quite general, actually fall apart when applied to small effects. Such as ESP or human sex ratios.

The ESP dude was a victim of his own success: His past accomplishments studying large effects gave him an unwarranted feeling of confidence that his methods would work on small effects.

This sort of thing comes up a lot, and in my recent discussion of Efron’s article, I list it as my second meta-principle of statistics, the “methodological attribution problem,” which is that people think that methods that work in one sort of problem will work in others.

The other thing that Carey didn’t have the space to include was that Bayes is not just about estimating the weight of evidence in favor of a hypothesis. The other key part of Bayesian inference–the more important part, I’d argue–is “shrinkage” or “partial pooling,” in which estimates get pooled toward zero (or, more generally, toward their estimates based on external information).

Shrinkage is key, because if all you use is a statistical significance filter–or even a Bayes factor filter–when all is said and done, you’ll still be left with overestimates. Whatever filter you use–whatever rule you use to decide whether something is worth publishing–I still want to see some modeling and shrinkage (or, at least, some retrospective power analysis) to handle the overestimation problem. This is something Martin and I discussed in our discussion of the “voodoo correlations” paper of Vul et al.

Should the paper have been published in a top psychology journal?

Real-life psychology researcher Tal Yarkoni adds some good thoughts but then he makes the ridiculous (to me) juxtaposition of the following two claims: (1) The ESP study didn’t find anything real, there’s no such thing as ESP, and the study suffered many methodological flaws, and (2) The journal was right to publish the paper.

If you start with (1), I don’t see how you get to (2). I mean, sure, Yarkoni gives his reasons (basically, the claim that the ESP paper, while somewhat crappy, is no crappier than most papers that are published in top psychology journals), but I don’t buy it. If the effect is there, why not have them demonstrated it for real? I mean, how hard would it be for the experimenters to gather more data, do some sifting, find out which subjects are good at ESP, etc. There’s no rush, right? No need to publish preliminary, barely-statistically-significant findings. I don’t see what’s wrong with the journal asking for better evidence. It’s not like a study of the democratic or capitalistic peace, where you have a fixed amount of data and you have to learn what you can. In experimental psychology, once you have the experiment set up, it’s practically free to gather more data.

P.S. One thing that saddens me is that, instead of using the sex-ratio example (which I think would’ve been perfect for this article, Carey uses the following completely fake example:

Consider the following experiment. Suppose there was reason to believe that a coin was slightly weighted toward heads. In a test, the coin comes up heads 527 times out of 1,000.

And they he goes on two write about coin flipping. But, as I showed in my article with Deb, there is no such thing as a coin weighted to have a probability p (different from 1/2) of heads.

OK, I know about fake examples. I’m writing an intro textbook, and I know that fake examples can be great. But not this one!

P.P.S. I’m also disappointed he didn’t use the famous dead-fish example, where Bennett, Baird, Miller, and Wolferd found statistically significant correlations in an MRI of a dead salmon. The correlations were not only statistically significant, they were large and newsworthy!

P.P.P.S. The Times does this weird thing with its articles where it puts auto-links on Duke University, Columbia University, and the University of Missouri. I find this a bit distracting and unprofessional.

17 Comments

  1. DanK says:

    Is it too inappropriate to grumble about a few points here?

    For many of us, it's not obvious what should be considered a "small" effect and what should be considered a "large" effect, and it seems harmful to dichotomize this way. As an occasional psychologist, I've always considered the Stroop effect somewhere in between, but different subpopulations of psychologists would have different (either pole) perspectives. Can't we do better than this? Wouldn't it be better to say that the effect size will be overestimated to the extent the study is underpowered (where we're considering power to detect the true effect size)? That at least gives the correct impression that "large" isn't some absolute numeric cutoff, but has more to do with robustness.

    Also, I wish you wouldn't promote the dead salmon study. In the popular media (and often among scientists), it's often mis-interpreted as some kind of indictment of brain imaging. Brain imaging deserves some degree of indictment, sure, but it's really unhelpful to demonstrate what everyone knows, that worst practice methods produce poor results. Those methods generally don't survive peer review at decent journals (although we all know the process is imperfect), and haven't for more than a decade.

  2. Tal Yarkoni says:

    Hi Andrew,

    Thanks for the link. Re: the point that the journal should have accepted the paper, I think you're misconstruing what I said, which was that if you find current publishing standards in psychology acceptable, then yes, the journal should have published the article. You don't get to give everything else that gets published in JPSP a free pass, but keep out the one article about ESP just because you don't like ESP.

    Now, if you think this episode shows that standards in psychology should be much tighter on the whole (which I personally agree with), then sure, the article shouldn't have been published. But that commits you to raising the bar for everyone, which many of the critics (on the psychology side) don't seem to realize.

  3. zbicyclist says:

    [repeating a comment I made on Tal's site]

    It seems likely that the major impact of Bem’s paper will be methodological — i.e. to show standard methods have pitfalls in actual use.

    If that’s what’s being demonstrated, then the paper would be a methodological artifact paper, and pretty much condemned to a side journal (particularly since the artifacts are a result of choices by the principal investigator).

    If you just showed these artifacts by a monte carlo study, you’d be more likely to make a blog post than a publication in a good journal.

    The cynic in me suggests that JPSP figured they will get a lot of citations out of the article, improving their impact rating.

  4. Richard D. Morey says:

    If you don't like the coin example, perhaps we should all switch to flipping buttered toast in our examples?

  5. Ian Fellows says:

    Just as people who want ESP to exist should be careful not to let their desire affect the interpretation of the evidence, we should be careful not to let the subject of the paper influence our assessment of its methodology.

    I read the paper a while back, and the statistical methods looked pretty standard to me, if a bit on the basic side. The author reported a series of marginally significant findings ( around the 0.01 – 0.05 level). Certainly interesting, but not enough to make me think that the laws of physics need a revision.

    Andrew's point about the inflation of effect sizes due to shrinkage are certainly correct, but in this case, the effect sizes are not of primary interest. No one really cares whether the prediction rate is 53% versus 51%. What matters is whether we can definitively say that the prediction rate is above 50%, because if it is above 50%, regardless of it's magnitude, we would need to fundamentally alter our view of how the world operates. Thus a frequentist hypothesis test is a perfectly reasonable method to use to assess this.

    While I think it unlikely that Bem's findings will be replicated, I think he has received undue criticism of his use of statistics.

  6. Kaiser says:

    I like Andrew's balanced approach to this age old issue. It's not a black and white issue of one method being better; the two approaches address different objectives and are appropriate based on the particulars of the problem.

    I can't help but think that we still don't have a satifactory solution to the problem of small effects when we have only one sample of small size. The classical approach is haunted by the possible (if rare) scenario where the specific sample that was observed is a tail event, leading to an overestimate; if we have repeated samples (not always possible), then this error will be exposed. The Bayesian approach shrinks towards the prior but the prior is based on "external" information – and the best such information would be past replicates, which brings us to a place not much better than classical. So I'm not convinced the problem has been solved.

  7. Mike says:

    I generally don't carry coins, so when my kids are arguing over something and I have to flip a coin, I have an app for that. Seriously. I have a coin-flipping app on my iPhone. Is the coin weighted? I don't know, but it would be very easy to make it that way. And I'm pretty certain that my kids would like to know if it is weighted towards heads.

  8. K? O'Rourke says:

    zbicyclist: That did happen to a colleague, published errors do increase one's citation count, sometime dramatically.

    (They use an analysis method thought inappropriate by probabilists and almost one hundred sent off articales and letters citing his paper as an example of getting it wrong)

    K?

  9. Andrew Gelman says:

    Dan:

    1. Yes, "large" and "small" effects are on a continuum.

    2. Regarding the dead fish, you write:

    Brain imaging deserves some degree of indictment, sure, but it's really unhelpful to demonstrate what everyone knows, that worst practice methods produce poor results. Those methods generally don't survive peer review at decent journals (although we all know the process is imperfect), and haven't for more than a decade.

    What makes you say this? Vul et al. find these problems in lots of recent studies. Whether or not "everyone knows" it's wrong, it appears that lots of people are doing it.

    Tal:

    Some topics are more important or more innovative than ESP, and I can see publishing inconclusive results rather than waiting for more data. For the ESP study, though, I don't see the hurry?

    Zbicyclist:

    It's hard for me to believe that the JPSP editors were trolling for hits. My guess is they didn't want to feel like they were being censored so they bent over backward to accept a paper that they very well may have thought was wrong.

    My impression is that things are different in econ and poli sci. In those fields, if you want to make a claim that is implausible or contradicts the dominant ideology, you need solid proof (or some statistical version thereof), not the merely border-of-statistical-significance-after-running-lots-of-tests sort of thing.

    Ian:

    I agree with you about the statistics in that study, and I think Tal Yarkoni agrees too. Here's what I wrote last week:

    I think it's naive when people implicitly assume that the study's claims are correct, or the study's statistical methods are weak. Generally, the smaller the effects you're studying, the better the statistics you need. ESP is a field of small effects and so ESP researchers use high-quality statistics.

    To put it another way: whatever methodological errors happen to be in the paper in question, probably occur in lots of researcher papers in "legitimate" psychology research. The difference is that when you're studying a large, robust phenomenon, little statistical errors won't be so damaging as in a study of a fragile, possibly zero effect.

  10. DanK says:

    Andrew, the reason I said that the dead salmon (DS) methods don't tend to survive review, despite Vul's survey (of only 55 studies), is that they're two different errors. The DS poster was about correction for multiple comparisons, while the Vul article was largely about exaggerated effect sizes (although it snowballed into a "double dipping" thing). These are different things in a way that's particularly relevant to public perception, so I do think it's harmful to create the impression that the DS error undermines fMRI in general. I could be wrong, it's possible DS errors get through more often than I think. Of course, more studies have the opportunity to make the DS error than the Vul error, and anyone who reviews articles sees it quite a bit. But my gut feeling about the DS error is that it's still not so common to justify informing the general public that all fMRI results are nonsense. (I also believe Vul's survey overestimates the prevalence and severity of the issues he described in fMRI.)

    I will admit, though, that I'm frustrated by how often I do see a specific inappropriate correction approach used (using numeric thresholds that have been passed down from generation to generation since 1995). DS errors are very common. But the ones I review don't get published that way. I rarely see it from groups I respect, nor do I see it as often in the most reputable imaging journals, though it happens. I tend to see it from people who are dabbling in imaging. So I will still believe, optimistically, that it's well known to be worst practice, despite the large number of inexpert practitioners.

  11. Bill Jefferys says:

    Andrew: About coin flipping:

    It may very well be that overweighting one side of the coin doesn't change the theoretical probability of "heads" versus "tails".

    But, suppose the coin is subtly bent? Surely if it is bent in an unsubtle way, for example, bent in half so that the head part of the coin is out, and the tail part in, it will have to fall "heads," right? The chance that it will land on the (now doubled) edge, and balances there (giving maybe a tail, depending on how you interpret "landing on the edge"), is very small, I would think…

    And if it is not bent so much, then there would be an advantage for heads in a "heads out" bending, I would think.

    Even if it were bent slightly, I would expect that there might be advantage for one side over the other. Some physical analysis would be appropriate here.

    In my classes, I talk about G. W.'s nose being heavy "or something", and that may not be right about the nose being heavy, but it might be right about the "or something". For example, G. W's nose might stick out a bit more than the eagle on the other side of the quarter, enough to subtly bias the coin toss (at least if tossed onto a flat and solid elastic surface, again to bring in physics.)

    What do you think, Andrew?

  12. Gustav says:

    Q: "statistical significance filter" = model selection based on P-values?

  13. K? O'Rourke says:

    Dank: Things that seem obvious once "you" get them- weren't always.

    In particular, I remember the DS error being pointed out to Keith Worsely in one of his very early talks on fMRI at U of Ts Stat Dept.

    Bill: I am with you with – the claim seems to imply that the null hypothesis of P(H) = .5 is exactly true _and_ robust…

    K?

  14. DanK says:

    K?, the issue is not that I personally get it. I realize that it wasn't obvious to everyone in 1995, but that was 16 years ago. At this point, the researchers who make this kind of error are at least 2 standard deviations out on the willful ignorance scale. It's that it's very misleading and harmful to showcase methods that are only used by the worst practitioners in a field, giving the impression that current functional imaging methods routinely show activation in dead salmon. I doubt any scientific field could survive being evaluated on the basis of its least responsible practitioners.

    I also doubt this is the effect intended by the authors of that poster. I imagine they just wanted to knock some sense into the heads of the willfully ignorant, as do we all. Unfortunately, the popular media picked up on the story in a way that's very misleading.

  15. DanK says:

    ps if it turns out that people who are only 1sd more willfully ignorant than average are still making this error, than i'll retract my comments, and agree that the field deserves to be ridiculed.

  16. K? O'Rourke says:

    DanK: In a different field – clinical research – a survey of those doing randomized clinical trials circa 2005 …

    found that about 50% thought DS stuff (multiplicity) was not relevant (just a theorectical rather than real concern).

    And one of my fellow DPhil students circa 2002 had his study to track original prespecified outcomes in REB approvals versus published claims on the effect on the rate false postive claims almost squashed by REBs who thought it would be unethical without the permission of investigators of those same studies.

    One REB felt otherwise and the results were that many were switching outcomes without noting so and mostly for data dredging reasons that inflated the effect estimates …

    And his papers were well read and felt highly informative.

    And one of my fellow faculty at an ivy league stats department circa 2007 argued that DS from a statistical inference point of view was simply bogus.

    And then there were those Bayesian who said it was not something they had to be concerned with and perhaps one of the best corrections to this was published just last year http://www.stat.duke.edu/%7Eberger/papers/EB-mult

    And, and I owe you an apology for ranting

    K?

  17. DanK says:

    K?: Certainly I wouldn't suggest that this error isn't prevalent in any community. I'm only suggesting that it's not common in the fMRI community (very little overlap, I believe, with the community of people doing clinical trials in 2005). My totally uninformed and unfair impression is that a lot of clinical trials are run by MDs who think a couple of required research rotations makes you a scientist. So I wouldn't be surprised it they were leading the league in DS errors, so to speak, as well as the other kinds of misdeeds you note (I am a little surprised by the mysterious view of your colleague, although I'll be an optimist and believe there was a reasonable point at the core of it).

    Thanks for the pointer to the Scott/Berger article, it looks like it should be very helpful. And for my part, I didn't detect any ranting requiring an apology.