Skip to content

I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence

It’s funny. I’m the statistician, but I’m more skeptical about statistics, compared to these renowned scientists.

The quotes

Here’s one: “You have no choice but to accept that the major conclusions of these studies are true.”

Ahhhh, but we do have a choice!

First, the background. We have two quotes from this paper by E. J. Wagenmakers, Ruud Wetzels, Denny Borsboom, Rogier Kievit, and Han van der Maas.

Here’s Alan Turing in 1950:

I assume that the reader is familiar with the idea of extra-sensory perception, and the meaning of the four items of it, viz. telepathy, clairvoyance, precognition and psycho-kinesis. These disturbing phenomena seem to deny all our usual scientific ideas. How we should like to discredit them! Unfortunately the statistical evidence, at least for telepathy, is overwhelming.

Wow! Overwhelming evidence isn’t what it used to be.

In all seriousness, it’s interesting that Turing, who was in some ways an expert on statistical evidence, was fooled in this way. After all, even those psychologists who currently believe in ESP would not, I think, hold that the evidence for telepathy as of 1950 was overwhelming. I say this because it does not seem so easy for researchers to demonstrate ESP using the protocols of the 1940s; instead there is continuing effort to come up with new designs

How could Turing have thought this? I don’t know much about Turing but it does seem, when reading old-time literature, that belief in the supernatural was pretty common back then, lots of mention of ghosts etc. And at an intuitive level there does seem, at least to me, an intuitive appeal to the idea that if we just concentrate hard enough, we can read minds, move objects, etc. Also, remember that, as of 1950, the discovery and popularization of quantum mechanics was not so far in the past. Given all the counterintuitive features of quantum physics and radioactivity, it does not seem at all unreasonable that there could be some new phenomena out there to be discovered. Things feel a bit different in 2014 after several decades of merely incremental improvements in physics.

To move things forward a few decades, Wagenmakers et al. mention “the phenomenon of social priming, where a subtle cognitive or emotional manipulation influences overt behavior. The prototypical example is the elderly walking study from Bargh, Chen, and Burrows (1996); in the priming phase of this study, students were either confronted with neutral words or with words that are related to the concept of the elderly (e.g., ‘Florida’, ‘bingo’). The results showed that the students’ walking speed was slower after having been primed with the elderly-related words.”

They then pop our this 2011 quote from Daniel Kahneman:

When I describe priming studies to audiences, the reaction is often disbelief . . . The idea you should focus on, however, is that disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.

And that brings us to the beginning of this post, and my response: No, you don’t have to accept that the major conclusions of these studies are true. Wagenmakers et al. note, “At the 2014 APS annual meeting in San Francisco, however, Hal Pashler presented a long series of failed replications of social priming studies, conducted together with Christine Harris, the upshot of which was that disbelief does in fact remain an option.”

Where did Turing and Kahneman go wrong?

Overstating the strength of empirical evidence. How does that happen? As Eric Loken and I discuss in our Garden of Forking Paths article (echoing earlier work by Simmons, Nelson, and Simonsohn), statistically significant comparisons are not hard to come by, even by researchers who are not actively fishing through the data.

The other issue is that when any real effects are almost certainly tiny (as in ESP, or social priming, or various other bank-shot behavioral effects such as ovulation and voting), statistically significant patterns can be systematically misleading (as John Carlin and I discuss here).

Still and all, it’s striking to see brilliant people such as Turing and Kahneman making this mistake. Especially Kahneman, given that he and Tversky wrote the following in a famous paper:

People have erroneous intuitions about the laws of chance. In particular, they regard a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics. The prevalence of the belief and its unfortunate consequences for psvchological research are illustrated by the responses of professional psychologists to a questionnaire concerning research decisions.


Having an open mind

It’s good to have an open mind. Psychology journals publish articles on ESP and social priming, even though these may seem implausible, because implausible things sometimes are true.

It’s good to have an open mind. When a striking result appears in the dataset, it’s possible that this result does not represent an enduring truth or even a pattern in the general population but rather is just an artifact of a particular small and noisy dataset.

One frustration I’ve had in recent discussions regarding controversial research is the seeming unwillingness of researchers to entertain the possibility that their published findings are just noise. Maybe not, maybe these are real effects being discovered, but you should at least consider the possibility that you’re chasing noise. Despite what Turing and Kahneman say, you can keep an open mind.

P.S.  Some commenters thought that I was disparaging Alan Turing and Daniel Kahneman.  I wasn’t. Turing and Kahneman both made big contributions to science, almost certainly much bigger than anything I will ever do. And I’m not criticizing them for believing in ESP and social priming. What I am criticizing them for is their insistence that the evidence is “overwhelming” and that the rest of us “have no choice” but to accept these hypotheses. Both Turing and Kahneman, great as they are, overstated the strength of the statistical evidence.

And that’s interesting. When stupid people make a mistake, that’s no big deal. But when brilliant people make a mistake, it’s worth noting.

Questions about “Too Good to Be True”

Greg Won writes:

I manage a team tasked with, among other things, analyzing data on Air Traffic operations to identify factors that may be associated with elevated risk. I think its fair to characterize our work as “data mining” (e.g., using rule induction, Bayesian, and statistical methods).

One of my colleagues sent me a link to your recent article “Too Good to Be True” (Slate, July 24). Obviously, as my friend has pointed out, your article raises questions about the validity of what I’m doing.

A few thoughts/questions:

(1) I agree with your overall point, but I’m having trouble understanding the specific complaint with the “red/pink” study. In their case, if I’m understanding the author’s rebuttal, they were not asking “what color is associated with fertility” and then mining the data to find a color…any color…which seemed to have a statistical association. They started by asking “is red/pink associated with fertility”, no? In which case, I think the point their making seems fair?

(2) But, your argument definitely applies to the kind of work I’m doing. In my case, I’m asking an open ended question: “Are there any relationships?” Well, of course, you would say, the odds are that you must find relationships…even if they are not really there.

(3) So let’s take a couple of examples. There are 1,000′s of economists building models to explain some economic phenomenon. All of these models are based on the same underlying data: the U.S. Income and Product Accounts. There are then 10,000′s of models built—only a handful of are publication-worthy. So, by the same logic, with that many people studying the same sample, it would be statistically true that many of the published papers in even the best economics journals are false?

(4) Another example: one of the things that we have uncovered is that, in the case of Runway Incursions, errors committed by air traffic controllers are many times more likely to result in a collision than errors committed by a pilot. The p-value here is pretty low—although the confidence interval is large because, thankfully, we don’t have a lot of collisions. What is your reaction to this finding?

(5) A caveat: In my case, we use the statistically significant findings to point us in directions that deserve more study. Basically as a form of triage (because we don’t have the resources to address every conceivable hazard in the airspace system). Perhaps fortunately, most of the people I deal with (primarily pilots and air traffic controllers) don’t understand statistics. So, the safety case we build must be based on more than just a mechanical analysis of the data.

My reply:

(1) Whether or not the authors of the study were “mining the data,” I think their analysis was contingent on the data. They had many data-analytic choices, including rules for which cases to include or exclude and which comparisons to make, as well as what colors to study. Their protocol and analysis were not pre-registered. The point is that, even though they did an analysis that was consistent with their general research hypothesis, there are many degrees of freedom in the specifics, and these specifics can well be chosen in light of the data.

This topic is really worth an article of its own . . . and, indeed, Eric Loken and I have written that article! So, instead of replying in detail in this post, I’ll point you toward The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.

(2) You write, “the odds are that you must find relationships . . . even if they are not really there.” I think the relationships are there but that they are typically small, and they exist in the context of high levels of variation. So the issue isn’t so much that you’re finding things that aren’t there, but rather that, if you’re not careful, you’ll think you’re finding large and consistent effects, when what’s really there are small effects of varying direction.

(3) You ask, “by the same logic, with that many people studying the same sample, it would be statistically true that many of the published papers in even the best economics journals are false?” My response: No, I don’t think that framing statistical statements as “true” or “false” is the most helpful way to look at things. I think it’s fine for lots of people to analyze the same dataset. And, for that matter, I think it’s fine for people to use various different statistical methods. But methods have assumptions attached to them. If you’re using a Bayesian approach, it’s only fair to criticize your methods if the probability distributions don’t seem to make sense. And if you’re using p-values, then you need to consider the reference distribution over which the long-run averaging is taking place.

(4) You write: “in the case of Runway Incursions, errors committed by air traffic controllers are many times more likely to result in a collision than errors committed by a pilot. The p-value here is pretty low—although the confidence interval is large because, thankfully, we don’t have a lot of collisions. What is your reaction to this finding?” My response is, first, I’d like to see all the comparisons that you might be making with these data. If you found one interesting pattern, there might well be others, and I wouldn’t want you to limit your conclusions to just whatever happened to be statistically significant. Second, your finding seems plausible to me but I’d guess that the long-run difference will probably be lower than what you found in your initial estimate, as there is typically a selection process by which larger differences are more likely to be noticed.

(5) Your triage makes some sense. Also let me emphasize that it’s not generally appropriate to wait on statistical significance before making decisions.

Bad Statistics: Ignore or Call Out?

Evelyn Lamb adds to the conversation that Jeff Leek and I had a few months ago. It’s a topic that’s worth returning to, in light of our continuing discussions regarding the crisis of criticism in science.

On deck this week

Mon: Bad Statistics: Ignore or Call Out?

Tues: Questions about “Too Good to Be True”

Wed: I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence

Thurs: Why isn’t replication required before publication in top journals?

Fri: Confirmationist and falsificationist paradigms of science

Sat: How does inference for next year’s data differ from inference for unobserved data from the current year?

Sun: Likelihood from quantiles?

We’ve got a full week of statistics for you. Welcome back to work, everyone!

On deck this month

Bad Statistics: Ignore or Call Out?

Questions about “Too Good to Be True”

I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence

Why isn’t replication required before publication in top journals?

Confirmationist and falsificationist paradigms of science

How does inference for next year’s data differ from inference for unobserved data from the current year?

Likelihood from quantiles?

My talk with David Schiminovich this Wed noon: “The Birth of the Universe and the Fate of the Earth: One Trillion UV Photons Meet Stan”

Suspicious graph purporting to show “percentage of slaves or serfs in the world”

“It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet”

One-tailed or two-tailed

What is the purpose of a poem?

He just ordered a translation from Diederik Stapel

Six quotes from Kaiser Fung

More bad news for the buggy-whip manufacturers

They know my email but they don’t know me

What do you do to visualize uncertainty?

Sokal: “science is not merely a bag of clever tricks . . . Rather, the natural sciences are nothing more or less than one particular application — albeit an unusually successful one — of a more general rationalist worldview”

Question about data mining bias in finance

Estimating discontinuity in slope of a response function

I can’t think of a good title for this one.

Study published in 2011, followed by successful replication in 2003 [sic]

My talk at the University of Michigan on Fri 25 Sept

I’m sure that my anti-Polya attitude is completely unfair

Waic for time series

MA206 Program Director’s Memorandum

“An exact fishy test”

People used to send me ugly graphs, now I get these things

If you do an experiment with 700,000 participants, you’ll (a) have no problem with statistical significance, (b) get to call it “massive-scale,” (c) get a chance to publish it in a tabloid top journal. Cool!

Carrie McLaren was way out in front of the anti-Gladwell bandwagon

Avoiding model selection in Bayesian social research

One of my favorites, from 1995.

Don Rubin and I argue with Adrian Raftery. Here’s how we begin:

Raftery’s paper addresses two important problems in the statistical analysis of social science data: (1) choosing an appropriate model when so much data are available that standard P-values reject all parsimonious models; and (2) making estimates and predictions when there are not enough data available to fit the desired model using standard techniques.

For both problems, we agree with Raftery that classical frequentist methods fail and that Raftery’s suggested methods based on BIC can point in better directions. Nevertheless, we disagree with his solutions because, in principle, they are still directed off-target and only by serendipity manage to hit the target in special circumstances. Our primary criticisms of Raftery’s proposals are that (1) he promises the impossible: the selection of a model that is adequate for specific purposes without consideration of those purposes; and (2) he uses the same limited tool for model averaging as for model selection, thereby depriving himself of the benefits of the broad range of available Bayesian procedures.

Despite our criticisms, we applaud Raftery’s desire to improve practice by providing methods and computer programs for all to use and applying these methods to real problems. We believe that his paper makes a positive contribution to social science, by focusing on hard problems where standard methods can fail and exp sing failures of standard methods.

We follow up with sections on:
- “Too much data, model selection, and the example of the 3x3x16 contingency table with 113,556 data points”
- “How can BIC select a model that does not fit the data over one that does”
- “Not enough data, model averaging, and the example of regression with 15 explanatory variables and 47 data points.”

And here’s something we found on the web with Raftery’s original article, our discussion and other discussions, and Raftery’s reply. Enjoy.

When we talk about the “file drawer,” let’s not assume that an experiment can easily be characterized as producing strong, mixed, or weak results

Neil Malhotra:

I thought you might be interested in our paper [the paper is by Annie Franco, Neil Malhotra, and Gabor Simonovits, and the link is to a news article by Jeffrey Mervis], forthcoming in Science, about publication bias in the social sciences given your interest and work on research transparency.

Basic summary: We examined studies conducted as part of the Time-sharing Experiments in the Social Science (TESS) program, where: (1) we have a known population of conducted studies (some published, some unpublished); and (2) all studies exceed a quality threshold as they go through peer review. We found that having null results made experiments 40 percentage points less likely to be published and 60 percentage points less likely to even be written up.

My reply:

Here’s a funny bit from the news article: “Stanford political economist Neil Malhotra and two of his graduate students . . .” You know you’ve hit the big time when you’re the only author who gets mentioned in the news story!

More seriously, this is great stuff. I would only suggest that, along with the file drawer, you remember the garden of forking paths. In particular, I’m not so sure about the framing in which an experiment can be characterized as producing “strong results,” “mixed results,” or “null results.” Whether a result is strong or not would seem to depend on how the data are analyzed, and the point of the forking paths is that with a given data it is possible for noise to appear as strong. I gather from the news article that TESS is different in that any given study is focused on a specific hypothesis, but even so I would think there is a bit of flexibility in how the data are analyzed and a fair number of potentially forking paths. For example, the news article mentions “whether voters tend to favor legislators who boast of bringing federal dollars to their districts over those who tout a focus on policy matters).” But of course this could be studied in many different ways.

In short, I think this is important work you have done, and I just think that we should go beyond the “file drawer” because I fear that this phase lends too much credence to the idea that a reported p-value is a legitimate summary of a study.

P.S. There’s also a statistical issue that every study is counted only once, as either a 1 (published) or 0 (unpublished). If Bruno Frey ever gets involved, you’d have to have a system where any result gets a number from 0 to 5, representing the number of different times it’s published.

Pre-election survey methodology: details from nine polling organizations, 1988 and 1992

This one from 1995 (with D. Stephen Voss and Gary King) was fun. For our “Why are American Presidential election campaign polls so variable when votes are so predictable?” project a few years earlier, Gary and I had analyzed individual-level survey responses from 60 pre-election polls that had been conducted by several different polling organizations. We wanted do know exactly what went on in these surveys but it was hard to learn anything much at all from the codebooks or the official descriptions of the polls. So Voss, a student of Gary’s who had experience as a journalist, contacted the polling organizations and got them to cough up lots of information, which we reported in this paper, along with some simple analyses indicating the effects of weighting. All the surveys had serious nonresponse issues, of course, but some dealt with them during the data-collection process while others did more of the adjustment using weights.

By the way, the paper has a (small) error. The two outlying “h” points in Figure 1b are a mistake. I can’t remember what we did wrong, but I do remember catching the mistake, I think it was before publication but too late for the journal to fix the error. The actual weighted results for the Harris polls are not noticeably different from those of the other surveys at those dates.

Polling has changed in the past twenty years, but I think this paper is still valuable, partly in giving a sense of the many different ways that polling organizations can attempt to get a representative sample, and partly as a convenient way to shoot down the conventional textbook idea of survey weights as inverse selection probabilities. (Remember, survey weighting is a mess.)

One of the worst infographics ever, but people don’t care?

This post is by Phil Price.

Perhaps prompted by the ALS Ice Bucket Challenge, this infographic has been making the rounds:
Infographic: disease deaths and dollars spent

I think this is one of the worst I have ever seen. I don’t know where it came from, so I can’t give credit/blame where it’s due.

Let’s put aside the numbers themselves – I haven’t checked them, for one thing, and I’d also say that for this comparison one would be most interested in (government money plus donations) rather than just donations — and just look at this as an information display. What are some things I don’t like about it? Jeez, I hardly know where to begin.

1. It takes a lot of work to figure it out. (a) You have to realize that each color is associated with a different cause — my initial thought was that the top circles represent deaths and dollars for the first cause, the second circles are for the second cause, etc. (b) Even once you’ve realized what is being displayed, and how, you pretty much have to go disease by disease to see what is going on; there’s no way to grok the whole pattern at once. (b) Other than pink for breast cancer and maybe red for AIDS none of the color mappings are standardized in any sense, so you have to keep referring back to the legend at the top. (c) It’s not obvious (and I still don’t know) if the amount of “money raised” for a given cause refers only to the specific fundraising vehicle mentioned in the legend for each disease. It’s hard to believe they would do it that way, but maybe they do.
2. Good luck if you’re colorblind.
3. Maybe I buried the lede by putting this last: did you catch the fact that the area of the circle isn’t the relevant parameter? Take a look at the top two circles on the left. The upper one should be less than twice the size of the second one. It looks like they made the diameter of the circle proportional to the quantity, rather than the area; a classic way to mislead with a graphic.

At a bare minimum, this graphic could be improved by (a) fixing the terrible mistake with the sizes of the circles, (b) putting both columns in the same order (that is, first row is one disease, second row is another, etc)., (c) taking advantage of the new ordering to label each row so you don’t need the legend. This would also make it much easier to see the point the display is supposed to make.

As a professional data analyst I’d rather just see a scatterplot of money vs deaths, but I know a lot of people don’t understand scatterplots. I can see the value of using circle sizes for a general audience. But I can’t see how anyone could like this graphic. Yet three of my friends (so far) have posted it on Facebook, with nary a criticism of the display.

[Note added the next day:
The graphic is even worse than I thought. As several people have pointed out, my suspicion is true: the numbers do not show the total donations to fight the diseases listed, they show only the donations to a single organization. For instance, according to the legend the pink color represents donations to fight breast cancer, but the number is not for breast cancer as a whole, it's only for Komen Race for the Cure.

If they think people are interested in contributions to only a single charity in each category --- which seems strange, but let's assume that's what they want and just look at the display --- then they need a title that is much less ambiguous, and the labels need to emphasize the charity and not the disease.]

This post is by Phil Price.

Discussion of “A probabilistic model for the spatial distribution of party support in multiparty elections”

From 1994. I don’t have much to say about this one. The paper I was discussing (by Samuel Merrill) had already been accepted by the journal—I might even have been a referee, in which case the associate editor had decided to accept the paper over my objections—and the editor gave me the opportunity to publish this dissent which appeared in the same issue with Merrill’s article.

I like the discussion, and it includes some themes that keep showing up: the idea that modeling is important and you need to understand what your model is doing to the data. It’s not enough to just interpret the fitted parameters as is, you need to get in there, get your hands dirty, and examine all aspects of your fit, not just the parts that relate to your hypotheses of interest.

There is a continuity between the criticisms I addressed of that paper in 1994, and our recent criticisms of some applied models, for example of that regression estimate of the health effects of air pollution in China.