Skip to content
 

I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence

It’s funny. I’m the statistician, but I’m more skeptical about statistics, compared to these renowned scientists.

The quotes

Here’s one: “You have no choice but to accept that the major conclusions of these studies are true.”

Ahhhh, but we do have a choice!

First, the background. We have two quotes from this paper by E. J. Wagenmakers, Ruud Wetzels, Denny Borsboom, Rogier Kievit, and Han van der Maas.

Here’s Alan Turing in 1950:

I assume that the reader is familiar with the idea of extra-sensory perception, and the meaning of the four items of it, viz. telepathy, clairvoyance, precognition and psycho-kinesis. These disturbing phenomena seem to deny all our usual scientific ideas. How we should like to discredit them! Unfortunately the statistical evidence, at least for telepathy, is overwhelming.

Wow! Overwhelming evidence isn’t what it used to be.

In all seriousness, it’s interesting that Turing, who was in some ways an expert on statistical evidence, was fooled in this way. After all, even those psychologists who currently believe in ESP would not, I think, hold that the evidence for telepathy as of 1950 was overwhelming. I say this because it does not seem so easy for researchers to demonstrate ESP using the protocols of the 1940s; instead there is continuing effort to come up with new designs

How could Turing have thought this? I don’t know much about Turing but it does seem, when reading old-time literature, that belief in the supernatural was pretty common back then, lots of mention of ghosts etc. And at an intuitive level there does seem, at least to me, an intuitive appeal to the idea that if we just concentrate hard enough, we can read minds, move objects, etc. Also, remember that, as of 1950, the discovery and popularization of quantum mechanics was not so far in the past. Given all the counterintuitive features of quantum physics and radioactivity, it does not seem at all unreasonable that there could be some new phenomena out there to be discovered. Things feel a bit different in 2014 after several decades of merely incremental improvements in physics.

To move things forward a few decades, Wagenmakers et al. mention “the phenomenon of social priming, where a subtle cognitive or emotional manipulation influences overt behavior. The prototypical example is the elderly walking study from Bargh, Chen, and Burrows (1996); in the priming phase of this study, students were either confronted with neutral words or with words that are related to the concept of the elderly (e.g., ‘Florida’, ‘bingo’). The results showed that the students’ walking speed was slower after having been primed with the elderly-related words.”

They then pop our this 2011 quote from Daniel Kahneman:

When I describe priming studies to audiences, the reaction is often disbelief . . . The idea you should focus on, however, is that disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.

And that brings us to the beginning of this post, and my response: No, you don’t have to accept that the major conclusions of these studies are true. Wagenmakers et al. note, “At the 2014 APS annual meeting in San Francisco, however, Hal Pashler presented a long series of failed replications of social priming studies, conducted together with Christine Harris, the upshot of which was that disbelief does in fact remain an option.”

Where did Turing and Kahneman go wrong?

Overstating the strength of empirical evidence. How does that happen? As Eric Loken and I discuss in our Garden of Forking Paths article (echoing earlier work by Simmons, Nelson, and Simonsohn), statistically significant comparisons are not hard to come by, even by researchers who are not actively fishing through the data.

The other issue is that when any real effects are almost certainly tiny (as in ESP, or social priming, or various other bank-shot behavioral effects such as ovulation and voting), statistically significant patterns can be systematically misleading (as John Carlin and I discuss here).

Still and all, it’s striking to see brilliant people such as Turing and Kahneman making this mistake. Especially Kahneman, given that he and Tversky wrote the following in a famous paper:

People have erroneous intuitions about the laws of chance. In particular, they regard a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics. The prevalence of the belief and its unfortunate consequences for psvchological research are illustrated by the responses of professional psychologists to a questionnaire concerning research decisions.

Indeed.

Having an open mind

It’s good to have an open mind. Psychology journals publish articles on ESP and social priming, even though these may seem implausible, because implausible things sometimes are true.

It’s good to have an open mind. When a striking result appears in the dataset, it’s possible that this result does not represent an enduring truth or even a pattern in the general population but rather is just an artifact of a particular small and noisy dataset.

One frustration I’ve had in recent discussions regarding controversial research is the seeming unwillingness of researchers to entertain the possibility that their published findings are just noise. Maybe not, maybe these are real effects being discovered, but you should at least consider the possibility that you’re chasing noise. Despite what Turing and Kahneman say, you can keep an open mind.

P.S.  Some commenters thought that I was disparaging Alan Turing and Daniel Kahneman.  I wasn’t. Turing and Kahneman both made big contributions to science, almost certainly much bigger than anything I will ever do. And I’m not criticizing them for believing in ESP and social priming. What I am criticizing them for is their insistence that the evidence is “overwhelming” and that the rest of us “have no choice” but to accept these hypotheses. Both Turing and Kahneman, great as they are, overstated the strength of the statistical evidence.

And that’s interesting. When stupid people make a mistake, that’s no big deal. But when brilliant people make a mistake, it’s worth noting.

120 Comments

  1. P says:

    In the 1930s and 40s, there were some experimental studies by J. B. Rhine and others that seemed to prove the existence of clairvoyance and telepathy. These results were famous at the time, and it took many years before the many experimental errors and fabrications in them were uncovered. I don’t think Turing believed in such things because people back then were more superstitious than today. Instead, he probably believed in them because the available scientific evidence seemed to support them.

  2. Anonymous says:

    Von Mises was careful to point out that his “collectives” – what we’d call a “data generation mechanism” or a stable frequency histogram – only rarely occurs. Most repeated phenomena do not amount to a collective in his sense, and he didn’t consider them a legitimate part of statistics, even though they superficially looked like ‘random variables’.

    Acting as if every repeatable event is a ‘collective’ when they rarely are is the source of all the problems. It’s the original sin of classical statistics.

    All that multiple comparisons stuff is joke. Any effects observed are real in that data set. The question is whether they’ll be there in other data sets. If you were dealing with a ‘collective’ then all those considerations about multiple comparison, cherry picking, p-value fishing, garden of forked paths and all the rest of it might be relevant. But since you’re rarely dealing with a ‘collective’, they’re positively misleading if not bizarrely stupid.

    When you’re not dealing with a ‘collective’, whether any effects seen in one data set will be there in other ones turns on factors unrelated multiple comparison, cherry picking, p-value fishing, garden of forked paths and all the rest of that crap.

    • +1 on this. There are a number of realistic scientific “experiments” which are repeatable, and hence you could consider a ball-in-urn type notion of probability (in particular, I’m thinking of things like running algorithms on whole-genome datasets and selecting subsets according to some theory, you can definitely ask “does my theory generate datasets which are different from a random number generator choosing subsets of the genome?”)

      But, much of the time, we’re studying things which are unique: a particular earthquake, a particular year’s ice-pack, a particular set of tree rings, a particular sediment in a particular part of the ocean, a particular factory’s machinery…

      or, we’re studying something which is repeatable only because we’re ignorant of the stuff that makes it different, in other words, with a sufficiently blunt microscope everything looks like the same kind of blur, eg:

      a population of people who come to the doctor with a similar set of 3 symptoms (but we are totally ignorant of the biological details of the virus/bacteria, the immune history of the patients, the genetic predispositions, etc)

      Acknowledging ignorance is part of statistics, and extrapolating to larger populations is always possible, but there’s no reason we should think it would work considering that most processes are highly detailed relative to the measurements we use to group them together.

      • Kyle C says:

        Daniel, I’m surprised to see you endorse Anonymous’s view that Prof. Gelman’s concerns are “positively misleading if not bizarrely stupid.”

        • Anonymous says:

          (a) you cut off the “if” part of that if-then statement, (b) Gelman largely avoids the problem with Bayesian multilevel modeling, (c) the claim was that in most instances every turns on other factors. Bayesians often sneak those “other factors” in through the prior. For example, Gelman’s use of prior information about the size of a possible effect.

        • You’ve shortcut the argument WAY too much here.

          Gelman’s core point seems to be that statistically significant effects often aren’t statistically significant. In other words, it’s easy to be fooled by p values. He gives some reasons such as data dependent analysis, and those are good points, but they already jump the gun.

          My point (and I think Anonymous’ or at least one of his points) is that when you don’t have a process that is somehow stable and repeatedly spits out something that looks like a random number from a single stable distribution, the idea of a p value based on such a process not being meaningful because of data dependent analysis is already jumping the gun. the p value isn’t meaningful because it’s a summary of data that hasn’t happened, under assumptions that further data will have a certain distribution which is already highly suspect to begin with!

          Gelman addresses these concerns through A) use of Bayesian methods that take the data as fixed and use probability theory to describe the parameters, and B) using heirarchical methods that borrow information from related data and incorporate prior information about effect sizes, and soforth to reduce the chance that you’ll be fooled into thinking something big is going on.

          In other words, p-values are often problematic even before the garden of forking paths.

          • Another way to frame this is “variable effect sizes”. Whatever you see in the sample of mechanical turk responders to shirt color and menstrual cycles will give you some effect which is valid on average for those people, but in other populations there’s no reason to believe that the effect sizes will stay the same, so a p value for the mechanical turk responders is telling you not very much about the population of say perimenopausal women, married women with children, black women between 25 and 40 years old in the northeastern states… etc.

            p values assume a stable set of future values from a stable population distribution. that’s already suspect in huge swaths of research.

          • Kyle C says:

            So you agree: “All that multiple comparisons stuff is [a] joke…. If you [aren’t] dealing with a ‘collective’ [which is always,] then … multiple comparison, cherry picking, p-value fishing, garden of forked paths and all the rest of it … [are] positively misleading if not bizarrely stupid … crap.”

            • I think I’ve made my point clear in my own words in other comments. please don’t put words (especially ones with your own editorial bracketed [] remarks) into my mouth.

              If you want to substantively engage those other comments in my words I will consider clarifying them, unless you’re just trolling, which I’m not sure about at this point. putting words in others mouths is classic troll behavior. I like to think we don’t get much of that here.

            • Anonymous says:

              Kyle C,

              Let’s put it in concrete modeling terms.

              If you have to choose which variables to put in your regression model, in most instances you’re better off using prior information to make the choice rather than doing a test to see if the coefficient of the variable is “statistically significant from zero”.

              Similarly, if you have a blip in your data set, in most instances you’re better off using prior information to decide whether you’ll see that blip in another data set then you are using a test to determine if it’s “statistically significant”.

          • Yet another way to frame this is “representativeness of the sample”.

            If you have a finite population of things (such as say, menstruating women in the US) you can always clump them all together and say that they have a “distribution” of measurable outcomes.

            Taking some sample of this population and then calculating a p value for the outcome to be on average different from some null we’ll call 0 requires us to calculate

            1-p(s_1(d),s_2(d),…,s_n(d))

            where p is a distributional form, and s_i are sample statistics calculated from the data d.

            if the data are sufficiently representative of the whole of the population, and if the outcomes are relatively stable in time, then future samples d’ will have similar characteristics when they are also representative of the whole population.

            but when d is only representative of a small region of the total range of outcomes the p value you calculate is meaningless as a summary of what to believe about other future samples. Also, when the function p is a function of t as well (that is, the distribution of outcomes changes in time, as when for example fashions in shirt color change for our favorite hobby horse study) then the p value is meaningless for future samples as well.

            These are already true even before we get to the fact that your choice of model p, and sample statistics s_i might be conditional on the data.

            whoo. I hope that’s helpful for someone.

            • Kyle C says:

              Daniel and Anonymous,

              Many thanks for engaging.

              I submit that it’s a bit much to accuse the person who questioned the use of terms like “joke,” “bizarrely stupid,” and “crap,” in direct reference to our host’s published lexicon, of trolling. You’ve now made your points temperately.

    • jrc says:

      Anonymous,

      I’m sympathetic to this argument, but I also think it is arguing a different point than the one our host is making.

      “Any effects observed are real in that data set.” This is true if we are looking at, say, point estimates of differences in means between two groups. That difference IS in the data, no question about it. But we want to know two further things, and that is where I think your argument misses something crucial.

      First – we want to know something like “is a difference in means that large likely to have come from chance alone under the assumption that there is no real differences between the groups.” So call that a p-value. I think it is very wrong to think that a p-value of less than .05 means anything vaguely resembling proof of non-random differences in means. If you look at the history of inference strategies in econometrics, you see a push towards continuously relaxing assumptions on standard error calculations, and that the new standard error estimators preferred in the literature generate substantially larger SEs (and thus larger p-values) than simple-random sampling p-values produce. My point is that you can very easily believe that some effect is not actually “in the data” and that the authors calculated lousy standard errors.

      Second – assuming that we have an effect in the data that appears to be too large to be due to chance, then do we think that effect is “in the world” or just “in the sample”? And this is where I think you make a good point. The ideas that a sample is “representative”, that results are stable across time and place and people, and that there is some “collective” or underlying data generating mechanism behind all effects we see in some particular dataset does seem epistemologically dangerous to me.

      If you were to ask me, I’d say that our host worries a lot about both points, but isn’t always clear on which one he is talking about at a given time. I think the first point means we should worry about what the data is telling us (p-hacking, things like that). I think the second one means we should worry about our conception of the world. I think the “garden of forking paths” problem actually relates to both – its a description of the way we get bad p-values by having a bad metaphysics/meta-statistics. Because we believe in the “collectives” we find ourselves likely to believe (well or poorly calculated) p-values. That said, I still struggle with the deep meaning of the Garden paper, but probably because I haven’t read it as closely as it deserves yet.

      All this somehow relates to the metaphysical differences that motivate “classical/parametric” inference approaches and “randomization test type” inference approaches… but that’s probably for a different thread.

      • Anonymous says:

        “is a difference in means that large likely to have come from chance alone”

        Those words, as usually interpreted, presume you’re dealing with a ‘collective’ or a physical data generation mechanism which throws off stable frequency histograms.

        Note statisticians almost never verify this assumption. Modeling a data set complete with some statistical “tests” doesn’t verify the physical assumption being made here. You can ALWAYS find a “good” model for a data set regardless of what physically produced it. So producing a “good” model doesn’t tell you whether it’s a ‘collective’.

        If the ‘collective’ assumption can be checked at all, it’s done by scientists conducting experiments to check that physical assumption. Usually, the statistician isn’t even close to being in a position to do those experiments. Even if they were, they’d typically give a negative result.

      • The “likely to have come from chance alone” part is problematic. Let’s use an econometrics type example:

        We have 1 million households in some county and we select some kind of sample of a few hundred of them and we ask them about their income last year for example. We find that in the sample households that identify as “asian” have higher incomes than households that identify as “latino” for example. We want to know if this “could have come from chance alone”. I can interpret that in several ways:

        1) We only have a sample from the population, so whatever method we used could have given us a sample that is different from the population. When we have a simple random sample from a finite population we can at least calculate the frequency with which a given sample would have certain sized deviations from population values. But quite often we don’t have a truly random sample, so what then?

        2) Incomes change through time, perhaps future populations will have more similar incomes. Often we’re actually modeling timeseries so we may be taking this into account, but there are plenty of non-timeseries analyses where stability in time is assumed (like for example effectiveness of a drug at regulatory approval time vs antibiotic resistance in the population of bacteria in the future)

        3) Changes from place to place, we may wish to generalize our sample from a California county to other populations of different counties, different states… sometimes it’s not so obvious that this is a problem. In econometrics perhaps it’s more common to be aware of this, but what about medical treatments from one hospital to another, behavioral issues from one population of criminals to another, physical characteristics of earthquakes from one fault to another? It’s very common to either pool things together or extrapolate from one pool to another pool where implicitly the extrapolation process is to assume that the extrapolated population is the same as the sampled population.

        4) Measurement issues, through time, place, and group: maybe people in some groups/times/places are more available/honest/willing to participate than others, maybe different studies use different granularity in their measurement outcomes (bucket sizes for income groups for example). etc. Basically Measurement error can vary and we need to have an explicit model for that.

        Each of these is a different kind of variability of a sample from a population. Randomization tests for example deal primarily with the issue of “random sampling variability” assuming that the sample is representative, whereas they can’t really say as much about measurement error, non-representativeness, and non-stationarity.

        Failing to acknowledge that all of these different kinds of uncertainty exist, and treating everything as if it were mainly an issue of random sampling noise is one of the big issues in reproducibility of scientific claims.

        • jrc says:

          Daniel (and Anonymous),

          I think we are saying similar things, but I’m probably doing it poorly. And I also think that, in general, I’m asking something different of the data itself than a lot of people are.

          So let me dichotomize your example of hispanic/asian income into the two parts of “inference” I discussed above, but simplify it by just comparing “asian” and “other”.

          First – suppose I take that data, randomly grab everyone’s race and re-assign it to someone else, and compute the difference in mean income between “asian” and “other”. Then I do that 20,000 times or something, look at the distribution of my estimated effects, and compare my “real” point estimate (the one from the actually recorded assignment of race) with my “placebo” estimates. If my real effect falls in the tail of that distribution of placebo effects (at whatever level you want), that is what I mean by “unlikely to have come from chance alone”. (Note: this is a simplification, and obviously ignores a lot of important things, but I think randomization tests of this sort can provide useful ways of thinking about the problem, and I’m happy if people use conservative standard error/p-value estimates that, in similar settings, have been shown to produce appropriate coverage rates).

          Now – that tells us absolutely nothing about the nature of the world. It tells us only about the data itself and its variability. So suppose I want to know about the actual population itself.

          Second – we want to know how well we think the result we found above maps on to some real feature of the world. Now we have all the problems you discussed above – stability, represenativeness, measurement, etc. And a whole host of other questions about how we interpret that result – is it due to some omitted variable that is doing the real work, is it an artifact of how people report race, is it about non-representativeness of response, is it just an unlikely sample, is it…whatever. I don’t think statistics helps us much on this. Repeating and replicating, and good old fashioned reasoning and critical thinking, are what are required here.

          The point I was trying to make originally is that I think, conceptually, we can benefit from separating these two aspects of statistical inference: one about the data itself, and one about the relation of the data to the world. I think your points are mostly about 2, and that is where I most agreed with Anonymous (and agree with you). But obviously 1 is still important. Even if no where near sufficient for “proving” (or even providing a basis for believing) that the effect observed in the data is a meaningful feature of the world itself.

          PS – I have a Go book for you.

          • jrc: I think your calculation has value, but I actually need to think about why. It’s not totally obvious. Consider

            1) If you’re interested in whether there’s something *in the data* you can just look at the data. Did you see larger income in “Asian” populations? If so, then within that sample of people… that’s there in the data.

            2) If you’re interested in sampling variability, then it’s not clear that randomly shuffling the race column in the data tells us anything about what other samples from the population might look like, since it’s unlikely that this random reassignment will mimic any feature of random sampling from the population. However, *perhaps* this uniform random reshuffling is like a maximum entropy alternative to resampling from the original population, since it treats all possible sequences containing the same number of each race category as equally possible. (that’s an interesting perspective but it’s not a very well developed idea)

            I think your issues about representativeness and soforth are good points, but I *do* think Bayesian models can help us deal with representativeness issues. Combining data from multiple sources where the sampling mechanisms are different to learn more about the population model is relatively straightforward within the bayesian framework.

        • Anonymous says:

          “from chance alone” erroneously implies that there’s a well defined model one could refer to as the “by chance” when in reality, unless one is doing a randomized experiment or has a well-defined natural experiment, there’s not just a single stochastic model that one could define as _the_ “by chance” model.

          Even within resampling / bootstrap-style tests, you could potentially have different resampling processes which capture different correlation structures. There’s not a unique “bootstrapped null distribution”.

          This is why I wish phrases like “likely to have come from chance alone” or even “it could just be random” were eliminated from academic discourse, at least in contexts outside of randomized experiments.

    • Rahul says:

      @Anon

      I agree with your comment mostly. But you say:

      “Acting as if every repeatable event is a ‘collective’ when they rarely are is the source of all the problems. It’s the original sin of classical statistics.”

      So what’s the alternative. Not being snarky. Genuinely curious. If you don’t want to use the simplistic “collective” assumption, what’s the alternative?

      Indeed the million dollar question is that of external validity, whether the blip you saw in this dataset was an artifact or a real trend that extrapolates. But how does one address that question?

      Is the alternative to say ignore statistics & trust domain knowledge?

      • Anonymous says:

        The alternative is to use distributions to represent the range of values considered in a problem, only one of which will be realized. That range is our “uncertainty” about the one value that is realized. Specific conclusions about the one value realized are warranted only to the extent they hold for almost all values in that range considered.

        The range of values considered should be big enough to ensure, based on what’s known or assumed, it includes the one value realized.

        This doesn’t even require repeatable events let alone a ‘collective’. On the other hand, if the values you’re interested in are frequencies, then just put such a distribution on the frequencies.

        If you’re interested in an honest-to-goodness ‘collective’, then it is modeled as a sharply peaked distribution on frequencies (the concentration of the distribution on frequencies reflects it’s stability, hence appearing to be a ‘collective’).

        Not only is there an alternative, but it includes ‘collectives’ as well as non-stable frequencies as special cases.

        • Rahul says:

          Sounds great. But in practice, how often do we have enough information to know this distribution?

          Maybe it’s easier to talk specifics? Say, take the much-criticised Tracy-Baell pink=fertility paper or Bem’s ESP results or some other example you might prefer. Can you outline how the alternative approach would look like?

        • Nick Menzies says:

          To me it seems like you are describing a hierarchical model with only one observation at the cluster level, in which case everything rides on the prior for the cluster-level variance. To me this sounds like a more accurate approximation of the world.

          One argument is that this is how any savvy consumer should interpret research evidence — though the researchers claim that they are giving you the global mean and sd (and implying that this is what you should want), what they are actually giving you is the cluster mean and sd, and what we want to know is the mean and sd for a different cluster (as we are generally interested in generalization to some related-but-different population). You are suggesting the researcher should do this extra step before they give you the answer, but I think current practice is for the (savvy) consumer to do it themselves, not that practice shouldn’t change.

          This seems similar to a recent post by Thomas Lumley: http://notstatschat.tumblr.com/post/95710047826/taking-meta-analysis-heterogeneity-seriously

          Also: why a sharply-peaked distribution?

          • Anonymous says:

            “Also: why a sharply-peaked distribution?”

            If you have a sharply peaked distribution P(f) for the frequency, then almost all possible realized frequencies will be approximately the same (hence the idea of a collectively).

            For example, Bernoulli’s week law of large numbers derived a P(f) which was sharply peaked under certain conditions.

      • Anonymous says:

        This alternative is implicitly used all the time. We just need to see it for what it is.

        Here’s a paradox for you: how can statisticians get away with assuming NIID for errors when they are almost never normally distributed (in a ‘collective’ sense)?

        Answer: because while that Normal distribution fails to describe the stable histogram of long range errors (they typically don’t even have a stable histogram!), a NIID often does determine a reasonable the range of possible errors for the one set of (unknown) errors realized in the data.

        If the one set of errors realized in the data is in that range considered, and our inferences are true for almost every set of errors in that range, then they’re likely true for the (unknown) realized errors in the data.

    • Anonymous says:

      From page 141 of von Mises’s book:

      “Mass phenomena to which the theory of probability does not apply are, of course, of common occurrence. In other words, not all repetitive events are collectives in the theoretical sense of the word.”

  3. Daniel Wright says:

    EJ et al’s paper is currently not linked on his website. It states he is awaiting permission to put it there.

  4. Chris Crandall says:

    This mixes a fairly weak argument with a lame swipe at a couple of prominent scientists. It’s bad rhetoric and unlikely to generate anyone over to the ideas. The only evidence stated in the argument is that Hal Pashler and Christine Harris made a presentation at a conference that showed they couldn’t replicate some effects. That’s not strong evidence of . . . anything.

    Better arguments can be generated, and some of them statistical.

    • Andrew says:

      Chris:

      1. Too bad this is bad rhetoric. As a scientist, I’ll try to get as close to the truth as possible. I’ll let others deal with the rhetoric, persuasion, etc. It’s hard enough to try to get things right, without worrying about rhetoric.

      2. Turing and Kahneman both made big contributions to science, almost certainly much bigger than anything I will ever do. And I’m not criticizing them (or even lamely swiping them) for believing in ESP and social priming. What I am criticizing them for is their insistence that the evidence is “overwhelming” and that the rest of us “have no choice” but to accept these hypotheses. Both Turing and Kahneman, great as they are, overstated the strength of the statistical evidence.

      And that’s interesting. When stupid people make a mistake, that’s no big deal. But when brilliant people make a mistake, it’s worth noting.

      • Anonymous says:

        May I be permitted to quote the greatest philosopher of science that ever lived (C. Truesdell):

        “The mistakes made by a great mathematician are of two kinds: first, trivial slips that anyone can correct, and, second, titanic failures reflecting the scale of the struggle which the great mathematician waged. Failures of this latter kind are often as important as successes, for they give rise to major discoveries by other mathematicians. One error of a great mathematician has often done more for science than a hundred impeccable little theorems proved by lesser men. Since Newton was as great mathematician as ever lived, but still a mathematician, we may approach his work with the level, tactless criticism which mathematics demands. “

        • D.O. says:

          Does he give any examples? What were the “titanic failures…as importent as successes”? Fermat’s last theorem? Hilbert’s axiomatization program?

          • Anonymous says:

            The quote is from an article about Newton’s legacy and affect on what came later in a book on the history of mechanics. The greatest mathematicians of the next 150 years spend an enormous amount of time correcting problems in Newton’s work (large chunks of which are inspired but nonsense).

            • Jake says:

              You’re going to have to be more specific. The Principia came out in 1687, 150 years after that is 1837, which gets us Gauss and Euler and Laplace and Cauchy and even Galois was five years in the ground when that window expired. (And also some Bayes guy that y’all might have heard of).

  5. Erin Jonaitis says:

    One frustration I’ve had in recent discussions regarding controversial research is the seeming unwillingness of researchers to entertain the possibility that their published findings are just noise.

    I think we might have a couple kinds of selection here. One: if a given researcher thinks that there’s a sizable chance that a finding is noise, he or she may be averse enough to being embarrassed later to try a few things before publishing it — replicating it, for instance, or extending it. Having done those things, if the finding survives, the researcher is likely to be more confident in it. Two: given (a) the hiring and promotion metrics in academia and (b) the surfeit of researchers compared to jobs and funding opportunities, the population of researchers is probably skewed toward people who publish first and question rarely. So the most questionable findings are probably either circular-filed or published by people who will go to the mat for them.

    It’s also possible you would get more satisfaction if you pursued these conversations privately instead of in your blog. I mean, I know that’s not how science is supposed to work, but scientists are people, and they want to save face, and when you get in these scrapes I don’t often see that you’ve given them a way to do that.

    • Andrew says:

      Erin:

      I could contact the scientists privately, but (1) typically that doesn’t work, people are often just as resistant to admit mistakes in private, and (2) I’m usually not so concerned with these individual scientists, I’m more interested in reaching 10,000 people on the blog, and maybe some of these readers will be motivated to be more aware of uncertainty in their own work.

      • Erin Jonaitis says:

        Yeah, it depends on what you’re trying to do. If your goal is to make an example of these people, absolutely, no point in warning them about it first! But in that case you really can’t expect a response that’s any different from the one you typically get. And I must say in that case it strikes me as rather disingenuous for you to complain about it! If this whole line of blogposts is basically Scared Straight for Scientists, an admission of error from the targets isn’t necessary. I’d even surmise that the more recalcitrant the researcher, the better!

        • Andrew says:

          Erin:

          I’m never quite sure what “disingenuous” means, but, believe me, I’d be happier if the researchers would admit their mistakes. This would make them even better examples!

          • Rahul says:

            Why is it so important (to us) that they admit their mistakes? In the larger picture. Say someone publishes something sensational yet deeply wrong.

            And then someone like Andrew critiques the work and posts it on a medium with wide reach like a blog. Let’s assume Andrew is right about that work being wrong. And his argument is compelling and most readers are convinced.

            At that point isn’t the major work done? We’ve corrected an obvious misconception in a large chunk of people. What’s the additional benefit (to the community) whether or not the original author admits he was wrong or not.

            • Anonymous says:

              Closure. That’s why.

            • Phil says:

              The “major work is done” when ALL scientists are willing to admit their mistakes and, perhaps equally important, to be less sure of themselves in the first place and thus able to keep an open mind when evidence is suggestive but not definitive.

              Honestly, I find it baffling that people like you and Erin are evidently unwilling to distinguish between (a) trying to get specific scientists to correct their mistakes and (b) using the behavior of specific scientists as examples of a much larger problem that all of us should be aware of in our own work as well as in others. I refuse to believe that this point is beyond you — I am sure you are _able_ to make this distinction. Why won’t you do so? Why not take Andrew at his word, that he is trying to improve the practice of science, and of scientific communication?

              • Rahul says:

                No, I’m sincerely confused what you are trying to say here.

                Ok, I’m totally with you that one should own up ones mistakes, admit them truthfully etc. as a matter of principle. OTOH pragmatically, I don’t care at all whether (say) Tracy and Beall issue a communique admitting they are wrong or not so long as the world does not start actually believing that the pink-fertility correlation is some fantastic new insight that’s factually true.

                It matters more to me whether the truth prevails than whether some individual scientist refuses to formally own up his mistake out of malice, laziness, incompetence, etc.

                If 900,000 students & researchers are convinced that the Tracy-Beall result is an artifact but yet Tracy and Baell continue to steadfastly refuse to acknowledge it I sure can live with that.

              • Erin Jonaitis says:

                I absolutely see the distinction. That’s what I was trying to get at in my second comment above. I trained as a psychologist originally and the problem Andrew keeps railing about is real and is what inspired me to move into applied statistics instead. Where I sort of part ways with him is in the way he keeps hammering at particular people and demanding they admit error. If the real goal is making the specific individuals in question change their ways, this is an ineffective way to do it. It’s sort of inconceivable to me that someone of Andrew’s stature would not know that, but since he keeps repeating the same refrain of “why won’t they just cry uncle already” in several contexts, I thought it was worth a brief attempt to explain why.

                If, on the other hand, the real goal is changing the practice of science — which I think is more likely, and which is a goal I support — then the hammering on specific individuals in such a public way feels either like gratuitous meanness or like demagoguery. Like, I think you can make an example of people without coyly wondering aloud why they’re not returning your calls. I guess whether the demagogue approach will improve the behavior of the other scientists who are watching the circus is an empirical question.

              • Andrew says:

                Erin:

                I’m not trying to get anyone to cry uncle. It’s not a wrestling match. There’s no reason why admitting a mistake should feel like losing a fight. When someone points out an error of mine, I don’t cry uncle or feel like I’ve lost. Instead, I admit the error, thank the person who pointed it out to me, and I move on. I learn a lot from my mistakes. When people don’t admit mistakes, or when they think that admitting a mistake is like “crying uncle” and losing a fight, then I think what they’re really losing is an opportunity to learn. Which I find particularly sad for young people who are just starting their careers. I don’t think it’s mean to inform people of their errors, not at all.

                Also, I’m not quite sure what you mean by saying “coyly wondering aloud why they’re not returning your calls.” If you’re referring to this example (in which I reported that a researcher did not respond to a message from me, even after I learned that he’d received it), yes, this bothered me. Scientific exchanges via blogs and published papers are fine, but sometimes a discussion can move much faster via email. There was nothing coy about my frustration.

                As Phil writes, sometimes you have to take what I write at face value. I don’t feel I’m in a fight with people, it’s about science. I don’t feel I’m in a fight with David Brooks either, I just would like him to correct the errors that he’s published in national media. To me, admitting error is so central to learning and understanding that I’m bothered when people don’t do it. Even though I understand that people have many motivations to not admit their mistakes, I still hate to see that behavior coming from anybody.

              • I understand Andrew’s motivations, and they are laudable. I make mistakes all the time and understand the value of learning from them.

                But as far as getting other people to admit and understand their mistakes, and to act on this new understanding in future work, the most effective way to achieve progress of the type Andrew wants is to publish papers that do the research in question properly and show that, indeed, the original claims were misguided. That will get the attention of the authors in question.

                For example, Andrew could easily collaborate in an experiment of the type that Tracy and Beall did, perhaps Tracy and Beall themselves. They’ll learn something, and the best practice will gain an even wider audience. Most of all, there will be real change.

                If Andrew were to take on higher-value targets, say in medicine, the impact would be even bigger.

              • Rahul says:

                Isn’t is possible to learn from mistakes without admitting them? Say, Tracy & Beall realize Andrew’s point. And they’ll pre-register all their future studies. And use hierarchical models instead of NHST. And use higher power studies with larger samples. Of not just students. And within-person designs. And better definitions of fertile windows.

                Now, if they did all that presumably they did learn something from Andrew’s critique. But they may still not have publically admitted their mistake.

                Does it matter much?

              • Andrew says:

                Shravan:

                I understand what you’re saying but my own work hours are limited. It takes much less time and effort to explain what is wrong with a study and suggest improvements, than to perform a new study.

                Indeed, one of my criticisms of Tracy and Beall, and other researchers in that field, is that they’re trying to do science on the cheap. For reasons that Button et al., Loken, and I (and others) have discussed, it’s essentially hopeless to study small within-person effects using crude measurements and between-person comparisons.

                I myself am not particularly interested on the effects of ovulation on clothing choice. But I have repeatedly recommended to Beall and Tracy that if they are interested in the topic, that they perform careful within-subject studies and not to fool themselves into thinking that they can learn anything from the statistically significant comparisons that they’ve seen so far. So far they have not been interested in this advice. But maybe others working in this field will be interested. As we’ve discussed in this thread, it’s a hard sell because what I’m basically saying is that researchers should be working a lot harder in their design and data collection.

                Anyway, if anyone wants to do these studies, they can feel free to do so. They don’t need me. I’ve made my contributions by working out the statistical issues. I’m a firm believer in the division of labor.

                Finally, I agree that medical studies can be more important than psychology studies. I just don’t know any medicine so it’s harder for me to be useful here. In contrast, my expertise in social science helps me out in understanding and interpreting social science research. Again, it’s the division of labor. The statistical principles are transferable, though, and I think that some people working in medical research are reading my articles and books and that this affects my practice.

              • I guess it doesn’t matter if they don’t admit their mistake, just act on their new understanding of all their mistakes. We’ll just have to wait and see whether they do that. If it turns out that after they do their study properly and find no effect, probably no journal will be willing to publish their null result (unless they want to try a psycholinguistics journal), but they can put it up on their home page and everyone will know they fixed their error. If they end up getting a red-means-ovulation result even in their properly done study, they will either mistakenly think Andrew’s wrong about his criticism, or do the study correctly in future. If they go down the right path in each case, Andrew’s work is done. They never have to come out and say, guys, we messed up in our earlier work.

                But Andrew should work with them in this attempted replication, to pre-empt later objections from him that they didn’t do this or that right.

              • Andrew:

                I understand your point about having limited time. Your point is very clear in your blog posts: one should do experiments properly. Your blog has been very effective in changing at least *my* research strategy, if not Tracy and Beall’s. However, I can afford to be as careful as I want, as I am a tenured professor. I can also afford to stop doing research and just study statistics for a while (I’m like Erin in that respect).

                But what about my students, and students in general? They will almost always find themselves in a situation where the study falls short of being perfect; this is a function of the limited time they have to do their PhD in. If they don’t publish their work, that’s it for them. They can’t afford to work forever on their PhD. Beall is or was a grad student doing his PhD with Tracy when the first paper came out. It could be they don’t have the resources [time, money] to do it exactly right. Also, they don’t have your statistical knowledge, so they probably don’t even know how to do it right. All this could be why they [apparently] dismiss your objections.

              • Hey! Wagenmakers et al are already on Tracy and Beall’s case:

                https://osf.io/973ej/wiki/home/

                The replication attempt is ongoing. I’m going to start pre-registering my own studies from now on.

              • Erin Jonaitis says:

                Sometimes the nesting limit is a hard master.

                Andrew, I’ve been trying to take you at your word, but I am apparently failing at deciphering it. I certainly do believe that at bottom your concern is about science. And I do not at all think it is mean to inform people of their errors! — I think it’s a kindness, and agree that the proper thing to do when informed by others is to admit your error gracefully and try to do better next time. That’s how I try to roll, anyway. On the basics we agree.

                Where we seem to disagree is on our understanding of human nature. Suppose we are standing at a cocktail party and I notice your fly is down. Imagine the following actions I could take.

                (1) I pull you aside and let you know privately that your fly is down.
                (2) We are standing in a circle with others and I say the same thing at normal volume.
                (3) We are standing in a circle and I step back so I can pull out my smartphone and blog about your fly.
                (4) I submit an essay on your zipper crisis to a prominent national magazine.

                If I go with Option 1, I am safe: everyone who survives middle school learns that no matter how much you’d like to blame the messenger, the messenger has actually done you a huge kindness in the least-damaging way possible. If I go with Option 2, the graceful response would be for you to respond exactly the same way, but you might not — it is odd enough behavior that you might decide you can get away with distracting the onlookers from your goof by pointing out that it was sort of rude of me to call attention to you in this way. The pragmatics of the other two actions I could take, in my opinion, almost force you to comment on that aspect. If you just say “oops, thanks,” you leave unquestioned the implicit assumption that your transgression was so bad it warranted national attention.

                With Beall and Tracy, as far as I can tell — and please correct me if I’m wrong [1] — you jumped right to Option 4. I enjoyed your Slate piece; I think it said important things; I shared it with my friends; but I do not really understand how you expected a different response from the authors than the one you got. That’s why I wondered whether your surprise wasn’t feigned for effect. What experience have you had with people that tells you to expect better behavior? (Like, clearly if that’s what your experience tells you about human nature, thou hast well pleased the God of Committee Assignments.)

                I will say that I think it critical for the progress of science that we create a culture in which it people have the comfort and confidence to respond gracefully to quasi-public criticism, such as my action 2 above (which I conceive as a rough parallel for critiques published in the scientific record). I am not really sure how to get there from here. I don’t think we’re there yet. I suspect preconditions for cultural reform may be (a) fixing the resource scarcity problem somehow and (b) getting rid of the apprenticeship system of scientific mentoring, in which students’ learning of scientific norms is filtered through the behavior of their advisers. I am sure you’re right that Beall has learned some pretty poisonous and sad things from this episode, even though his adviser probably felt like she was protecting him.

                [1] Hee.

              • Andrew says:

                Erin:

                I wouldn’t say I was surprised by Tracy and Beall’s reactions (or, for that matter, by the reactions of David Brooks when his mistakes were pointed out). Still, I was disappointed. Human nature disappoints me sometimes.

              • Dan Wright says:

                Hi Erin,

                Erin, I thought the zipper analogy was good, but a couple of differences. Is how the person acts about zipper important for which option to take? I agree that if someone is not publishing the state of their zipper in high profile journals or doing press releases about it (and most don’t), then the quiet whisper is probably appropriate. Also, with any of the four options the person would likely admit their zipper was down.

              • Rahul says:

                Andrew:

                Just as you were disappointed at Tracy & Baell, I would be disappointed at anyone who chooses Option #4 in Erin’s excellent analogy as the first course of action.

                Can you explain why you thought direct public shaming was the best course of action? What about, say, Erin’s alternative #1 do you not like?

              • Following up on Erin’s four-choices comment, are you 100% sure that option 4 was the right thing to do? I would have expected you to follow your own advice and respond that, yes, I agree that I might have taken the wrong option out of those four, I’m not sure if I did the right thing. But that was not your response, and it’s because it would involve loss of face. It’s hard to say, yeah, I might have made a mistake. Maybe you think that privately.

                It’s truly admirable that you have no problem acknowledging your errors (or uncertainty about your position/conclusions) in a scientific paper, but it’s different in real life, even for you. That’s OK because it’s human. But it’s important to realize that you are also susceptible to the same loss-of-face problem. Another example was the ethics column you wrote in Chance, in which you attacked a statistician for not sharing his data (which is fine), adding the gratuitous comment that he didn’t even have a PhD (which is not fine). When he replied in a letter to Chance, the right thing to do would have been to apologize to him, and to just reiterate your main point about the importance of sharing data. But that would involve loss of face, and so that’s not what you did; instead, you stood by your comment that not having a PhD (never mind that he had decades of experience behind him) made him somehow not up to the task of analyzing his data.

                Of course, as you said elsewhere, what I’m saying is unfalsifiable, since you can always claim that your action was the right one in each of these cases. But we do what we do buffeted by noisy decision criteria that don’t always have the best rational motivation. But once we’ve made a decision we just defend it post-hoc as if it was the only rational action possible.

              • Andrew says:

                Rahul:

                I did not “publicly shame” Tracy and Beall; I pointed out a statistical error they made (which is made by many other people, most notably Bem). When people point out my errors, they’re not shaming me either. Learning from errors is, in my experience, central to science. There’s no shame in making a mistake or in exhibiting a common misunderstanding.

                Shravan:

                In my article, I did not write that “not having a PhD (never mind that he had decades of experience behind him) made [the statistician] somehow not up to the task of analyzing his data.” What I wrote was that their analysis was flawed (regardless of the qualifications of the analysis) and that “the lead researcher and his statistician should have realized that, given their lack of expertise in statistics, it was at least plausible that an outsider could improve on their analysis.”

                Had the statistician had stronger academic qualifications, I think it still would’ve been a mistake for him to not accept an outside analysis. In that case it could’ve been an error of overconfidence; perhaps a Ph.D.-level statistician could’ve thought of himself as such an expert that he would not accept the advice of outsiders. You refer to his “decades of experience,” but it is a flaw to think that just because you’ve done something a certain way for many years, that you’ve been doing it right. Whatever qualifications this statistician had, he and his colleague made a mistake by not sharing their data; I felt that the particular details of their academic background were helpful in understanding the sort of mistake they were making.

              • Erin Jonaitis says:

                I fear we are going in circles.

                Some time ago a friend shared with me the text of a talk one of her math professors had given called “The Lesson of Grace in Teaching.” I reread it from time to time when I am feeling low. I think both Andrew and the Beall-Tracy team could benefit from the ideas it contains.

                http://mathyawp.blogspot.com/2013/01/the-lesson-of-grace-in-teaching.html

                Key quote: “Grace is precisely what makes hard conversations possible, and productive, between people. But you have to extend the grace first.”

  6. Anonymous says:

    “Things feel a bit different in 2014 after several decades of merely incremental improvements in physics.”

    Things feel a lot different in 2014. Those decades of “merely incremental” improvements in physics have effectively ruled out a lot of nonsense: http://blogs.discovermagazine.com/cosmicvariance/2008/02/18/telekinesis-and-quantum-field-theory/#.VAcrLWBdVph Furthermore, if ESP is real the experiments parapsychologists do aren’t capable of showing that it’s real anyway (for reasons mentioned in that Wagenmakers et al paper which has been taken down). The ESP researchers are doing futile cargo cult science.

  7. Peter Dorman says:

    I realize this is not a post about ESP per se, but (implicitly) about the appropriate response to statistical evidence for implausible effects: question the evidence that much more rigorously. Agreed.

    But since you brought up ESP, here’s a gripe I’ve been carrying around: tests of the presence of such mechanisms conflate three issues: whether ESP exists at all for anyone ever, whether it is a general phenomenon detectable at a population level, and whether it is responsive to the experimental conditions intended to elicit it. I’m convinced that the evidence shows that #2 and #3, at least taken together, are false. Case study evidence, however, is sufficient to establish #1. There can be non-sensory perception that is infrequent and uncontrollable — I’m pretty sure there is. From a philosophy/sociology of knowledge point of view, what I find interesting is the near-universal assumption that if an effect exists it has to be evident at a population level. It’s very important, of course, to identify population characteristics that can be detected via sampling, but in some contexts idiosyncratic outcomes also matter. Think of genetic mutations, individual students responsive to teaching methods that don’t work “in general”, and, here, ESP. Statistical evidence simply isn’t appropriate for capturing one-off’s, and the absence of such evidence doesn’t have much bearing on whether such one-off’s are valid.

    I bring this up because I think idiosyncratic effects or outcomes of importance are ubiquitous in most fields and are not given enough attention because of a bias toward population-level evidence. Of course, common phenomena that we can identify through sampling tend to be relatively consequential simply because they are common. It’s a question of balance.

    My apologies for taking this thread in an unintended direction.

    • Jonathan (another one) says:

      But “infrequent and uncontrollable” are going to require a huge number of observations to separate from chance. And we’re back to the discussion of the hot hand.

      As to Kahneman, I think he would say that the quote describes why statistics is so hard to teach and truly understand, but that published peer-reviewed articles are supposed to be immune to that particular cognitive bias. Surely you would grant that conditioning on peer review has *some* effect, even if it isn’t large enough to overcome a bad prior based on faulty statistical reasoning.

      • Peter Dorman says:

        Jonathan, I think you’re repeating the assumption I’m calling into question. Indeed, idiosyncratic events *are* the product of chance at the population level; that’s the point. A mutation is a chance event. A student responsive to a teaching method that does not produce discernible results over a population is a chance event. My proposition is that ESP can be (and I think probably is) a chance event in exactly the same sense. This is not a criticism of statistical methods but of applying them in contexts where they have nothing to contribute. Chance events must simply be documented on an individual level, and peer scrutiny can be applied to the adequacy of this documentation.

        • It’s fine to say that there are chance events which cause particular people to be in some sense consistently responsive to some kind of ESP. It’s another thing entirely to say “occasionally some people are randomly responsive to ESP”. Is there *any* way to distinguish between “randomly have a correct response due to ESP” and “randomly have a correct response, but not due to ESP” ?

          You have to show that ESP produces in some sense “a different kind of randomness” than non-ESP randomness. Furthermore, this is a very very good example of needing some theory with pre-registered predictions and data analysis. Because for a claim to be scientific, it HAS to have some *specific* prediction, predicting that “there will be a difference in outcomes between different subgroups” isn’t a specific enough prediction.

        • Jonathan (another one) says:

          Maybe I misunderstood your “infrequent.” If you mean there are few people with ESP, let’s say, five in the world, then the process of proving ESP is to advertise and test the people who respond, then winnow down that group and keep going until there is a subset of people who continually manifest whatever this effect is. (Note that this what actually what was done in the ’40s by Rhine, and the failure to get the effect to display itself continually was what started the fruitless search for experimental condition constancy) In that case, you will have proven your case in entirely standard statistical ways. If the “infrequent” applies to an individual person, then you’re back to the complaint I made and that Daniel Lakeland reiterates. When Joe says he has ESP but says it only happens on one out of every hundred times I predict the next card, well, ha.

          • Peter Dorman says:

            I had hoped that the expression “one-off” would be clear, but obviously it’s not. Consider the difference between a single ESP episode and “having” ESP. I am not aware of any evidence that ESP is discernible in any sample of events, whether across individuals, trials with the same individual or both. I am saying only that this does not bear on one-off episodes. Evidence for such episodes would have to take the form of well (meaning objectively) documented accounts. Suppose Joe perceives complex event A, and A actually occurs but is distant in time or place. If Joe documents this perception in writing or through a spoken record, if it can be convincingly documented that Joe had no way of knowing of event A through ordinary sensation at the time of his description of it, and if the level of specificity and detail encompassed in A is such that it is highly unlikely that Joe’s account could be produced by random guessing, then we have a documented account of a one-off episode. This does not imply, however, that Joe is capable of generating a second such episode or that there is more than one such Joe in the world.

            What makes one-offs interesting is the truism that anything that has happened — even only once — can happen. In many realms of science the documentation of one-offs plays a role. Think about animal intelligence. Even if no other parrot is as smart as yours, and even if your parrot can solve a particular problem only once, if the solution is complex enough that the parrot is unlikely to stumble on it, and if you document this solution sufficiently, you have a result.

            Maybe switching the subject from ESP, which people have strong feelings about, to something more neutral might make the point easier to see.

            • I’m reminded of Dirk Gently’s Holistic Detective Agency (by humorist Douglas Adams), in which Dirk Gently through the usual methods of studying hard and making educated guesses about what his professor might test on, happens to reproduce the actual final exam to be given by his professor word for word and hand it out to his fellow classmates as a study guide (I’m paraphrasing a dim recollection).

              Which is to say, even bizarre, seemingly impossible coincidences can come about through non-ESP if given enough chances. In principle perhaps once someone levitated a pencil with their mind, and never again, and so… it is in principle possible… or in principle once a parrot sat down at a keyboard and typed out the first paragraph of a Shakespeare play…

              The key to making people believe this really happened though, is to do it in some kind of controlled environment in which we can rule out shenanigans. This either means it’s got to be repeatable, or we need to get extraordinarily lucky that such an event just happens to occur in one of the extremely rare circumstances we call “a controlled environment”. You might argue we should massively fund ESP research so we could ensure that enough “controlled environments” exist that such random events could be “caught” but I don’t think you’re going to get anywhere with me or most other people in terms of “bang for the buck” of dedicating such resources ;-)

              • In the end, if we “act as if ESP is totally impossible” and it turns out only to be extremely improbable and random in such a way that only a few instances will occur in the lifetime of the human race…. we have lost very little by acting like it’s impossible.

                On the other hand, if there are a small number of people who have consistent and repeatable ESP experiences, then we could gain a lot by identifying them (for example, perhaps they could help prevent tragedies or solve crimes or whatever).

                There are enormous numbers of one-off events that occur which prove that such one off events can occur. Some of them might even be kind of extraordinary. For example, I made some damn fine pancakes the other day, also a meteor or comet once impacted where the Chicxulub crater is and caused the extinction of non-avian dinosaurs…

              • Peter Dorman says:

                What would really impress me is if the parrot pecked out an opening soliloquy that “wasn’t* by Shakespeare or anyone else but was just as good. Personally, I wouldn’t toss this off as insignificant, even if subsequent writings by this same parrot were more pedestrian.

                Seriously, I don’t think there’s any point to pursuing this. Lots of science is concerned with answering the question “did X happen in this instance?” It’s not the end of all inquiry but it’s not nothing either.

            • Thomas says:

              Everything here hinges on “the level of specificity and detail encompassed in A [being] such that it is highly unlikely that Joe’s account could be produced by random guessing”. Indeed, not just “highly unlikely”, but impossible. Suppose there’s a one in a million chance of an image of an event appearing in my mind that is indistinguishable from, say, the image of it that I might call to mind if I had been there (last year) say and was remembering it. Now, by hypothesis, I wan’t there, so I’m not remembering it. And, if one in a million counts as “highly unlikely” it’s highly unlikely that it would occur by chance. And yet, like Peter says, even though it happens only once, this proves it can happen. But does it prove ESP? (ESP meaning some unknown causal mechanism by which the event brought the image to my mind.)

              My answer is no. And there’s an obvious reason for this. In a population of a million it’s actually pretty likely that it’s going to happen to one of us. In fact, I encourage you to do this: imagine a burning house of some specific design with some specific amount of windows and doors. Draw a picture of it. Now, if we don’t care when or where it happened, I’m pretty sure if we looked long enough we’d be able to find an “eerily” similar house fire, either sometime in the past (when/where you weren’t there to see it), right at this moment (OMG! spooky!) somewhere in the whole world, or at (this is relatively easy) simply by waiting long enough for it to happen somewhere. If that’s the sort of “one-off” you’re looking for, it’s not very impressive right?

              The only way to establish a causal link between the event and your (extra-sensory) perception of it is to demonstrate that there’s a reliable mechanism. As I let (pretty serious) ESP researchers teach me many years ago, it’s fine to think of “reliable” on the order of memory: i.e., it’s imperfect, but demonstrably something we have. If I got only one thing right about what I did last year, I would not be comfortable concluding that I was remembering it. More likely, I’m imagining something that then happens to be true.

              • Peter Dorman says:

                This reply points to the methodological issues in assessing the evidence for one-offs, which is all to the good. However, it is strange to see on this blog a post that asks for evidence that rises to the standard of “proof”. Surely we are all beyond this.

                I agree that the evidentiary weight for one-offs needs to be established according to accepted protocols. Every science that engages in this sort of work (I mentioned animal intelligence and, by implication, molecular biology) has them. Perhaps they can be improved by methodologists taking a close, critical look.

                I had started this tangent by noting that ESP research does not have to proceed only on a population (or repeated trial) basis, where it seems clear that disconfirmation is the prevailing pattern. Well-documented one-offs would be very interesting in this field. Of course, simply “seeing” a generic event, like a house fire, from afar or in advance, no matter how well documented, would not constitute very convincing evidence. But now we are talking about the relative weight of evidence, not whether one-offs can be of scientific interest and importance.

  8. Anonymous says:

    This came up before on this blog, but a big problem with testing unlikely hypothesis like ESP was pointed out by Jaynes just using the good old sum/product rule of probability. If H1=”ESP is real” then data may dramatically increase the odds of H1 being true, however if you did a fuller analysis using a hypothesis like H2=”theres some error contaminating the data” (innocent or not), then that same data can increase of the odds of H2 even more than H1.

    That is to say, given a full Bayesian analysis, the paradoxical effect of finding data supporting H1 can be to create an even stronger belief in H2.

    This is a general phenomenon. Whenever you have two hypothesis H1, H2, both of which have low probability, finding data that supports one can at the same time have a dramatic effect on the other. I believe jaynes referred to is as “resurrecting a dead hypothesis”.

    • I was about to add the same comment. Jaynes wrote about this in his book in Chapter 4 on Hypothesis testing.

      “This kind of experiment can never convince me of the reality of … ESP; not because I assert [that ESP is impossible] dogmatically at the start, but because the verifiable facts can be accounted for by many alternative hypotheses (such as unintentional error in the record keeping, tricks by Mrs Stewart, withholding of data, or outright fabrication), every one of which I consider inherently more plausible than ESP, and none of which is ruled out by the information available to me.

      Indeed, the very evidence which the ESP’ers throw at us to convince us, has the opposite effect on our state of belief; issuing reports of sensational data defeats its own purpose. For if [we think the likelihood] for deception is greater than that of ESP [being real], then the more improbable the alleged data are …, the more strongly we are led to believe, not in ESP, but in deception. For this reason, the advocates of ESP (or any other marvel) will never succeed in persuading scientists that their phenomenon is real, until they learn how to eliminate the possibility of deception in the mind of the reader… The reader’s [perceived likelihood] for deception by all mechanisms must be pushed down below that of ESP.

      It is interesting that Laplace perceived this phenomenon long ago. His Essai Philosophique sur les Probabilites (1814, 1819) has a long chapter on the ‘Probabilities of testimonies’, in which he calls attention to ‘the immense weight of testimonies necessary to admit a suspension of natural laws’. He notes that those who make recitals of miracles, ‘decrease rather than augment the belief which they wish to inspire; for then those recitals render very probable the error or the falsehood of their authors. But that which diminishes the belief of educated men often increases that of the uneducated, always avid for the marvelous.’

      …Note that we can recognize the clear truth of this psychological phenomenon without taking any stand about the truth of the miracle; it is possible that the educated people are wrong. For example, in Laplace’s youth educated persons did not believe in meteorites, but dismissed them as ignorant folklore because they are so rarely observed. For one familiar with the laws of mechanics the notion that ‘stones fall from the sky’ seemed preposterous, while those without any conception of mechanical law saw no difficulty in the idea. But the fall at Laigle in 1803, which left fragments studied by Biot and other French scientists, changed the opinions of the educated – including Laplace himself. In this case, the uneducated, avid for the marvelous, happened to be right: c’est la vie.”

      • Anonymous says:

        Thanks for that!

        In essence Jaynes is showing how the sum and product rule mimic the way we do think in this case. It’s reminiscent of Polya’s sections in Mathematics and Plausible reasoning showing how probability theory mimics hour heuristic reasoning.

        Gelman has said he liked that argument from Jaynes (or at least liked the idea of explicitly model alternative hypothesis which embody these kinds of errors/mistakes). On the other hand it looks like he’s got an upcoming post which is negative on Polya.

      • Fernando says:

        Joe:

        Thanks for the quote.

        Scientists need to be much more systematic about incorporating the operating characteristics of their study into the inferential process. And in justifying their priors about said operating characteristics with reference to the procedures used in implementing the study, as described in a study protocol. This is why research design is so important, especially for low probability hypotheses like ESP.

        PS Nice website!

      • Bill Jefferys says:

        FWIW, I have used Jaynes’ ESP example in my course, and here (starting with Chart 76) is my exposition of this. Thank you, Joe, for mentioning this.

        http://bayesrules.net/courses/stat330.2012/10.%20Bayesian%20Hypothesis%20Testing.pdf

  9. Martha says:

    The quote from Kahneman and Tersky (“People have erroneous intuitions about the laws of chance. In particular, they regard a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics. The prevalence of the belief and its unfortunate consequences for psvchological research are illustrated by the responses of professional psychologists to a questionnaire concerning research decisions.”) brings to mind a couple of things that are more or less the converse of assuming random implies representative:

    1.Something I noticed this morning looking at a textbook from the new book shelf:

    In talking about model assumptions, the authors did not say that samples should be random, but did say that they should be representative, elaborating with: “The idea is that if a sample is representative of a population, the numeric or mathematical characteristics of that population will be present in the sample. This attribute will ensure that statistical analysis of the sample would yield similar results to a (hypothetical) statistical analysis of the population.”

    2. The Schmider et al paper I “called out” in a comment yesterday on your “Bad Statistics: Ignore or Call Out?” post. The authors did “Monte Carlo” simulations, but tossed out samples that weren’t “representative” (as tested by a hypothesis test) of the distribution from which they were sampled.

  10. jonathan says:

    I totally agree with this post. A result is just a result. It may indicate something deep below but it’s likely a correlation that may not be nearly as important if you frame the matter differently. I have no problem accepting some of the implications of priming studies generally, with not only the caveat that individual ones maybe no but the word “some” and both personal and I believe general confusion about what “some” means. I doubt Kahneman was trying to delete “some” or replace it with “all”.

    As for Turing, I gather he wanted to believe. Wanting to believe is one of the great problems of life and it asks as well, “How can we escape the biases of our day?” That’s hard when figuring out mathematical proofs and is a great limiting factor when talking about this correlation versus that correlation.

  11. Mayo says:

    It might be relevant to consider how in the 80s the likes of Diaconis, IJ Good and Patrick Suppes agreed to take part in a session with me on statistics and ESP: http://errorstatistics.com/2012/09/22/statistics-and-esp-research-diaconis/
    I was even going to edit a book on it until I decided it wasn’t the best way to advance my early career. But enjoyed this wave of serious attention because of the statistical methodology.

    • Keith O'Rourke says:

      Always liked this suggestion by Percy “This suggests that magicians and psychologists be regularly used as observers”

      • Bill Jefferys says:

        Indeed, James Randi (who has debunked a lot of alleged strange results) has made this point emphatically. Especially the “magicians” part; I’m not so sure that psychologists are equipped to discover fraud in the way that magicians, by profession, are.

  12. Fernando says:

    Andrew: “when reading old-time literature, that belief in the supernatural was pretty common back then, lots of mention of ghosts etc. “

    I’m afraid ghosts are back in vogue: https://books.google.com/ngrams/graph?content=ghosts&year_start=1800&year_end=2014&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cghosts%3B%2Cc0

    • Daniel Gotthardt says:

      Fernando:

      Ghost stories, fantasy in general are in vogue but they are written and read as imaginary stories, not as something to actually believe in. I wouldn’t take the development of this measure to mean much regarding the development of believe in ghosts.

  13. […] a comment thread on Gelman's blog (complete with a little controversy) I discussed some of the realistic problems with that, which […]

  14. Rahul says:

    Is it a fair characterization of Andrew’s position to state that “Statistical evidence can never ever be overwhelming?”

    If not, then unlike Turing / Kahneman’s blunder what are counter-examples where Andrew thinks it is justifiable for someone to say “Statistical evidence is so overwhelmingly in favor of X that you have no choice but to accept X is true”?

    PS. Usually the interesting cases seem those where people see a certain phenomenon statistically but as yet don’t have a good causal explanation for it.

    • I think statistical evidence can be overwhelming. We still don’t know fundamentally what causes say quantum entanglement (oh sure, we have a mathematical model that can predict it, but we don’t have good ontology, that is, a good understanding of a mechanism) yet we can run quantum entanglement experiments all day and they will in fact work out a shockingly large fraction of the time (note, measurement errors and noise also make it so that individual instances of the experiment aren’t perfect)

      In less physicsy realms: “there is overwhelming evidence that voter preferences vary considerably by income and racial categories”, “there is overwhelming evidence that HIV causes AIDS” (no one has run the controlled experiment where randomized *people* have been purposely infected with HIV, though they have done similar things with SIV in monkeys, and also shown that the viruses are similar)

      In general the less specific or complicated the claim, the easier it is to establish statistically. so “there are some considerable differences in voting patterns” is a lot easier to establish than “poor people are 30-34% more likely to vote for …” or whatever.

      In some sense, as Hume pointed out, ALL scientific evidence is statistical.

      • Rahul says:

        So can you articulate why certain of these cases you consider overwhelming evidence & others not?

        • There is overwhelming evidence for a statistical hypothesis just in case the posterior probability is extremely large, let’s just pick a threshold, call it say greater than 0.9995 or if you like 0.99998 or something.

          so let’s say we have the question of “whether voter preference varies considerably by income and racial categories”. take one of Gelman’s models for voter preference in which he has parameters that describe say “percentage point difference from an overall average” call “not-considerably” any differences that are within say 2 percentage points of the overall average. Calculate the posterior probability in this model that *all* racial and income effects are *simultaneously* within +- 2 percentage points…. you’ll find the probability is likely tiny.

          ok, admittedly, you need to also believe that the model is a good one, and that the data are not terribly contaminated and biased (but the data are probably from a variety of reasonably high quality sources)

          I’m guessing, because I’m not actually that intimately familiar with his voter preference models, but that’s the kind of thing I’d find overwhelming.

          Now, is the evidence overwhelming that for some *specific* subset of race and income the difference is greater than 7 percentage points? probably a lot less overwhelming.

          • Rahul says:

            So examples with overwhelming evidence are all essentially trivially true?

            • You asked in essence for a definition of overwhelming evidence. I gave you one. Since it’s a definition, it’s trivially true that it is (my) definition.

              On the other hand, the statement “voter preference varies considerably by income and racial categories” isn’t trivially true as in “by definition it’s true”, it’s a statement about the world and it turns out when you look at data, it’s true… But logically, you might well find that it isn’t true. Especially for example if you pick some fringe issue we don’t have much experience with, like say “voter preference on the issue of whether horse flesh should be a commodity that is traded in agricultural markets in CA” (there was a ballot referendum around 1999 or something that had to do with this issue).

              It’s a substantive question, and if you went back and looked at the polls, you might well find that this issue doesn’t have any real variability across race and income categories… or maybe only marginal evidence, say a posterior probability of 0.28 that all categories are within 2% points of hte overall mean… or something

      • hjk says:

        “The less specific or complicated the claim, the easier it is to establish statistically”

        So the less falsifiable? Isn’t this essentially Popper’s reason for not judging scientific hypotheses by their probability? Not that I’m completely on board with that particular argument, but still!

        • Rahul says:

          Excellent point.

          Also. Even if falsifiable specificity ought to matter too. e.g. It’s like that World Cup predictive example. I could predict that a game’s score will fall between 10-0 and 0-10 and statistically make a very repeatably accurate prediction.

          But what good is that?

          • I can’t help but think how this goes back to a lot of what Entsophy used to rail against in terms of how it’s almost always possible to create an objectively true prior provided you don’t take a frequentist “distribution long term histogram” viewpoint.

            World Cup games almost all have less than 10 point margins is an objectively true fact that is established by looking at a large sample of world cup games (hence statistically).

            It’s not terribly specific (since almost all of them also have say less than 3 point margins) but it is true.

        • If you want to know that the speed of light is greater than 15000 m/s this is obviously easier to establish than if you want to know that the speed of light is within 0.002 m/s of 299792358 m/s

          (ignore for the moment that the meter and the second are defined in terms of the speed of light, we can rearrange the question in an equivalent way to ask whether a given distance is within 10^-9 m of 1 m or something like that)

          I don’t think it constitutes a crisis in the logic of science that these two claims are not somehow equally easy to establish statistically.

          • hjk says:

            ‘I don’t think it constitutes a crisis in the logic of science’

            Me either, but I also don’t see probability as the logic of science.

            • Ignoring probability as logic of science, falsifiability as Popper concieved of it (or as I understand it at least) is a binary logical concept. Is it in principle possible to give some data which would make you disbelieve a hypothesis? If not, then the hypothesis is not falsifiable.

              In that sense, a lot of statements which are perfectly falsifiable in principle are not falsifiable by statistics in practice.

              That’s basically Hume’s problem of induction right? Things might always be different in the future. Our data so far may be non-representative… etc.

              The only kinds of in practice falsifiable theories are the ones that say “there are no X such that Y” where we can falsify it by simply observing one X such that Y.

              But those are pretty boring types of theories. They’re either wrong before we start thinking about them, or they’re about stuff that at best rarely happens: “there are no breakfast cereals with blue cheese flavoring, there are no people who voted for Lefty the Clown in the presidential election… whatever.”

              Interesting stuff, like whether animals derive from each other by a process of natural selection, whether energy is a conserved quantity in the universe, or whether a certain medicine is more effective at treating disease Q than another, requires observing data and coming to a conclusion about what is “more likely to be true” (in a non-technical sense). Whether you use formal probability theory or you use some kind of other intuitionist theory, or you use classical statistical ideas, you’re only going to get “overwhelming evidence” not a falsified/non-falsified dichotomy.

              Conservation of energy is an example of a theory that has overwhelming statistical evidence (it’s held in every single case that it’s been carefully tested), but it’s still statistical.

              • Or the speed of light thing… faster than light neutrinos had a hard time against a very highly established theory. Mainly, because people thought it was more likely to be a flaw with a complex experiment than to be a true falsification of the theory that nothing goes faster than light.

                In the end, the flaw with the experiment was chased down and sure enough there it was (actually there were several).

              • hjk says:

                I’m no Popper expert (or disciple, etc) but I wouldn’t say ‘falsifiability as Popper concieved of it (or as I understand it at least) is a binary logical concept’ is true. He said

                ‘Every genuine test of a theory is an attempt to falsify it, or to refute it. Testability is falsifiability; but there are degrees of testability: some theories are more testable, more exposed to refutation, than others; they take, as it were, greater risks.’

                He said much more along these lines, and I think you have an incorrect ‘feel’ for his views in general. He also discussed auxilary hypotheses (re: neutrinos) in many places. I think Duhemian problems were still an issue for him in the end, though.

                So, whether he arrived at a coherent final formulation of his views I think is debatable, but he was hardly as simple-minded as sometimes portrayed.

              • hjk: thanks for the quote. I guess you know more about Popper than I do (because I know I don’t know very much about him) but also I think I mis-stated my point, which is more closely related to the fact that as soon as you decide to use “statistics” to try to refute theories, you are stuck with only statistical refutations… Whether “probability is the logic of science” or just “We’re using stats to see whether this thing seems weird” (a kind of naive classical statistics).. eventually you have to get down to “this does or doesn’t seem likely” for most theories.

                If Popper doesn’t want to rely on probability theory, he still needs to give an account of what it means to falsify via statistical evidence.

              • hjk says:

                ‘If Popper doesn’t want to rely on probability theory, he still needs to give an account of what it means to falsify via statistical evidence.’

                Well Popper never really did I don’t think, but one direct attempt at this, likely popular around here, would be Mayo’s, no? My understanding is that she uses statistics/probability to characterise a test’s ‘severity’ or ability to falsify a given hypothesis, in contrast to the bayesian’s use of probability as a confirmation measure assigned directly to a hypothesis.

                So Popper/Mayo: is this a good test of my hypothesis, statistically speaking? (Hypotheses e.g. pass/fail good tests are not themselves assigned probabilities, merely labelled as ‘passed/failed a good test’.)

                Bayesian: is this a probable hypothesis?

                Or something like that, right?

    • D.O. says:

      Let me try. There is an overwhelming statistical evidence that girls, on average, acquire language earlier than boys. It is a quintessentially statistical phenomenon — not every girl is ahead of every boy (even if only normal development is considered) and overlap is probably substantial. Yet, there is no good theory behind it. Neither it is obvious, in a sense that we would know it without gathering large quantities of data.

  15. EJ Wagenmakers says:

    The publisher of the book wanted me to take down the preprint from my website, and I’m still negotiating to have it back up. In the mean time, those who would like a preprint can drop me an Email at EJ.Wagenmakers@gmail.com. Then at some point I guess it may just happen that the preprint finds its way online.

    The paper also quotes Jaynes on ESP.

    Cheers,
    E.J.

  16. […] If you have some questions about statistical evidence in today’s go go world, check out “I Disagree with Alan Turing and Daniel Kahneman Regarding the Strength of Statistical Evidence.” […]

  17. […] Today Dwayne Woods was kind enough to direct me to one of my favourite statisticians — Andrew Gelman — who has a post up touching on the same […]

  18. Roy Abrams says:

    I’m calling Alan on my Ouija board right now to tell on you.

  19. […] Andrew Gelman on the strength of statistical evidence. […]

  20. Martha Smith says:

    I think this discussion is in many ways worthwhile, but seems to be getting bogged down in too much certainty all around. Here are some of my thoughts:

    1. Andrew does often seem to me too brash in the manner of his criticism. (I emphasize the manner, not the substance — and that this is my subjective impression).

    2. Erin’s fly analogy at first seemed a good one to me, but the more I think about it, the more I believe that it (like many analogies when examined closely) doesn’t fit very well with the situation it was prompted by (in this case, critiquing statistical methods used in research papers). Some details:
    If I were the person with the open fly (or my slip was showing, to give a situation I have more often encountered, at least when I was young), I most definitely would have preferred an observer to choose option 1. But then I started thinking about what would be the analogue of option 1 in critiquing a research paper. If you know one of the authors and can talk with them in person, then Erin’s option 1 fits well. But if you don’t (as is often the case), the current best analogue seems to be email.
    I have tried the email version of option 1 a number of times in the past few years. The typical response was no response. Occasionally I got a “thank you for taking an interest in my work” response, with no comment on the substance of my critique (once I got something like, “I don’t really think your comments are of great concern for my work.”) Only once did my email lead to a substantive discussion, with the recipient expressing some concern – but that recipient was a mathematician (as I am) with a master’s degree in statistics, and the article critiqued was in a (non-research) mathematics journal.
    When I heard about the special replications issue of Social Psychology and found some articles on a topic I had some familiarity with, I looked at them – and found that they had some of the problematical statistical issues that I often caution students about. I decided that since the topic of questionable statistical practices was more in the air now, and in light of my previous lack of response to emails to authors, it might be time to try something that seems somewhere between Erin’s options 1 and 2: Posting comments on the papers on my own blog, emphatically pointing out in the blog that my comments are not intended and should not be construed as singling out the authors mentioned for criticism (see http://www.ma.utexas.edu/blogs/mks/2014/06/22/beyond-the-buzz-on-replications-part-i-overview-of-additional-issues-choice-of-measure-the-game-of-telephone-and-twwadi/), and trying to avoid any hint of blaming. I also emailed one author of each paper mentioned in the posts, as a heads up, and repeating my caveat. (I did not receive replies to any of the emails.)
    I invite anyone reading this to read my blog posts (the one linked is the first of eight; I’ve been meaning to post a ninth on exploratory data analysis as well, but haven’t gotten around to it), and to give me feedback on whether or not the blogs are adequately respectful and if there is anything I could have done better to lessen the (understandable but inevitable) embarrassment my posts might have caused the authors involved.

    3. I’ve also looked at some of the psychology blogs regarding statistical practice. Some of them have a tone that to me seems unnecessarily blaming, some seem to have “rationalizations” for poor practices, and some seem to have a degree of concern for science balanced with courtesy and civility that I appreciate. One that has made a particularly good impression is David Funder’s Funderstorms (http://funderstorms.wordpress.com/). One of his posts had a link to a paper he co-authored making recommendations to improve research practice in his field. I sent him a commentary on it, with permission to share with co-authors or on his blog. He suggested that I ask to post it on the SPSP blog – which I was able to do (with an introduction requested by the blog keeper). If you care to read or critique it (either by comment on the blog or by email to me), see http://www.spspblog.org/comments-on-funder-et-al-improving-the-dependability-of-research-in-personality-and-social-psychology-recommendations-for-research-and-educational-practice/

    4. This discussion is related to Andrew’s comment about embracing variation at http://andrewgelman.com/2014/09/08/talk-simons-foundation-wed-5pm/. Human nature is one of those things that involves variation. In particular, what shames or embarrasses varies from one person to another (and from one subgroup to another). So some disagreement about what is and is not appropriate is inevitable; we need to be careful not to assume that others will react to something the same way we do. One person’s tact or courtesy may even be another’s rudeness.

    • Andrew says:

      Martha:

      Yes, I agree. I have my own way of communicating, and each person has to find his or her own style. I’ve tried many times to contact researchers directly, often with little success, but others may succeed where I’ve failed. It’s good that there’s variation. Indeed, this gets back to the problems with the Turing and Kahneman quotes: Turing didn’t just say “I believe in ESP” and Kahneman didn’t just say “I believe in priming”; rather, each of them insisted that the rest of us had to believe also. As Wagenmakers et al. wrote, “disbelief does in fact remain an option,” and I think Turing and Kahneman were in error in suggesting otherwise.

  21. […] 03 – I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence by Andrew It’s funny. I’m the statistician, but I’m more skeptical about statistics, […]

  22. TF says:

    I find it absurd that people need to defend Turing and Kahneman to the point where it starts to feel hostile. They are HUMAN!! The vitriol is starting to sound like Turing and Kahneman aren’t ALLOWED to make mistakes! I’m also sure they can take a little criticism.

Leave a Reply