Skip to content
 

Confirmationist and falsificationist paradigms of science

Deborah Mayo and I had a recent blog discussion that I think might be of general interest so I’m reproducing some of it here.

The general issue is how we think about research hypotheses and statistical evidence. Following Popper etc., I see two basic paradigms:

Confirmationist: You gather data and look for evidence in support of your research hypothesis. This could be done in various ways, but one standard approach is via statistical significance testing: the goal is to reject a null hypothesis, and then this rejection will supply evidence in favor of your preferred research hypothesis.

Falsificationist: You use your research hypothesis to make specific (probabilistic) predictions and then gather data and perform analyses with the goal of rejecting your hypothesis.

In confirmationist reasoning, a researcher starts with hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A.

In falsificationist reasoning, it is the researcher’s actual hypothesis A that is put to the test.

How do these two forms of reasoning differ? In confirmationist reasoning, the research hypothesis of interest does not need to be stated with any precision. It is the null hypothesis that needs to be specified, because that is what is being rejected. In falsificationist reasoning, there is no null hypothesis, but the research hypothesis must be precise.

In our research we bounce

It is tempting to frame falsificationists as the Popperian good guys who are willing to test their own models and confirmationists as the bad guys (or, at best, as the naifs) who try to do research in an indirect way by shooting down straw-man null hypotheses.

And indeed I do see the confirmationist approach as having serious problems, most notably in the leap from “B is rejected” to “A is supported,” and also in various practical ways because the evidence against B isn’t always as clear as outside observers might think.

But it’s probably most accurate to say that each of us is sometimes a confirmationist and sometimes a falsificationist. In our research we bounce between confirmation and falsification.

Suppose you start with a vague research hypothesis (for example, that being exposed to TV political debates makes people more concerned about political polarization). This hypothesis can’t yet be falsified as it does not make precise predictions. But it seems natural to seek to confirm the hypothesis by gathering data to rule out various alternatives. At some point, though, if we really start to like this hypothesis, it makes sense to fill it out a bit, enough so that it can be tested.

In other settings it can make sense to check a model right away. In psychometrics, for example, or in various analyses of survey data, we start right away with regression-type models that make very specific predictions. If you start with a full probability model of your data and underlying phenomenon, it makes sense to try right away to falsify (and thus, improve) it.

Dominance of the falsificationist rhetoric

That said, Popper’s ideas are pretty dominant in how we think about scientific (and statistical) evidence. And it’s my impression that null hypothesis significance testing is generally understood as being part of a Popperian, falsificiationist approach to science.

So I think it’s worth emphasizing that, when a researcher is testing a null hypothesis that he or she does not believe, in order to supply evidence in favor of a preferred hypothesis, that this is confirmationist reasoning. It may well be good science (depending on the context) but it’s not falsificationist.

The “I’ve got statistical significance and I’m outta here” attitude

This discussion arose when Mayo wrote of a controversial recent study, “By the way, since Schnall’s research was testing ’embodied cognition’ why wouldn’t they have subjects involved in actual cleansing activities rather than have them unscramble words about cleanliness?”

This comment was interesting to me because it points to a big problem with a lot of social and behavioral science research, which is a vagueness of research hypotheses and an attitude that anything that rejects the null hypothesis is evidence in favor of the researcher’s preferred theory.

Just to clarify, I’m not saying that this is a particular problem with classical statistical methods; the same problem would occur if, for example, researchers were to declare victory when a 95% posterior interval excludes zero. The problem that I see here, and that I’ve seen in other cases too, is that there is little or no concern with issues of measurement. Scientific measurement can be analogized to links on a chain, and each link—each place where there is a gap between the object of study and what is actually being measured—is cause for concern.

All of this is a line of reasoning that is crucial to science but is often ignored (in my own field of political science as well, where we often just accept survey responses as data without thinking about what they correspond to in the real world). One area where measurement is taken very seriously is psychometrics, but it seems that the social psychologists don’t think so much about reliability and validity. One reason, perhaps, is that psychometrics is about quantitative measurement, whereas questions in social psychology are often framed in a binary way (Is the effect there or not?). And once you frame your question in a binary way, there’s a temptation for a researcher, once he or she has found a statistically significant comparison, to just declare victory and go home.

The measurements in social psychology are often quantitative; what I’m talking about here is that the research hypotheses are framed in a binary way (really, a unary way in that the researchers just about always seem to think their hypotheses are actually true). This motivates the “I’ve got statistical significance and I’m outta here” attitude. And, if you’ve got statistical significance already and that’s your goal, then who cares about reliability and validity, right? At least, that’s the attitude, that once you have significance (and publication), it doesn’t really matter exactly what you’re measuring, because you’ve proved your theory.

I am not intendeing to be cynical or to imply that I think these researchers are trying to do bad science. I just think that the combination of binary or unary hypotheses along with a data-based decision rule leads to serious problems.

The issue is that research projects are framed as quests for confirmation of a theory. And once confirmation (in whatever form) is achieved, there is a tendency to declare victory and not think too hard about issues of reliability and validity of measurements.

To this, Mayo wrote:

I agreed that “the measurements used in the paper in question were not” obviously adequately probing the substantive hypothesis. I don’t know that the projects are framed as quests “for confirmation of a theory”,rather than quests for evidence of a statistical effect (in the midst of the statistical falsification arg at the bottom of this comment). Getting evidence of a genuine, repeatable effect is at most a necessary but not a sufficient condition for evidence of a substantive theory that might be thought to (statistically) entail the effect (e.g., a cleanliness prime causes less judgmental assessments of immoral behavior—or something like that). I’m not sure that they think about general theories–maybe “embodied cognition” could count as general theory here. Of course the distinction between statistical and substantive inference is well known. I noted, too, that the so-called NHST is purported to allow such fallacious moves from statistical to substantive and, as such, is a fallacious animal not permissible by Fisherian or NP tests.

I agree that issues about the validity and relevance of measurements are given short shrift and that the emphasis–even in the critical replication program–is on (what I called) the “pure” statistical question (of getting the statistical effect).

I’m not sure I’m getting to your concern Andrew, but I think that they see themselves as following a falsificationist pattern of reasoning (rather than a confirmationist one). They assume it goes something like this:

If the theory T (clean prime causes less judgmental toward immoral actions) were false, then they wouldn’t get statistically significant results in these experiments, so getting stat sig results is evidence for T.

This is fallacious when the conditional fails.

And I replied that I think these researchers are following a confirmationist rather than falsificationist approach. Why do I say this? Because when they set up a nice juicy hypothesis and other people fail to replicate it, they don’t say: “Hey, we’ve been falsified! Cool!” Instead they give reasons why they haven’t been falsified. Meanwhile, when they falsify things themselves, they falsify the so-called straw-man null hypotheses that they don’t believe.

The pattern is as follows: Researcher has hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A. I don’t see this as falsificationist reasoning, because the researchers’ actual hypothesis (that is, hypothesis A) is never put to the test. It is only B that is put to the test. To me, testing B in order to provide evidence in favor of A is confirmationist reasoning.

Again, I don’t see this as having anything to do with Bayes vs non-Bayes, and all the same behavior could happen if every p-value were replaced by a confidence interval.

I understand falisificationism to be that you take the hypothesis you love, try to understand its implications as deeply as possible, and use these implications to test your model, to make falsifiable predictions. The key is that you’re setting up your own favorite model to be falsified.

In contrast, the standard research paradigm in social psychology (and elsewhere) seems to be that the researcher has a favorite hypothesis A. But, rather than trying to set up hypothesis A for falsification, the researcher picks a null hypothesis B to falsify and thus represent as evidence in favor of A.

As I said above, this has little to do with p-values or Bayes; rather, it’s about the attitude of trying to falsify the null hypothesis B rather than trying to trying to falsify the researcher’s hypothesis A.

Take Daryl Bem, for example. His hypothesis A is that ESP exists. But does he try to make falsifiable predictions, predictions for which, if they happen, his hypothesis A is falsified? No, he gathers data in order to falsify hypothesis B, which is someone else’s hypothesis. To me, a research program is confirmationalist, not falsificationist, if the researchers are never trying to set up their own hypotheses for falsification.

That might be ok—maybe a confirmationalist approach is fine, I’m sure that lots of important things have been learned in this way. But I think we should label it for what it is.

Summary for the tl;dr crowd

In our paper, Shalizi and I argued that Bayesian inference does not have do be performed in an inductivist mode, despite a widely-held belief to the contrary. Here I’m arguing that classical significance testing is not necessarily falsificationist, despite a widely-held belief to the contrary.

60 Comments

  1. Dan Wright says:

    Another issue, discussed in Meehl (1967), is that as measurements improve (e.g., with bigger samples, better instruments) that it becomes easier to reject the hypothesis the confirmationist wants to, but harder for the falsificationist.

    Meehl, P.E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115.
    http://www.tc.umn.edu/~pemeehl/074TheoryTestingParadox.pdf

    • mayo says:

      But this alludes to an utterly illicit practice of taking a statistically significant effect as evidence for a substantive theory . It is a canonical error, and if so-called NHST condones it, then it is a completely invalid animal that was condemned by Fisher, N-P, and lampooned for decades. I do fault Meehl, who was a friend, for suggesting that exaggerated abuses of tests were permissible (by Fisher). Meehl, you know, was a Popperian—but also a Freudian. It rankled him to no end that the pitiful p-value exercises he saw many carrying out would be deemed “scientific” whereas Popper condemned Freudian psychology to the not-yet-scientific (even though he thought it could become scientific). Thanks to Meehl, event Lakatos can be found denouncing significance tests in social science as “pseudointellectual nonsense” (or a term like that). On the other hand, N-P tests were regarded (by Lakatos and Popper) as exemplifications of Popperian “methodological falsificationism”.

      • Andrew says:

        Mayo:

        I think that by “so-called NHST” you’re referring to what the rest of us call “NHST.” And it is a canonical error, an error that unfortunately continues to be expressed despite the condemnation of Fisher, N-P, Meehl, and others. The continuation of this error has motivated Simonsohn, Loken, me, and many others to keep writing about it. One problem, I think, is that practitioners of this “invalid animal” (as you call it, and I agree) seem to think that they are applying falsificationist logic when they’re actually applying confirmationist logic. I’m sure I’ll keep blogging on this topic. The point is that, yes, NHST is fundamentally flawed, but it’s not enough to point this out. I think it’s necessary to recognize the real goals that researchers have and to understand their motivations in using NHST, in order for us to move forward. And I think that’s part of your research project too.

        Unfortunately, one of the selling points of NHST is that it’s easy to get statistical significance (and, thus, a higher chance of publication and a career boost) from noisy data. I’m not saying this is the only selling point of NHST, nor am I saying that researchers are choosing this approach because they want to cheat. Rather, the ease of rejecting null hypotheses from noisy data is a side-effect of how NHST works when applied in typical research settings, and this side-effect, whether or not it is intended, provides pleasant results for researchers. This can make it a hard sell if we want to replace NHST by methods of inference that are less likely to result in strong claims that can be promoted in a publication.

        • Keith O'Rourke says:

          >This can make it a hard sell if we want to replace NHST by methods of inference that are less likely to result in strong claims that can be promoted in a publication.

          Exactly or as Pogo used to say – the insurmountable opportunity.

        • Not to even mention the “you can’t get a paper into Nature unless you publish a proper p value” type editorial policies created by statistically naive editors at top journals. This makes it hard for a researcher to publish high quality work when they actually know what they’re doing and purposefully chooses to use something like a Bayesian method or even a frequentist method which focuses on estimation and confidence bounds instead of p values. I’m especially pointing at Biology because that’s the area where I have first or second hand experience of this kind of thing going on.

        • mayo says:

          Andrew: I think the term was invented in psychology and alludes to no real methodology. I agree it alludes to a fallacy of (probabilistic) affirming the consequent–one that is embraced by those who view evidence as a matter of increasing probabilistic confirmation. But I think there is a danger in confusing significance tests–be they “pure significance tests (as David Cox calls them), or N-P tests–entities with their own confusions and misuses, with the affirming the consequent move from an observed effect to a substantive theory T (where T might be thought to render that effect more probable). By jumbling these together, much of the discussion has degenerated. The error control that is vouchsafed in proper significance testing and cognate methods is absent in the illicit form of these methods. It would be like criticizing a method, but not referring to that method at all. Worse, many of the critics of significance tests and related methods say that what we really want are ways to boost the confirmation of theory T (whether absolute or comparative). So they go right back to recommending confirmation boosts instead of NHST, which boils down to recommending a version of NHST instead of a methodology that would have precluded the unreliable move to T (whose errors have been poorly probed by merely finding effect x).

          Meehl was wrong about a number of things in the midst of his criticisms of tests in general . Like many people, he claimed that if you are going to allow an observed effect x to be evidence for theory T, then you must take the absence of the effect x as grounds to deny T.

          • mayo says:

            The last para on Meehl wasn’t to be included. It’s correct but goes onto a different topic that I decided not to take up here. It’s quite important, but I don’t want to mix issues.

          • Andrew says:

            Mayo:

            As you know, I’m a big fan of model checks, of comparing data to predictions from the model. I’m a big fan of using a model to make strong claims, then checking those claims with data. I think this is particularly important to do with models that I like.

            But I’m not a big fan of NHST, which is the procedure of trying to reject straw-man null hypothesis A in order to make a claim that a preferred hypothesis B is correct. NHST may very well refer to “no real methodology” (as you put it), but it’s a non-real methodology that is used a lot in psychology and elsewhere. Daryl Bem, Jessica Tracy, and Alec Beall are not the only scientists to take p<0.05 as strong evidence that their preferred model is true. This attitude appears all over the place. Again, I like model checking, indeed I've gone to a lot of effort to emphasize the continuity between classical and Bayesian goodness-of-fit tests. Posterior predictive checks, for example, can be viewed as a generalization of classical tests where there is uncertainty about the parameters and no easy analytical solution. But I don't like NHST.

            • mayo says:

              If you are criticizing an NHST from a stat sig effect to a substative claim T, you should add, as did Meehl: “For the corroboration to be strong, we have to have “Popperian risk” (Popper, 1959/1977, 1962), “severe test” (Mayo, 1991, 1996) . (Meehl and Waller 2002)”
              So you’d have to be critical of all statistical affirming the consequent where even though T “fits” or “explains” or “confirms” or is given a B-boost by, the observed effect, Prob (so good a fit, even if not T) = not low.

              That would NOT be to criticize statistical significance tests—certainly not any error statistical test–which must control two error probabilities, erroneously finding evidence against and erroneously failing to. Strictly speaking, N-P statistical tests only concern statistical hypotheses but we can extend the reasoning to any level. Your example concerns the former (essentially a type 1 error). If one moves from a statistically significant effect to a substantive claim T—where that claim has not had its errors probed, and so T has not been corroborated with severity—then you fail to control that error. That would not condemn tests where that error was controlled. Thus to criticize all statistical significance tests would be to preclude such inferences even where they are warranted, and in fact it would preclude empirical falsification in science. Consider a case where we’d allow it to be warranted:

              1. If this ebola drug didn’t work, then they wouldn’t be able to show such and such improved survival.
              2.They show improved survival (statistically)
              This is evidence the drug works.

              The bottom line is: we can regard criticisms of something called NHST as relevant only insofar as all cases of insevere corroboration are condemned.

              • Andrew says:

                Mayo:

                I don’t know why you continue to write things like “something called NHST” or “so-called NHST” or “no real methodology.” What you call “something called NHST” is what the rest of us call “NHST.” You might as well refer to “something called evolution” or “something called the Eiffel Tower.” “NHST” is a phrase that refers to a statistical method that is prevalent in psychology research in elsewhere. The method goes like this: a researcher has a substantive theory X and, from the data, he or she tests a null hypothesis Y that he or she does not believe, then gets p<0.05 and declares rejection of Y and then claims X is correct. In recent years, NHST has been used to demonstrate all sorts of things including the existence of ESP and huge effects of ovulation on vote preferences. From a statistical perspective, NHST is flawed for several well-known reasons that I have discussed many times on this blog and in published and unpublished papers. As we have discussed from time to time, the problem is not with p-values---the same errors will arise if instead people use confidence intervals---but rather with the NHST framework. Also as we've discussed, NHST has various properties that make it appealing to practitioners, as well as a superficial (but, as I've argued in the above post, false) connection to Popperian falsification. So for all these reasons, I think NHST is worth discussing. As I've also written many times, I don't think the framework of Type 1 and Type 2 errors is helpful in most of the applications I've seen.

              • question says:

                Mayo,
                Good example w the Ebola drug. Let’s take the recent zmapp study(Qiu et al 2014). The increased survival was because they euthanized the control animals for having high ‘clinical scores'(bad symptoms). They do not tell us how this clinical score was calculated, but do mention that the study was not blinded.

                So we have a situation of unblinded researchers deciding how long monkeys survive according to unknown criteria then claiming increased survival in the treatment group….the stats do not even enter into the decision making process.

              • Rahul says:

                I think the flaws & blind spots of NHST are well recognized. But practitioners aren’t stupid. Mostly.

                What’s not available or practical in many use-cases is an improved alternative to NHST.

              • question says:

                “An earlier $58 million request for the Centers for Disease Control would help the agency ramp up production and testing of the experimental drug ZMapp, which has shown promise in fighting the Ebola epidemic in western Africa.”
                http://www.washingtontimes.com/news/2014/sep/5/white-house-asks-for-58m-for-ebola-drugs/

                Really, this is the evidence:

                Reversion of advanced Ebola virus disease in nonhuman primates with ZMapp. Nature(2014) doi:10.1038/nature13777
                http://go.nature.com/oY8pGI

                Why did editors/reviewers fail to make them include a description of the “clinical score” protocol? Why would you spend so much money on a project then fail to blind the people measuring your primary outcome?

                This is grade-school stuff.

              • Andrew says:

                Rahul:

                You write, “I think the flaws & blind spots of NHST are well recognized. But practitioners aren’t stupid. Mostly.” It’s not about stupidity. Statistics is hard! Recall our discussion the other day. Even brilliant scientists such as Turing and Kahneman can get snowed by what seems to them as overwhelming statistical evidence.

                I think it’s worth returning to these topics over and over again because they are difficult enough that non-stupid people get them wrong, over and over. I don’t delude myself that I can change everyone by one blog post or even by 100 posts and 10 journal articles. But I think these issues are worth thinking about, and I think that we make progress by elaborating and discussing them.

                And you write, “What’s not available or practical in many use-cases is an improved alternative to NHST.” Here, I think we have to return to the question of incentives. An alternative statistical method such as multilevel modeling can be an improvement in some ways (for example, giving more accurate and reproducible estimates of effects) but could be considered as negative in other ways (with multilevel modeling, it’s harder to get statistically significant p-values (see my 2000 paper with Tuerlinckx), hence harder to get publication). So I agree with those people who say that, along with working on statistical methods, we have to work to change some of the perverse incentives of the system.

              • Anonymous says:

                “The bottom line is: we can regard criticisms of something called NHST as relevant only insofar as all cases of insevere corroboration are condemned.”

                To the extent “Severity” has been made concrete, it’s only been tested on 200 year old problems where it’s numerically and functionally the same as using the Bayesian posterior. In other words, it hasn’t been tested.

                To the extent “Severity” is a meta principle, it’s infinitely malleable and can be always be fudged on an ad-hoc basis to save face. This leads to a kind of Statistical Zeno’s paradox. As each new problem is found and fixed the methods inch ever closer to the Bayesian answer (just like ‘severity’ brings p-value methods closer to posteriors), without anyone having to admit that’s where they’re headed. At least when Abraham Wald encountered this phenomenon, he had the mathematical skill to see where perfection lead and the integrity to name them “Bayes strategies”.

                Oh the delicious irony of Popperites (Popperazzi?) using Popper’s words to swear allegiance to untested theories and unfalsifiable ideologies.

                Of course if you don’t know enough math to verify these claims you can deny them indefinitely. That’s the chief perk of being a Philosopher I suppose. The statistical community will eventually discover the truth though if they take “Severity” seriously enough. The truth always wins with this sort of thing and it won’t make a spit of difference what the highly credentialed super geniuses on this blog have to say – just like the world learned the truth about Classical Statistical methods no matter how thoroughly the super stars of Frequentism indoctrinated each new generation and enforced their ideological prescriptions.

                So enjoy your adulation while it lasts Mayo, and pray those statisticians who’ve praised your work continue to spend their time complimenting it rather than using it.

                Oh and to the extent there’s a kernel of truth in “severity”, Jaynes of course did it 100 times better, with mathematical details, in yet another of his articles pregnant with useful ideas that any statistician worth their salt could turn into half a dozen profitable research programs: http://bayes.wustl.edu/etj/articles/what.question.pdf

            • Andrew says:

              Anonymous:

              Please be polite. Regarding your comment: I don’t think there are a lot of perks to being a philosopher; I think philosophers such as Mayo are doing their best to formalize what scientists do. Jaynes is great but there are lots of ways of doing statistics and I value what Mayo does even if I don’t agree with everything she writes. For that matter, I got a lot out of reading Popper and Lakatos, even though neither of them offered any methods that I could use.

      • statteacher says:

        “utterly illicit practice of taking a statistically significant effect as evidence for a substantive theory . It is a canonical error …. lampooned for decades”.

        Absolutely true. I drill into my stats students that a statistically significant effect is not evidence for a theory no matter how many frequentist guarantees are proffered unless the calculation is supplemented with a fuzzy, ill-defined, researcher-dependant ad-hoc “interpretation” of the results. I warn my students against using priors for this because probability theory can be derived from consistency requirements, and these additional constraints/requirements can greatly hamper the interpretive phase of the analysis.

        Not all stat teachers are guilty of teaching the objectivty theory of statistics wrong. Some of us get it right.

  2. hjk says:

    Ha, nice. Should have saved my comments from the last thread for this one. Do you assign probabilities to hypotheses or methods or neither? Posterior predictive checks and all that I suppose.

  3. Mike F says:

    Thank you for articulating clearly the phenomenon I’ve been calling “pseudo hypothesis testing” with a sneer. I never tried to put my finger on what made so many papers “pseudo” — tests, p-values, and “is/isn’t signficant”‘s abound but no one is testing their hypotheses.

    Related I think is “interesting data comparison was insignificant according to us (not shown)” which both doesn’t quite make sense and excludes the data from the scientific record.

  4. jonathan says:

    Years ago, I was recruited for a study about diet. I decided not to participate because, all things described, I didn’t see the connection between what they were measuring and what they wanted to measure. That is, they wanted to change my diet with x but they weren’t measuring the effect of x itself – or if there was one – but whether x being changed would have a larger effect. When I learned the size of the study – pretty small – and the duration – pretty small – I said no; if they found a connection, I wouldn’t trust it so I was out. That’s a garden variety example.

    Another example refers to your discussion of “hot hand” studies. I threw in that maybe the question should be phrased in larger terms, in terms of the game itself, and used as an example research into “flow state” – whatever the heck that is – which finds significantly greater levels of reported involvement. In other words, I have trouble understanding the concept of “hot hand” redacted from the game in which it occurs. Modern basketball stats hint at how well a player does overall in his time on the court and can be pulled apart to find stretches where they’re more productive than others. That would be a “hot hand” in game terms, something which makes sense to me on many levels rather than what seems a relatively blunt and dumb idea that shooting by individual players can be isolated as “hot” or “cold” over fairly short game stretches. Modern basketball stats try to say what makes a better ballplayer, which is obviously meaningful to test. (Famous example Shane Battier’s effect.) My point about that kind of example is: think how much ink has been spilled talking about something, the individual “hot hand”, when the question is kind of silly (and seems at times to reveal an attraction to claimed mystical properties).

    To end with a question: much of scientific history occurs by accident. X-rays had a lot of oops in them. Penicillin is a story of oops almost lost. I suppose there’s survivorship bias at work: how many papers have been published over the many decades that have no meaning? Not no citation in decades but which literally have no value as work today except as artifacts? Before journals, when much was circulated, how much? Lots, I’d guess because if you list out all the achievements in any field, the number is not the continuum and yet there’s been a lot of activity over time.

  5. Martha Smith says:

    The confirmationist vs falsificationist dichotomy does make sense to me as descriptive of what is often done — but it also seems restrictive of describing what is sometimes (and might more often be) done.

    This view is undoubtedly partly (but not entirely*) influenced by the fact that I am about a third of the way through Fred Booksteins’ book Measuring and Reasoning: Numerical Inference in the Sciences (Cambridge U. Press, 2014). Bookstein focuses on three “forms of argument” in achieving scientific understanding: quantitative consilience, abduction, and strong inference.

    He gives the following definitions of these terms (p 5):

    Consilience is “the matching of evidence from data at different levels of observation or using essentially different kinds of measurement.”

    Abduction is “the sequence of finding something surprising in your data and then positing a hypothesis whose specific task is to remove that surprise.”

    Strong inference is “the wielding of a single data set to support one hypothesis at the same time it rejects several others that had previously been considered just as plausible.”

    He draws from his own experience as a morphometrician (he has a joint Ph.D. in zoology and mathematics), and acknowledges the influences of many others (Boulding, Peirce, E. B. Wilson, Karl Pearson, Jaynes, …) . It’s an interesting and thought-provoking read.

    *Other influences involve listening to and talking with scientists in various disciplines, especially biology.

  6. Dan Riley says:

    In high-energy physics, we have “discovery” searches and “precision” measurements. Discovery searches look for something inconsistent with the known processes (the null hypothesis). Precision measurements are meant to confirm or refute a specific hypothesis. Replications of new discoveries often fall somewhere in between.

  7. Daniel Gotthardt says:

    I’ve always wondered about this. I’ve been a “critical rationalist” long before I started to delve more seriously into (social science) statistics. Even though I’ve been much more influenced by Albert than by Popper this difference is not relevant here. Although Popper and some of his ideas have been mentioned approvingly from time to time in my studies, nothing – or nearly nothing – anyone did seem falsificationist at all to me. Especially the null hypotheses testing did never seem like the “critical tests” Popper proposed. Still classical statistics always seemed to be framed in some kind of pseudo-falsificationism while qualitative and Bayesian ideas often seemed to be inductivist which made them somewhat suspicious for me. You’ll always be able to find evidence in favor for your favorite hypotheses and at least for a long time I agreed with Popper’s reasonig that it’s impossible to calculate the probability of a hypotheses to be true or not. That’s why I’m always happy to see that Andrew seeks ways to incorporate falsificationist thinking into his statistical work and I think we actually need to think more and not less about these issues. I think at least some of the problems with how p-values and the like are used is that researchers follow a problematic “conformationist” way of thinking. Also disliking replication and criticism does have a lot to do with it.

    On the one hand Popper’s approach is not completely applicable. We just cannot have two competing hypotheses predicting two precise different outcomes and use an empirical test to differentiate between them in the Social Sciences. On the other hand I do not think that just because Popper did himself reject probabilistic reasoning, that we should not even try to use “falsificationist” approaches for statistical research. That’s why i really like Andrew’s paper with Shalizi even though I’m not sure if I completely agree. The problem though is that falsificationism of any kind can only work in a culture where falsificiationism is at least a widespread if not dominant approach and I just don’t see that in the Social Sciences at all. In the end that’s probably more important than the specific formal approach someone is using.

    • Keith O'Rourke says:

      I would put this more strongly as folks getting to the point where failed replication is more pleasing than successful replication, criticism that hasn’t being partially anticipated or easily dealt with, more pleasing to get than the opposite.

      C.S. Peirce seemed to indicated such a stance, in particular his quote that the best compliment he ever received from a critic was “the author does not seem fully convinced of his own arguments.”

      I think I have worked with a couple of clinical researchers who were close, when one pointed out the flaw in other’s approach, there often a detectable grin (on the one with the fault.) But, if one can do this and are talented and resourceful, one can _expect_ to exceed other researchers who can’t. (By resourceful, I primarily mean not having to worry about sources of funding.)

      _True_ Falsificationists actively seek out (productive) non-replication and (almost) insurmountable criticism.

      • Daniel Gotthardt says:

        Keith:

        That Peirce’ quote is quite awesome, do you happen to have any kind of source for it? Of course, what you describe would be even more preferable but I don’t think it’s necessary for a scientist to be *happy* for his own work and theories to be falsified. I think it’s okay to have some kind of division of labor in so far as you usually (try to) falsify other peoples’ work and not your own. What we really need is acceptance and openness towards falsifications and criticism on the one hand and empirical (statistical) research that in itself is more akin to falsificationism on the other hand.

  8. mayo says:

    If you read my final response to Gelman, though, I think you’ll see the situation is much more complex than this falsification vs confirmation dichotomy makes it look.

    http://errorstatistics.com/2014/06/30/some-ironies-in-the-replication-crisis-in-social-psychology-1st-installment/comment-page-1/#comment-85400

    • Kyle C says:

      This response? Could you expand? —

      AG wrote: “I understand falsificationism to be that you take the hypothesis you love, try to understand its implications as deeply as possible, and use these implications to test your model, to make falsifiable predictions. The key is that you’re setting up your own favorite model to be falsified….”

      Mayo wrote: “Andrew: Now I see what you’re alluding to, not a falsificationist logic* but a stringent, self-critical, testing account….
      *Your criticism is essentially pointing up the unsoundness of the falsification argument I laid out for them [psych researchers], in one of my comments above.”

      Why do you say that what AG calls “falsificationism” doesn’t follow “a falsificationist logic”? Is this a definitional disagreement?

      • Mayo says:

        This is more complicated than a mere blog comment can do justice to, but the thing is, setting out to falsify, formulating your “test’ as a modus tollens need not warrant the denial of the antecedent (in the case of a failed consequent) in the least. In the case at hand, turning the example (under criticism) into modus tollens doesn’t turn it into a critical affair or spare it from being questionable science. What makes a methodology pseudoscientific isn’t that it refuses to falsify so much as being unable to reliably pinpoint the blame of any apparent anomalies. Popper’s philosophy denied we could solve such “Duhemian problems” reliably (even though, personally, he thought we must manage to). At most, for Popper, you can infer something is wrong somewhere. That’s where Popper’s methodology falls apart (and mine, I hope, goes beyond his). This is on p. 1 of Error at the Growth of Exper Knowledge (ch. 1 “Learning from Error). To claim one is being stringent or self-critical simply because one is going to try to find flaws in a model or hypothesis is empty.

        • question says:

          “What makes a methodology pseudoscientific isn’t that it refuses to falsify so much as being unable to reliably pinpoint the blame of any apparent anomalies.”

          I think this is an apt description. The question is how to “pinpoint the blame” when we have a theory capable of only vague (higher/lower, no/some relationship) conditions. It seems to me you simply cannot do so, the solution is to 1)Observe carefully
          2)Record as much about the phenomenon as possible
          3)Think about what may be going on
          4)Guess (adduce) an explanation
          5)Formulate some assumptions of your guess in mathematical form
          6)Deduce precise predictions from these assumptions (upper/lower bounds, existence of a phenomenon, exact values)
          7)Check how close your predictions are to new data

          I made this awhile back based on some of Meehl’s publications, I wonder what you guys think of it:

          http://s29.postimg.org/3n52c0iqv/logical_structure.png

          It was based on these two papers:

          Meehl, P (1990). “Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It”. Psychological Inquiry 1 (2): 108–141. doi:10.1207/s15327965pli0102_1
          http://rhowell.ba.ttu.edu/meehl1.pdf

          Meehl , P. E. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L. L. Harlow S. A. Mulaik J. H. Steiger (Eds.), What if there were no significance tests? (pp. 393-425). Mahwah, NJ: Erlbaum.
          http://www.tc.umn.edu/~pemeehl/169ProblemIsEpistemology.pdf

  9. Andrew:

    You wrote: “The issue is that research projects are framed as quests for confirmation of a theory.” And elsewhere you said you don’t understand why people are not willing to consider that they might be wrong.

    Ignoring the obvious reasons for all this (the fame, the money, tenure), why is all this surprising to you? Who doubts themselves (publicly)? If researchers really were to abandon blind loyalty to their own ideas, it would be a personal defeat for them. I’m sure that you also rarely back down from a position or opinion that you have; it would involve loss of face. It’s easy to fool oneself into thinking, that no, it’s not about loss of face, I’m really right about this.

    I think that’s a primary driver of the behavior of scientists and their theory-development, not any real belief in their theory. It’s mostly about not losing face.

    At least in my field, I have yet to encounter a scientist who backs down from a position, science related or not science related, that they have taken a stand on publicly. People are not even willing to express uncertainty about their beliefs; it’s a binary decision. I’m sure there must be people out there in other fields who are willing to express uncertainty publicly and in writing, but I’ve never met one.

    • Andrew says:

      Shravan:

      I’m happy to admit my mistakes; see for example here:
      http://www.stat.columbia.edu/~gelman/research/published/AOAS641.pdf
      and here:
      http://andrewgelman.com/2014/05/12/results-shown/
      and here:
      http://www.stat.columbia.edu/~gelman/research/published/GelmanSpeedCorrection.pdf
      and here:
      http://andrewgelman.com/2014/07/15/stan-world-cup-update/
      and here:
      http://andrewgelman.com/2009/05/11/discussion_and/

      And my colleagues such as Bob Carpenter, Jennifer Hill, Phil Price, etc., are the same way. Indeed I would find it difficult to do science without admitting errors when they occur. It is often through recognition of our errors that we learn the most.

      You write, “I’m sure that you also rarely back down from a position or opinion that you have; it would involve loss of face. It’s easy to fool oneself into thinking, that no, it’s not about loss of face, I’m really right about this.” Of course this is an impossible argument to refute, but really it would be silly for me not to back down from a position when I made a mistake, as that’s how I learn.

      In any case, sure, I realize that human nature is what it is, and I’m not expecting Mark Hauser, Ed Wegman, Anil Potti, etc., to admit they were wrong—they’ve had their chances to admit wrongdoing and haven’t taken those opportunities—nor do I expect Daryl Bem and the various “Psychological Science”-type researchers to admit that they have been spending years chasing noise. I agree with you that in these cases it would just be too difficult for these people to admit, even to themselves, what they’ve done. I suspect that even the out-and-out cheaters have a way to explain to themselves what they’ve done (for example that they’re being attacked by haters, or that their critics are politically motivated, etc.). For example, when I asked him about retracting the false numbers he’d put in his column David Brooks characterized his critics as “intemperate.” I think that in his mind it got him off the hook.

      So, sure, that’s the way it is. But I don’t have to like it. From an empirical or statistical standpoint, yes, I understand. But on an emotional level, I am continually surprised when people refuse to admit their errors. It just seems so weird to me.

  10. Peter says:

    “Suppose you start with a vague research hypothesis (for example, that being exposed to TV political debates makes people more concerned about political polarization). This hypothesis can’t yet be falsified as it does not make precise predictions.”

    I think more clarity is needed here between the concepts of Theory and Hypothesis (for which I don’t think there are adequately accepted definitions). I think of Theory as describing the theorized and unobservable *causal* relationship being considered, and the Hypothesis as the observable ‘correlational’ relationship *implied* by the theory. The theoretical relationship is between ‘theoretical level’ concepts (A&B), while the Hypothesis is about relationship between *operationalized* measures of those theoretical concepts (m(A) & m(B)).

    There is then a logical step involved: (A influences B) implies (m(A) correlates_with m(B))

    This logical step is open to critique: are m(A) and m(B) valid measures of the concepts A and B? Is the theorized relationship correct (e.g. linear vs. various forms of non-linearity)?

    Testing then proceeds under the assumption that the above situation is correct.

    Falsification seeks a ‘reductio ad absurdum’ situation, in which m(A) does not correlate with m(B), implying that one of the *many* assumptions of the test is false. The central assumption is that “(A influences B)”, but it is not the only assumption that should be considered. The critique questions above are two others. Others could involve whether or not there is another theory that would also imply that “(m(A) correlates_with m(B))”. Possibilities include: reverse causality (“B influences A”) and tertium quid (“C influences both A and B”). Another is sample bias: were the samples actually representative of (e.g. selected rendomly from ) the population to which the results are to be generalized?

    “Confirmation” (in which m(A) *does* correlate with m(B)) says little. All of the same assumptions need to be considered. But once those have been considered, and assuming we’re pretty sure the assumptions hold, we would *tentatively* accept the idea that “A influences B”. But “*tentatively*” is a hard concept for our species to hold onto; we are simply too enamoured of certainty. Confirmation bias gets in the way of falsificationism. There may be a better theory out there that explains the correlational relationship, without the existence causal connection between the two concepts, but which we weren’t able to think up at present. If someone comes up with such a theory, we then need to look for operational level consequences of the old and the new theory that are inconsistent with each other, and see which of the two we can reject. (Which gets into the idea of competing research hypotheses, and *comparative* analysis, rather than simply testing a single theory’s Hypothesis against its own Null Hypothesis. And then there’s the idea of generating multiple operatinalizations of the same theory: what other *testable* research hypotheses does the theory imply… and what if the different research and nulll hypotheses produce different results?)

    Similarly, the idea that we can ‘confirm’ a theory doesn’t work, unless we have full knowledge of all the other possible theories. We can only compare a theory against another theory (with the Null Hypothesis representing a ‘Null Theory’), and recognize that the theories have some assumptions involved that we can be aware of, and others that we are not yet equipped to be aware of.

    If you can argue (convince yourself? convince others?) that those other assumptions are valid, then “m(A) does (not) correlate with m(B)” does imply that “A does (not) influence B”. But the arguments around those assumptions are never certain… they too are subject to future insights, theorizing and testing.

    Science is *process* not *certainty*!

    =Peter

    • Fernando says:

      Peter:

      That is exactly the point made by Jeynes.

      For rejection of H0 to count as evidence for H1 we must ensure that these two hypotheses are a partition of the space of hypotheses. If so it must be the case that if H0 is unlikely then H1 must be likely, as beliefs must sum to 1.

      However, science is a human endeavor prone to failure. So we must always consider a third hypothesis, H2, that the evidence reflects some artifact, measurement error etc. If so the logic above breaks. H0 may be unlikely, but H1 even more so, so H2 must be who done it.

      The point of good design, thorough implementation, reliable instruments etc is precisely to minimize the probability of H2 so evidence against H0 counts as evidence for H1.

      Bayesians can also go wrong if they ignore H2.

      If you include H1-H3 in the analysis then results will inform you about the null, the alternative, and the probability of an artifact. Note that H3 can be a stand in for all the ways in which a study might go awry. No need to specify each and every possibility, thought this can be more informative.

  11. ezra abrams says:

    perhpas I missed it, the part where you do a “bewley” and go out and ask 500 researchers how they work, and thus, with one small data set, slay 100 theoretical arguments ?

  12. Christian Hennig says:

    I’d like to add to the “confirmationist/falsificationist” distinction presented here that “testing a straw man H0” (which is commented on here as part of a “confirmationist” research programme) would, if understood properly and carried out with enough interpretational care, something appropriate in the earlier steps of research, closer to exploratory work. One of my major interests is cluster analysis, and in cluster analysis all kinds of methods will partition the data into clusters, regardless of whether there are “really meaningful” underlying clusters. Part of this exercise then is (or at least should be) to test whether the patterns in the data can actually be explained by a null model that models homogeneity. In the easiest cases, such a model could be a simple Gaussian or uniform distribution. However, sometimes there is structure other than clustering structure in the data, such as spatial dependence, and a null model would have to take such structure into account.

    According to the discussion given here, this would look “confirmationalist”, because the researcher doesn’t really believe the null model, but rather a clustering alternative. The informative value is that if the null model is not rejected, it is clear that the given dataset cannot be used to argue that whatever clustering was found is real and meaningful (although of course it doesn’t mean that the H0 is true). On the other hand, a significant rejection is the more convincing, the harder the researcher tries and works to find a null model that models the data as well as possible (rejecting a naive Gaussian distribution with very low p is usually not enough).

    However, interpreting the result appropriately it is clear that rejection doesn’t “confirm” the specific clustering that was found in the data by the researcher’s favourite method. It is an rather earlier step and only says that “some kind of clustering is going on here” (or even something else that was more complex than what was in the null model). Better than nothing (often enough there is indeed no evidence for this) but far away from making a “confirmative” statement about anything; so it isn’t really “confirmationist”, or only confirmationist regarding the quite weak hypothesis that “something is going on”.

    To me there seems to be nothing wrong with this apart from the fact that, as was discussed before, people want to make (and read) stronger statements, and they do, regardless of whether these are justified, and people don’t like to admit that what could be found in their data was quite weak (this would probably be easier if everyone else was more modest about their statements as well) and would need to be subjected to much more research including serious falsification attempts in order to generate reliable knowledge.

  13. question says:

    Christian,

    You make these two claims:
    A) “The informative value is that if the null model is not rejected, it is clear that the given dataset cannot be used to argue that whatever clustering was found is real and meaningful (although of course it doesn’t mean that the H0 is true).”

    B) “On the other hand, a significant rejection is the more convincing, the harder the researcher tries and works to find a null model that models the data as well as possible (rejecting a naive Gaussian distribution with very low p is usually not enough).”

    Claim A appears to be incorrect. If “model 2” is a much better fit (enough to cancel out any extra complexity), the results can be used to argue the clustering is real. “Rejecting” the null model has nothing to do with it. That the rejection criteria is usually arbitrary should really drive this point home intuitively.

    Claim B I also do not think is correct. It is talking about a family of “null models”, which is interesting. The implication seems to be that we should choose to work with the simplest model consistent with the data. First, I would say that the origins of the model also play a role (was it derived from first principles, is it totally ad hoc, what assumptions are necessary, etc). Second, we are once again talking about the relative complexity/accuracy tradeoff rather than trying to reject a model. In this case I think p-values may be useful by indexing likelihoods (at least for simple cases such as the t-test, see Michael Lew’s findings here: http://arxiv.org/abs/1311.0081), but the “rejection” step is not.

    If you could attempt to write out your thoughts on this in a more formal fashion perhaps it can help.

    • Christina Hennig says:

      Claim A is of course relative to the test statistic used. What I had in mind here (but have not written down) is a test statistic that formalizes how strong the clustering is, as used for example in C. Hennig and B. Hausdorf: Distance-based parametric bootstrap tests for clustering of species ranges. Computational Statistics and Data Analysis 45 (2004), 875-896 and ftp://ftp.stat.math.ethz.ch/Research-Reports/110.html. Rejection criteria are not arbitrary to me, but have to be related to what you want to find out about. True is that there may be more than one test statistic worth looking at.

      What do you mean by “much better fit”? A significantly better one? (This *has* to do with rejections.) If it’s not significantly better, the data are compatible with the worse one, too.

      Claim B is not really a precise claim – what would need to be shown is that there is no model that isn’t interpreted as “clustering” and fits the data so well that it could have generated the data. If I reject just a single one, that’s a rather weak step in that direction. However, if I reject something that takes into account all non-clustering structure we can imagine, that’d be much better (although still no mathematical proof that no other possibility is left).

      Certainly I’m not saying with what model we should “work” – I’m interested in sets of models compatible or incompatible with the data in the sense of Laurie Davies’s “Data Features” (1995, Statistica Neerlandica). I’m not interested in ending up with a single “right” model. Also I’m pretty agnostic about how we should arrive at such models (this would probably need to be qualified when challenged).

      • question says:

        “What do you mean by “much better fit”? A significantly better one? (This *has* to do with rejections.) If it’s not significantly better, the data are compatible with the worse one, too.”

        This would depend on context and the practical consequences of deviation from the fit. Also, once again what we appear to be doing in this scenario is comparing the relative merits of multiple models, this is not NHST.

        • Christian Hennig says:

          Well, if an apparent clustering turns out to be not significantly more clustered than what can be expected under a certain non-clustering null model, this seems to be very much NHST to me.

          • question says:

            Christian,

            Can you write out some kind of proof (or link to such) of how the specific procedure you are talking about is helping you cluster? Possibly give an example of this occurring as well. Your link did not work but now I have accessed this paper which I presume is one: http://www.sciencedirect.com/science/article/pii/S0167947303000914.

            As I keep saying you appear to be talking about comparing the relative merits of different models based on fit/complexity, this is not NHST. One model will always be better or worse than the other based on AIC or whatever score you choose to use. If you are indeed using NHST-filtering steps this is most likely spurious, and this does appear to be going on in Henning and Hausdorf 2004. You run the simulations once to reject the null model or not, then again to compare null model with alternative model. Why not just do the second?

            • Christian Hennig says:

              question,
              the paper you link is the right one. I don’t understand what kind of proof you want. If you have a test statistic T that quantifies the amount of clustering, and you have a model M that can be interpreted as “no clustering, all structure comes from other aspects than clustering such as spatial autocorrelation”, and the value of T in your data does not significantly differ from what is expected in M, I can say that in terms of clustering (as far as it’s measured by T), the data cannot be significantly distinguished from M and are therefore not significantly clustered.

              This holds regardless of whether I can find a model that indeed can be interpreted as clustering and fits the data better or not. Actually in the paper we were *not* interested to find a “best model” for the data (you may have been confused by the fact that we actually did specify an alternative model for power simulations; but for the logic I’m referring to here this is not needed). The clustering that we usually do for such data partitions the dataset based on defining a distance and running and MDS, so we get a clustering but this does *not* come with a model for the underlying data generating process. We don’t need the latter if only the clustering itself is of interest. However, in order to demonstrate that *any* real clustering is going on, we see whether there is significantly more clustering than what would be expected under M using T as a measure.

              The argument is less strong in case of significance than in case of insignificance, because in case of significance one could still suspect that another non-clustering model could explain the observed amount of clustering, which cannot be ruled out given how many non-clustering models are conceivable; however, as long as nobody can give me a model that a) formalises non-clustering and b) fits the data well, I will treat the data as a significantly clustered.

              Generally I’m quite happy to avoid relying on restrictive model assumptions for this kind of thing. As long as there is a big set of models that could fit the data well, I don’t mind much whether in terms of likelihood one is better than another. As long as we cannot reject a model, it cannot be ruled out, whether there is a better one or not.
              Read Laurie Davies’s work on “Data Features” and “Approximating Data” for getting a taste of this kind of philosophy (although I’m not strictly following his ideas) and especially for why he thinks (as well as I do) that comparing likelihoods is a very questionable tool.

              • question says:

                Christian,

                “However, in order to demonstrate that *any* real clustering is going on, we see whether there is significantly more clustering than what would be expected under M using T as a measure…in case of significance one could still suspect that another non-clustering model could explain the observed amount of clustering”

                So… we agree that rejecting the null model fails achieve your goal of demonstrating any real clustering. Also, earlier you wrote: “Rejection criteria are not arbitrary to me, but have to be related to what you want to find out about.” Yet in this paper you have decided to use the “magical” p=0.05
                p<0.05 therefore ~H0

                ~H0 AND (no one bothered to try anything else) therefore Hcluster

                How do we get to that last step? Where does it come from?

              • question says:

                Something happened to that post…

                “however, as long as nobody can give me a model that a) formalises non-clustering and b) fits the data well, I will treat the data as a significantly clustered.”

                Ignoring the issue of choosing a rejection criteria, what is your reasoning behind (where the tilde ~ = NOT) ~H0 therefore Hcluster? This is what I mean by write up some sort of proof. For example if we had a theory T that predicted parameter A=100, then we measure A=10 we could write:

                If T then A=100
                A=10
                10/=100
                A/=100 therefore ~T

                You are doing something like the following:
                If H0 then p>=0.05
                p<0.05 therefore ~H0

                ~H0 AND (no one bothered to try anything else) therefore Hcluster

                How do we get to that last step? Where does it come from?

              • Christian Hennig says:

                question,

                “So… we agree that rejecting the null model fails achieve your goal of demonstrating any real clustering.” The thing is, what are the standards here? I was making a rather modest statement conceding the limitations of what I have done, and you make a “fail” out of it. It is true that if I claim “there is a real clustering” because of this, it can be objected that there may exist a model that demonstrates otherwise which I didn’t try. However, this is the case with *whatever* model-based approach is taken. Because *no* parametric model exhausts the full space of possibilities and there is no work whatsoever, in any area, that can argue convincingly that all possible models formalising non-clustering (including the model that states that with probability one data had to look exactly how they look) are ruled out based on the data. Usually such models are just ruled out by assumption, which of course doesn’t restrict reality. So if you think that I failed, for this very reason, no success is possible at all. However, if I can convince a user that the model that I rejected is actually a pretty good attempt to explain the patterns in the data by other means than clustering, this is about as good as it gets. What I can say is that if T measures the amount of clustering and the observed value is too large for what we expect under M, that there is *significantly more clustering in the data than expected under M*. If you want to translate this into an alternative hypothesis, I have rejected M in favour of the class of models under which T can be expected to produce larger values. We could take this as a *definition* of your “Hcluster”, how does that feel? (This certainly depends on whether I can convince a user that the definition of T is appropriate for measuring the kind of clustering they are interested in.)

                I don’t mind much about the “magical” 5%. If you want a crisp yes/no-decision, you need a cutoff, but I’m rather happy to say that with p between about 0.01 and 0.07, say, evidence is moderate but not strong, and I know very well that there is no rational reason to defend precise values. (I probably got your initial text on “arbitrary criteria” wrong because I thought that you were not only talking about the significance borderline but also about the test statistic.)

                The thing is, the biogeographical theories the exploration of which the work was meant for don’t come as precisely specified statistical models with parameter values. We have proposed a null model that formalises what some biogeographers were saying informally (it’s up to them to decide whether we succeeded in this), so rejection should tell them something. Some others were mentioning that they expect some kind of clustering – and that was the *direction* in which our rejection points.

              • Christian Hennig says:

                I should, regarding the “magical” decision boundary, add that in such parametric bootstrap simulations for computational reasons we cannot accurately figure out p-values like 10^{-8}, but I take as really strong evidence against M not p<0.05, but rather that the observed T is some distance away from the most extreme value seen under the null model. With the model and statistics in the paper, we have seen this happening for several datasets, as well as non-significance, so the procedure certainly has some discriminative power.

              • question says:

                Christian,

                I meant no offense by the “fail” terminology, I don’t think your attempts to deal with this data was a fail at all.

                I just really don’t understand your justification for rejecting/disproving the null model in order to claim something about a different model. It seems to me you should not do this because there are other reasons for the model to be rejected other than presence of clusters. It is very straightforward.

                “We have proposed a null model that formalises what some biogeographers were saying informally (it’s up to them to decide whether we succeeded in this), so rejection should tell them something. Some others were mentioning that they expect some kind of clustering – and that was the *direction* in which our rejection points.”

                I mostly agree with this. Your model is not a strawman, it appears to be a good attempt at describing the data and a good tool for exploring it. My problem is 100% with this “rejection” step. No one expects it to be perfect to begin with so rejecting the perfection of the model does not seem capable of providing useful information. Checking how good the fit is, however, does provide useful information if compared to other models.

                The way I see it there should be two models, one with clusters and one without. They should be compared by p-value, AIC, BIC, whatever. The deviations from the models should be explored (as you did in your paper) to see if anyone comes up with an idea to improve them. Then some non-algorithmic thoughts run through peoples brains and they decide (informally, as recommended by Fisher) which model seems best or if more data is necessary.

  14. Christian Hennig says:

    “There is some clustering” is a rougher qualitative statement than to specify a precise model for clustering. Cluster analysis is only partly statistical in the sense that there are many techniques that are not probability model-based or only semi model-based (like ours). I use probability models as tools and they’re not my only tools. If I’m interested in a clustering, I don’t necessarily need a model for the data. I can get a clustering in other ways (for example, I may want a clustering based on a tailor-made dissimilarity measure because this quantifies properly what should count as “being close” in this application; but fitting and evaluating likelihoods for such models is tough). I wouldn’t make much of the clustering, though, if it could be explained by non-clustering structure such as spatial autocorrelation alone, as formalised in the null model.
    I’m not in principle against having a more comprehensive model including clustering and against comparing them. I just don’t subscribe to the dogma that that’s the only way of doing this, and there may be good reasons to take another way.

    Oh, and let me reiterate: I don’t reject the “perfection” of the null model and you’re right, I don’t believe it literally anyway. What is important is the *direction* of clustering formalised in my T, against which I’m rejecting the model.

    The statistic T is a better formalisation of what we’re interested in in this application than the likelihood ratio of two specific models.

    • Christian Hennig says:

      This was a reply to “question” above.

    • question says:

      Christian,

      “I just don’t subscribe to the dogma that that’s the only way of doing this, and there may be good reasons to take another way.”

      To be clear I am not so much interested in the best way to do this particular case, I am trying to figure out why people seem to think that by disproving one model it provides evidence in favor of another (Inf-1 = Inf). It became clear to me in my own work that the usual “model” being rejected of “two groups of animals are exactly the same on average” was not helpful to me in any way because there will always be differences at baseline and “lurking variables” that arise during the course of the study. It is at best a spurious step on top of estimating effect sizes. At worst has been a source of widespread confusion.

      Your use of a null model is different in some ways, but I still do not understand why you think “rejecting” it has helped you. By rejecting you are implying that some sort of deduction can be made from the premise that H0 is false. I can see no other purpose to rejecting. I can however see a purpose to comparing the model to the data in order to 1)make predictions 2)compare to other models. The usefulness of your paper is to provide a null model that future models should be compared to.

      You write: “What is important is the *direction* of clustering formalised in my T”. I interpret this as “if H0 is false (or a bad fit or whatever) then clustering is more likely”. Please make this into the form of a proof as I suggested above so that I can understand:

      “You are doing something like the following:
      If H0 then p>=0.05
      p<0.05 therefore ~H0

      ~H0 AND (no one bothered to try anything else) therefore Hcluster"

      • Christian Hennig says:

        Well, I wrote before that in principle non-rejection is the stronger result because it clearly shows that a non-clustering model is compatible with the data, so I’m OK with you being skeptical about the implications of rejection.
        However, you should realise that I’m making a weaker statement than “I have evidence in favour of another (specific) model”. I’m just saying that there is significantly more clustering in the data (as measured by T) than expected under M. If you are happy to call *just this* Hcluster, it’s Hcluster indeed, but this Hcluster is not a very strong or specific statement. There is nothing to prove here; I’m just explaining how I interpret what was shown.
        OK, this is going round in circles now.

        What would it mean to you to say that a clustering is “real”, and how would you go on to show it?

  15. […] Gelman highlights a nice exchange he had with Deborah Mayo on falsificationist vs. confirmationist approaches to […]

  16. […] as a good Popperian, Meehl would agree with me completely that null hypothesis significance testing wears the cloak of falsificationism without actually being […]

  17. […] “significant” is insignificant. A pernicious mistake that scientists constantly make is assuming that every rejection of the null is confirmation for the alternative. The fact that the data is unlikely under the null hypothesis doesn’t mean it’s any […]

Leave a Reply