Of hypothesis tests and Unitarians

Xian, Judith, and I read this line in a book by statistician Murray Aitkin in which he considered the following hypothetical example:

A survey of 100 individuals expressing support (Yes/No) for the president, before and after a presidential address . . . The question of interest is whether there has been a change in support between the surveys . . . We want to assess the evidence for the hypothesis of equality H1 against the alternative hypothesis H2 of a change.

Here is our response:

Based on our experience in public opinion research, this is not a real question. Support for any political position is always changing. The real question is how much the support has changed, or perhaps how this change is distributed across the population.

A defender of Aitkin (and of classical hypothesis testing) might respond at this point that, yes, everybody knows that changes are never exactly zero and that we should take a more “grown-up” view of the null hypothesis, not that the change is zero but that it is nearly zero. Unfortunately, the metaphorical interpretation of hypothesis tests has problems similar to the theological doctrines of the Unitarian church. Once you have abandoned literal belief in the Bible, the question soon arises: why follow it at all? Similarly, once one recognizes the inappropriateness of the point null hypothesis, it makes more sense not to try to rehabilitate it or treat it as treasured metaphor but rather to attack our statistical problems directly, in this case by performing inference on the change in opinion in the population.

I like the line about the Unitarian Church, also the idea of hypothesis testing as a religion (since people are always describing Bayesianism as a religious doctrine).

15 thoughts on “Of hypothesis tests and Unitarians

  1. I don’t think the Unitarian Universalist Association makes any attempt to follow the Bible. Perhaps the line would work better (but more controversially) if applied to the many Christians who ignore the Bible’s story of creation, but still accept all the parts about Christ.

  2. We want to assess the evidence for the hypothesis of equality H1 against the alternative hypothesis H2 of a change.

    Here is our response:

    Based on our experience in public opinion research, this is not a real question. Support for any political position is always changing. The real question is how much the support has changed, or perhaps how this change is distributed across the population.

    A defender of Aitkin (and of classical hypothesis testing) might respond at this point that, yes, everybody knows that changes are never exactly zero and that we should take a more “grown-up” view of the null hypothesis, not that the change is zero but that it is nearly zero.

    This is a straw man argument. Here’s the appropriate way to think about this particular problem. When things are measured they have a natural variability. For example, a “fair” coin has a probability or producing a heads half the time, yet if we flip it 10 times, the probability of 5 heads is rather small. On the other hand, the probability of between 4 and 6 heads the first time we flip the coin and 4 and 6 heads the second time we flip the coin is fairly sizable. If we observe the 4 heads the first time and 6 heads the second time what we would say is not that there has been a change of 2 (though it is certainly possible, as the power of discrimination is rather low with 10 flips), but rather, the hypothesis that this is a fair coin cannot be rejected on either trial. Similarly, here, we would infer the variability from the first and second polls (and if that variability has changed that indicates something has changed) and use this as a basis for whether there has been a change in public opinon _greater_ than the natural variability of the underlying random process. This is how to interpret “change”–it is change different that what would ordinarily be expected given our assumptions on how probability works and is modeled.

    I think the reason Bayesian statistics has caught on so much in the social sciences is that it allows one to get one’s ego involved. It’s hard to imagine CERN doing multi-level inference or an agricultural statistician doing a Latin Squares ANOVA with an inverse-Gamma prior. Certainly, physical scientists will accept and expect that the nature of the underlying probability distribution will affect the observed results (just a gravity is a distortion of a uniform space which affects the path of a particle). But for Bayesian inferential techniques (as opposed to the numerical techniques which have been developed) to really score a victory over frequentist methods, going outside the social sciences for examples is demanded, IMHO.

    • Numeric:

      Agricultural statisticians have been doing Bayesian inference for a long time. And BUGS was produced by the (U.K.) Medical Research Council. Finally, in answer to the question in your last sentence, see my 1996 paper with Bois and Jiang. The topic: toxicology.

      • I do a google search on

        factor analysis agricultural statistics

        and get 3,600,000 matches (approximately). The same for

        bayesian agricultural statistics

        is 174,000. The immediate inference (see Corey’s comment below, and deriving a method of inference from that) is that factor analysis is 20 times more important in agricultural statistics than Bayesian inference. Given what most statisticians think about factor analysis (generate data from a box and run a FA on it), this is not a recommendation. Referencing Corey’s comments about astronomy, I have a belief approaching unity that Bayesian statistics are used more (a lot more) in fields where reasonable experimentation is limited or not available.

        Here’s what I want. I want a paper in the physical sciences, that uses Bayesian methodology, that derives a different scientific result than the frequentist methods, from the same data, and this scientific result is then verified experimentally (within the usual standards of proof in that field).

        As a final thought, I leave you with a table of bogusness for a variety of techniques. Make of it what you will.

        Presence of Critical Theory Signifiers for Different Search Terms

        Modalities Emergent Any Match

        Percent N Percent N N
        Discourse Analysis 2.80 (24,500) 17.65 (167,000) 940,000
        Self-Organizing Maps 3.00 (5,390) 14.83 (26,700) 180,000
        Factor Analysis 4.00 (104,000) 10.2 (265,000) 2,600,000
        Linear Regression 1.80 (83,900) 5.59 (260,000) 4,640,000

        *From Google Search on 3/17/10

  3. How about this interpretation of the null hypothesis test: you’re checking whether there’s enough data to say anything conclusive about the change. It’s just a way to summarize the size and positioning of the confidence interval. Of course there was a real change, the question is how informative your data is about it.

  4. @Brendan that doesn’t really make sense to me, isn’t a confidence interval itself already a summary? Seems like you might as well look at the confidence interval.

    • Hang on, there’s an equivalence between hypothesis tests and confidence intervals. CIs are constructed by inverting HTs, and contain values that wouldn’t be rejected by a HT. So isn’t the HT vs CI argument really pointless, as you need one to get to the other it’s not like we’re in an either/or situation – a HT is a first step in making an interval estimate.

      I also think there are circumstances where hypothesis testing can be more relevant than making inferences about by how much the support has changed. Two examples are (a) testing a composite null (say movement >0) where you might just be interested in a direction – have things got worse, or (b) testing a sharp null through permutation, which uses less assumptions than a model based CI or estimate of change. I’m not arguing you shouldn’t try and estimate quantities, that’s a perfectly good point, but there are circumstances where hypothesis testing has value too.

        • You *can* always invert a (family of) tests to get a confidence interval, just as you can go from a confidence interval to a test of any point null. However, the validity of both may end up being sensitive to the veracity of modeling assumptions (aptly labelled “dire” in the original post). If the model’s wrong, one can get very poor behavior.

          Even when the model is right, there may be some datasets for which the interval is a point (or the whole real line) and the corresponding test is equally non-intuitive. Here, the problem is that one is sacrificing sensible inference using the data at hand in favor of guaranteeing frequentist properties, in repeated uses. It’s similar to the use of ridiculous unbiased estimators, just because they are unbiased.

          These seem like different problems, though in practice it may be hard to tell them apart.

  5. A different kind of comment is that the value, or effectiveness, of any analogy depends on how far it is clear.

    If you guess here a range of readers in different countries, how many would have familiarity with the idiosyncrasies of Unitarian theology inside the United States, which is the reference here? (People who might be Unitarians in other countries are usually something else in those countries.)

    Otherwise put, this is likely to be a drily amusing analogy to those who get it and a very obscure one to those who don’t, by my wild guess a strong majority of your likely readers.

  6. I wonder if part of the enduring appeal of HT is that it imposes a dichotomous framework on the analysis. People seem to like to reduce things into yes/no answers because it’s more memorable and easier to combine arguments/reasoning about phenomena (insert taleb reference to procrustes here). “We know that X is more effective than Y…”, “They showed that X has a negative effect on…” etc.

    I hate this approach to science for the usual reasons (the real world is complicated and these it’s too tempting to generalize these rigid yes/no results when it is unwarranted). Personally, I would like to see a more fluid approach to science where knowledge where hypotheses are more nuanced and change more dynamically with evidence, but from a cultural/social-engineering standpoint, it’s difficult to push things in that direction.

Comments are closed.