What hypothesis testing is all about. (Hint: It’s not what you think.)

I’ve said it before but it’s worth saying again.

The conventional view:

Hyp testing is all about rejection. The idea is that if you reject the null hyp at the 5% level, you have a win, you have learned that a certain null model is false and science has progressed, either in the glamorous “scientific revolution” sense that you’ve rejected a central pillar of science-as-we-know-it and are forcing a radical re-evaluation of how we think about the world (those are the accomplishments of Kepler, Curie, Einstein, and . . . Daryl Bem), or in the more usual “normal science” sense in which a statistically significant finding is a small brick in the grand cathedral of science (or a stall in the scientific bazaar, whatever, I don’t give a damn what you call it), a three-yards-and-a-cloud-of-dust, all-in-a-day’s-work kind of thing, a “necessary murder” as Auden notoriously put it (and for which was slammed by Orwell, a lesser poet put a greater political scientist), a small bit of solid knowledge in our otherwise uncertain world.

But (to continue the conventional view) often our tests don’t reject. When a test does not reject, don’t count this as “accepting” the null hyp; rather, you just don’t have the power to reject. You need a bigger study, or more precise measurements, or whatever.

My view:

My view is (nearly) the opposite of the conventional view. The conventional view is that you can learn from a rejection but not from a non-rejection. I say the opposite: you can’t learn much from a rejection, but a non-rejection tells you something.

A rejection is, like, ok, fine, maybe you’ve found something, maybe not, maybe you’ll have to join Bem, Kanazawa, and the Psychological Science crew in the “yeah, right” corner—and, if you’re lucky, you’ll understand the “power = .06” point and not get so excited about the noise you’ve been staring at. Maybe not, maybe you’ve found something real—but, if so, you’re not learning it from the p-value or from the hypothesis tests.

A non-rejection, though: this tells you something. It tells you that your study is noisy, that you don’t have enough information in your study to identify what you care about—even if the study is done perfectly, even if measurements are unbiased and your sample is representative of your population, etc. That can be some useful knowledge, it means you’re off the hook trying to explain some pattern that might just be noise.

It doesn’t mean your theory is wrong—maybe subliminal smiley faces really do “punch a hole in democratic theory” by having a big influence on political attitudes; maybe people really do react different to himmicanes than to hurricanes; maybe people really do prefer the smell of people with similar political ideologies. Indeed, any of these theories could have been true even before the studies were conducted on these topics—and there’s nothing wrong with doing some research to understand a hypothesis better. My point here is that the large standard errors tell us that these theories are not well tested by these studies; the measurements (speaking very generally of an entire study as a measuring instrument) are too crude for their intended purposes. That’s fine, it can motivate future research.

Anyway, my point is that standard errors, statistical significance, confidence intervals, and hypotheses tests are far from useless. In many settings they can give us a clue that our measurements are too noisy to learn much from. That’s a good thing to know. A key part of science is to learn what we don’t know.

Hey, kids: Embrace variation and accept uncertainty.

P.S. I just remembered an example that demonstrates this point, it’s in chapter 2 of ARM and is briefly summarized on page 70 of this paper.

In that example (looking at possible election fraud), a rejection of the null hypothesis would not imply fraud, not at all. But we do learn from the non-rejection of the null hyp; we learn that there’s no evidence for fraud in the particular data pattern being questioned.

21 thoughts on “What hypothesis testing is all about. (Hint: It’s not what you think.)

    • @Erin:

      Actually, judging by what I see on the internet, a better click bait would have to include a numbers:

      “10 reasons p-values are wrong (and they are not what you think!)”

      “5 ways to misunderstand p-values”

      “3 common ways to get NHST wrong”

      “The 7 things you need to know about NHST”

      “The end of the world is near. 7 ways to test for it.”

      and so on…

  1. You said: “I say the opposite: you can’t learn much from a rejection”. I think I’d say that you can’t learn much from a _weak_ rejection. There are lots of conversations around the 5%/standard-deviation/p-value threshold these days, and rightly so, but that’s a borderline case; an artificial line in the sand. The closer you are to the line, the less you learn, and in that I agree; the nearer to 5% you are, the less valuable your rejection. But the same is true on the other side — you can learn more from some non-rejections than others. Near the line, it’s silly to ask if you learned more from a 0.0501 or a 0.0499 result. There’s no useful difference.

    It’s not a matter of whether you accept or reject that informs the learning experience, it’s by how much (assuming an otherwise reasonably laid out experiment… high enough power and all that).

  2. People keep thinking there is some nuance of explanation, some special way to look at classical statistics, some verbal formulation which will make all the problems with frequentist’s world view disappear. If there were, it would have been discovered by the 1930’s at the latest and we wouldn’t be talking about the real meaning of hypothesis testing today.

    Here’s the bottom line. Any acceptance or rejection of a hypothesis is really a claim about what you’ll see out of sample, or a future sample or whatever. Frequentist ideology (which most Bayesians are influence heavily by in truth) leads people to believe their model building techniques, steps, verifications, and model tests, “confirm” the model and thus “confirm” those out of sample predictions. The vast majority of the time they do not. The vast majority of the time those model verification steps have precisely nothing to say about those out of sample predictions.

    Until that is fixed, there will continue to be a nearly random connection between the findings of statistical studies and the truth.

  3. Check out figures 3-4 of Meehl (1990) where he attempts to define a “corroboration index”:

    “Working scientists who never heard of Popper, and who have no interest in philosophy of science, have for at least three centuries adopted the position that a theory predicting observations “in detail,” “very specifically,” or “very precisely” gains plausibility from its ability to do this. I have not met any scientist, in any field, who didn’t think this way, whether or not he had ever heard of Karl Popper. If my meteorological theory successfully predicts that it will rain sometime next April, and that prediction pans out, the scientific community will not be much impressed. If my theory enables me to correctly predict which of 5 days in April it rains, they will be more impressed. And if I predict how many milli-meters of rainfall there will be on each of these 5 days, they will begin to take my theory very seriously indeed. That is just scientific common sense, part of the post-Galilean empirical tradition that does not hinge on being a disciple of Popper or Lakatos.”
    Meehl, P (1990). “Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It”. Psychological Inquiry 1 (2): 108–141. doi:10.1207/s15327965pli0102_1
    http://rhowell.ba.ttu.edu/meehl1.pdf

    The information content of rejection vs non-rejection depends upon the precision of the estimate and precision of the *research* hypothesis. A theory that predicts “some relationship” or “positive relationship” along with a statistical test that rejects “no relationship” or “negative relationship” is not impressive. Why? The more imprecise the research hypothesis the easier it is to come up with alternative explanations for the observations.

    • If you have two hypothesis H1 and H2 which are both consistent with (or confirmed by) the data in the sense that the observed data isn’t a low probability extreme value of either distribution, then Bayes theorem will pick out the one that makes the sharper predictions. If P(data|H1) is sharply peaked (narrow uncertainty) for example, while P(data|H2) is spread out, then P(observed data|H1) will be a much larger number than P(observed data|H2) and thus favoring P(H1|observed data) over P(H2|observed data).

      The sum and product rule of probability theory, on which the Bayesian formalism rests, is a better guide to scientific inference than any one person’s intuition. Especially the intuition of some philosopher poser like Popper who never made a real world scientific inference in his entire life.

      • Anonymous,

        I agree that the Bayesian approach is superior. However when analyzing my own data I really did have trouble choosing a non-uniform prior, it just felt wrong/unnecessary.

        The data was similar to figure 5 of this paper (a set of roughly sigmoidal learning curves): http://www.ncbi.nlm.nih.gov/pubmed/15364484. I wished to fit some curve to each individual and then see how (magnitude and type) the curves differed across some groups. Then I thought I could say, e.g., “insofar as this functional form is a good summary of the data, which it appears to be from the plots (see figure X), group A was slower learning on average than group B”. Obviously this is a classic time to fit a hierarchical model (eg with STAN).

        1) I chose not to estimate group level parameters using a hierarchical model, preferring to calculate from the individual fits. I did not like bayesian shrinkage, which I felt assumed group differences a priori. Similarly, I was not comfortable using non-uniform priors because it seemed to be adding in additional (unnecessary) assumptions.

        2) In the case of the uniform priors, I found the credible intervals for each parameter will be *mathematically* equivalent to confidence intervals. So when summarizing the data either process could work just as well, there was no practical issue with transposing the conditional: p(data|H) = p(H|data). I have to say this really made me question what the big issue is that “frequentists” have with using uniform priors. It also made me doubt the “Bayesian” claim that misinterpretations of confidence intervals are a big problem in practice. It must be for problems other than the one I was dealing with!

        3) I found that I did not like any use of arbitrary (90%, 95%, 99%, etc) intervals (either confidence or credible). The problem was that my choice of functional form (the model) was already somewhat arbitrary, only expected to be a “good enough” summary. Assigning such numerical precision on top of this felt dishonest. I did like the posteriors though, they much better conveyed the uncertainty given the model.

        4) I realized that even if we accept that e.g. Group A was slower than Group B according due to some arbitrary cutoff, that was not informative enough to be of any use. A treatment can lead to slower learning for all sorts of reasons that may or may not be of any interest, or even be due to opposite types of effects. Maybe slower learning is associated with developing “good habits” which is beneficial. On the other hand, maybe the treatment caused some obstacle/deficit to learning, which is harmful.

        In the end, I could not be comfortable drawing any kind of meaningful conclusions from the parameter estimates. Yes, we want to know p(H|data) rather than p(data|H), but that is not the end of the story. Even p(H|data) is not enough, I was still unable to assign normative statements (ie, Group A was better/worse than Group B) to the results.

        So I appear to disagree with you in that I do not think the most fundamental problem lies in bayesian vs. frequentist approaches to analyzing data, it is instead the lack of meaningful models of the data generating process. If there are group differences (and there almost always are for some reason or the other), that is only a subset of the data that needs to be explained. We also need to propose answers to what processes may lead to sigmoidal timeseries, why with that set of midpoints, etc? Then meaningful inferences can be made in the context of the theory (or better theories) that develop.

        • ****”I was not comfortable using non-uniform priors because it seemed to be adding in additional (unnecessary) assumptions.”

          You should use whatever knowledge is relevant and known to be true. If you don’t use a uniform assumption you are (typically) effectively ruling out certain possibilities. You shouldn’t do this unless you have positive information to rule then out. The uniform distribution is conservative in the sense that it’s say roughly “we give full consideration to any possibility for the parameter, we wont rule any possibility out before seeing the data”.

          Even if you do have some info which implies a non-uniform distro, you may still use a uniform because it’s easier. Doing so will widen the final uncertainties, but if there’s small enough uncertainty to answer the question you care about then the greater effort and precision would have been wasted.

          ****”I found the credible intervals for each parameter will be *mathematically* equivalent to confidence intervals….It must be for problems other than the one I was dealing with!”

          In a historically important simple group of problems that’s true. Outside of that there are huge differences. For example, in real and straight forward looking problems you can get confidence intervals for parameters which provably (using the same assumptions used to derive the CI) can’t contain the true parameter value. This can never happen with a full Bayesian version because Bayes will automatically be consistent with anything you can deduce logically.

          ****”I did like the posteriors though, they much better conveyed the uncertainty given the model.”

          The totality of evidence points towards the posterior you worked out. That is the result. The Frequentist habit of adding a gratuitous “hypothesis” test step and accepting or rejecting portions of the parameter space are effectively truncating the posterior with no additional information/evidence. This “fallacy” is the origin of a huge percentage of frequentists problems in practice (especially when they string a series of these together when analyzing one data set–the total error involved grows rapidly well beyond the alpha% stated levels).

          If Bayesians start copying it, then they have those same problems. If you have to make a final decision or best guess then use the posterior with a loss function in a real decision analysis. Otherwise always retain the full posterior. That’s what your evidence and model implied.

          ****”I could not be comfortable drawing any kind of meaningful conclusions from the parameter estimates”

          That happens all the time. I have a hypothesis that Gelman sits at home all day with a giant red hat and sings “La Marseillaise” while drinking zima’s. My data consists of the entire internet. So far, I can’t tell whether my hypothesis is true despite having an unimaginable amounts of data.

          ****”it is instead the lack of meaningful models of the data generating process.”

          Laplace got a mass of reproducible results over 200 years ago using Bayesian significance tests. But the kind of problems he worked on looked something like this. We know what path a planet should take given Newton’s Laws, and we know the magnitude of the measurement errors. So is the observed path significantly different from Newton’s path given errors of that magnitude? If so there might be an new planet or other physical effect perturbing the motion. Then he used his knowledge of physics to go find the new thing. Sometimes the new “effect” was just that prior approximation techniques for calculating the planets path were ineffective, and there was no new planet or new physics involved. They just needed better approximation techniques to solve the differential equations.

          Contrast this with for example financial quants who look at stock data, which suggests a functional form and then model the stock prices as that functional form with errors. Their predictions are highly unreproducible.

          Superficially and mathematically, they’re doing identical things: comparing functional forms+errors to observed data. In reality Laplace was using Bayesian statistics as a relatively small adjunct to real science. The overwhelming trend today is to use statistics as a substitute for doing real science. I don’t recommend do that at all, but since academics seem hell bent on wasting every last penny of taxpayer funding doing that kind of “research” it may be hard to fight the tide on that one. Just know that it’s extraordinarily unlikely you’ll ever make a permanent contribution to science like Laplace did that way. Your research results will have about the same shelf life as those quant models for predicting the stock market.

  4. The question of what is learned by failing to reject is simple, but so thoroughly muddled by frequentist blather, even seemingly smart people can’t seem to see through the haze of confusion. Look at the question from a purely Bayesian viewpoint.

    Suppose you have some parameter of interest lambda. You form a posterior for it P(lambda|evidence). From that you get a Bayesian Credibility interval for lambda. That interval represent basically all values of lambda which are reasonably consistent with the evidence. If the evidence is true than, the true value of lambda must be consistent with it, and that’s the basis for thinking the true lambda is in that interval.

    So consider two cases which would both be considered “failing to reject the null”.

    First, the interval for lambda includes zero but is very wide. Here “wide” means something like “if lambda were that far from zero, but we set it equal to zero in our equations, we’d get a important and large error”. In this case “failing to reject the null” obviously doesn’t mean you can set lambda=0 in your equations.

    Second, the interval for lambda includes zero but is very narrow around zero. Here “narrow” means “lambda may not be zero, but has to be so close to zero, that if we set it equal to zero, we will only get small irrelevant errors”. In this case “failing to reject the null” obviously means we can set lambda=0 in our equations.

  5. I am wondering what people think about the whole machine learning/data mining approach in the context of this discussion. No p values, no confidence intervals, just predictions. While there are certainly erroneous predictions, there are many ways to protect against overfitting, so the results of these studies, by definition, do not permit simply causation or conclusive type statements. I see these as a third approach to analysis – not frequentist nor Bayesian, but something else. Mis-used, over-hyped, but often providing fairly good predictions.

    Just wondering what people think of that approach which avoids the whole NHST paradigm, as well as issues of establishing priors.

    • Sometimes your main aim is prediction quality, sometimes your main aim is evaluating evidence in favour/against certain unobservable hypotheses, which doesn’t necessarily translate into predictions. It’s all fine to have some methods that do one thing well, but this doesn’t mean that you always want to do this one thing only.

      • Given how much trouble people get into when they want to evaluate “evidence in favour/against certain unobservable hypothesis” I wonder if it might be better to just emphasize prediction. I don’t really believe this as there are too many interesting things involved with evidence and its evaluation, but given how much time (on this blog, for instance) is spent trying to undo the damage caused by overstating inferences, it does make me wonder whether prediction might be under-rated by researchers.

  6. Seems to me that the likelihood function that corresponds to a P-value is just as informative when the P-value is small as when the P-value is large. The only ‘trick’ needed to get information from a hypothesis test when the null is not rejected is to contemplate parameter values other than the one originally chosen as the null.

  7. I appreciated this post. I do most of my work with very large data sets, and most covariate-type data that I have is of a sort that has extremely well-established relationships to the things I care about. I run a lot of standard multivariate regression models when I need to take a look at a new data set, usually to figure out its data quality because I have found that to be a time-efficient way to apply my intuition to what I have.

    And that’s the reason that this post excited me: When I started reading it, I realized that when I run these investigatory regression models, the things that set off red flags for me are usually fail-to-reject situations, like if I found (this is a fake example–I don’t work with this kind of data) that race and blood pressure were unrelated in my data, I would be really, really concerned.

  8. I am honestly not keeping up with the celebrity game but this was a valuable post. I would add that quite often hypothesis testing is used to move past something YOU DON”T CARE ABOUT. For example in clinical research there is a tremendous number of data points gathered, many for safety or regulatory purposes. There should be no clinical relationship between something like blood pressure reduction and back skin, just off the top of my head. So the field for ‘skin-back’ gets marked ‘normal’, the p-value is not significant, and the researchers move on.

Leave a Reply to Anonymous Cancel reply

Your email address will not be published. Required fields are marked *