I’ve said it before but it’s worth saying again.
The conventional view:
Hyp testing is all about rejection. The idea is that if you reject the null hyp at the 5% level, you have a win, you have learned that a certain null model is false and science has progressed, either in the glamorous “scientific revolution” sense that you’ve rejected a central pillar of science-as-we-know-it and are forcing a radical re-evaluation of how we think about the world (those are the accomplishments of Kepler, Curie, Einstein, and . . . Daryl Bem), or in the more usual “normal science” sense in which a statistically significant finding is a small brick in the grand cathedral of science (or a stall in the scientific bazaar, whatever, I don’t give a damn what you call it), a three-yards-and-a-cloud-of-dust, all-in-a-day’s-work kind of thing, a “necessary murder” as Auden notoriously put it (and for which was slammed by Orwell, a lesser poet put a greater political scientist), a small bit of solid knowledge in our otherwise uncertain world.
But (to continue the conventional view) often our tests don’t reject. When a test does not reject, don’t count this as “accepting” the null hyp; rather, you just don’t have the power to reject. You need a bigger study, or more precise measurements, or whatever.
My view is (nearly) the opposite of the conventional view. The conventional view is that you can learn from a rejection but not from a non-rejection. I say the opposite: you can’t learn much from a rejection, but a non-rejection tells you something.
A rejection is, like, ok, fine, maybe you’ve found something, maybe not, maybe you’ll have to join Bem, Kanazawa, and the Psychological Science crew in the “yeah, right” corner—and, if you’re lucky, you’ll understand the “power = .06″ point and not get so excited about the noise you’ve been staring at. Maybe not, maybe you’ve found something real—but, if so, you’re not learning it from the p-value or from the hypothesis tests.
A non-rejection, though: this tells you something. It tells you that your study is noisy, that you don’t have enough information in your study to identify what you care about—even if the study is done perfectly, even if measurements are unbiased and your sample is representative of your population, etc. That can be some useful knowledge, it means you’re off the hook trying to explain some pattern that might just be noise.
It doesn’t mean your theory is wrong—maybe subliminal smiley faces really do “punch a hole in democratic theory” by having a big influence on political attitudes; maybe people really do react different to himmicanes than to hurricanes; maybe people really do prefer the smell of people with similar political ideologies. Indeed, any of these theories could have been true even before the studies were conducted on these topics—and there’s nothing wrong with doing some research to understand a hypothesis better. My point here is that the large standard errors tell us that these theories are not well tested by these studies; the measurements (speaking very generally of an entire study as a measuring instrument) are too crude for their intended purposes. That’s fine, it can motivate future research.
Anyway, my point is that standard errors, statistical significance, confidence intervals, and hypotheses tests are far from useless. In many settings they can give us a clue that our measurements are too noisy to learn much from. That’s a good thing to know. A key part of science is to learn what we don’t know.
Hey, kids: Embrace variation and accept uncertainty.
P.S. I just remembered an example that demonstrates this point, it’s in chapter 2 of ARM and is briefly summarized on page 70 of this paper.
In that example (looking at possible election fraud), a rejection of the null hypothesis would not imply fraud, not at all. But we do learn from the non-rejection of the null hyp; we learn that there’s no evidence for fraud in the particular data pattern being questioned.