One-tailed or two-tailed

This image of a two-tailed lizard (from here, I can’t find the name of the person who took the picture) never fails to amuse me.

But let us get to the question at hand . . .

Richard Rasiej writes:

I’m currently teaching a summer session course in Elementary Statistics. The text that I was given to use is Triola’s Elementary Statistics, 12th ed.

Let me quote a problem on inference from two proportions:

11. Is Echinacea Effective for Colds? Rhino viruses typically cause common colds. In a test of the effectiveness of echinacea, 40 of the 45 subjects treated with echinacea developed rhinovirus infections. In a placebo group, 88 of the 103 subjects developed rhinovirus infections (based on data from “An Evaluation of Echinacea Angustifolia in Experimental Rhinovirus Infections,” by Turner et. al., New England Journal of Medicine, Vol. 353, No. 4). We want to use a 0.05 significance level to test the claim that echinacea has an effect on rhinovirus infection.

The answer in the back of the teacher’s edition sets up the hypothesis test as H0: p1 = p2, H1: p1 <> (not equal to) p2, gives a test statistic of z = 0.57, uses critical values of +/- 1.96, and gives a P-value of .5686.

I was having a hard time explaining the rationale for the book’s approach to my students. My thinking was that since there is no point in claiming that echinacea has an effect on the common cold unless you think it helps, we should be doing a one-tailed test with H0: p1 = p2, H1: p1 < p2. We would still fail to reject the null hypothesis, but with a P-value of .2843. Or, is what I am missing that, if you are testing the claim that something has an effect you want to also test the possibility that the effect is the opposite of what you'd normally want (e.g. this herb is bad for you, or inhaling smoke is good for you, etc.)? Any advice you could give me on how best to parse this problem for my students would be greatly appreciated. I already feel very nervous stating, in effect, "well, that's not the way I would do it."

My reply:

The quick answer is that maybe echinacea is bad for you! Really though the example is pretty silly, as one can simply compare 40/45 and 88/103 and look at the sampling variability of the proportions. I don’t see that the hypothesis test and p-value add anything.

This doesn’t sound like much, but, amazingly enough, Rasiej replied later that day:

I guess I was led astray by the lead-in to the problem, which seemed to imply that there was a benefit. Obviously it’s better to read the claim carefully and take it literally. So, “test the claim that echinacea has an effect” is two-tailed since ANY effect, beneficial or not, would be significant.

That said, I do agree with you that the example is silly, given the data in the problem.

Thanks again for your insights. They helped in my class today.

Perhaps (maybe I should say “probably”) he was just being polite, but I prefer to think that even a brief reply can convey some useful understanding. Also I think it’s a good general message to take what people say literally. This is not a message that David Brooks likes to hear, I think, but it is, to me, an essential aspect of statistical thinking.

P.S. Perhaps I should stress that in my response above I wasn’t saying that confidence intervals are some kind of wonderful automatic replacement for p-values. I was just saying that, in this particular case, it seems to me that you’d want a summary of the information provided by the experiment, and that this summary is best provided by the estimated proportions and their standard errors. To set if up in a p-value context would seem to imply that you’re planning on making a decision about echinacea based on this single experiment, but that wouldn’t make sense at all! No need to jump the gun and go all the way to a decision statement; it seems enough to just summarize the information in the data.

16 thoughts on “One-tailed or two-tailed

  1. The majority of the time, in the journals I publish in, you’re going to need to include a p-value to get the paper accepted. So it might be true that “I don’t see that the hypothesis test and p-value add anything” but until we change the editors and reviewers, we’re going to need them. Given that we’re going to need them, here are my thoughts on one tailed tests.

    Martin Bland (an old colleague of mine) said (something like) that the only time he’d used a one tailed test appropriately was when examining outcomes for doctors – specifically, do more of this doctor’s patients die than we would expect (this was in the aftermath of Harold Shipman, in the UK – http://en.wikipedia.org/wiki/Harold_Shipman). If a doctor’s patients have a higher probability of dying than we would expect, we’ll look into that. If they have a lower probability, we don’t care (for the purpose of this study), regardless of how low that probability is.

    Personally, the only time I’ve seen one tailed tests done in articles that I’ve reviewed for psychology journals is when the two-tailed p-value is 0.05, so the researcher then decides that it was a one tailed test all along and is therefore significant.

    • If the book was written before 1985, there would be no p-value, just a statement of significance based on the Z-statistic and the critical value. A confidence interval could easily be constructed for the difference that does not cross the bounds. When the software became available to provide easily p-values to the 4th digit, it created endless misuse. But it is expected now.

      Many clinical statisticians concede that they are usually working with a one-sided 97.5% test when they are looking for superiority. But the two-sided process is retained because there are cases where both sides are meaningful. For example a pill with old active ingredient and a new but often-used inert that performs worse might bring into question the inert. Also since two-sided tests for Z-statistics match to the two-sided confidence interval, they are usually retained.

      Overall there is a lot of material in this example: difference of proportions, the normal approximation to proportions, the estimation of the variance for the difference. I would provide the confidence intervals for the proportion of the treatment, the placebo and their difference. I find that this is a hard topic for many clinical researchers. Significance testing is often used to show things in a common context, i.e. signify, not to make a one-off decision.

  2. The two-sided test is clearly a bit silly, since it is designed to answer a question we already know the answer to. Of course[*] echninacea has some effect on colds.

    [*] If there is serious dispute on this – really? – it’s not something any amount of data could help us resolve.

    But if the one-sided test is also silly, the argument is substantially more subtle. We cannot say a priori whether echincea has positive effect or not on colds; maybe true, maybe false. One can still complain about the question: why would we care if the effect is positive but vanishingly tiny? Can a small effect exist (does this even _mean_ anything?) unless you pin down the population substantially more precisely than I bet was done here? And so forth. But these are different arguments, and deservedly get substantially more push-back than the claim that the two-sided test is absurd.

    • Bxg:

      Indeed.

      In addition, the effect (to the extent that exists) certainly varies. The effect will be bigger for some people than for others, it will be bigger under certain conditions than under others, and indeed the effect could quite possibly be positive in some settings and negative in others. This is yet another reason why I don’t think the null hypothesis significance testing (NHST) framework makes sense. If someone wants to say that NHST is often useful, or that it can be a good data summary, or that its use has generally positive effects for science, I can accept that sort of argument (even if I might disagree). What really bothers me is the commonly-held idea that NHST is not just a sometimes useful, if flawed, procedure to apply to data, but that NHST should be the foundational principle of statistical inference. This is one reason I push back so hard against the oft-stated claim that confidence intervals are just inversions of hypothesis tests.

    • I don’t see two-tailed test that way. The normal question for testing the null-hypothesis is whether we see, statistically speaking, a large enough effect not to be swamped by sampling error. Whether that means the effect is large or small or positive or negative for any purpose is a different question entirely. In an alternative (statistical) universe, if your ratio is the other way around (echninacea is bad) you simply don’t care whether your samples are small or it’s a real effect, which is sort of stupid, but that’s what people probably do. In other words, you can do a one-sided test if you agree to use censored statistics (wrong sign gets rejected outright), which is the same thing for symmetric distribution as a two-sided test.

      • > The normal question for testing the null-hypothesis is whether we see, statistically speaking, a large enough effect not to be swamped by sampling error

        But the question explicitly asked “Is Echinacea Effective for Colds?”. That just doesn’t map onto your “normal question”.

        Suppose I am interested in a proposed mathematical theorem X. I tell you: “I’m wondering if X true”? And you come back later saying “I found a paper establishing that someone named ‘Joe’ just isn’t capable of proving X.” I don’t know who Joe is, and wasn’t particularly asking for an entangled fact about X and some particular person’s mathematical ability. I’ll take the information on board if that’s all you can give me because it is something, but just not a lot (and if you can’t tell
        me very much about Joe, because the paper didn’t bother to report it, it’s hardly anything.)

        (I’d be even more unhappy if later on you mentioned, “By the way X is well known to be false – I didn’t mention this because I just assumed that your _real_ interest was Joe was smart enough to prove it.”)

        • Sorry, i didn’t get the theorem X story, but as I see it, statistical analysis goes in stages. One of them is to figure out whether we have enough data to discuss a purported effect or we shouldn’t even bother. Maybe that’s too clear cut. Maybe if there is insufficient on some p-level data, we still want to speculate, but it is instructive to have some metrics as to whether data is sufficient. That what’s null-hypothesis test is doing.

        • I don’t disagree with you in the slightest.

          But look at the original question. “We want to use a 0.05 significance level to _test the claim that_ echinacea has an effect on rhinovirus infection.” It does not say “We will use a test to see if our current data set is sufficient to warrant additional stages of analysis”. These students are seemingly being taught (and looking at published literature, it’s clear that this is commonplace) that hypothesis tests can be an endpoint of analysis. That’s only rarely so.

  3. There was a paper published over 20 years ago by a group at Princeton who were attempting to demonstrate parapsychological effects by having Princeton students (mostly) act as subjects attempting to “influence” a gadget that worked with either shot noise of radioactive decays to produce one of two different random events. They gave feedback as to success or failure by a red or green light that would flash when a sequence of events had taken place.

    They had over 100 million individual events after a decade or so of doing this experiment, which they said went in the direction that the subject was attempting to “will” the generator to produce significantly more often than expected. They analyzed the data with a one-sided p-value.

    But the parapsychological literature is replete with examples where the alleged effect was in the opposite direction to whatever was intended. The people who do this claim that when this happens, it is due to “psi-missing” and is equally evidence for a real effect, just in the wrong direction. That is, some people are very good at getting the intended result, and some are very bad.

    Obviously, if there is a possibility that you’re going to claim significance no matter what direction the actual experiment turns up, then you should be using a two-sided test, not a one-sided test.

    (This experiment was deeply flawed in many ways and I don’t believe the result for one minute. But the question posed by Richard Rasiej was whether one-sided or two-sided tests are appropriate. It all depends on what you’re trying to do. As Andrew points out, when comparing two treatments, it sometimes is the case that there is something wrong with the treatment used in that wing, which can make it not work as well as placebo because it actually has a deleterious effect. So that could be a reasonable argument for eschewing a one-sided test).

    • This experiment was deeply flawed in many ways and I don’t believe the result for one minute.

      The funny part is, no matter what the experiment I would be very loathe to believe any result that said in favor of ESP or a similar effect.

      i.e. If statistical analysis of seemingly kosher experimental data indicates ESP I’d rather examine the validity of the statistical method than start believing in ESP.

      • Ed Jaynes, in his well-known book, points out that if you have a prior that puts a very small prior probability on a hypothesis such as “psi is real,” and if you find the evidence overcoming that prior, you may upon reflection realize that there are other hypotheses that would explain the data for which the prior is much greater, so that rather than supporting the hypothesis that you really doubt, it will “raise the dead hypothesis” of one of these alternative hypotheses.

        For example, if I shuffle a deck of cards thoroughly and hand it to a performer who claims to have psychic powers, and that performer correctly names each card before turning it over, I am much more likely to believe that he has accomplished this feat by using the conjuring skills of a magician than to believe that he actually has psychic powers. There are hundreds of ways that magicians can accomplish feats like this (and good magicians will actually use different methods to create the same effect).

        It turns out that there are a number of statistical flaws with the experiment I cited. It’s also the case that they continued the experiment until they had over 800 million individual events, and the absolute size of the effect decreased[*] even though the p-value became somewhat more “significant” (they included the first 100 million events in the analysis of the 800 million — that is, they obtained 700 million more and analyzed the whole batch — a no-no in my book). I’ve also heard skeptics voice concerns that the experiment itself may not be all that it seems.

        [*] This decline in size of effect is commonly seen in these parapsychological experiments, to the point where it is known as the “decline effect” and by some is counted as evidence in favor of psi being real. Go figure. The more plausible explanation is that controls were tightened, the experiment was improved, or some such, which indicates that the earlier experiment was flawed even though this possibility is not often even mentioned.

      • Rahul:

        I think that statement of yours is in many ways data-based, it’s a fair reaction to nearly a century of failed ESP studies. That is, your statement represents a “prior distribution” for any current or future ESP experiment but it is a “posterior distribution” with respect to the many failed studies of the past.

  4. “To set if up in a p-value context would seem to imply that you’re planning on making a decision”.

    .. and if you are using p-values to make decisions (i.e. choosing between courses of action, under uncertainty) then you have bigger problems than the issue of 1-tailed vs. 2-tailed tests, I would think.

    • Ain’t that the truth.

      If you want to make decisions, use decision theory.

      (But as Andrew noted just below, it is silly to use loss functions that are highly nonlinear when the number of dollars changes only a little bit.)

  5. By the way, echinacea has repeatedly worked for me in staving off colds over the last 17 years since I started using it while undergoing chemotherapy. For example, I just spent three weeks with my son in his apartment. He came down with a bad cold. I subsequently came down with the kind of sore throats that usually lead to a cold the next day, but I staved off the cold with about 20 cups of echinacea tea over a couple of weeks.

    On the other hand, echinacea has never been effective for either of my sons. For all I know it may make their immune system less effective at fighting incipient colds. This might sound implausible to people expecting a straightforward dose-response relationship that is common across all humans, but immune systems are immensely complicated and idiosyncratic. It’s possible that echinacea is, on average, bad for people but also that it’s quite good for a small percentage of individuals such as myself. You can find out by experimenting on yourself.

    Now, you’re probably thinking I’m just some crank full of weird diet and health advice, but in reality I only publicize a few opinions: that echinacea might work for some fraction of the population, that for a lot of people a low carb diet is easier to stay on than a high carb diet, and that rituximab is good for fighting some versions of non-Hodgkin’s lymphoma.

    My meta-view on diet and health is that there is a lot of variation among individuals and that treatments that work or don’t work on average might have the opposite effect on certain individuals, and that our ways of thinking about diet and health statistics need to become more flexible to take that into account.

Leave a Reply

Your email address will not be published. Required fields are marked *