Raghu Parthasarathy presents a wonderfully clear example of disastrous p-value-based reasoning that he saw in a conference presentation. Here’s Raghu:
Consider, for example, some tumorous cells that we can treat with drugs 1 and 2, either alone or in combination. We can make measurements of growth under our various drug treatment conditions. Suppose our measurements give us the following graph:
. . . from which we tell the following story: When administered on their own, drugs 1 and 2 are ineffective — tumor growth isn’t statistically different than the control cells (p > 0.05, 2 sample t-test). However, when the drugs are administered together, they clearly affect the cancer (p < 0.05); in fact, the p-value is very small (0.002!). This indicates a clear synergy between the two drugs: together they have a much stronger effect than each alone does. (And that, of course, is what the speaker claimed.)
I [Raghu] will pause while you ponder why this is nonsense.
Another interpretation of this graph is that the “treatments 1 and 2” data are exactly what we’d expect for drugs that don’t interact at all. Treatment 1 and Treatment 2 alone each increase growth by some factor relative to the control, and there’s noise in the measurements. The two drugs together give a larger, simply multiplicative effect, and the signal relative to the noise is higher (and the p-value is lower) simply because the product of 1’s and 2’s effects is larger than each of their effects alone.
And now the background:
I [Raghu] made up the graph above, but it looks just like the “important” graphs in the talk. How did I make it up? The control dataset is random numbers drawn from a normal distribution with mean 1.0 and standard deviation 0.75, with N=10 measurements. Drug 1 and drug 2’s “data” are also from normal distributions with the same N and the same standard deviation, but with a mean of 2.0. (In other words, each drug enhances the growth by a factor of 2.0.) The combined treatement is drawn from a distribution of mean 4.0 (= 2 x 2), again with the same number of measurements and the same noise. In other words, the simplest model of a simple effect. One can simulate this ad nauseum to get a sense of how the measurements might be expected to look.
Did I pick a particular outcome of this simulation to make a dramatic graph? Of course, but it’s not un-representative. In fact, of the cases in which Treatment 1 and Treatment 2 each have p>0.05, over 70% have p<0.05 for Treatment 1 x Treatment 2 ! Put differently, conditional on looking for each drug having an “insignificant” effect alone, there’s a 70% chance of the two together having a “significant” effect not because they’re acting together, but just because multiplying two numbers greater than one gives a larger number, and a larger number is more easily distinguished from 1!
As we’ve discussed many times, the problem here is partly with p-values themselves and partly with the null hypothesis significance testing framework:
1. The problem with p-values: the p-value is a strongly nonlinear transformation of data that is interpretable only under the null hypothesis, yet the usual purpose of the p-value in practice is to reject the null. My criticism here is not merely semantic or a clever tongue-twister or a “howler” (as Deborah Mayo would say); it’s real. In settings where the null hypothesis is not a live option, the p-value does not map to anything relevant.
To put it another way: Relative to the null hypothesis, the difference between a p-value of .13 (corresponding to a z-score of 1.5), and a p-value of .003 (corresponding to a z-score of 3), is huge; it’s the difference between a data pattern that could easily have arisen by chance alone, and a data pattern that it is highly unlikely to have arisen by chance. But, once you allow nonzero effects (as is appropriate in the sorts of studies that people are interested in doing in the first place), the difference between p-values of 1.5 and 3 is no big deal at all, it’s easily attributable to random variation. I don’t mind z-scores so much, but the p-value transformation does bad things to them.
2. The problem with null hypothesis significance testing: As Raghu discusses near the end of his post, this sort of binary thinking makes everything worse in that people inappropriately combine probabilistic statements with Boolean rules. And switching from p-values to confidence intervals doesn’t do much good here, for two reasons: (a) if all you do is check whether the conf intervals excludes 0, you haven’t gone forward at all, and (b) even if you do use them as uncertainty statements, classical intervals have all the biases that arise from not including prior information: classical conf intervals overestimate magnitudes of effect sizes.
Anyway, we know all this, but recognizing the ubiquity of fatally flawed significance-testing reasoning puts a bit more pressure on us to come up with and promote better alternatives that are just as easy to use. I do think this is possible; indeed I’m working on it when not spending my time blogging. . . .