Alain Content writes:
I am a psycholinguist who teaches statistics (and also sometimes publishes in Psych Sci).
I am writing because as I am preparing for some future lessons, I fall back on a very basic question which has been worrying me for some time, related to the reasoning underlying NHST [null hypothesis significance testing].
Put simply, what is the rational justification for considering the probability of the test statistic and any more extreme value of it?
I know of course that the point value probability cannot be used, but I can’t figure the reasoning behind the choice of any more extreme value. I mean, wouldn’t it be as valid (or invalid) to consider for instance the probability of some (conventionally) fixed interval around the observed value? (My null hypothesis is that there is no difference between Belgians and Americans in chocolate consumption. If find a mean difference of say 3 kgs. I decide to reject H0 based on the probability of [2.9-3.1].)
My reply: There are 2 things going on:
1. The logic of NHST. To get this out of the way, I don’t like it. As we’ve discussed from time to time, NHST is all about rejecting straw-man hypothesis B and then using this to claim support for the researcher’s desired hypothesis A. The trouble is that both models are false, and typically the desired hypothesis A is not even clearly specified.
In your example, the true answer is easy: different people consume different amounts of chocolate. And the averages for two countries will differ. The average also differs from year to year, so a more relevant question might be how large are the differences between countries, compared to the variation over time, the variation across states within a country, the variation across age groups, etc.
2. The use of tail-area probabilities as a measure of model fit. This has been controversial. I don’t have much to say on this. On one hand, if a p-value is extreme, it does seem like we learn something about model fit. If you’re seeing p=.00001, that does seem notable. On the other hand, maybe there are other ways to see this sort of lack of fit. In my 1996 paper with Meng and Stern on posterior predictive checks, we did some p-values, but now I’m much more likely to perform a graphical model check.
In any case, you really can’t use p-values to compare model fits or to compare datasets. This example illustrates the failure of the common approach of using p-value as a data summary.
My main message is to use model checks (tail area probabilities, graphical diagnostics, whatever) to probe flaws in the model you want to fit—not as a way to reject null hypotheses.