Solution to the problem on the distribution of p-values

See this recent post for background.

Here’s the question:

It is sometimes said that the p-value is uniformly distributed if the null hypothesis is true. Give two different reasons why this statement is not in general true. The problem is with real examples, not just toy examples, so your reasons should not involve degenerate situations such as zero sample size or infinite data values.

And here’s the answer:  Reason 1 is that if you have a discrete sample space, the p-value will have a discrete distribution.  Reason 2 is that if you have a composite null hypothesis, the p-value will, except in some special cases, depend on the value of the nuisance parameter.

All 4 of the students gave reason 1, but none of them gave reason 2.  And none of them gave any other good reason.

We’ll do the final question tomorrow.

20 thoughts on “Solution to the problem on the distribution of p-values

    • It’s not clear what you mean by that. Take usual Bernoulli distribution with p=1/2 as null hypothesis. After N trials you’ll have distribution over 1,2,…,N successes and 1-sided (for simplicity) p-values will be 1/2^N x cumulative sums of binomial coefficients; probability associated with each p-value will be 1/2^N x the binomial coefficient itself. This is a “uniform” distribution for an unusual definition of uniformity. On the other hand “p-density” = (probability of p-value) / (interval around where it is the closest possible p-value) is approx 1/2^N for large N. If you rescale [0,N] interval of successes to frequencies [0,1] than in large N limit you’ll get a uniform distribution in continuous sense.

  1. “discrete” sounds too narrow to me, though maybe it depends on how you interpret this. Suppose the distribution of the test statistics covered a continuous range, but also had a delta-spike at some value in that range. Is that a discrete distribution? But the p-value wouldn’t cover a continuous range.

  2. Reason 1 is that if you have a discrete sample space, the p-value will have a discrete distribution.

    Really? I have a n observations that take values on {0,1} and want to test the hypothesis of equal probability of a 0 or 1 and use a pearson chi-square, won’t under the null hypothesis the p-value of the chi-square be uniform?

    Also, I’m surprised that no one argued that while the p-values may be uniform, the reported p-values are very much non-uniform! (at least in the academic literature, and that’s where nearly all p-values are reported).

    • The p-value will only take on at most n+1 values in your example (based on the the number of 1s). A uniform random variable is continuous, and can take on an uncountable number of values. Therefore it is not possible that the pvalue is uniform.

      • For the above two comments:

        The sample space of _any_ experiment is discrete. Why? Because our measurement technologies and our data storage capabilities will only allow a measurement to a certain number of decimal places, hence there is always a finite sample space (assuming boundedness of the presumed underlying probability distribution). Hence this question only makes sense in a theoretical sense–that is, the assumed underlying probability distribution generating the data must be continuous. But the interesting thing about a sample of {0,1} with equal probability is that the chi-square is not multinomial with parameters (n+1, 1/(n+1), …, 1/(n+1)), which I should have realized (that would have be the continuous analogue–the actual probabilities can be calculated from the underlying bionomial generated by the independent bernoulli trials). I wonder about the relationship between the nature of probability masses with a discrete pdf and the distribution of the test statistic.
        Note Freshman’s comment below–I can always add a vanishingly small amount to my chi-square and make it “continuous” (or you could add it to the sample space) yet the p-values won’t be uniform.
        I’ll have to look up the various theorems unless someone has a simple summary.

        For the record, the second part of the question is the most interesting and not generally realized (I didn’t).

    • The exact null distribution of the chi-square is discrete.

      But I do agree with you that Reason 1 is not that solid. Whether the p-value will have a discrete distribution depends on what test statistic is used. I can generate a number uniformly between 0 and 1, independently of the data, and reject the null if that number is smaller than 0.05.
      It is a valid test procedure (according to the definition of a significant test that I learnt). However, the question emphasized that “The problem is with real examples”, so I think my example would have earned a cross in the exam.

    • X,

      We don’t require that the students do perfectly on the exam! They’re just supposed to do something reasonable. In any case, as I noted in a response to a comment on one of the other posts, I don’t actually like that we have this exam. It seems silly to me that, after several semester-long courses, students are judged based on an exam of a few hours’ length.

      • CMU abandoned qualifying exams for that reason. The only one left is a Data Analysis Exam, where you’re given a dataset and some interesting questions, and given eight hours to write a report summarizing your findings. I’m currently grading first-years practicing for the exam, and it turns out it’s very difficult to turn theoretical competence into good answers to practical questions. I see a lot of very fancy models which don’t actually answer the questions. So I think this is a very useful experience for them.

  3. That’s funny… what I was expecting was something like this: “The null hypothesis is a restriction on the model. The restriction could be true even though the model itself is misspecified. If so, the distribution of p-values won’t be uniform.” Andrew, if you get a chance, could you say if would you have accepted this as an answer?

    • I was having thoughts in this direction too: like if the null hypothesis is mean = 0 and the p-value is calculated under an assumption that the actual distribution of the data fails to satisfy (e.g., normality assumption and logistic distribution). But the phrase “the null hypothesis is true” is ambiguous and could be interpreted to cover the model too, not just the point in parameter space…

    • Mark, Corey and Xi’an:

      There is a context within a program of study that narrows the interpretation of the questions, but to me

      The null hypothesis is a given parameter value (or set of parameter values) and does not refer to the (form) of the data model.

      So that would make a 3rd one – model miss-specification (e.g. unknown confounding in epidemiological studies) and arguably the most important as it could give the most discrepant from Uniform(0,1) distribution.

      The first is true always, in that samples are measured to finite accuracy and the distribution of p-values will be a step function distribution rather than a continuous distribution, but often the steps are very tiny and so the continuous distribution is an accurate enough and extremely convenient approximation.

      The second is almost always true in actual applications as the null will be composite and except in some special cases, depend on the value of the nuisance parameter.

      So how does it come to be _speculated_ as usually Uniform(0,1) by many who have studied statistics??

      Xi’an: I would not take the students not getting this as a reflection on them or even the teaching in the program but rather the difficulty in going from theory to practice. In theory everything can be exact and even (in the abstract) continuous but in practice it is always messy approximate and discrete. Somehow the clarity/exactness of theory gets projected onto the applications in ways that can be very nonsensical. As Alex put it, “a lot of very fancy models which don’t actually answer the questions” but are suggested as answers.

      A neat example was an invited session at the JSM about 10 years ago on exact methods where closeness to Uniform(0,1) distribution of p_values under the null was the focussed measure of success with Percy Diaconis as one of speakers (probably on discrepancy measures from Uniform(0,1). One of the other speakers presented a case study based on an Epidemiology study.

      I complained that it was a silly example as with residual confounding the distribution of p_values would never be any where near close to Uniform(0,1). I was informed by a number of participants that I was wrong and and to resolve the difference of opinion Percy was asked to explain to everyone exactly why i was wrong. Percey commented, that I had pointed out the model was misspecified and given that, no one would be able to determine what the distribution of p_values under the null would be. My inference was that no one other than Percy had a good enough grasp of this issue to understand what I was pointing to.

  4. Another reason for non-uniformity is real-world selection bias. If p-value is close enough to zero, the null-hypothesis is rejected, and in fact, many tests are done with a reasonable expectation to reject the null hypothesis, and thus GIVEN the null-hypothesis is correct, the p-values would be skewed upwards (say towards [0.05,1]).
    It may be a somewhat smart-assed answer, but the question did ask for a real-world example.

Leave a Reply

Your email address will not be published. Required fields are marked *