Sam Behseta sends along this paper by Laura Lazzeroni, Ying Lu, and Ilana Belitskaya-Lévy, who write:

P values from identical experiments can differ greatly in a way that is surprising to many. The failure to appreciate this wide variability can lead researchers to expect, without adequate justification, that statistically significant findings will be replicated, only to be disappointed later.

I agree that the randomness of the p-value—the fact that it is a function of data and thus has a sampling distribution—is an important point that is not well understood. Indeed, I think that the z-transformation (the normal cdf, which takes a z-score and transforms it into a p-value) is in many ways a horrible thing, in that it takes small noisy differences in z-scores and elevates them into the apparently huge differences between p=.1, p=.01, p=.001. This is the point of the paper with Hal Stern, “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant.” The p-value, like any data summary, is a random variable with a sampling distribution.

Incidentally, I have the same feeling about cross-validation-based estimates and even posterior distributions: all of these are functions of the data and thus have sampling distributions, but theoreticians and practitioners alike tend to forget this and instead treat them as truths.

This particular article is that it takes p-values at face value, whereas in real life p-values typically are the product of selection, as discussed by Uri Simonson et al. a few years ago in their “p-hacking” article and as discussed by Eric Loken and myself a couple years ago in our “garden of forking paths” article. I think real-world p-values are much more optimistic than the nominal p-values discussed by Lazzeroni et al. But in any case I think they’re raising an important point that’s been under-emphasized in textbooks and in the statistics literature.

There was an article in The American Statistician back in 2008 by Duncan Murdoch et al that discussed this concept:

P-Values are Random Variables

TAS, Volume 62, Issue 3, 2008

http://www.tandfonline.com/doi/abs/10.1198/000313008X332421

Another factor is that p-values are highly nonlinear in relation to the actual deviation from the mean. This causes extreme variations in p when there are (otherwise) mild sampling variations. So the p-values are essentially an unstable measure of something-or-other.

The z-values are more stable and, so, at least for me, preferable. If |z| > 3 or 4, we probably have pretty strong evidence (or forking paths), for |z| ~ 2 the data are suggestive (especially without forking paths), and |z| ~ 1 or less, well, who knows, maybe there’s an effect or maybe not. It seems ridiculous to claim that e.g., z = 1.93 means nothing whereas z = 1.96 does.

That’s of course if we have to boil it down to just a single simple measure.

>The z-values are more stable and, so, at least for me, preferable. If |z| > 3 or 4, we probably have pretty strong evidence (or forking paths), for |z| ~ 2 the data are suggestive (especially without forking paths), and |z| ~ 1 or less, well, who knows, maybe there’s an effect or maybe no

But what’s the thinking behind what strength of evidence a z value corresponds to, if not the quantiles they correspond to in the sampling distribution?

I didn’t get the critique about cross-validation-based-estimates being functions of data.

I mean sure they are, but then what isn’t?

Rahul:

Yes, everything is a function of data. But when people put a lot of effort into computing something, and if that something has some appealing theoretical properties, there seems to be a tendency to forget that it’s a function of data. I see this with p-values, with cross-validation, and with posterior distributions. In each case, the output is often taken too seriously, as an object in itself, without full recognition that it’s just a product of a particular, randomly sampled, dataset.

Cross validation error is a very blurry image of a very important value. People often forget about the blurry image part. I’ve seen people claim that their model must be better because cross-validation error was reduced 32.7% to 32.6%. This was during a candidate talk, not just some kid on Twitter.

To be fair, the candidate was a numerical analyst, so superiority of their model was someone else’s professional concern, not theirs.

The biases and errors that went into the data are still there when you cross validate.

As opposed to when? When are the biases & errors not there?

I read Elin’s “still” as “also” or “remain”.

Hmm – a question – would these authors and their apparent readers have spent their time more purposefully reading more informed accounts rather than producing a calculator and published paper?

“A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so—and yet these misinterpretations dominate much of the scientific literature.” http://link.springer.com/article/10.1007/s10654-016-0149-3

In particular see point 18 or observed power in http://www.sciencedirect.com/science/article/pii/S1047279712000221

If everything were perfect and ideal, then the p-values would be reasonable indicative. Then Z = 1.96 would indicate the same thing as p = 0.05 (assuming normality, of course). But Z is a random variable, and so is p. Z varies in a much more restrained manner than p when there are sampling variations. For example, increasing Z from 1.96 to 2.16 (about a 10% change) reduces p from 0.05 to 0.03 (a 40% change).

In addition, it’s much easier to assess the influence of non-random errors when thinking about Z-values (or just standard deviations, for that matter). For example, say we measure someone’s weight and find it is 75 kg. We have found that day-to-day variation is 0.5 kg, for a Z-value of 0.5/75 = 0.0067. But we know from experience that home scales can easily be 1 kg out of calibration, or Z = 0.0133. We can easily tell that the calibration error could be much larger than the statistical error. But if we only knew the p-value for a measurement, we would be hard-pressed to tell that.

> For example, increasing Z from 1.96 to 2.16 (about a 10% change) reduces p from 0.05 to 0.03 (a 40% change).

Z=1.96 is 51% more probable than Z=2.16 and Z>1.96 is 62% more probable than Z>2.16. The “equivalent” (10%) change of increasing Z from 0.196 to 0.216 has almost no effect (less than 2%) on the corresponding probabilities. If one is interested on the changes in tail probabilities, one wants to distinguish this two cases. The sensibility to changes in the value of Z is a desirable feature.

One could of course reason as well in terms of sigmas, because there is well-known correspondence. But one would not usually consider that the change from one sigma to two sigma is in any way comparable to going from five sigma to six sigma.

> We have found that day-to-day variation is 0.5 kg, for a Z-value of 0.5/75 = 0.0067.

That’s not a Z-value.

P Hacking Commercial

1. P-values determine publication, career success, standing among peers, marital happiness, and overall well being.

2. P-values are random.

3. You don’t have to let a lottery determine your future!

4. Learn the P-hacking way to happiness!

Call Us at 1-800-PHACK*** NOW & receive the life changing booklet “1000 Degrees of freedom: P-hacking for success”

All for $39.99 +pp

More seriously, I think the notion that the p-value is random explains why most published findings are false.

Who in their right mind gambles their career on a p-value lottery?

The evaluation of scientists should not be based on something random but on items under their control:

– The importance / novelty of their research questions, and

– The quality of their research designs.

+1

The p value isn’t just a random variable, under data actually generated by a random number generator with the null hypothesis distribution, it’s a UNIFORM random variable between 0 and 1. I think people get something like p = 0.061 and they think “gee it’s ‘nearly significant'” and maybe next time with a bigger sample size it will be p = 0.044 or something. But if the null is literally true, the next time it could be 0.85, there’s no “continuity” that means next time you’ll get something near to what you got before.

leaving off all the other problems with p values, this uniform between 0,1 property is very different from stuff people are used to in the sciences. Almost everything is continuous or nearly so, small perturbations in what you do result in small changes to outcomes, but definitely NOT with sampling under the null and transforming to p values.

Daniel:

The bit about the p-value being uniformly distributed is more misleading than helpful, I think, because that statement is conditional on the null hypothesis being true, and it in general isn’t.

Then why do people test the null that \mu = 0? Conditional on the assumption that such a null hypothesis is worth testing, pointing out that the p-value has a uniform distribution under the null is very worthwhile. If you say that testing such a hypothesis in the first place is not worth it, then there’s no point in bringing up this point, agreed. But if you buy into the NHST framework, this uniform distribution detail is one of the first things people should be taught.

Viewing the world through Andrew’s Type S errors, not Type I, is a much more logical framework, and the proper way to understand the motivation behind why hypothesis testing might make any sense what so ever. It also helps see why Andrew says that thinking of p-values as uniformly distributed is very misleading.

But using the likelihood ratio test, testing Type S errors becomes equivalent to testing type I errors; mu = 0 is on the boundary of the set mu <= 0. So testing

H_o: mu 0

will lead to exactly the same rejections (**assuming alpha 0.5…) to testing

H_o: mu = 0

vs

H_a: mu > 0

Oops. First set of hypothesis should have been

H_o: mu 0

there’s some weird text processing that I do not understand.

Ho: mu less than or equal to 0

Ha: mu greater than 0

the blog eats things after a less than sign because it thinks it’s the start of an HTML tag… it’s the seriously most annoying thing about WordPress, together with the lack of preview.

Ah that it explains it, thanks. For a moment I thought the blog just ate meaningless hypothesis tests.

>Conditional on the assumption that such a null hypothesis is worth testing

Isn’t this a bit similar to “conditioning on the asssumption that 1+1==3 is worth testing (…)”

I’m sure it’s strange and interesting. But since I know it’s false and wouldn’t

spend a fraction of a penny to test otherwise, this seems an empty hypothetical to me.

Perhaps you can suggest an example of a studied null-hypothesis which isn’t just obviously

false (or obviously true, I suppose I should cover that weird case) and thus is “worth testing”. Even for

even de minimus values of “worth”.

There are many papers in psychology where they want to argue for mu=0. They haven’t thought much about what exactly thatmeans. Indeed several famous psycholinguists have based their entire careers on showing that mu=0.

One could argue that the very concept of p-value is more misleading than helpful. But if p-values are kept, it’s essential to know that if the null hypothesis is true then it will be distributed uniformly between 0 and 1. How can you have any real understanding of p-values otherwise? However, I think this fact is widely ignored (while on the other hand I think most people is aware that the p-value depends on the data, or the wouldn’t bother running their experiments to get it).

I had a discussion with someone who was convinced that if the null hypothesis was false as you added data your p-value would trend to 0 (ok) but if the null hypothesis was true as you added data the p-value would trend to 1 (wtf?). Pointing to many references explaining that this was not the case didn’t help, his conclusion was that then the p-value would we useless. I was also surprised to see that finding these references was harder that I expected.

Carlos:

> it will be distributed uniformly between 0 and 1. How can you have any real understanding of p-values otherwise?

Completely agree – but given all the myriad of background assumptions above and beyond simply absolutely no effect, such as properly randomized, no dropouts, no selection, no biased outcome assessments, etc., etc, (more fully discussed in the paper link I gave above) if there ever is absolutely no effect – the distribution of p_value will be undefined but likely far from uniformly between 0 and 1. So only helpful when dealing with ideally conducted animal experiments?

> surprised to see that finding these references was harder that I expected

I used to use it as a diagnostic question to get a sense if another statistician had much of a purposeful grasp of statistics, “if the treatment was completely inert and all the assumptions needed for the analysis were absolutely fine what would the distribution of p-values be?” (common answer Normally distributed – wtf?)

These days I might try, of course you know that if the null hypothesis is true then p_values will be distributed uniformly between 0 and 1 but what other implicit assumptions are needed for this to be actually true?

My bigger question which I might have been a bit snarky about above, why don’t people read informed views on such statistical issues (especially if they are writing papers on them)? Are they unable to locate/discern where these informed views may be found?

It’s the bit about people kind of expecting a normal distribution that is my main point. People expect the p value to “cluster” close to a common value. Even if the null hypothesis is definitely false, there are plenty of distributions that the p value might have which don’t look like a tightly clustering distribution. Like maybe a section of an exponential curve, or something like beta(1.5,0.75) or whatever

We don’t disagree. But let me insist that the relevant point is not that the p-value is going to be distributed uniformly (we hope it will not, because we expect the null hypothesis to be false, and even if it was true you are right that there are so many other factors that can spoil our theoretical expectation). The point is that this theoretical distribution, if the null were true and the model correct, is the only thing that gives a meaning to the p-value. The p-value is a measure of how often would I see a statistic more extreme than the one at hand if the null was true. So by definition, if the null is true I get a value corresponding to p<0.05 with probability 0.05 and I get a value corresponding to p<0.01 with probability 0.01. If people are surprised to learn that the p-value is distributed like that when the null hypothesis is true, how do they interpret the p-value? What do they think they are measuring?

> If people are surprised to learn that the p-value is distributed like that when the null hypothesis is true, how do they interpret the p-value?

Actually we agree on that, its knowing the p_values would be Uniform(0,1) if there was no effect and all necessary assumptions were true enough that allows a possibly sensible interpretation.

See http://andrewgelman.com/2016/07/26/29552/#comment-289122

DUH they think they are measuring the probability that their favorite hypothesis is false :-)

This is not as “wtf” as it may appear.

For one-sided tests, where the null is true but the true value of the parameter is not at the dividing line (typically zero) then adding more data the p-value *will* trend to one.

For one or two-sided tests, in situations where a point null (i.e. parameter value exactly zero) is implausible – as often discussed here – then no sensible analysis ever ends up with U(0,1) p-values.

Obviously it’s not “wrong” to talk about p-values being U(0,1) in some circumstances. But in some contexts there may be better ways to motivate them, and to convince others of their (limited) utility.

> For one-sided tests, where the null is true but the true value of the parameter is not at the dividing line (typically zero) then adding more data the p-value *will* trend to one.

I don’t follow you. What is the “dividing line”?

Suppose you’re testing whether a parameter is positive, your null is that it’s negative. The true parameter is well and truly negative, like say -10 on some scale where 1 is a big deviation. I don’t understand the point about why that trends towards 1 though. It seems like the null hypothesis is true here, so the p value should probably be uniform(0,1) if the test is well constructed. Perhaps he means that if the parameter is +10 the p value trends towards 0?

I have to confess I didn’t know that composite null hypothesis were a thing. I’m not sure it solves more problems than it creates (frankly, it seems an abomination) but I can see how the p-value can then trend to 1. If we do the test corresponding to H0: mu=0, but allow all the the values mu<0 as being "null", we would consider that mu=-10 satisfies the null hypothesis and as we add data our mean will be closer to -1 and further (in sigma terms) from 0 (so the p-value will approach 1, in the same way that if mu=10 it would approach 0). And the p-value doesn't have a definite meaning in this setting, it's not that "if the null is true then we get a value this extreme with probability p" but "if the null is true then we get a value this extreme with probability p or lower".

Carlos: by the “dividing line”, I mean the point in the parameter space where the null hypothesis stops being true. So if your null is that a parameter is positive, this is true for all parameters above zero and false below zero. the dividing line is zero.

Carlos and Daniel: well-calibrated p-values are smaller than alpha in *no more* that proportion alpha of repeated experiments. This means that they can be smaller than alpha *far less* than the nominal level of control and still be valid. This is what happens, in most situations, when the truth is null but not at the dividing line.

If you want a concrete example, try testing H0: mu positive when 100 independent observations are N(mu,1), using test statistic the sample mean. When mu is zero you’ll get U(0,1) p-values. For large positive mu you’ll get large p-values basically every time – but the null still holds.

Carlos: sorry, my reply crossed with your response. Yes, composite nulls are a thing. You might also want to look up unbiased tests, which define a related form of testing behavior often deemed sensible. Hope you can get back to your “someone” too; their intuition was not so terrible.

george: thanks for the explanation. It’s true that I already knew that the p-value is not *always* distributed uniformly (discrete distributions are the obvious counter-example), but in the discussion I mentioned it was completely clear that the null hypothesis was a single value (the usual N(mu,1) stuff, with H0: mu=0). Simple cases are already complex enough!

Look at the p-value distribution in our paper here:

https://mega.nz/#F!QIpXkL4Q!b3QXepE6tgyZ3zDhWbv1eg (paper.pdf)

It’s the test of astrological sign for every other variable in a very large dataset. The distribution of p values is very close to uniform.

The point about the posterior distribution also having a sampling distribution seems to basically be completely neglected.

I think this reflects a different emphasis and a different meaning for a Bayesian analysis. A Bayesian posterior gives you a measurement of plausibility of different parameter values *given the data*. The fact that some other data set would have given you some other posterior distribution is not really relevant. I mean, we could also say “if you’d been born as a Zlinglian on planet Zork you’d have 3 eyes” but that’s not the actual case.

>>>The fact that some other data set would have given you some other posterior distribution is not really relevant.<<<

Isn't this what external validity is about? Or is that different?

In some sense this is what it’s about, but in another sense I think external validity is about the validity of the *form of the model*. Basically, if you assume there is a universal “true” parameter value for a “true” model (or at least, let’s say “one good” parameter value for a “widely applicable” model), when in fact there isn’t, then a symptom of that problem will be that the posterior distribution for this “true” parameter value will vary widely based on your dataset (like for example, perhaps with two different datasets you will find two different high probability regions that are completely disjoint).

So, that’s an interesting diagnostic criterion when you have several datasets to look at, and it’s a concern that you should try to make your explanations have broad “external” applicability, but if you do have a good model form, the fact that two different datasets have high probability regions that overlap but are not exactly the same is not a bug, it’s a feature, or rather it’s simply an inevitable consequence of having a limited amount of information in your data.

But if your model is good, & has external validity, shouldn’t the posterior be relatively stable on varying the datasets?

If your parameter sets vary crazily contingent on which particular dataset you use you probably don’t have a good model?

Yes, relatively stable, especially for large datasets for a good model, and crazily contingent on the dataset for a bad model.

But even with a good model there is a hypothetical issue of how would the posterior differ under some alternative hypothetical dataset. As we said, it should be relatively stable for a good model. But a Frequentist might ask you a question like “you say there’s a 95% chance that the parameter X is positive, but what would you say if we collected a different hypothetical dataset? How could you know it’s a 95% chance if it isn’t the case that you’d say it’s positive under 95% of hypothetical repetitions of the experiment?” etc…

and that’s just not what the posterior distribution is about, and that’s what I think of when someone mentions that the posterior is “a random variable” (ie. contingent on the data). Technically speaking, the posterior distribution of N different parameters is an infinite dimensional function on an N dimensional manifold. You can use this fact pre-data to design an experiment to give you “the best” inference you can (where you have to define best via some objective function to maximize, typically trading off cost vs information content). But after seeing the data, if you’re committed to the model, what would have happened in alternative worlds is… questionably relevant.

Isn’t the role of the prior basically to represent your assumptions on the sampling distribution. The observed likelihood varies with the data, the prior attempts to stabilise it. The Frequentist approach to stabilisation is the device of hypothetical replications. In this way it seems to me that both Frequentists and Bayesians ‘make up’ additional data to stabilise estimates based on the data at hand. Likelihoodists, Machine Learners etc etc can also make up stabilisation methods, and all seem equally well justified – useful, but with minimal guarantees of correctness/performance when compared against actual reality.

The posterior distribution is, explicitly, conditional on the data.

p(Parameters | Data, Assumptions)

where Assumptions is the label we’re using to stand in for the form of the likelihood and for the priors on the parameters.

So, saying that “with different data you’d have got a different posterior over parameters” is like saying “if you had steered left your car would have turned”, yes, true, that’s what it’s for!

The main situation where you *DO* use this fact is in design of experiments (where the data haven’t yet been collected!). Bayesian design of experiments should basically look like optimizing some tradeoff between cost of experiments and expected entropy of the posterior distribution. You can generate fake datasets under an experimental scheme, using your priors to sample the parameters, then generate a dataset, calculate posterior samples, calculate an entropy, and then repeat under a wide variety of experimental schemes and try to find the scenario that gives you the best tradeoff between cost and uncertainty.

I think you missed my point.

Whoops I think I was trying to reply to Olav directly.

I’m not so sure that the recent emphasis on describing the exact nature of the p-value is ultimately going to benefit researchers who use its interpretation in practical science.

No matter how we slice it the p-value depends heavily on what we know of the data. Is the data a collective (i.e. complete) or a sample? Is the data measured with error or very little to non-existent error? To the extent that error exists, is it random or is it systematic? Do we believe even in the presence of pure certainty of the generating function of the data that the same generating function will continue to hold in the future?

Each question/answer will lead to a different practical interpretation of the p-value. Since there are so many interpretations available, depending on what we know of the data, maybe it would behoove us to put more emphasis on urging researchers to think about what they’re doing at each critical time-step. For example, instead of asking researchers to think of the p-value as a random variable, why not ask them to think about why they believe it to be a good summary of their data? What does the p-value mean under the conditions of their experiment and assumed data generating function? Why have they not chosen to summarize it otherwise?

P.S. I say this despite the fact I think it’s quite cool to think of the p-value as having its own distribution under certain conditions. I’m just tired of this whole business. At what point do the adults show up and get researchers to think about what they’re doing?

“Is the data a collective (i.e. complete) or a sample?”

If the data are complete, then it doesn’t make sense to calculate a p-value — the p-value is only used for trying to make an inference from a sample to the “population”.

“it would behoove us to put more emphasis on urging researchers to think about what they’re doing at each critical time-step.”

Part of that thinking is knowing and remembering what the p-value is, and in particular that the p-value they calculate is just one particular value of a random variable — i.e., that it depends on the data, even if all the model conditions are satisfied.

“why not ask them to think about why they believe it to be a good summary of their data?”

Thinking of the p-value as a “summary of the data” (let alone a good summary) seems to me like missing what the p-value is (and isn’t).

+1 for the last sentence!

> If the data are complete, then it doesn’t make sense to calculate a p-value — the p-value is only used for trying to make an inference from a sample to the “population”.

If one wants to test the hypothesis that planetary orbital planes are laid out at random, one can calculate a p-value even though the complete set of planets is known. I guess one can say that there is a “population” also when the data are complete.

> Thinking of the p-value as a “summary of the data” (let alone a good summary)

It might not be a good one, but I’d say p-values are a summary of the data (as any other statistic).

@Carlos

I guess we have some terminology differences here.

1. I interpret “complete data” as complete data from all units in the population — so if you are testing the hypothesis that planetary orbital planes are laid out at random, the population would be “all possible planetary planes”.

2. I guess “summary” is sufficiently vague that it is subject to lots of interpretations. But it does seem extreme to say that any statistic is a summary of the data. I suppose I could accept the idea of “summary for certain purposes” — in which case I guess the p-value could be considered a summary for certain purposes. Still, referring to it as a summary of the data misses a lot of what it is (and isn’t).

Martha, I don’t think we are in disagreement. I just found that the concept of “summary” as “for certain purposes” was implicit in the comment you were replying to (“[…] why not ask them to think about why they believe it to be a good summary of their data? What does the p-value mean under the conditions of their experiment and assumed data generating function? Why have they not chosen to summarize it otherwise?”).

What about this note from Taleb on the p-value distribution?

https://arxiv.org/pdf/1603.07532.pdf