https://arxiv.org/pdf/1603.07532.pdf ]]>

I guess we have some terminology differences here.

1. I interpret “complete data” as complete data from all units in the population — so if you are testing the hypothesis that planetary orbital planes are laid out at random, the population would be “all possible planetary planes”.

2. I guess “summary” is sufficiently vague that it is subject to lots of interpretations. But it does seem extreme to say that any statistic is a summary of the data. I suppose I could accept the idea of “summary for certain purposes” — in which case I guess the p-value could be considered a summary for certain purposes. Still, referring to it as a summary of the data misses a lot of what it is (and isn’t).

]]>If one wants to test the hypothesis that planetary orbital planes are laid out at random, one can calculate a p-value even though the complete set of planets is known. I guess one can say that there is a “population” also when the data are complete.

> Thinking of the p-value as a “summary of the data” (let alone a good summary)

It might not be a good one, but I’d say p-values are a summary of the data (as any other statistic).

]]>If the data are complete, then it doesn’t make sense to calculate a p-value — the p-value is only used for trying to make an inference from a sample to the “population”.

“it would behoove us to put more emphasis on urging researchers to think about what they’re doing at each critical time-step.”

Part of that thinking is knowing and remembering what the p-value is, and in particular that the p-value they calculate is just one particular value of a random variable — i.e., that it depends on the data, even if all the model conditions are satisfied.

“why not ask them to think about why they believe it to be a good summary of their data?”

Thinking of the p-value as a “summary of the data” (let alone a good summary) seems to me like missing what the p-value is (and isn’t).

]]>No matter how we slice it the p-value depends heavily on what we know of the data. Is the data a collective (i.e. complete) or a sample? Is the data measured with error or very little to non-existent error? To the extent that error exists, is it random or is it systematic? Do we believe even in the presence of pure certainty of the generating function of the data that the same generating function will continue to hold in the future?

Each question/answer will lead to a different practical interpretation of the p-value. Since there are so many interpretations available, depending on what we know of the data, maybe it would behoove us to put more emphasis on urging researchers to think about what they’re doing at each critical time-step. For example, instead of asking researchers to think of the p-value as a random variable, why not ask them to think about why they believe it to be a good summary of their data? What does the p-value mean under the conditions of their experiment and assumed data generating function? Why have they not chosen to summarize it otherwise?

P.S. I say this despite the fact I think it’s quite cool to think of the p-value as having its own distribution under certain conditions. I’m just tired of this whole business. At what point do the adults show up and get researchers to think about what they’re doing?

]]>But even with a good model there is a hypothetical issue of how would the posterior differ under some alternative hypothetical dataset. As we said, it should be relatively stable for a good model. But a Frequentist might ask you a question like “you say there’s a 95% chance that the parameter X is positive, but what would you say if we collected a different hypothetical dataset? How could you know it’s a 95% chance if it isn’t the case that you’d say it’s positive under 95% of hypothetical repetitions of the experiment?” etc…

and that’s just not what the posterior distribution is about, and that’s what I think of when someone mentions that the posterior is “a random variable” (ie. contingent on the data). Technically speaking, the posterior distribution of N different parameters is an infinite dimensional function on an N dimensional manifold. You can use this fact pre-data to design an experiment to give you “the best” inference you can (where you have to define best via some objective function to maximize, typically trading off cost vs information content). But after seeing the data, if you’re committed to the model, what would have happened in alternative worlds is… questionably relevant.

]]>If your parameter sets vary crazily contingent on which particular dataset you use you probably don’t have a good model?

]]>So, that’s an interesting diagnostic criterion when you have several datasets to look at, and it’s a concern that you should try to make your explanations have broad “external” applicability, but if you do have a good model form, the fact that two different datasets have high probability regions that overlap but are not exactly the same is not a bug, it’s a feature, or rather it’s simply an inevitable consequence of having a limited amount of information in your data.

]]>Isn't this what external validity is about? Or is that different?

]]>Isn’t this a bit similar to “conditioning on the asssumption that 1+1==3 is worth testing (…)”

I’m sure it’s strange and interesting. But since I know it’s false and wouldn’t

spend a fraction of a penny to test otherwise, this seems an empty hypothetical to me.

Perhaps you can suggest an example of a studied null-hypothesis which isn’t just obviously

false (or obviously true, I suppose I should cover that weird case) and thus is “worth testing”. Even for

even de minimus values of “worth”.

https://mega.nz/#F!QIpXkL4Q!b3QXepE6tgyZ3zDhWbv1eg (paper.pdf)

It’s the test of astrological sign for every other variable in a very large dataset. The distribution of p values is very close to uniform.

]]>p(Parameters | Data, Assumptions)

where Assumptions is the label we’re using to stand in for the form of the likelihood and for the priors on the parameters.

So, saying that “with different data you’d have got a different posterior over parameters” is like saying “if you had steered left your car would have turned”, yes, true, that’s what it’s for!

The main situation where you *DO* use this fact is in design of experiments (where the data haven’t yet been collected!). Bayesian design of experiments should basically look like optimizing some tradeoff between cost of experiments and expected entropy of the posterior distribution. You can generate fake datasets under an experimental scheme, using your priors to sample the parameters, then generate a dataset, calculate posterior samples, calculate an entropy, and then repeat under a wide variety of experimental schemes and try to find the scenario that gives you the best tradeoff between cost and uncertainty.

]]>Who in their right mind gambles their career on a p-value lottery?

The evaluation of scientists should not be based on something random but on items under their control:

– The importance / novelty of their research questions, and

– The quality of their research designs.

Carlos and Daniel: well-calibrated p-values are smaller than alpha in *no more* that proportion alpha of repeated experiments. This means that they can be smaller than alpha *far less* than the nominal level of control and still be valid. This is what happens, in most situations, when the truth is null but not at the dividing line.

If you want a concrete example, try testing H0: mu positive when 100 independent observations are N(mu,1), using test statistic the sample mean. When mu is zero you’ll get U(0,1) p-values. For large positive mu you’ll get large p-values basically every time – but the null still holds.

]]>Actually we agree on that, its knowing the p_values would be Uniform(0,1) if there was no effect and all necessary assumptions were true enough that allows a possibly sensible interpretation.

See http://andrewgelman.com/2016/07/26/29552/#comment-289122

]]>I don’t follow you. What is the “dividing line”?

]]>For one-sided tests, where the null is true but the true value of the parameter is not at the dividing line (typically zero) then adding more data the p-value *will* trend to one.

For one or two-sided tests, in situations where a point null (i.e. parameter value exactly zero) is implausible – as often discussed here – then no sensible analysis ever ends up with U(0,1) p-values.

Obviously it’s not “wrong” to talk about p-values being U(0,1) in some circumstances. But in some contexts there may be better ways to motivate them, and to convince others of their (limited) utility.

]]>> it will be distributed uniformly between 0 and 1. How can you have any real understanding of p-values otherwise?

Completely agree – but given all the myriad of background assumptions above and beyond simply absolutely no effect, such as properly randomized, no dropouts, no selection, no biased outcome assessments, etc., etc, (more fully discussed in the paper link I gave above) if there ever is absolutely no effect – the distribution of p_value will be undefined but likely far from uniformly between 0 and 1. So only helpful when dealing with ideally conducted animal experiments?

> surprised to see that finding these references was harder that I expected

I used to use it as a diagnostic question to get a sense if another statistician had much of a purposeful grasp of statistics, “if the treatment was completely inert and all the assumptions needed for the analysis were absolutely fine what would the distribution of p-values be?” (common answer Normally distributed – wtf?)

These days I might try, of course you know that if the null hypothesis is true then p_values will be distributed uniformly between 0 and 1 but what other implicit assumptions are needed for this to be actually true?

My bigger question which I might have been a bit snarky about above, why don’t people read informed views on such statistical issues (especially if they are writing papers on them)? Are they unable to locate/discern where these informed views may be found?

]]>I had a discussion with someone who was convinced that if the null hypothesis was false as you added data your p-value would trend to 0 (ok) but if the null hypothesis was true as you added data the p-value would trend to 1 (wtf?). Pointing to many references explaining that this was not the case didn’t help, his conclusion was that then the p-value would we useless. I was also surprised to see that finding these references was harder that I expected.

]]>Z=1.96 is 51% more probable than Z=2.16 and Z>1.96 is 62% more probable than Z>2.16. The “equivalent” (10%) change of increasing Z from 0.196 to 0.216 has almost no effect (less than 2%) on the corresponding probabilities. If one is interested on the changes in tail probabilities, one wants to distinguish this two cases. The sensibility to changes in the value of Z is a desirable feature.

One could of course reason as well in terms of sigmas, because there is well-known correspondence. But one would not usually consider that the change from one sigma to two sigma is in any way comparable to going from five sigma to six sigma.

> We have found that day-to-day variation is 0.5 kg, for a Z-value of 0.5/75 = 0.0067.

That’s not a Z-value.

]]>Ho: mu less than or equal to 0

Ha: mu greater than 0

H_o: mu 0

]]>But using the likelihood ratio test, testing Type S errors becomes equivalent to testing type I errors; mu = 0 is on the boundary of the set mu <= 0. So testing

H_o: mu 0

will lead to exactly the same rejections (**assuming alpha 0.5…) to testing

H_o: mu = 0

vs

H_a: mu > 0

To be fair, the candidate was a numerical analyst, so superiority of their model was someone else’s professional concern, not theirs.

]]>Yes, everything is a function of data. But when people put a lot of effort into computing something, and if that something has some appealing theoretical properties, there seems to be a tendency to forget that it’s a function of data. I see this with p-values, with cross-validation, and with posterior distributions. In each case, the output is often taken too seriously, as an object in itself, without full recognition that it’s just a product of a particular, randomly sampled, dataset.

]]>The bit about the p-value being uniformly distributed is more misleading than helpful, I think, because that statement is conditional on the null hypothesis being true, and it in general isn’t.

]]>leaving off all the other problems with p values, this uniform between 0,1 property is very different from stuff people are used to in the sciences. Almost everything is continuous or nearly so, small perturbations in what you do result in small changes to outcomes, but definitely NOT with sampling under the null and transforming to p values.

]]>1. P-values determine publication, career success, standing among peers, marital happiness, and overall well being.

2. P-values are random.

3. You don’t have to let a lottery determine your future!

4. Learn the P-hacking way to happiness!

Call Us at 1-800-PHACK*** NOW & receive the life changing booklet “1000 Degrees of freedom: P-hacking for success”

All for $39.99 +pp

]]>