The answer is: if a) you used an RNG to do assignment of the groups and b) you replicate the whole experiment exactly multiple times and continue to get small p values, *then* you will be able to conclude “yes the assumptions of my test are wrong” and it won’t matter how many buttons you have on your keyboard, or whether you looked at the data first or after or had soup for breakfast or wear red on wednesdays or secretly want the results to be one way or another.

]]>now, you can legitimately infer that “either something is unusual about these two groups such that the assumptions of my test are wrong, or the assumptions of my test are correct, and I have one of the unusual samples that lead to p = 0.01 even when the assumptions are correct”

How can you distinguish? Very simple, run the randomized experiment say 20 or 100 times, and see if in multiple repetitions the p value you get from doing the same experiment and conducting the same exact Tn test results in a non-uniformly distributed p value. When you consistently get small p values, then you can conclude “the assumptions of my test are probably wrong”. (For all the good that does you!)

Wait, you say, I can’t possibly afford to run a $25M drug safety and effectiveness test 20 to 100 times in a row!!!

Oh, well, then you’d better do science (ie. causal modeling using realistic mechanistic prediction models) and analyze the science using Bayesian analysis (ie. an analysis of the plausibility of various facts given model assumptions) instead of relying on terribly inefficient frequency properties of random assignment in long-run repetitions to detect “there is or there isn’t a difference of the type Tn”.

The whole call for “replication” is really just people starting to realize that a single p value tells you “either my assumptions are wrong, or I have randomly got one of the weird samples” and the only way to distinguish which using p values is to replicate over and over again until you’re convinced of which one it is. Maybe in an other 20 or 30 years people will then finally realize also that “my assumptions are wrong” is not all that informative either, so that even if a difference replicates it still isn’t really that informative about the science. I’m not holding my breath though.

]]>Any other use of a p value amounts to saying that some other process you are calling “my experiment” is a kind of random number generator. If you want to make that claim, you must first get it to pass the battery of frequency tests. Oh, wait, you can’t repeat your experiment 100 Billion times in order to generate a sample sufficient to feed into Die Harder? Ah, well, I guess you’ll just have to either do actual randomization using an RNG that does pass the tests, or be Bayesian and give up on p values.

]]>This is not a statement of fact, rather it illustrates the inherent absurdity of the Frequentism when pushed to extremes. What happens when the researcher presses the key at the same moment they look at the data? Does it go into a quantum supposition of legit and illegitimate p-values?

]]>There’s much discussion in the frequentist literature of what to condition on in a hypothesis test. For example, suppose you do a simple experiment in which you first choose the sample size N at random. You can define p-values conditional on N or unconditional on N, and these give different answers. This particular problem might seem too silly to be interesting but the same issues arise in more complicated settings.

]]>What are different things that might be conditioned on here?

]]>Also, is there a typo in his name? Should it be “Wynar”?

]]>Note that “exploring the data” doesn’t mean roaming around until you find a “good” p value. For an eye-opening account of what it means to explore data carefully, try Cleveland’s book “Visualizing Data”.

]]>