Alain Content writes:

I am a psycholinguist who teaches statistics (and also sometimes publishes in Psych Sci).

I am writing because as I am preparing for some future lessons, I fall back on a very basic question which has been worrying me for some time, related to the reasoning underlying NHST [null hypothesis significance testing].

Put simply, what is the rational justification for considering the probability of the test statistic and any more extreme value of it?

I know of course that the point value probability cannot be used, but I can’t figure the reasoning behind the choice of any more extreme value. I mean, wouldn’t it be as valid (or invalid) to consider for instance the probability of some (conventionally) fixed interval around the observed value? (My null hypothesis is that there is no difference between Belgians and Americans in chocolate consumption. If find a mean difference of say 3 kgs. I decide to reject H0 based on the probability of [2.9-3.1].)

My reply: There are 2 things going on:

*1. The logic of NHST.* To get this out of the way, I don’t like it. As we’ve discussed from time to time, NHST is all about rejecting straw-man hypothesis B and then using this to claim support for the researcher’s desired hypothesis A. The trouble is that both models are false, and typically the desired hypothesis A is not even clearly specified.

In your example, the true answer is easy: different people consume different amounts of chocolate. And the averages for two countries will differ. The average also differs from year to year, so a more relevant question might be how large are the differences between countries, compared to the variation over time, the variation across states within a country, the variation across age groups, etc.

*2. The use of tail-area probabilities as a measure of model fit.* This has been controversial. I don’t have much to say on this. On one hand, if a p-value is extreme, it does seem like we learn something about model fit. If you’re seeing p=.00001, that does seem notable. On the other hand, maybe there are other ways to see this sort of lack of fit. In my 1996 paper with Meng and Stern on posterior predictive checks, we did some p-values, but now I’m much more likely to perform a graphical model check.

In any case, you really can’t use p-values to *compare* model fits or to compare datasets. This example illustrates the failure of the common approach of using p-value as a data summary.

My main message is to use model checks (tail area probabilities, graphical diagnostics, whatever) to probe flaws in the model you want to fit—not as a way to reject null hypotheses.

I’ll try, not that I can defend all of this by any mans, or even like it. It’s called NHST for a reason. The “NH” has real content. The null hypothesis has been given a privileged status. It’s not just whether Hypothesis A or Hypothesis B fits the data better, it’s that the null is the champ and you’ve got to knock it out. Consequently, you only reject the null when your actual result is far away from what the null says. (Note: not the null and things close to the null, which is what Alain’s method gives you.) While you could use the likelihood ratio of the result observed to the null likelihood under the null (that’s what Royall advocates) which then avoids all the tail area calculations and also conforms to the likelihood principle in which results you didn’t observe are irrelevant, the tail area (or better, the two tail areas farther from the point null that what you observed) are all the results that would have been even more surprising than the ones you actually observed. If there isn’t a lot more surprising under the null that what you got, then you’re suspicious of the null.

“It’s not just whether Hypothesis A or Hypothesis B fits the data better, it’s that the null is the champ and you’ve got to knock it out.”

Imagine a range of values that your observation can take and assign intervals to different statistical hypotheses. The values consistent with the the Null are marked N, and with the alternative are marked a:

MIN-aaaaNaaaa-MAX

Or sometimes we have a directional alternative hypothesis a and “significant” results in the other direction correspond to some other hypothesis b:

MIN-bbbbNaaaa-MAX

The null hypothesis N doesn’t look like the champ, he’s the heavily handicapped underdog because he’s only one who is precise, the others are vague and cover a huge range of possible outcomes. In fact, a better names would be:

“Null hypothesis”=”Precise statistical hypothesis”

“Alternative hypothesis”=”Vague statistical hypothesis”

We want our precise statistical hypothesis to correspond to the research hypothesis and “something else is going on” to the vague statistical hypothesis. Then it is an impressive feat for the research hypothesis when we fail to reject the precise statistical hypothesis.

The research hypothesis has then been put to a severe test. The more precision we can get, the more severe the test. Therefore, the more impressive the failure to reject. It becomes more difficult to think of alternate research hypotheses that also predict the observation will fall within same precise range of values.

I think I’ve written hypothesis enough for one day.

To me, the most sensible use of p-values is in “pure significance tests”, in which there is a null hypothesis but no clearly formulated alternative hypothesis. Instead, the choice of test statistic highlights alternatives in some direction, without requiring one to formalize exactly what the alternative data distribution might be. The test statistic in this context is designed so that “more extreme” corresponds to “something that might well be more plausible under some sort of ill-defined alternative than under the null hypothesis”. So it then makes sense to look at the probability (under the null) of the observed value of the test statistic or a more extreme value.

This doesn’t make sense in all circumstances. Consider the null that x is from the Uniform(-1,+1) distribution, and the alternatives you have in mind are that x is from the Uniform(m-1,m+1) distribution for some non-zero m. As test statistic, you use |x|. If |x|=0.99, it doesn’t actually make sense to say that the p-value of 0.01 is evidence against the null, since this value has no higher probability (density) under any alternative. (There might be some other set of possible alternatives under which this p-value would be evident against the null, however.)

Fisher, I believe would have suggested you think of repeated studies and the first study and its p_values used simply to suggest whether the repeated studies are likely to be worthwhile or not – see – http://andrewgelman.com/2014/04/29/ken-rice-presents-unifying-approach-statistical-inference-hypothesis-testing/

What a great question.

The construction of the test statistic – where more extreme values are meant to indicate larger discrepancies of interest from the null – does make it “sensible” to consider tail areas, as Radford notes. But I haven’t seen any stand-alone argument for why one always

hasto look at the tail probabilities beyond the observed T – just like one doesn’thaveto focus on tails of posterior distributions.It’s certainly convenient to use tail probabilities – for example, we don’t have to specify how far either side of the observed T might be worth considering, and the tail probability respects order-preserving transformations of the test statistic, and the calibration of the p-value is simple. Going beyond (pure) significance tests and bringing in issues of power – and related quantities – then one does get motivations beyond convenience; the optimal tests are those that use tail areas, at least for some definitions of “optimal”, so not using them throws away power.

But without that step, I think it’s like arguing whether the mean or median is a better summary of central tendency. There is no uniquely right answer, and depending on what you think good “reasoning” is, different choices will win out.

NB Keith: I think pure significance tests are compatible with Fisher’s significance test (where a null can be rejected but no alternative is specified or accepted) – reporting p alone just summarizes all the significance tests one might do, at different alphas.

Ken:

I agree. Tail-area seems perfectly intuitive until you look at it too closely. For example, we’re flipping coins and we’re supposed to have a binomial(100, 0.5) distribution. We observe 60 heads. So what’s the probability of seeing 60/61/62/63/64/65/etc? (forgetting about the whole two-tailed thing for now)?

But why not ask what’s the probability of seeing something close to 60? 58/59/60/61/62, for example?

One answer is that, if we define the tail area test, and if the null hypothesis is true, and if it’s a simple (non-composite) null hypothesis, then the probability of rejecting at the 5% level is no more than 5%. But that’s a pretty thin reed on which to hang all the burden that’s placed on p-values as a measure of evidence.

Yes, your example with nominal level 5% and true level not more than 5% is what I meant by “calibration”. As (of course) tail areas are not the only way to get calibration – there’s e.g. the very extreme, much-discussed example of ignoring the data and generating p from U(0,1) – then this property doesn’t force one to use tail probabilities.

If Alain’s reading this, it’d be fun to know what he told the class.

There is no real justification for it and consequently it breaks down easily. Although it’s common in complicated high dimensional models where no analysis of rejection regions is practical to stick to simple tail area reasoning (usually using simple averages as a statistic) with the view this is safe, the exact opposite it true. It’s very dangerous and can easily, if not usually, lead to disastrous inferences. See here for a very stripped down version of this.

http://www.bayesianphilosophy.com/test-for-anti-bayesian-fanaticism/

As Jeffrey’s explained 75 years ago, the tail area reasoning was developed because Frequentist weren’t able to use something like P(Model_1|Data)/P(Model_2|Data) proportional to P(Data|Model_1)/P(Data|Model_2) to compare competing models. Taking ratios gives a scale to P(Data|Model_1). Frequentist rejected all this and just wanted to use P(Data|Model_1) which isn’t usable for continuous distributions. So they had to find an ad-hoc way to give P(Data|Model_1) some mass. Tail area reasoning is one way, but as the correspondent in the post noted it’s not the only way, or even the most intuitive way.

Having said all that there are real world subtleties here. Usually, we don’t explicitly consider every potential model. For example, we usual don’t create a “the data collector cheated” model. So we usually have a host of models which we could consider, or at least develop with enough effort, which we know have some possibility but we effectively set P(Model_i)=0. Getting a very low P(Data|Model_1) is a warning sign we need to take these other models seriously. It’s a warning because a low P(Data|Model_1) makes Model_1 beatable by some otherwise unpromising alternative models. See here for a longer explanation:

http://www.bayesianphilosophy.com/rough-and-ready-model-selection/

So Bayesians could work this up into a more formal approach. Using just the sum/product rules of probability theory (i.e. all that’s needed to derive Bayes Theorem) they could create from P(Data|Model_1) a kind of measure of “how easily Model_1 could be beaten by alternatives”. If you did so you’d see that this is the (Bayesian) result Fisher et al were intuitively trying to grasp at, but fell somewhat short of.

Incidentally, most commenters on that first link claimed Frequentist would never make that mistake. Yet the greatest living Frequentist Statistician claims that mistake isn’t a bug, it’s a feature:

http://errorstatistics.com/2016/01/02/sir-harold-jeffreys-tail-area-one-liner-sat-night-comedy-draft-ii-2/

Suppose I take an observed difference d_0 as grounds to reject H_0 on account of its being improbable under H_0, when in fact larger differences (larger D values) are more probable under H_0. Then, as Fisher rightly notes, the improbability of the observed difference was a poor indication of underlying discrepancy. This fallacy would be revealed by looking at the tail area; whereas it is readily committed with accounts that only look at the improbability of the observed outcome d_0 under H_0.To make it clear what’s being said, suppose again there are only two possible hypothesis H_0 and H_1, and the only evidence distinguishing them is the observed Data*. Further suppose, P(Data*|H_0)=.00000001 and P(Data*|H_1)=.9, but we also have that data larger than the observed value are “more probable” so that for example P(Data greater than Data*|H_0)=.15.

For uniform prior, a Bayesian will see this as very strong evidence for H_1 over H_0, while a Frequentist will see it as insufficient evidence to reject H_0. Mayo claims this represents the superiority of Frequentism over Bayesians.

Incidentally, most commenters on that first link claimed Frequentist would never make that mistake.

Indeed, after reading the material on that link, I thought you should rename the ling from “http://www.bayesianphilosophy.com/test-for-anti-bayesian-fanaticism/” to “http://www.bayesianphilosophy.com/example-f-bayesian-fanaticism/”

I always thought part of the motivation for tail-areas was when one’s model gives a density over data space. In that case — and given the test statistic makes sense as a “measure of extremeness” — considering a tail area rather density at a point leads to an interpretable/dimensionally meaningful measure of ill-fit.

In some few cases it is a measure of ill-fit. Like I said, Fisher et al were grasping toward something real, they just didn’t come close to reaching it. Even minor variations destroy their cozy tail area reasoning. Take iid normal observations and use the mean as a statistic, but make one of the observations have dramatically lower variance than the others. That simple change already destroys their results in general. In this case, it’s simple enough Frequentist will spot the error (sometimes) in practice and wont do it this way, but the point is in a modern high dimensional highly heterogeneous model (think for example of a of one of those weather models using 10’s of thousands of different inputs), using simple averages with tail area reasoning is incredibly dangerous. People think it’s safe because that’s what their stat class lead them to believe. It’s not.

There’s another part of this I didn’t mention. In general a single test statistic throws away some of the information in the data. So even if the tail area reasoning is otherwise ok in your particular example, the correct Bayesian method which uses the full P(Data|Model) will beat it (sometimes dramatically so if a lot of information is thrown away). This point is obscured by the simple examples we see in intro stat class because they use sufficient statistics, so it doesn’t arise.

The bottom line here is the same old story of the last 100 years of statistics. Frequentists reject the answer supplied by the basic equations of probability theory simply because they don’t understand what probabilities mean. They then use their intuition to guess at a solution. Their guess is good enough for a few simple problems, but is horribly wrong in general. Each time they realize this they try to patch it up with some ad-hoc fixes. These fixes never go far enough, so somewhere down the road they find more problems. All the while, Bayesians who stick to the basic equations of probability theory, get the right answer without any of this hassle.

The meaning of probability isn’t obscure. It’s not beliefs, it’s not frequencies. It was understood correctly several hundred years ago and possible well before then. To quote from McElreath’s new book Statistical Rethinking:

In modest terms, Bayesian inference is no more than counting the numbers of ways things can happen, according to our assumptions. Things that can happen more ways are more plausible. And since probability theory is just a calculus for counting, this means that we can use probability theory as a general way to represent plausibility.Once you understand this, you can by pass 95% of the utter bullshit of the last 100 years of Frequentist statistics and move on to something far more useful.

Can you actually provide an example of a frequentist who rejects “the answer supplied by the basic equations of probability theory”? Clearly, frequentist have a different interpretation of probability, and that restricts them regarding the kind of events that they can assign a probability too, but I never came across a frequentist that rejected any of the basic equations of probability.

And you may call probabilities degree of belief or degree of plausibility or degree of XXXX. The fact remains that if a frequentist tells me that the probability of this event happening is 95%, I have an idea what she or he tries to express. When a Bayesian tells me that his or her belief/plausibility for an event is 95%, I am not sure what he or she tries to communicate; but it does give me a starting point to calibrate the beliefs/plausibilities of that Bayesian against my beliefs/plausibilities.

My favourite example on this are the estimates for a catastrophic failure of a space shuttle before the launch of Challenger in 1986. NASA management put it at 1:100,000 while their engineers put them at between 1 in 100 and one in 300. All clearly Bayesian probabilities/beliefs/plausibilities (and Feynman did the mistake to misinterpret these in the Frequentist sense when there was no basis for that), apparently very different, but we do not really know without knowing how the plausibilities from these different groups need to be calibrated to be comparable.

Well, if I remember correctly Feynman talked about how NASA management started with 1/100000 as the desired answer and then made people calculate individual quantities (such as failure probabilities of certain bolts) from a requirement that they get the desired answer. From this we can conclude that no credence should be given to the management’s calculations.

My memory is (presumably we could look it up, but then we would not have the fun talking about it), that the 1/100000 management figure was due to the legislation that required a self-destruction device (that could be triggered from the command centre) on the shuttle if the chance for a failure was higher than this number. I do not remember that he described any calculations of the kind that you are alluding to. OTOH, Gigerenzer describes in his book that such calculations were done to determine the chances of a successful launch of an Ariane rocket (he calls the probabilities used in these kind of calculations “propensity based probabilities”).

In either case, non of these calculations were done using probabilities that are based on a Frequentist interpretation, hence the Frequentist interpretation of the final number by Feyman was not really warranted.

Apparently, I cannot reply to your later comment (below) anymore, but I think you are quite correct. Since in a Frequentist interpretation of probabilities one cannot assign probabilities to fixed quantities, Frequentist do not think that they are denying probabilities.

Aside from Good’s classification of Bayesians into 46656 categories, there seem to be a more fundamental dichotomy. Those that see the parameters as random variables (i.e. priors as part of the data generating process) and those that see parameters as fixed but unknown quantities and view priors as expression of prior belief/information/plausibilities/&c. The former would definitely have to justify their view for problems in which it is known that the parameter of interest does not behave like a random variable (see Thompson, “The Nature of Statistical Evidence”, Springer LNS 189, for an example). The latter might be fine, but sometimes I do wonder what the justification would be for expressing prior information on a parameter via a probability distribution when you know that the parameter does not behave like a random variable.

Berwin:

I don’t know who’s reading this deep in the thread, but . . . in answer to your question about “what the justification would be for expressing prior information on a parameter via a probability distribution when you know that the parameter does not behave like a random variable,” I would answer that it’s exactly the same justification that a practicing statistician has when using a logistic regression to model survey responses or votes or whatever. People’s responses to a question such as, “Do you support the death penalty for a person convicted of murder?” are not random, but it can still be useful to model them as such. You could of course go to an approach to statistics that avoids probability models entirely (except for the rare situations where you’re modeling coin flips, die rolls, radioactive decay, etc.), but in that case your quarrel will be with most of statistics, nothing special about Bayesians.

Can you actually provide an example of a frequentist who rejects “the answer supplied by the basic equations of probability theory”?Yes. If you use the equations of probability theory, then everything you do must be consistent with deductive logic. In other words, your results can’t be in conflict with what can be deduced from the same assumptions in a given problem. Deductive logic is a limiting case of the basic equations of probability theory obviously.

As is well known, Confidence Internals in real problems can sometimes cover values all of which are provably impossible from the same assumptions used to derive the confidence interval (i.e. you might get a 95% CI which is entirely negative for something which has to be positive for example).

I know frequentists don’t understand Bayesian probabilities. If they did they would be Bayesians. But it’s not that difficult. A Bayesian saying P(H|E)=.95 means roughly that 95% of the possibilities left unknown by E favor H. Or alternatively the truth of H is fairly robust with respect to things left unknown by E.

That doesn’t look much like an example.

Uh? So which part are you denying?

(1) Confidence intervals can sometimes give logically impossible ranges in real problems?

or

(2) This implies Confidence Intervals sometimes contradict the basic equations of probability theory since if CI’s were completely consistent with them then CI’s would be completely consistent with deductive logic as well.

Since you haven’t figure out Google yet, here’s an example:

http://bayes.wustl.edu/etj/articles/confidence.pdf

the example is about half way down page numbered 196 and continues through page numbered 201. Start where it says “the following problem has occurred in several industrial quality control situations.”

How is this even controversial? The entire raison d’être for Frequentism is to limit the use of the basic equations of probability to a restricted class of variables (r.v.’s), and then use other methods for inference when dealing with variables outside that class.

Obviously the resulting Frequentist methods will contradict to some extent what you’d get if you consistently used the basic equations of probability theory for all variables to do inference. The examples involving CI’s which only contain logically impossible values is just an extreme instance of this.

Laplace: I think the controversial thing is that Frequentists don’t deny the basic equations of probability theory when working with random variables… and they deny that the equations are appropriate when not working with random variables, and therefore don’t feel like they’re denying probability theory in those cases, since the things they do deny they don’t call an application of probability theory…

It’s a little like if you only believe that color theory is applicable to photons and not to say chemical pigments… when you fail to predict how pigments work… it’s not (in your mind) because you deny color theory, since you’re perfectly happy with color theory applied to beams of light.

I still think you’re right. But I understand why Frequentists don’t think they’re denying probability theory.

If a Bayesian were to say “P(H|E)=.95 means roughly that 95% of the possibilities left unknown by E favor H”, then he or she would interpret the probability P(H|E) in a Frequentist way, and that would need some justification. I agree with your alternative that “the truth of H is fairly robust with respect to things left unknown by E”; except that you left the bit about “given what my belief system is” out.

Statistics is, inherently, inductive. So I am not sure where your claim of necessity to “be consistent with deductive logic” comes from. In fact, if we take this normative approach, then in the limit we should all have the same posteriors, no matter with what priors we started off with. So way do we not life in one-party states where everybody beliefs the same? Or did humans have not yet collected enough data to all arrive at the same posterior? ;-)

Your critique of confidence intervals seems to be based on a fundamental misunderstanding on where and when Frequentist make a probability statement (given their interpretation of probabilities) and realisations of confidence intervals on a particular data set. Even Jaynes notes in the example that you give that “any numbers y1, y2 satisfying F(y2)-F(y1)=0.9 determine a 90\% confidence interval” and then goes on to choose a particular set. But there are other choices that can be made beforehand, e.g. on what statistics to choose to base the CI on. Jaynes chooses the sample mean. These days (not sure what the state of the art was when Jaynes wrote his article) Frequentist statistical theory prefers to base CI on (asymptotic) pivotal quantities. I will leave it as an exercise to you to determine what the pivotal quantity is in the example that Jaynes uses, and what the resulting CI is.

While this person may be a fine psycholinguist, his question suggests that he has a very poor understanding of basic statistics. While many highly-qualified statisticians (such as Andrew) may have issues with “NHST”, I can’t imagine any of them not having a clue about where the null distribution of a test statistic comes from.

This is not meant as a personal attack against this person, but rather an expression of dismay with statistics education in general.

More specifically, it seems that this type of question would come from someone raised on “statistics for dummies”-type courses, where the instruction is of the following form: push button A to make your computer spit out test statistic T and p-value p; if p<0.05 reject the null. I find it rather scary that someone who is teaching statistics was raised on this type of course! More generally, is it surprising that we see the junk we do in Psychological Science if this is how psychologists are being taught statistics?

Anon, I do agree you with that button-pushing statistics probably dominate the statistical education of many psychologists. In my experience, at least, even graduate courses in statistics for psychologists fail to cover some very basic concepts – e.g.: a well-known and respected teacher said that ‘p-value means the probability of making a mistake’, without qualifying this definition and keeping the ‘p<0.05 = reject the null' mantra. Bayesian methods are almost unheard of. But this shallow understanding os statistics does not stop them from publishing many papers with complex Item Response or Structural Equations models — because the computer spits a p-value (or another thresholded fit statistic) that allows an easy interpretation.

On the other hand, I do believe that the question asked by the author is legitimate. The question isn't about the sampling distribution of a test statistic, but why tail areas are used to compute p-values. Under a Neyman-Pearson setting, the tail area makes sense because it is uniformly distributed under the null, so we can control the Type I error rate in the long run.

But in a more classical fisherian sense, in evidential reasoning with p-values, why do we care about values equal or more extreme? Fisher defines the 'P [as] the probability that [a test statistic] shall exceed any specified value'. Seems like and ad hoc way to talk about probabilities, as Laplace stated above. Fisher talks about some plausible and inplausible values for p-values, suggesting the 0.05 threshold as a good thing, for some unjustified reason. Uniformity under the null, coherence (if I consider a value x extreme evidence agains the null, I should consider higher values as more extreme evidence), no reasonable justification is given by Fisher.

The question is perfectly legitimate.

Instead of attacking the questioner, why don’t you answer the question. Irrespective of one’s state of knowledge, we all learn by thinking and asking questions.

Yeah, the guy (who I don’t know) asked an innocent question and he even identified himself (unlike Anon). What are we, the R mailing list? He wants to know more; isn’t that worth something?

About Anon’s other point, that we are not teaching this stuff well and no wonder things are going wrong, is valid. I think there is a solution: Psych and such like departments should hire statisticians to teach stats classes instead of teaching it themselves.

I don’t think whether he identified himself should matter. Agree with the rest of your comment though.

In my experience, the reliance on NHST comes from binary thinking, as Andrew points out in the cited post. Several times a day I get asked “Is there an effect of ___ on ___, yes or no?” NHST nicely fits that logic. A small p-values means you “showed an effect of ___ on ___,” and a large p-value shows the converse. I want to know which came first: binary thinking in scientific practice or NHST?

binary thinking has been around a long time, i think NHST just made it easier.

Binary thinking aside, even if researchers interpreted p-values as some sort of continuous evidence statistic, there’s still the problem that effort is conflated into the p-value definition:

http://www.johnmyleswhite.com/notebook/2012/07/17/criticism-5-of-nhst-p-values-measure-effort-not-truth/

My advisor in grad school referred to NHST as “Statistical Hypothesis Inference Testing.” It makes a better acronym.

I once thought of suggesting in a SAMSI working group that the task was to consider Fitting, Understanding, Criticizing and Keeping only models that made sense. As this needs to be continually iterated and so one could indicate that with an “!” marl at the end.

(Actually, I suggested FCUK and hoping criticizing before understanding would be tolerated.)

I guess I am reading the question a bit differently. I think a lot of it goes back to Fisher’s exact test, where he added up the “this or more extreme” probabilities. There isn’t exactly a conventional value to make an interval around in a contingency table. Yes proportions but …

I don’t think anyone has bothered to answer the question. So let’s look at it again.

Put simply, what is the rational justification for considering the probability of the test statistic and any more extreme value of it? wouldn’t it be as valid (or invalid) to consider for instance the probability of some (conventionally) fixed interval around the observed value?

Answer: We use “the probability of the test statistic and any more extreme value of it” for the following reason. If it is very small it is very unlikely that the test statistic obtained was as a result of the null hypothesis being true. So it casts doubt on the validity of the null hypothesis. The smaller the value the more we doubt, but we can never be certain that the null hypothesis is untrue. The bigger the value the less we doubt. The common cut off point of 5% or p=0.05 to distinguish between true and false is entirely arbitrary. The reason we don’t consider the probability of a fixed interval around the observed test statistic is that we would be including values more likely than the test statistic actually obtained and excluding values much less likely.

In answering the question we don’t need to introduce alternative hypotheses, nor discuss evidential reasoning with p-values, nor ad-hoc ways of talking about probabilities. This is a very basic question asked by someone who is struggling to understand current practice. It is not a high level philosophical question about the use of p-values by someone who understands what they are.

Peter, I guess my comment did not address your question directly, but I did mention two possible reasons to integrate over “equal or more extreme values”: uniform distribution of p-values under the null, and coherence in decision.

The reason why I think it’s a really interesting question is that prior to more formal approaches of statistical hypothesis testing, like Neyman and Pearson framework, there seems to be no rational reason for doing this. When Laplace analysed the babies data, he wanted to know the probability that the proportion of boys was greater than 0.5, so the tail area makes sense. But in the context of Goodness of fit test, from where Fisher borrows the definition of p-values, the notion of more extreme events is not obvious.

I’m not well versed in Math to know if there is any obvious motivation to work with tail areas besides the fact that’s one way to obtain probabilities statements. It seems to me that tail area as a definition of p-value was an ad hoc, practical way of obtaining probability statements for hypothesis that later were be ter formalized and justified, as Mark explains below.

To address the emailer’s question: we use tail probabilities because these represent optimal rejection regions against alternative hypotheses for certain classes of null and alternative distributions.

Remember, for any null distribution, we must also define an alternative distribution. Then, for any event we can define both the probability of the event under the null hypothesis and the alternative hypothesis. Then we want a test procedure that partitions the sample space of possible events into those that suggest we reject the null hypothesis vs. those that suggest we accept the null. If we want to reject the null when it is true no more than alpha percent of the time, then the rejection region should have no more than alpha probability under the null. Conversely, we want this region to be highly probable under the alternative.

For situations where we can approach this task by ordering all the events using a test statistic, and the test statistic is either non-decreasing or non-increasing for partitioning the data based on alpha level, then the optimal partition of the data is one in which we look in the tail (by the Karlin-Rubin theorem). That is, the partition that has no more than alpha probability under the null, but maximum probability under the alternative.

What can confuse this process sometimes (and others note this in their responses) is that we don’t always clearly define the null. Rather, we define the test statistic and work the logic in the other direction with the idea that the tails of the test statistic are optimal against some alternative, which is defined by the test statistic but not immediately obvious.

While I’ve been using the language of “reject-accept” based on a fixed alpha level, all this plays through to p-values, which are sometimes known as the “observed significance level”. That is, the alpha level you would have set to make your data just at the border of the rejection region.

As to the second question, defining an interval, there is an increasing interest in defining tests based on “substantive significance” (http://www.carlislerainey.com/papers/nme.pdf). For example, if a change of 0.1 units isn’t important, but 1 units would be, test the null hypothesis that the difference of means for the two group is between -1 and 1 against the alternative that it is either less than -1 or greater 1. Equivalently, use a confidence interval, which can best be described as the set of null hypothesis that would not be rejected at a given alpha level.

Hope this clarifies the reasoning of using tails. I’ll leave it to others to debate whether the goals hypothesis testing are worth the effort.

A quick comment: The use of outcome less discrepant from Ho than the observed isn’t at all mysterious if you understand how falsification works in the real world (where you’d rarely have a logical contradiction between general claims and data). In realistic cases, we need statistical falsification rules. Any data will be improbable under Ho in some respect, so we clearly don’t want to look merely at Pr(x;Ho). Nor do we want the character to be selected based on the data. Such improbable events aren’t of interest to science, because they aren’t genuine physical effects. Only deviations that may be reliably brought about count as genuine discrepancies from Ho, and thus statistically falsify Ho. That’s why Fisher insisted we’re not interested in isolated small P-values (you had to know how to bring about differences that rarely fail to be statistically significant). He required a high probability of a smaller deviation than d*, under the assumption of Ho, before regarding it statistically significant. (i.e., Pr( P-value > alpha; Ho) = 1 – alpha.) The tail area let him get sensible results without an explicit alternative; but you still needed a sensible test statistic d. (See this post http://errorstatistics.com/2016/01/02/sir-harold-jeffreys-tail-area-one-liner-sat-night-comedy-draft-ii-2/)

Now Neyman and Pearson don’t start with a tail area, it drops out of their requirement for a reliable falsification/no-falsification rule. Beginning with Fisher’s intuition (we need a sensible test statistic d(x), and the ability to compute Pr(d(x) > do;Ho), they deduced rules of form: infer (the data indicate) a discrepancy from Ho whenever d(x) > d*. Of course they put it in terms of a rule: reject Ho at level alpha iff d(x) > d*. Neyman-Pearson, like Fisher, required the consideration of the probability of {d(x) >d*} under Ho –because the probability had to be associated with the method or rule (not assigned to the statistical hypothesis, unless there was a legitimate prior). But unlike Fisher, Neyman-Pearson also required considering the probability of {d(x) >d*} under statistical alternatives to Ho. Also, the null and alternative had to exhaust the parameter space. The “tail area” falls out from their test requirements: the rule should have a high probability of inferring discrepancies just to the extent they exist. By specifying the characteristic of the data pre-data, they could avoid the ills of cherry-picking and post data selection effects (sufficiently low type I error), and ensure tests had adequate capability to discern discrepancies of interest (adequate power). Post-data, power was designed to be used to ascertain if a non-rejection was indicative of a “confirmation”—i.e., evidence of the absence of discrepancies (of specified size) from the null. Well, this is way too quick, but wanted to say something about tail areas.

The NHST animal is generally reserved for a highly fallacious monster in which a low p-value is taken to warrant a research hypothesis. Worse, biasing selection effects are apparently permitted so that the reported P-value has no relation to the actual P-value. Perhaps some people have been taught this fallacious method. Often it arises as a convenient straw method to attack.

Mayo:

Yes, I agree that it is important to distinguish NHST (the procedure by which rejection of a straw-man null hypothesis is taken as evidence in favor of some other model) from the p-value or tail-area probability.

From one direction, NHST does not require p-values. You can do NHST using Bayes factors and I hate that too.

From the other direction, p-values can be used for purposes other than NHST. You can use p-values to evaluate some observed lack of fit of model to data, as I have recommended at times in chapter 6 of BDA and elsewhere. This operation has some logical holes (it’s not quite clear what to make of a statement such as, “If the model is true and we were to repeat this experiment 100 times, we’d expect 98 of these replications to show a higher value T(y.rep) than the T(y) in the observed data”) but it still seems to me that it can be useful. In any case, love it or hate it, it’s not NHST because it’s about assessing failures of an existing model, it’s not about rejecting a null.

So, yes, NHST does not require p-values, and the use of p-values does not imply NHST.

Grossman 2011, Ch. 7 has a good discussion of this issue from the perspective of an advocate of the Likelihood Principle: http://bunny.xeny.net/linked/Grossman-statistical-inference.pdf

Thanks for the reference! The summary has already caught my eye.

And this:

“If you teach statistics to bright undergraduates, you nd that occasionally a student asks, “Yes, but why do we calculate the tail area?” I know of only two justications for this grouping strategy, and only one of them makes it seem more than ad hoc. The first justication is that it works: it gives answers which, by and large, are not counter-intuitive. But then, sometimes they are counter-intuitive. The other justication — the less ad hoc one — is that it can be shown mathematically that this grouping strategy gives the same results as a Bayesian analysis of the data in a variety of moderately common cases (Deely & Lindley 1981). Accepting the correctness of a Bayesian procedure is the best way I know to answer the student’s question. Should the student not want to accept Bayesianism then I can see no answer to her question. And should the student want to accept Bayesianism, she would presumably see no reason for calculating Frequentist tail areas.

Despite its ad hocness, this is the path that applied statistics has taken: one calculates Frequentist tail areas. Since no-one (to the best of my knowledge) has suggested any other completely general way to instantiate

error rates statistically (apart from confidence intervals, discussed below), we will have to take this as a given for the moment.” (p. 188-189)

Now, he doesn’t mention the points explained above by Mark and Mayo in how tail areas have desirable proporties under Neyman-Pearson framework (unless this has something to do with Deely and Lindley reference, which I haven’t read). But, in the same page, he argues (in a footnote!) that the rejection region doesn’t have to be tail areas, according to Neyman-Pearson theory! I guess I will have to read the full chapter to understant it better.

If you focus on the “Why tail probabilities” question, and also want a sample value that goes down when the discrepancy goes up, then consider that

1. Real data values are discrete and can be tied so that

2. Empirical sampling distributions (randomization or resampling/sub-sampling/bootstrap) are also discrete and can have ties,

3. Empirical point probabilities aren’t monotone, but observable survival functions are, by definition.

So the tail area gives a “nicer” answer than the empirical point or local region probabilities. There is also the small matter that the point probability goes down when the sample size goes up. That doesn’t happen with tail probabilities, and it also allows convenient mathematical approximations (I don’t think I’ve ever analyzed real-world Guassian or Weibul data, but they sure are useful approximations)