A bad definition of statistical significance from the U.S. Department of Health and Human Services, Effective Health Care Program

As D.M.C. would say, bad meaning bad not bad meaning good.

Deborah Mayo points to this terrible, terrible definition of statistical significance from the Agency for Healthcare Research and Quality:

Statistical Significance

Definition: A mathematical technique to measure whether the results of a study are likely to be true. Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05). Example: For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to be statistically significant because p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

The definition is wrong, as is the example. I mean, really wrong. So wrong that it’s perversely impressive how many errors they managed to pack into two brief paragraphs:

1. I don’t even know what it means to say “whether the results of a study are likely to be true.” The results are the results, right? You could try to give them some slack and assume they meant, “whether the results of a study represent a true pattern in the general population” or something like that—but, even so, it’s not clear what is meant by “true.”

2. Even if you could some how get some definition of “likely to be true,” that is not what statistical significance is about. It’s just not.

3. “Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance.” Ummm, this is close, if you replace “an effect” with “a difference at least as large as what was observed” and if you append “conditional on there being a zero underlying effect.” Of course in real life there are very few zero underlying effects (I hope the Agency for Healthcare Research and Quality mostly studies treatments with positive effects!), hence the irrelevance of statistical significance to relevant questions in this field.

4. “The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).” No no no no no. As has been often said, the p-value is a measure of sample size. And, even conditional on sample size, and conditional on measurement error and variation between people, the probability that the results are true (whatever exactly that means) depends strongly on what is being studied, what Tversky and Kahneman called the base rate.

5. As Mayo points out, it’s sloppy to use “likely” to talk about probability.

6. “Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05)." Ummmm, yes, I guess that's correct. Lots of ignorant researchers believe this. I suppose that, without this belief, Psychological Science would have difficulty filling its pages, and Science, Nature, and PPNAS would have no social science papers to publish and they'd have to go back to their traditional plan of publishing papers in the biological and physical sciences. 7. "The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems." Hahahahahaha. Funny. What's really amusing is that they hyperlink "probability" so we can learn more technical stuff from them. OK, I'll bite, I'll follow the link:

Probability

Definition: The likelihood (or chance) that an event will occur. In a clinical research study, it is the number of times a condition or event occurs in a study group divided by the number of people being studied.

Example: For example, a group of adult men who had chest pain when they walked had diagnostic tests to find the cause of the pain. Eighty-five percent were found to have a type of heart disease known as coronary artery disease. The probability of coronary artery disease in men who have chest pain with walking is 85 percent.

Fuuuuuuuuuuuuuuuck. No no no no no. First, of course “likelihood” has a technical use which is not the same as what they say. Second, “the number of times a condition or event occurs in a study group divided by the number of people being studied” is a frequency, not a probability.

It’s refreshing to see these sorts of errors out in the open, though. If someone writing a tutorial makes these huge, huge errors, you can see how everyday researchers make these mistakes too.

For example:

A pair of researchers find that, for a certain group of women they are studying, three times as many are wearing red or pink shirts during days 6-14 of their monthly cycle (which the researchers, in their youthful ignorance, were led to believe were the most fertile days of the month). Therefore, the probability (see above definition) of wearing red or pink is three times more likely during these days. And the result is statistically significant (see above definition), so the results are probably true. That pretty much covers it.

All snark aside, I’d never really had a sense of the reasoning by which people get to these sorts of ridiculous claims based on such shaky data. But now I see it. It’s the two steps: (a) the observed frequency is the probability, (b) if p less than .05 then the result is probably real. Plus, the intellectual incentive of having your pet theory confirmed, and the professional incentive of getting published in the tabloids. But underlying all this are the wrong definitions of “probability” and “statistical significance.”

Who wrote these definitions in this U.S. government document, I wonder? I went all over the webpage and couldn’t find any list of authors. This relates to a recurring point made by Basbøll and myself: it’s hard to know what to do with a piece of writing if you don’t know where it came from. Basbøll and I wrote about this in the context of plagiarism (a statistical analogy would be the statement that it can be hard to effectively use a statistical method if the person who wrote it up doesn’t understand it himself), but really the point is more general. If this article on statistical significance had an author of record, we could examine the author’s qualifications, possibly contact him or her, see other things written by the same author, etc. Without this, we’re stuck.

Wikipedia articles typically don’t have named authors, but the authors do have online handles and they thus take responsibility for their words. Also Wikipedia requires sources. There are no sources given for these two paragraphs on statistical significance which are so full of errors.

What, then?

The question then arises: how should statistical significance be defined in one paragraph for the layperson? I think the solution is, if you’re not gonna be rigorous, don’t fake it.

Here’s my try.

Statistical Significance

Definition: A mathematical technique to measure the strength of evidence from a single study. Statistical significance is conventionally declared when the p-value is less than 0.05. The p-value is the probability of seeing a result as strong as observed or greater, under the null hypothesis (which is commonly the hypothesis that there is no effect). Thus, the smaller the p-value, the less consistent are the data with the null hypothesis under this measure.

I think that’s better than their definition. Of course, I’m an experienced author of statistics textbooks so I should be able to correctly and concisely define p-values and statistical significance. But . . . the government could’ve asked me to do this for them! I’d’ve done it. It only took me 10 minutes! Would I write the whole glossary for them? Maybe not. But at least they’d have a correct definition of statistical significance.

I guess they can go back now and change it.

Just to be clear, I’m not trying to slag on whoever prepared this document. I’m sure they did the best they could, they just didn’t know any better. It would be as if someone asked me to write a glossary about medicine. The flaw is in whoever commissioned the glossary, to not run it by some expert to check. Or maybe they could’ve just omitted the glossary entirely, as these topics are covered in standard textbooks.

Screen Shot 2015-07-18 at 3.40.10 PM

P.S. And whassup with that ugly, ugly logo? It’s the U.S. government. We’re the greatest country on earth. Sure, our health-care system is famously crappy, but can’t we come up with a better logo than this? Christ.

P.P.S. Following Paul Alper’s suggestion, I made my definition more general by removing the phrase, “that the true underlying effect is zero.”

P.P.P.S. The bigger picture, though, is that I don’t think people should be making decisions based on statistical significance in any case. In my ideal world, we’d be defining statistical significance just as a legacy project, so that students can understand outdated reports that might be of historical interest. If you’re gonna define statistical significance, you should do it right, but really I think all this stuff is generally misguided.

190 thoughts on “A bad definition of statistical significance from the U.S. Department of Health and Human Services, Effective Health Care Program

  1. “All snark aside, I’d never really had a sense of the reasoning by which people get to these sorts of ridiculous claims based on such shaky data”

    Easy.

    Step 1: Collect data, create a histogram. Then pick a distribution that looks similar. Then run a test to “show” the data is “drawn” from this distribution.

    Step 2: Use this “test” to claim the histogram of future data will look roughly like the data you already have.

    Step 3: Make an assertion which will be true 95% of the time if future data really does look like the data you currently have.

    Step 4: When such assertions turn out to be true 10% of the time, indignantly claim steps 1-3 haven’t been taught correctly and refuse to admit your understanding of probabilities had anything to do with the failure.

  2. Going public, unfortunately, is like sticking your head above the parapet. Andrew’s definition of p-value ignores any situation where “zero” is not the focus. Moreover, what about one-sided nulls as in <= "zero"? In a sense there should be some sympathy for the attempt by the authors of the Agency for Healthcare Research and Quality at defining p-value; it is hard to be concise and still cover all the nuances. Simultaneously pleasing the experts and informing the lay population is a tough job.

      • “If the null hypothesis is true” is a reasonable phrasing that doesn’t specify what the null is. I agree it is very hard to explain in an understandable way as is illustrated by the many instances of people who should know better who do it wrong even writing for more numerate audiences.

        I don’t think it particularly strengthens you argument to accuse women with Ph.D.s in an area they are writing on who using the standard epidemiological definitions of “youthful ignorance” on something not even related to your post. It just seems unnecessary when the point is about concerns about p hacking not about your or their interpretation of the Wilcox data on fertility windows https://www.musc.edu/mbes-ljg/Courses/Biology%20of%20Reproduction/Paper%20pdfs/Wilcox%20fertile%20window.pdf. It’s just a distraction.

        • Elin:

          The authors of the papers on fertility included both men and women.

          And, no, they weren’t using the standard values for peak fertility. Including day 6 in the peak fertility range, that indicates an unfamiliarity with the standard recommendations in this area.

          And I do think their youth is an issue. When my wife and I were trying to have our last child, we were in our 40s and we learned the days of the month that are the best for having babies. A young person might happen to know this information, but he or she might also not ever happen to have encountered that information. As you get older, this is the sort of biological fact you’re more likely to come across.

        • Yes they were if you read the literature in the field, day 6 has over a 10% risk of conception and that’s a reasonable cut off. You can say you want a higher cutoff and that’s fine, but their choice was absolutely standard and based on the epidemiology.

  3. “…. is a frequency, not a probability.”

    There are two definitions of probability which are so easily confused most do, but are very different. One is the old Bernoulli/Laplace/Bayesian one:

    “The probability of x is the ratio of the number of cases favorable to it to the number of cases possible”

    and the frequentist one:

    “The probability of x is the ratio of the number of times x occurs to to the total number of repetitions”

    Despite a superficial similarity, the first one is a quantification of the uncertainty remaining when our knolwedge/assumptions aren’t strong enough to remove all cases unfavorable to x (i.e. to deduce x) and applies even if x only occurs once or is a hypothesis. The second is a rather strong claim about the dynamical evolution of the universe and only applies if repetitions can be performed.

    • My understanding of probability does not agree with either of the definitions offered above.

      A frequentist professor of mine, one with a bent towards physics, used a definition of probability along the lines of
      The probablity of x is the ratio of the number of cases favorable to it to the total number of cases in an infinite sequence of trials.

      That definition will yield the right answer if you have a real random number generator uniform on [0,1] and a process that compares the random number to 1/Pi and outputs a 1 if the number is greater than 1/Pi and a zero otherwise.

      Any finite sequence of trials will yield the wrong number.

      Bob

  4. It also deserves noting the replication projects in psychology and biology–a lot of studies did not replicate. There are complicated questions as to procedures and power. But, given the failures to replicate, the “true” talk is so so so off-base.

  5. When I was doing college statistics I remember an exam question where you had to pick the right definition of statistical significance. One of the wrong examples was worded nearly identically to the HHS wording.

    • I can’t get worked up over the quoted text. There isn’t a shred of evidence people indoctrinated in the nuances of the definitions of p-values, hypothesis testing, confidence intervals, and “significance” produce statistical analysis any better than those who can’t parrot back official verbiage.

      The whole thing has the character of physicists arguing over the definition of “impetus” they read in an 11th century scholastic work written in latin and acting like it was important for getting the mars rover to it’s destination.

        • That doesn’t get at the fact that the whole piece is csstigsting a website for daring to have a lay friendly definition that doesn’t get so lost in the weeds of P value nuance so loved by statisticians that it fails to actually give the reader a usable definition.

        • Alex:

          Yes, I will castigate a U.S. government agency for daring to give a misleading definition. It’s not about “nuance,” it’s about what the purported definition is conveying.

          In what sense is it “usable” to say that the probability equals the observed proportion? In what sense is it “usable” to tell people that statistical significance answers the question of “whether the results of a study are likely to be true”? How will these definitions serve the goal of helping people understand research studies? I don’t see it. Rather, I see these definitions as teaching and reinforcing confusion, the sort of confusion that’s led to the research stylings of Daryl Bem being published in a top psychology journal, the sort of confusion that led to that himmicanes and hurricanes story, and so forth.

          Simplified and reader-friendly is fine. False, not so much.

      • Still, you’re often giving examples of “Stats 101” analysis which are intentionally wrongheaded, and then someone here is inevitably coming back with “but you’ve obviously made mistake X which frequentist statisticians would never do and you’re characterizing researchers as basically ignorant and stupid” to which your reply is usually “but if you go out into the world, this is the kind of analysis you find in almost every textbook/guidelines/glossary etc” and then… Andrew went out and found one and was SHOCKED, SCHOCKED I TELL YOU… to see such a poor concept of stats in some kind of official govt document…

        so, I’m pointing to this and saying “see, Anonymous isn’t so over the top after all” because this kind of stuff *IS* out there everywhere

        • I think a lot of this has to do with poor choice of words, you can see this elsewhere.

          “The p-value is not really a probability, and the toy examples we use in introductory texts aren’t really how you should use it.”

          Compare to the greenhouse effect: “The atmosphere is not really like a greenhouse, and the toy S-B law examples we use to first demonstrate it aren’t really calculating its magnitude.”

          For example:

          “I will say that I do not particularly like this model as a suitable introduction to the greenhouse effect. It is useful in many regards, but it fails to capture the physics of the greenhouse effect on account of making a good algebra lesson, and opens itself up to criticism on a number of grounds; that said, if you are going to criticize it, you need to do it right, but also be able to distinguish the difference between understood physics and simple educational tools.”
          https://www.skepticalscience.com/postma-disproved-the-greenhouse-effect.htm

          Please do not respond with arguments about the greenhouse effect existing/etc. I just mean to convey that misguided pedagogy can lead to endless confusion. It is very important to not simplify things too much.

        • Miss:

          Pedantic is in the eye of the beholder. But I think these misconceptions have real consequences. There really appear to be researchers who think that statistical significance is “a mathematical technique to measure whether the results of a study are likely to be true.” And, sure, Psychological Science is a bit of a punch line, but it’s also a leading journal in psychology. And psychology is important.

          So, although this bores you, I think it’s important.

          And it’s not just psychology research. Remember that paper awhile ago on air pollution in China? Or that paper on early childhood intervention? Real decisions are on the line here, so I think it’s a bad idea to spread wrong ideas about commonly used statistical concepts.

        • Wrong conclusions have consequences, but getting these definitions just right doesn’t seem to have any affect.

          Doing analysis using classical statistics/frequentism is like trying to heal someone using the four humors theory of disease. No one would dare apply the theory literally and consistently or it would be obvious it’s junk. In reality, they use some combination of trial and error, fudge factors, rules of thumb, intuition, guesswork, and rank speculation, sprinkled over with a thin layer of four-humors (frequentist) vocabulary to make it seem respectable.

          The definitions just don’t play a big role.

        • Here’s a good example Andrew:

          http://errorstatistics.com/2015/05/04/spurious-correlations-death-by-getting-tangled-in-bedsheets-and-the-consumption-of-cheese-aris-spanos/

          It’s by an econometrician name Aris Spanos who bills himself on his CV as one of the 20 best or something. Two time series appear to move in unison. One is “death by getting tangled in bedsheets” and the other is “consumption of cheese”. After calculating the sample correlation coefficient which is close to 1, Spanos declares “the key issue for this inference result is whether it is reliable, or statistically spurious”.

          He then spends a few slides massaging the data in whatever way he feels like doing until he gets a new correlation coefficient which isn’t “statistically significant” and declares the correlation is proven “spurious”.

          This is as close to a public admission as I’ve ever seen that classical statics is the direct modern equivalent of reading chicken entrails to predict the future. It was obvious the correlation is spurious without any analysis and he simply worked toward that goal. You could repeat this exact same analysis on a hundred other times series half of which were spurious and half weren’t and you’d get a basically random combination of outcomes.

          If the numbers had been exactly the same, yet the two series were connected, Spanos would has simply found another analysis which produced a “statistically significant” correlation.

          If legerdemain like this actually worked, science would be much easier. Just collect millions of times series, pair each of them against each other, run them through a computer program without even knowing what the numbers mean and whenever you get a correlation coefficient which is “statistically significant” you’ve made a major scientific breakthrough.

          The definitions make no difference. It makes absolutely no difference whether Spanos can give the “right” definitions for “statistically significant” or not.

        • I’m surprised to see you dismiss with such ease the clear causal connection between consumption of cheese, which clearly to an econometrician would include Cheeze-Whiz, and death by Tangled Bedsheets …

        • Of course there’s no connection between the calculations and reality. Spanos just contorted the analysis to agree with his prior that it was spurious. Frequentists do this all the (every?) time, but it’s nice to see a clear example of it, so it’s worth savoring the irony.

          There is a real possibility a sizable chunk of that “spurious” correlation (rho=.94) is due to obesity or something and isn’t quite so spurious. If that’s the case then the frequentist failure here is 10 times as embarrassing.

          But what’s the purpose of identifying a correlation as spurious? The usual reason is to warn the correlation may not continue. If that’s the takeaway anyone got from Spano’s analysis then they’re dead wrong. A correct analysis would likely show both are related to population growth or something, so the observed correlation would be expected to continue.

        • @anon

          Spanos does kind of say that in the comments:

          “A trend provides a generic way one can use to account for heterogeneity in the data in order to establish statistical adequacy…Once that is secured, one can then proceed to consider the substantive question of interest, such as a common cause. The latter can be shown by pinpointing to a substantively meaningful variable z(t) that can replace the trend in the associated regression and also satisfies the correlation connections with y(t) and x(t) required for a common cause”

          but I can agree that the original post left it pretty unclear on what then was achieved by the manipulations and seemed to contain a number of fairly arbitrary assumptions.

        • Ok, since we’re going all “straight man” on this anyway, here’s what I think of that analysis:

          1) Per capita consumption of cheese has dimensions of {mass}/{person}{time}
          2) Rate of death by tangled bedsheets has dimensions of {person}/{time}

          In the range 2000-2009 the population increases by a few percent, but this is due to babies being born, and so we can assume the adult population increases less, let’s let it be near constant.

          In the range 2000-2009 presumably based on all the hype, obesity is increasing, meaning that per capita consumption of calories has increased. Per capita consumption of calories, considering an average caloric content of the general diet, can be converted to per capita consumption of food mass, which also has dimensions of {mass}/{person}{time}. It is therefore reasonable to believe that 1) is a proxy for obesity.

          Now, the adult population was assumed to be about constant, certainly not doubling over the timeperiod, as the death in bedsheets data does. So if we convert the death in bedsheets data to per capita, it will be like dividing by a constant, and the new percapita death in bedsheets data will have dimensions of 1/{time} and have the same shape, but i’m going to create a new dimension called “risk” which is really a dimensionless ratio of {person death}/{person alive}, so the second data series is now {risk}/{time}

          Now, clearly based on the high correlation coefficient, if we divide the death data by the cheese consumption data we get {risk}{person}/{mass}, which when you consider the definition of {risk} works out to {death}/{mass}, and considering the trend, the result will probably be nearly constant.

          Considering a taylor series about the initial 2009 death vs person’s mass set-point, and considering that obesity is linked to death, is there any question that the first-order term in the taylor series has a positive coefficient? Probably not. So death_t ~ death_2009 + C_mass * (mass_t-mass_2009).

          you’ll notice that C_mass has dimensions of {death}/{mass} which is exactly the dimensions of the ratio I constructed above, which is more-or-less a constant.

          We now hypothesize the following causal structure CALORIES -> BODY MASS -> RISK OF DEATH BY TANGLED BEDSHEETS

        • So I actually did the math, here’s the R code:

          cheese <- c(29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8);
          death <- c(327, 456, 509, 497, 596, 573, 661, 741, 809, 717);
          uspop <- 300e6+20e6*(seq(0,9)); ## approximated from graph on google

          plot(death*1e8/uspop/cheese,ylim=c(0,10));

          Sure enough, you have more or less a constant around 4.

        • Daniel: “In the range 2000-2009 presumably based on all the hype, obesity is increasing, meaning that per capita consumption of calories has increased.”

          Wrong. You can get fat and be close to starvation. E.g. you do not stop a cancer tumor growing by fasting.

          Calories in and out is only a part, if at all, of why people are getting fat. You can get fat (or loose weight) holding calorie intake constant.

        • Fernando: you CAN do those things, but on average in the population, people who are more massive also consume more calories. For the overall statistics, the special cases are not so important.

        • Fernando wrote: “E.g. you do not stop a cancer tumor growing by fasting.”

          Look up “caloric restriction cancer”…

        • http://www.tylervigen.com/spurious-correlations I spent some time in my class this semester having the students look at these and I think it was pretty memorable. And then we looked at the chocolate/Nobel prize graph and they were totally on it. And then we looked at gap minder and it was an interesting discussion because they were really primed (uh oh) to be skeptical.

        • But in this case, see above, I think it’s reasonable that cheese and death by bedsheets really could be causally related through overall calorie consumption and its risk in general of death especially cardiovascular related disease (which might result in death during sex, or death while waking, the most common time for a heart attack is the morning), and the data when transformed in a reasonable manner, produce a constant which can be interpreted as a coefficient in a risk model which is independent of time.

          In other words. More Models, less Frequentist statistical claptrap ;-)

        • There was a person who regularly published stuff that drew wacky connections. My favorite was the finding – which got reported fairly widely – that consuming hot dogs was associated with children’s leukemia … except the effect was only significant between something like 16-20 hot dogs a week, not below and not above. Pretty much a clear message that statistical significance is fairly meaningless on its own.

        • Well, it really isn’t big news that the technical details of statistical significance are not widely understood. I don’t know exactly what should be done about it. I suppose I am glad that people are calling attention to it.

          But at least their definition has a sort of internal coherence. Yes it is totally weird to talk about small p-values as indication that ‘the results are true.’ I guess the alternative using this weird language would be that ‘the results are due to chance.’ The conceptualization is a little off, but as Anonymous notes it is not off so much that it has much effect on their substantive conclusions.

          The insistence that authors in the popular press adhere to the same standards of strict rigor as used in statistical literature can become a little silly and that is what I often find overly pedantic. Also when people say stuff like it is sloppy to use ‘likely’ when talking about probability — that’s a little hard to take seriously. But this isn’t popular press, this is research institutions and research journals, so, I agree it is important.

        • As an applied statistician, I can tell you that misunderstanding of p-values leads to LOTS of lost tax payer money. Well worth nitpicking over.

          I work with biologists, who spend lots of money on experiments. I have people saying things like “Well, X’s results are non-reproducible because when they ran the experiment, they got p = 0.03, but when I tried to run the experiment, I got p = 0.07” or “we know that treatment 1 is significantly different than placebo and we know that treatment 2 is not significantly different than placebo, so we know that treatment 1 is better than treatment 2 (despite the estimated effects being nearly identical”. These are really expensive mistakes (made by very high level researchers) that come from not understanding the definition (or more over, the basic concept) of a p-value.

          Perhaps the problem is in making the definition precise, we have lost the ability to convey the basic concept to those who need to know?

        • I think there are two separate but related issues here:
          (1) The correct definition of the p-value
          (2) The role that p-values should play in drawing conclusions from data.

          Understanding (1) might help in establishing a reasonable view about (2), but it doesn’t seem necessary. A person could misunderstand (1) but it may be inconsequential, practically speaking, because they still have some sensible views about (2). Alternatively, a person might be solid on (1), but have badly considered views on (2).

          It sounds like your collaborators are confused about (2), but it’s not clear that this is because their views about (1) are mistaken. P-values are only one form of information, and they should be considered in the the context of effect size, sample size, background information, and so on. The abuse of p-values in (2) I think often comes as much (or perhaps more) from discomfort in dealing with uncertainty, ambiguity, and the difficulty of constructing a formulaic method of incorporating all available information to arrive at conclusions, as it comes from a misunderstanding of (1).

        • I would agree with what you would say.

          But a further note on the matter is the issues you mentioned in (2) are very wide spread and, in my opinion and experience, in many ways slow down the pace of science. I think these issues stem from non-statisticians trying to come up with simple rules on how to use p-values based on an overly summarized idea of what a p-value is. This occasionally leads to totally counter intuitive, counter productive conclusions.

          It’s hard to think of way in which such researchers could use p-values productively without thoroughly understanding what they are. I will certainly admit that giving them a very precise definition is not the same as giving them a thorough understanding.

        • “I think these issues stem from non-statisticians trying to come up with simple rules on how to use p-values based on an overly summarized idea of what a p-value is.”

          This is an instance of what I call The Game of Telephone (TGOT)* phenomenon: Some well-meaning person comes up with a simple way of explaining something, but that isn’t quite correct. This becomes adopted by others, who make further oversimplifications, and so on, until a lot of people are using a version of the concept that is far from the original.

          *The name refers to the kids’ game where they sit in a circle, one person whispers something into a neighbor’s ear, the neighbor whispers what they hear into the next person’s ear, and so on, until the “message” has gone all the way around the circle, and usually comes out hilariously different from the original.

  6. An explanation of p-value which will satisfy both the experts and the uninitiated is akin to stating the Heisenberg Uncertainty Principle which will pass muster to quantum physicists and to us plain folk. From https://simple.wikipedia.org/wiki/Heisenberg%27s_uncertainty_principle:

    “Historically, the uncertainty principle has been confused with a somewhat similar effect in physics, called the observer effect. This says that measurements of some systems cannot be made without affecting the systems. Heisenberg offered such an observer effect at the quantum level as a physical ‘explanation’ of quantum uncertainty…Measurement does not mean just a process in which a physicist-observer takes part, but rather any interaction between classical and quantum objects regardless of any observer.”

    Of course, to make matters more opaque, Heisenberg was not writing in English. Just as probability and likelihood are sort of similar in plain English but technically very different to statisticians, “uncertainty” in English is only somewhat the same as “indeterminacy.” From https://en.wikipedia.org/wiki/Uncertainty_principle:

    “Throughout the main body of his original 1927 paper, written in German, Heisenberg used the word, ‘Ungenauigkeit’ (‘indeterminacy’), to describe the basic theoretical principle. Only in the endnote did he switch to the word, ‘Unsicherheit’ (‘uncertainty’). When the English-language version of Heisenberg’s textbook, The Physical Principles of the Quantum Theory, was published in 1930, however, the translation “uncertainty” was used, and it became the more commonly used term in the English language thereafter.”

  7. “Second, “the number of times a condition or event occurs in a study group divided by the number of people being studied” is a frequency, not a probability.”

    I think you mean proportion, not frequency.

    And I think what Anonymous is saying up above is that a finite “proportion” has been posited as a definition of probability by some of the classical folks. It satisfies Kolmogorov’s Three Axioms, no?

    Anyhow, I do agree that this is not at all a modern definition of probability that would be applicable to the practice of data analysis.

    But it seems forgivable to botch the definition of probability when most of us are unsure of what it means.

    JD

    • I think Andrew’s point is that the observed frequency (or proportion) is not the probability, in the same way as the sample mean is not the population mean. My two cents.

        • But my point is simply that this type of proportion (what we’d call a sample proportion in statistics) has been used as a definition of probability, albeit this is perhaps not the concept of probability we’d reference when in this context.

          And to reiterate, it is perhaps forgivable for them to link to such a definition since there are *many* definitions of probability and it seems that when statisticians are taught the “definition” of probability we just give them the Three Axioms. Note that the Three Axioms cannot serve alone as a definition of probability, because there are concepts that satisfy the axioms that to many of us are clearly not probabilities.

          It seems that probability and statistics texts tend to just let students retain whatever their “intuitive” understanding of probability is, perhaps only briefly making the point that there are many “concepts” of probability.

          So the question is, how would we prefer them to have defined “probability” here? Will there not be a problem with just about any definition they provide? Whether it is one of the so-called frequentist definitions (there isn’t just one), a propensity definition, a “subjective” definition? So, ok, I accept that have linked to a definition that is one of the less relevant to the context. But I don’t think “defining” probability is as straightforward as it might be presented to us when we’re taking our probability and statistics courses.

          I include myself in the “unsure of what it means” category.

        • I may not know what probability is — is there even a single right answer to the question “what is the probability a National League team will win the 2030 World Series?” — but I know some things it isn’t, and one thing it isn’t is an observed frequency. If I flip a coin 8 times and get 6 heads, it’s wrong to say “the probability of getting heads with this coin is 75%.”

        • Are you saying that if 100 events per year occur in a population of 1000 for 5 years straight — that the probability of randomly selection a member of the population who has experienced the event is not .10?

        • James:

          Why would you think Phil is saying that?? He explicitly said he was talking about flipping a coin 8 times and getting 6 heads. Nowhere did he talk about 100 events or a population of 1000. If you have large n and random sampling, all definitions of probability converge to the same answer.

        • Perhaps I misunderstood. But this is how I read his comment:

          1. He made a general statement, ‘one thing it isn’t is an observed frequency’.
          2. He gave a single example to support this claim, ‘If I flip a coin 8 times and get 6 heads, it’s wrong to say “the probability of getting heads with this coin is 75%’.

          It seemed to me that the broad statement was not accurate and I gave an example to support that.

        • James:

          A probability is not the same thing as an observed proportion. When sample size is large and bias is low, an observed proportion is a good approximation to a probability—that’s the law of large numbers. But probability and observed proportion are different. Phil gave an example to illustrate that probability and observed proportion are different concepts. They coincide in a certain limiting case but not in general.

        • Andrew:

          Let me see if I can articulate the distinction you are making:

          1. The probability of the event occurring in the future is not equal to the proportion of events in the past across some time frame.
          2. The probability of selecting without replacement a member of the population who experienced the event is equal to the proportion, but is conceptually different.

          Is this close to what you mean? Or am I still missing it?

        • In pure mathematics / probability theory we need some kind of a precise definition. And since mathematics doesn’t deal with making inferences from data about physical processes etc, the definition needs to be in terms of something mathematics deals with. The main thing left for probabilists to work with is large or even countably infinite sequences of numbers. If, like me, you are a fan of nonstandard analysis, you could do something like:

          “if (x_i) is a sequence of numbers, each of which is either 0 or 1, and whose length is a finite nonstandard integer N, then the probability of a given arbitrarily chosen number being a 1 is st(sum(x_i)/N)”

          But this doesn’t help us, because it’s a mathematical definition about pure mathematical objects. It seems to me that the flaw in Frequentist statistics is assuming that the same definition needs to be brought over directly into the realm of using probability to make inferences about scientific questions.

    • JD, no I wasn’t taking issue with finite proportions, but since you brought it up it’s worth mentioning the Infinite Set Irrelevance Supposition (ISIS, catch name I know) which says

      “any issue in probability theory that only occurs for infinite sets is irrelevant to both the foundations and practice of statistics”

      • Hi Anonymous,

        “finite” wasn’t really the operative word in my mistaken rephrasing of your post. I was just, perhaps mistakenly, assuming you meant what I say up above that what we know of in statistics as the “sample proportion” has been used as definition of probability before, with the caveat that I acknowledge it is not the most relevant concept of of probability for the context here.

        JD

  8. I also have suggested improvements for the key terms on my current blog. Gelman’s is nearly the same as mine. Feel free to add to it, maybe we can send it to them.
    http://errorstatistics.com/2015/07/17/statistical-significance-according-to-the-u-s-dept-of-health-and-human-services-i/

    Anonymous’ claimed definition for frequentist probability is wrong. It’s actually scarcely different from the other one he gives, but I have no desire to have an exchange with him.

  9. “under the null hypothesis (which is commonly the hypothesis that there is no effect)”
    I think the use of ‘effect’ here leads to confusion for laymen who interpret it causally, which is understandable since that is the colloquial meaning of the term and there’s nothing to signal that it has a different technical meaning in this context.

  10. There is a slightly different version of this. It is somehow conveyed that a p-value is not directly related to the truth of the null/alternative hypotheses (perhaps because p means probability which means you can never know for sure), so the mind substitutes “real” for “true”:

    “Definition: A mathematical technique to measure whether the results of a study are likely to be *real*. Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are *real*). Researchers generally believe the results are probably *real* if the statistical significance is a P-value less than 0.05 (p<.05)."

    The confusion really is quite a fascinating subject when in a position to view it from afar.

  11. 1. Andrew wrote: “I don’t even know what it means to say “whether the results of a study are likely to be true.” The results are the results, right?” Mike LaCour and John Lott send their regards;-) Actually, Lott doesn’t. He’s off sulking in a corner. But Mary Rosh sends her best.

    2. Are likelihood ratios a thing? Sure, I care about p(H0|data) but I care at least as much about L=p(H1|data)/p(H0|data). Am I really supposed to disregard a favorable L for a p-value of 0.06? That makes no sense to me. (For example, in the context of a Hotelling’s T2 test, computing a p-value seems like a non sequitur. Am I missing something?)

    3. Setting aside the arbitrary cutoff of 0.05, using a p-value to reject seems akin to reducing your decision hypotheses to H0 and ~H0. Why would anyone do that if they had an H1 hypothesis?

  12. There is basically just one reason to use p values, and that is to determine whether a random number generator of a certain type would be likely or unlikely to produce a certain dataset. If you have some process which might reasonably be considered to be “like a random number generator” (such as a roulette wheel, a repeated polling procedure with low non-response rate, an algorithm for picking genes to study out of the genome, a “well oiled” manufacturing process which very regularly produces similar output, or a continuous-recording seismometer) then you can use p values to see if something that comes out of that process is “weird” compared to some calibrated historical performance.

    Pretty much any other use of p values is in my opinion wrongheaded. Sometimes, we can discover something interesting by using p values, but it will always be of the form “this previously well calibrated random number generator is no longer doing what it used to”, which is hugely informative to manufacturing engineers (stop the production line and look for a broken machine!), seismologists (There’s a tiny earthquake!), casino owners (Take that slot machine offline and get it repaired!) or the like.

    Selecting small samples from processes not proven to be stationary and trying to argue that there’s a difference of interest between them due to a p value is a large part of what’s wrong with the “mythology” of Stats 101 as taught to undergrads everywhere.

  13. Communicating statistical concepts in an accurate way can be a daunting task. Kudos to Andrew for providing concise, meaningful criticisms of the definition contained in the glossary written by the Agency for Healthcare Research and Quality.
    I would also like to comment on Andrew’s Definition of Statistical Significance. I am accustomed to thinking that a small p-value could arise due to several conditions:
    1. The ‘true underlying effect (in the population)’ is not consistent with the null hypothesis.
    2. The sample is not representative of the ‘true underlying effect (in the population)’ (Type I Error).
    3. The assumptions of the method that produced the p-value are not met.
    I’m not convinced that Andrew’s ‘try’ gives sufficient due to all of the potential causes of a small p-value. In particular, I feel like we should always check into the possibility that assumptions are not met.
    Furthermore, ‘strength’ seems uncomfortably close to ‘power’ when I read Andrew’s Definition; could this also be considered ‘sloppy’?
    In any event, I have come away from reading this post with a richer understanding of what p-values are not, so Thanks, Andrew!

  14. The original post on the government website wasn’t intended for pedantic statistics textbook writers it was intended for confused people who have no idea or desire to know what null hypotheses, underlying effects, etc. are. Yes there are issues with the definitions as read by somebody who already knows the material. But for the ignorant layman for whom the definition was intended? Those definitions allow that layperson to meaningfully grasp with the concepts found on other parts of the website. Something that even the definition of P value put forth by the author in this smug take down piece fails to accomplish. If you ignore the writing and reading standards for lay accessible websites and presume that the readers all hold bachelors degrees then sure, perfect definition!

    But as someone who has to explain these concepts to students who have not taken a statistics course and never will – – a discussion of underlying effects, sample sizes, null hypotheses, and what have you is not only a irrelevant and self-indulgent exercise in pedantry, its counterproductive as I am adding more confusion rather than reducing it. If your goal is to inform the public to the level of a PhD student that’s admirable, but not really feasible. And I think the linked definitions strike a good balance between the two.

    • Alex:

      I disagree with your claim that these definitions “allow that layperson to meaningfully grasp with the concepts.” If they read about a p-value on another page of the website and think that it is “measuring whether the results of a study are likely to be true,” I think this sends them in the wrong direction, to an unfortunate state of credulity, perhaps followed by a nihilistic skepticism once they realize that they’ve been misled.

      I fully support your proposal to make definitions that are comprehensible to the layperson. It’s just that I think that the definition on that website is (a) not actually comprehensible to the layperson, (b) misleading, and (c) false.

      • I am wondering if animation might help folks see the underlying process at work better than a technically correct definition.

        Here is what I am thinking of doing:

        Simulate a histogram of p_values from Norm(0,1) samples of say 30 – the no underlying effect world.

        Simulate another histogram of p_values from Norm(3,1) samples of say 30 – the _important_ underlying effect world.

        Animate showing one histogram above the other with p_values of different colours raining down from them.

        This shows them the Uniform(0,1) distribution from no underlying effect world and non-uniform one from _important_ underlying effect world and experience the raining down of p_values from these.

        Now most of the screen gets blacked out and one grayed p-value falls down – in Ken Rice’s conceptualization* the question appears “Is it worth calling attention the the possibility of this p_value coming from an _important_ underlying effect world – i.e. should others do more studies to learn if we regularly get small p_values doing this study over and over again?”

        If they were interested and watched closely a few times they might get the sense of the process.

        Also, further vary the _important_ underlying effect size Normal(E,1) and other assumptions.

        * http://statmodeling.stat.columbia.edu/2014/04/29/ken-rice-presents-unifying-approach-statistical-inference-hypothesis-testing/

        • Do laypeople want to know the gory details?

          A lot of these problems arise from answering a question that no one had in the first place. I’m not sure people ever have the question that a p-value answers.

        • I think that a lot of people wouldn’t understand the idea of a simulation. If you do something physical that might work. When my daughter was in 1st grade they played “Rolling Dice 6” and later “Rolling Dice 12” constantly (6 and 12 dice respectively) and making histograms and all the time. It was really memorable not to mention that they really did love that game which was great because everyone won sometimes. I can’t even remember what constituted winning, I just know whatever it was it appealed to 6 year olds.

        • > people wouldn’t understand the idea of a simulation. If you do something physical that might work
          I agree, but but you can’t explain anything without relying on other things being understood.

          With a physical model they still need to see it as a representation of something in the world using a concept of randomness (not easy).

        • True. I’ve been thinking about making a little lego people population and doing something with that. Or maybe a list of all the public schools in NYC. Actually I do use SAMP in my class and that’s a simulation, but there is a story behind it. I have my students each taking dozens of samples of different sizes and graphing the distribution of the results. One thing that always happens is that even though I have talked about sampling as a concept and we’ve done various things with it, when we first go to SAMP and I say okay, now everyone take a 1% sample and I do it on the screen, it’s always kind of a light bulb moment for some people that we don’t all get the same results.

        • One physical illustration that I (and a lot of other people) have used is to start by asking, “What proportion of M and M’s do you think are tan?” (or whatever), then give each student a small package of the candies, and have them use that as a sample to do a hypothesis test or form a confidence interval. Then compare notes. That seems to help many understand the idea of sampling variability. (Often big packages of little packages of M and M’s or other suitable candies are on sale cheap after Halloween, so one can stock up then and be prepared.)

        • Btw, I shared this with some medical medical researchers I was working with a couple of years ago, and at least one of them said it was “eye opening”. It’s certainly entertaining.

        • Thanks, close to what I had in mind but I wish there was not so much judgement being overlaid “highly unreliable p_values”, “hug a confidence interval”. Not necessarily good or bad concepts, just badly interpreted/used.

          (In meta-analysis, the combination of p_values can be the least wrong way to go.)

    • Basically agree with this. This glossary entry is essentially a descriptivist definition of p-value. Like it or not, within the context of the US Health & Human Services Agency for Healthcare Research and Quality’s Effective Health Care Program, this is functionally what the p-value means & how it is used. The glossary isn’t trying to precisely define the concept, it’s trying to explain how it is actually used. Prescriptivist quibbling doesn’t change that.

      Side note, “unfortunate credulity or nihilistic skepticism” seems like a pretty decent characterization of the options available to a layperson confronted with the use of p-values in policy-relevant fields at the moment.

      • S:

        Indeed, this definition is what a lot of people use. It’s given us himmicanes and hurricanes, ovulation and voting, everything causes cancer, ESP, and various other areas of junk science. I and many others think we as a scientific community can do better. I and many others think the U.S. government can do better.

        Millions of people believe in astrology too, but I don’t want the U.S. Department of Health and Human Services to recommend astrology either.

    • Alex: “But as someone who has to explain these concepts to students who have not taken a statistics course and never will – – a discussion of underlying effects, sample sizes, null hypotheses, and what have you is not only a irrelevant and self-indulgent exercise in pedantry, its counterproductive as I am adding more confusion rather than reducing it.”

      And how’s this “dumbing down” of absolutely critical statistical concepts working out for your students? How many of them are being given a false sense of competence regarding these concepts and then end up botching a statistical analysis and wasting resources as we’re seeing on a massive scale?

      Yes, the the theory and practice of statistics is difficult. That should be OK.

      • I think actually when you teach p value or types of error in a basic stats course for non stats students, what you are or should be teaching is don’t believe your eyes (or what other people say) when it comes to data. You could get data that show large differences between groups … but it could be the result of chance. You could get data that show no difference between groups … as a result of chance. Then on the basis of that be skeptical of your own and others’ conclusions. Ask questions about the sample size and how the sample was drawn. And keep in mind that statistical significance does not mean practical significance (just like random and normal don’t mean the same things in statistics that they do in normal conversation).

        I think it is admirable to try to help the public improve statistical literacy but it’s hard and if it is done in a sloppy way (as in this example) counter productive. Physics, biology and philosophy are also hard. Actually so is writing essays. That doesn’t mean we don’t teach intro courses, courses for non majors, or start teaching kids basic concepts in elementary school.

  15. I feel compelled to post this as a general response because I have some contact with this side of healthcare. First of all the government does publish good statistics guides-see http://www.itl.nist.gov/div898/handbook/. But NIST has a 200 year legacy and much of the government/non-government healthcare metrics movement is a rapid response to the lesser known parts of Obamacare. And so far, despite lack of rigor, it seems to be working. Hospitals are penalized for re-admissions and infections so follow-up is improved. Early reports suggest some success.

    The people in this field come from a range of professional and educational experiences. They know their subject matter-like operating room materials or treatment plans-but never expected to be put on a committee for metrics. They struggle with basic math like fractions, ratios and proportions. So just explaining that successes are the numerator and total sample the denominator is a big deal. The distinction between rate of event, counts and time to event is overwhelming. So talking about effect size-a difference in a metric-is impossible until they ‘get’ the metric. So as with the quality techniques in the NIST handbook, significance testing is about acting on a signal, not a research claim. I suspect the writer of this definition was copying something or dumbed-down the definition.

    Re: rates, probabilities, frequencies. It is common practice to INTERPRET a rate as a probability in biostatistics, but not to say a rate IS a probability. I saw INSIDE OUT this weekend and will use a metaphor from the movie. As statisticians we have all opened the door to Room of Abstract Thinking but we can’t let ourselves be reduced to shapes and colors.

    • I think they are estimates of the population proportion, right? Of course people interpret them as probabilities, that’s how they can use data to try to predict what will happen next month. And that’s a good thing to try to do rather than going by gut instinct.

      I agree that for huge numbers of smart people even figuring out the numerator and denominator (especially which denominator is the right one to use) is really hard.

  16. I’m curious if someone can suggest a rewrite of their first example that’d make sense to a lay audience (i.e. don’t use the word “null”):

    “The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.”

      • I don’t think “chance variation” is very helpful by itself. It is precisely the nature of what “chance” has to do with the results that has people confused. Perhaps what should be tried is “The results were consistent with the two drugs having identical effects on the whole population, with the observed differences merely due to chance variations in what part of the population turned up in the samples.”

        • Srp:

          Sure, but once you try to be accurate there’s more and more details that can go in. For example, it’s not just sampling variation, there’s also measurement error.

  17. Given your writing about this issue, I expected you to say a proper discussion would be about what a p-value isn’t rather than trying to fix a bad definition that almost no one without statistical learning would understand. As in, you see a lot of p-values and you should be aware of these problems and then go into sample size and power, nature of null hypothesis, nature of model, etc., and then you would highlight that given these limitations how these p-values and measures of statistical significance can be useful.

    • > “but it would be OK so long as they reported the actual type I error, which is the P-value”

      Nope. P-value is not Type I error. Suppose, for example, that I was making a fire control decision based on some data. I’m going to decide to pull the trigger based on the value of p(H1|data)/p(H0|data), my hypotheses being H1=target and H0=non-target. The Type I error (false positive) rate is the fraction of the time that I fire on non-targets*. p(data|H0) tells me whether the data collected is typical or unusual under the H0 hypothesis. It says nothing about whether the data is typical or unusual with respect to H1. The data may be “not unusual” with respect to H0 but even more typical of H1 thus resulting in a likelihood ration which favors H1.

      You can’t make an informed decision in a vacuum. In order to make an informed decision you also need to know p(data|H1). Yes, p(data|H0) factors into the H0 vs H1 decision by the error rate (Type I and Type II) follows from p(H1|data)/p(H0|data).

      *At the risk of stating the obvious, Type II error also has its downside.

  18. Ah, this explains why I find statistics textbooks so impenetrable. I learn a lot from your blog but if this definition of statistical significance is all I had to go by, I’d never be any the wiser. The AHRQ definition is wrong and dangerously misleading but wrong in a way that can be understood by the lay reader – which is what makes it so dangerously misleading. It achieves that by using clear language and giving examples. Your definition does neither. It tries to be maximally generative and minimally expressive – which is a virtue in mathematics but a sin in pedagogy. You must keep in mind a reader who will have only the definition to go by after reading it – not the expert who will nit pick every aspect of the answer. Your definition will not mislead the lay reader but it will also not lead them. Therefore, they will likely fall back on their common sense understanding of the word ‘significance’. A pedagogic definition should make it clear what it does not mean as well as what it means.

    Something like:
    Statistical significance is often misunderstood
    * Statistical significance does not mean real world significance
    * Statistical significance does not mean true results
    * Statistical significance is often used by researchers but more and more statisticians believe it is not useful

    In fact, all that statistical significance measures is [insert your definition – slightly reworded].

    Often, it is expressed as p<0.5 but the difference between p=0.4 and p=0.6 is itself not significant. The 0.5 value was picked at random and has no real foundation.

    So for example: [insert some examples from your blog].

  19. Your problems with the definition are all pedantic – their text is sufficient for the masses who are not well versed in statistical inference.

    That being said you are wrong on at least two fronts.
    1.) P values, while influenced by the size of your sample are not a measure of sample size. While a P value of .051 with 1000 respondents may tick to .049 with 1500 its just as likely to tick down esp if the the relationship is indeed due to chance.

    2.) a P value of .05 is sufficient in social sciences given the complexity of things we measure. That being said very few regression results in social science journals will have results at this level. We ideally shoot for .01. I know the hard sciences generally like six sigma but that is simply not tenable when studying people.

    • An also this ” “likelihood” has a technical use which is not the same as what they say. Second, “the number of times a condition or event occurs in a study group divided by the number of people being studied” is a frequency, not a probability.”

      Yes likelihood has a technical meaning and a whole suite of modeling techniques associated with it (MLE) but the number of times an event is observed over the number of observations made (possible number of times that event can occur) is a probability. Whilst the number of times an event is observed with out standardizing it to the number of observations is a frequency.

      100 people have read this article and believe what you’ve written – thats a frequency. If that is out of 10000 people who’ve just read it than that’s 10% and we can say anyone reading this article has a 10% chance of believing it.

      • Josh:

        I think you may be trolling, but . . . your statement, “the number of times an event is observed over the number of observations made (possible number of times that event can occur) is a probability,” is not in general true, as is illustrated by Phil’s example above of the coin that was flipped 8 times with 6 heads by does not have a probability of 75% of coming up heads. Again, if you increase N, the empirical proportion gets closer to the probability—that’s the law of large numbers—but that’s the whole point: The empirical proportion and the probability are not the same thing, it is only under certain conditions that they approximately coincide.

        Of course, given the existence of people who (a) wrote the definition quoted in the above post, (b) didn’t realize their errors, it’s no surprise that there are blog commenters who share these mistaken attitudes about probability and statistical significance. Lots of people don’t understand Newton’s laws of motion either. Science is hard. If it weren’t hard, it all would’ve been discovered earlier.

        • I’m not trolling – I’m very well versed in statistics and ABD on PhD.

          You’re right about the law of large numbers bringing the empirical results closer the underlying probability. However A.) in that example we can say of our observations that any one has a 75% probability of being heads. If as is the case in all of our studies we do not know the underlying probability this where B.) the P value comes in. In this example with an N of 8 you’re P value would not be sig. in your estimated probability of heads. Increasing the sample size as you indicated will bring your estimation more inline with underlying reality and thus your P value will improve.

          It is important to note that the improvement in your P value occurs because your estimation gets closer to reality not because you have a larger sample. Though it is that larger sample that brings you closer to reality this is not always the case.

          I really adore the fact you’re slagging on HHS do so using the similarly inaccurate and misleading prose. You’re contention that P values are a measure of sample size is just wrong. In your words “FUUUUUK no no no no no”. You also come off as pompous ridiculing a P value of .05 – which literally means there is a 95% chance that observed relationship is not due to chance and by itself very convicting and goes a long way to demonstrate the veracity of your hypothesis (note I didn’t say prove, that word belongs in math and distilleries). I’m not saying your wrong in contending that P values are sensitive to sample sizes but you are very very wrong in concluding that P values are a measure of sample size.

        • Josh:

          I am sorry but no, a p-value of .05 does not “literally means there is a 95% chance that observed relationship is not due to chance.” This is simply wrong. If you are really studying for your Ph.D. in a quantitative field, I strongly recommend you speak with your supervisor, the other faculty in your program, and some of your fellow students and ask them to explain your confusion.

          In all seriousness, it’s never too late to learn, and I suggest you swallow your pride and get this cleared up.

        • Andrew,

          Is it even safe to conclude that Josh’s supervisor will be able (or want) to clear up his confusion?

          Have you looked at the “research methods” teaching materials for some of our colleagues in the social sciences?

          I have. And I’ve seen one esteemed “colleague” that presents himself as having expertise in statistics refer to such accuracy in definitions of statistical concepts that you provide here as “statistical bluster”. And then in the same teaching materials make very glaring errors in simple formulas. My conclusion is that if that “colleague” had spent more time understanding the “statistical bluster” they would be better equipped at recognizing when a formula they write is in error. But if you’re gonna reduce statistics to just the application of a bunch of formulas, it seems to me you should at least get the formulas right.

          Reading between the lines of those teaching materials, it seems to me that the message being sent to that “colleague’s” graduate students is that getting statistics right will only serve in slowing them down in their efforts to publish as much as possible.

          JD

        • You are pompous. I am not only studying for my PhD in a quantitative field but doing so in a program that is highly quantitative itself, having taken 6 stats clasess (in addition to game theory) and studied at ICPSR i’m very confident in my position. I’ve studies a suite of methods you’ve likely not heard of and I can tell you that yes A P. value tells you the probability that the observed relationship is due to chance or measurement error ect.. (unless your Bayesian) which i’m thinking you must be.

          I would suggest that if you can not tell the difference between an observed probability a frequency and the true probability you should pack it in, no need to keep a blog on stats which you obviously are not well versed in. Add to that your argument that a P value is a measure of sample size and You’ve got no leg to stand on.

          Lets look at the calculation of a P value shall we. In your example above the standard error for run is sqrt[(.75*.25)/7]=.163. If you null hypothesis is that you’d observe heads 50% of the time your test statistic would be z=(phat-p)/SE(p) or T=.25/.1531=1.533 the associated P value for a sample with 7 degrees of freedom is thus .1705 (or 17.05% chance you observed results are due to random chance – or measurement error). You can tell the results here are not sig. because the 95% confidence interval (which uses a P value of .05) puts the true population mean between .36 and 1.138 and as you cannot have a proportion greater than 1 its not sig.

          Rather than hurling insults at me why don’t you explain any errors you see in my thinking or caluations

        • Hey Daniel Lakeland its a good thing you knew “Andrew” here was Andrew Gelman given there are millions of Andrews in this world.

          Thanks for your pomposity

        • “which literally means there is a 95% chance that observed relationship is not due to chance”

          Sheesh. This is precisely why we need to emphasize that a p-value *is not* the probability that the results were due to chance. That’s the typical layman’s interpretation of a p-value, but we need to be very clear that that is not what it is. Otherwise, this is quite logically how one would interpret 1-p, and it’s just plain wrong.

        • What would be your interpretation of a P value than.

          BTW I think we’ve lost site of the authors original claim that a P value is a measure of sample size which it is most certainly not

        • Josh,

          For what it’s worth — I define the p-value as follows:

          “p-value = the probability of obtaining a test statistic at least as extreme as the one from the data at hand, assuming
          i. the model assumptions are all true, and
          ii. the null hypothesis is true, and
          iii. the random variable is the same (including the same
          population), and
          iv. the sample size is the same.”

          For more details, you may download the slides for Day 2 at http://www.ma.utexas.edu/users/mks/CommonMistakes2015/commonmistakeshome2015.html

        • Josh is a troll, he makes statements like: “no need to keep a blog on stats which you obviously are not well versed in.” to Andrew Gelman, professor of statistics and the first-author of one of the most comprehensive textbooks on modern Bayesian statistics (BDA3), as well as the manager of a group that produces the most cutting edge statistical inference software in the world (Stan)

          a p value is the probability of seeing data as extreme or MORE extreme than the result, under the assumption that the result was produced by a specific random number generator (called the null hypothesis)

          If the null hypothesis is that data come from a unit normal N(0,1) but the data actually come from a N(1,.05) and you have 1 data point, you are almost guaranteed that your p value is not “significant”. Perhaps p ~ 0.16, does this mean that there’s a 16% chance that your data “occurred by chance”, or that there’s an 84% chance that it didn’t occur by chance?

          The phrase “occurred by chance” is meaningless by itself.

        • @Martha: You wrote, “I define the p-value as follows: “p-value = the probability of obtaining a test statistic at least as extreme as the one from the data at hand…”

          Yes, exactly. It’s the test statistic that matters. A couple things:

          1) Unless your test is just an anomaly detector you need both H0 and H1 hypotheses to compute a test statistic. If all you’ve got is H0 then you’re not in a position to test anything. Just this afternoon one of my interns ran F-tests for models she developed and calculated associated p-values. For reasons not worth getting into, we took the F- and p-values with a grain of salt but the point was to see whether the p-value was really low or really high. It was a quick way to compare two hypotheses. (It was a bit of an academic exercise. You could see by eye that the more complex model didn’t fit the data appreciably better than the simple one. If it had looked like it did, we would have computed AIC or BIC values or the like.)

          2) Am I out of my mind in thinking that I pretty routinely see people write “data at least as extreme” where they should be writing “test statistic at least as extreme”? It’s not a subtle distinction, is it?

        • Chris G: Now that I consider the question for longer than two minutes, it occurs to me that according to Egon Pearson the test statistic is really (or ought really to be) defined by a system of level curves in the sample space across which we become “more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts”. (I call them level curves because by their construction, we should only care about which level curve the data lie on and should be indifferent to the position of the data within that level curve.) From that point of view the the distinction between extreme data and an extreme test statistic is rather blurry, don’t you think?

        • @Corey: Mayo’s post is not a lite read. I’m going to have to chew on it and the level curve suggestion for a bit.

          To clarify, I absolutely believe p(data|H0) is instructive but it’s only part of the story. Yes, you want to know whether your data is consistent with H0 but in order to make any definitive conclusions about what’s going on you need to establish whether the data is or isn’t consistent with other hypotheses. For example, suppose my H0 hypothesis is x~N(0,1^2) and H1 is x~N(3,1^2). I measure x=10. The value is extreme under both H0 and H1. In practice, I’d probably decide ~H0 and ~H1 because I wouldn’t believe that the H0 and H1 pdfs were actually normal out to seven and ten sigma, i.e., I’d figure that 1) the pdfs were good-faith estimates but that nowhere near enough data had been collected to characterize the tails out that far and 2) there’s probably some other Hn hypothesis that I hadn’t thought of which is more consistent with the data. If forced to make a decision between H0 and H1, I’d want to know the penalty for incorrect decisions in each case. Since I’d suspect both hypotheses are wrong I’d choose the one with the lowest penalty for error. I’m now way off topic. I’ll look more at Mayo’s post.

  20. Because a.) I did not know who Andrew Gelman is and B.) am a frequentist rather than a Bayesian does not make me a troll. BTW I think Bayesian statistics belongs in game theory not in results. Had I known the man was indeed a Bayesian than our cause for difference of opinion would have been apparent.

    This article is hitting HHS for the sloppiness of their language but rather educating, pointing out errors in my thinking you make fun of me. Could that be why people in general are not statistically literate? Instead of accusing me of being a troll and making fun of me, my advisory and my program your time would be better spent pointing out the inaccuracy of my statements. Martha did a great job on this front

    Martha you explanation is very informative, makes sense and I see the inaccuracy of my statement – I would say that the claim that a P value is an estimate the observed relationship is due to chance is not wrong per say but on over simplified statement.

    further not one of you have addressed my underlying points that are as follows A.) P value is not a measure of sample size B.) a P value of .05 is fine in some cases esp because we’re not taking anything as proof it provides credible supporting evidence and C.) the author confuses observed probability with frequencies.

    • Josh:

      This is my last try . . . Nowhere did I make fun of you or your advisor or your program. I am 100% serious that you talk over p-values with some people you work with, and maybe they can clear up your misunderstanding. This stuff really can be confusing. There’s no shame in being mistaken. Just take it as an opportunity to learn.

      • Andrew:

        You’re right you didn’t make fun of me, I misspoke. I did never the less feel attacked and marginalized by the accusation of being a troll, a slight heightened by Daniel’s comments.

        I will admit to falling into the lay, and incomplete understanding of a P values. Thanks to Martha’s explanation and linked write up I understand that now. I still think its a bridge too far to claim P values are a measure of population size, which was my original point. You could have a sample of a billion and still not get any sig for the relationship between foot size and political ideology.

        • Josh:

          Nope. With a sample of a billion, you’ll almost certainly get a statistically significant correlation between foot size and political ideology. Again, I suggest that, instead of arguing on the internet, you talk with some statisticians at your university and maybe they can explain this to you.

        • Andrew:

          I am actually, a colleague of mine has made the same point. I suggested we try a monte-carlo simulation for such a thing. We used eye color and like of star trek. Here is my R code and although I did not have enough RAM to simulate a billion I did do it with 100,000,000 with no sig.

          set.seed(324)
          blue<-rbinom(100000000,1,0.08)
          startrek<-rbinom(100000000,1,.5)
          gc()

          cor.test(blue, startrek)

        • This isn’t the sort of thing you can simulate. The issue is twofold: (i) with gigantic sample sizes, tests have tiny Type II error rates even for minuscule effect sizes; (ii) in social science, effect sizes may be minuscule in practical terms but they are never truly zero.

        • just so the simulations don’t require gazillions of bytes of ram, suppose instead the probability of blue is say 0.25 instead of 0.08, also suppose that in the finite population of 10e6 people surveyed, the data is consistent with a 1e-3 increased chance of liking startrek.

          > set.seed(324);
          > blue startrek cor.test(blue,startrek)

          Pearson’s product-moment correlation

          data: blue and startrek
          t = 3.027, df = 9999998, p-value = 0.00247
          alternative hypothesis: true correlation is not equal to 0
          95 percent confidence interval:
          0.0003374269 0.0015770158
          sample estimates:
          cor
          0.0009572217

          Hold on while I write up a new Psychological Science article…

        • Daniel:

          You got sig. from the minuscule effect you added to the simulation. and Corey is right in practical terms the effects are never truly zero and thus with very large sample sizes you’d get sig. on what would otherwise be a minuscule effect. There is a difference between sig in the statistical sense and sig. in the real world sense.

          But you have to keep in mind that your simulation detected the with sig, the pre-determined relationship because the sample size made your estimates more BLUE. The Sig. itself is otherwise unrelated to population size. As I told my college we can all agree that the level to which your tiers are inflated effects your gas millage, but it would be just wrong to say that the degree to which your tiers are properly inflated is a measure of your gas millage.

        • The designers of numerical random number generators go to extreme lengths to ensure that they pass large batteries of tests. The Mersenne-Twister algorithm is very good at being a random number generator.

          Therefore, when you use Mersenne-Twister to generate a sequence of independent binomial random variables with p exactly 0.5 you get a sequence that satisfies that criterion VERY well. If you test using null hypotheses of your data coming exactly from a random number generator, unsurprisingly, with all that effort put into it, you get exactly what you’re looking for.

          Fortunately for those of us who enjoy human nature, but unfortunately for those who wish to use NHST, samples of humans rarely produce such high quality random sequences

        • Daniel:

          That was very much Corey’s point and one I agree with. But my point is a simple one – P values are not, nor have they ever been a measure of sample size.

          To demonstrate this point to simulated to orthogonal and wholly independent variables confirming that a test correlation between the two remain very much not sig. even with a population size 1e8.

          Now real data is not so random and the larger your sample the smaller the effect you will be able to detect (with sig) regardless if that effect has real world implications or not e.g people with blue eyes are 1e4 times more likely to like star trek.

          In your simulation you began by generating data such that the two were dependent to that extent and unsurprisingly you found that very effect (sig so) with a sample of 1e6. and while agree that you’d likely not detect that effect with a sample of say 1000, the P value increases with a larger sample size bc your estimates are more BLUE and not because the sample size itself. P values have never and will never be a measure of your sample size.

        • They’re not a measure of sample size when you use independent random numbers from a random number generator, but in the presence of “every effect in real science is not exactly zero” p is indeed highly affected by sample size.

          If you agree that “every effect in real science is not exactly zero” then if you get your sample big enough, you can detect the non-zeroness and hence, p *in real science* depends strongly on sample size.

        • vals <- sapply(10^seq(1,7),function(x){t.test(rnorm(x,0,1),rnorm(x,0+3e-3,1))$p.value});
          > vals
          [1] 8.281375e-01 7.857281e-01 1.786090e-01 7.281725e-01 1.272459e-01
          [6] 4.639469e-03 3.562949e-14

          this is the difference between doing probability theory, and doing statistics on scientific data.

        • ” [T]hus with very large sample sizes you’d get sig. on what would otherwise be a minuscule effect.” I think actually for the general public the issue is that you can get a large observed effect size in a small sample even though there is no real difference (and I think there are plenty of cases where there are no real differences) or a difference that is so small it is meaningless in a practical way. So when they read about homeopathy or ESP or numbers presented in advertisements where claims for the existence of meaningful effects are made it’s certainly reasonable for them to ask “what is the probability of getting this effect size or greater in this sized sample if the effect is not real?”

          Chris G mentions test statistic, but I think that comes after asking the question and for this level of explanation it basically is part 2, which is how do you determine what this probability is, which is about test statistics and their assumptions.

          So in other words there is “What is a p value?” and then “How is a p value calculated” and those are two different questions.

    • Wikipedia defines trolling as:

      “a person who sows discord on the Internet by starting arguments or upsetting people, by posting inflammatory…” etc etc

      Coming here and posting from a point of view of ignorance personal attacks about how the blog owner should shut down because he doesn’t know anything about his topic…. That’s trolling. If you don’t want to be called a troll, don’t do that sort of thing.

  21. To a lay person replacing the incorrect “p-value is the probability an effect is false” by the correct “p-value is the probability of that effect or one more extreme conditional on the null hypothesis being true” looks just like word masturbation. I’ve seen plenty of posts like this, and to a non-statistician like me they always looked like arguments of semantics. It was not until playing with numbers that it clicked. Unless you already understand why the former is incorrect and the later is correct then you won’t see the distinction between the two sentences, defeating the purpose of the instruction.

    Setting up some simulated data, and showing how applying the common wrong interpretations leads to real and substantial incorrect quantitative and qualitative answers and how applying the correct definitions leads to correct answers is the best way to get this stuff across to people. This is where graphs of data do wonders.

    For example, on the definition of probability. “[Probability] is the number of times a condition or event occurs in a study group divided by the number of people being studied”. Saying “No, that’s a frequency not probability” has no effect on one’s understanding. To get the point across, you need to say something like “The probability of getting heads on a coin toss is 0.5. If I toss a coin 5 times and I get 3 heads and 2 tails, the frequency of heads was 0.6. If I was to use that result of 0.6 as the probability of getting heads (rather than understanding it was just the frequency of heads on those 5 tosses) then I would believe that setting up a gambling game where I win $4.1 for heads and lose $5.9 for tails would net me on average $10 per 100 rolls, when in fact I would lose on average $90 per 100 rolls.”

    • Luke:

      The first job when writing this sort of glossary is to not get anything wrong. I agree that further explanation of the sort you suggest is a good idea. Phil earlier in the comments suggested that coin-flipping example.

      I’m certainly not saying that I should be the one who writes these glossaries. But no amount of explanation will help if the underlying definitions are so screwed up. You can see this in the government-supplied material above: the definitions are wrong and the accompanying examples are no good.

    • Here are 3 examples I believe elucidate both the distinction and relationship between frequency and probability and also highlight a difference between Frequentist and Bayesian approaches.

      Example 1 (Known baserate):

      What is the probability of choosing a King from a deck of cards?
      1. If we draw a single card from a deck of cards 52 times without replacement the frequency of Kings drawn between any given two draws will be between 0 and 4.
      2. The probability of drawing a King:
      a. 1st draw = 4/52
      b. 2nd – 52nd draw = (4-k)/(52-d) (Where k=number of Kings drawn and d=number of cards drawn)

      After all 52 draws the frequency of Kings = 4 and the frequency ratio = 4/52 and the probability of choosing a King on a single draw = 4/52.

      Example 2 (Frequentist approach):

      What is the probability of choosing a King from a deck of 30 randomly chosen cards from another normal 52 card deck?

      What is our best estimate of the probability of choosing a King from the 30? (Frequentist approach)
      a. Draw a single card 30 times and count frequency. Frequency of Kings/30 = probability of selecting a King on a single draw.
      b. Do the same 100 times.
      c. Test the Null Hypothesis that Pr(King) = 0 versus Alternative Hypothesis that Pr(King) > 0.
      d. If the Null Hypothesis is rejected, then

      In this example the actual number of Kings is unknown and therefore the probability must be estimated. As the number of repetitions increases the average frequency of Kings stabilizes around an estimate assumed to be the true score.

      Example 3 (Prior information approach):

      What is the probability of choosing a King from a deck of 30 randomly chosen cards from another normal 52 card deck?

      What is our best estimate of the probability of choosing a King from the 30? (Bayesian approach)
      a. 1st draw = (30*(4/52))/30 = ~2.308/30
      b. 2nd – 52nd draw = (2.308-k)/(30-d)

      In this example, again the actual number of Kings is unknown and therefore the probability must be estimated. Utilizing information from the known probability of Kings in a 52 card deck allows us to estimate the number of kings in the 30 randomly drawn cards. We know this estimate on the first draw and can show that the variation around it in 100 such studies will converge to this value. This is a far more efficient estimate of the true probability than the approach in Example 2.

  22. Their definition of a confidence interval is just as bad. It starts with “The range in which a particular result (such as a laboratory test) is likely to occur for everyone who has a disease. “Likely” usually means 95 percent of the time.”

    • And their example for a CI is even worse:

      “For example, a study shows that the risk of heart attack from a drug is 3 percent (0.03). The confidence interval is shown as “95% CI: 0.015, 0.04.” This means that if you conduct this study on 100 different samples of people, the risk of heart attack in 95 of the samples will fall between 1.5 percent and 4 percent. We are 95 percent confident that the true risk is between .015 and .04.”

      • The example is the type of thing I include on a list of attempts to explain what a 95% confidence interval means. I ask students to classify them as correct, partly correct but missing some details, doesn’t get it, or I haven’t got a clue. We then have a whole class discussion. It usually helps some students understand and (possibly equally importantly) helps some students realize they don’t understand. But a lot of people really have difficulty understanding a concept involving so many conditionals.

        • So many conditionals and the added trick of having to use the sample mean as the estimate for the population mean to estimate the standard error. That’s the part where students put their heads down on the table.To me that’s always the “pay no attention to the man behind the curtain” point. Of course that again is a difference between what a confidence interval is and how it is calculated, but unfortunately in a lot of stats 101 text books these are really collapsed.

        • Elin:

          Ultimately I think the problem is that the classical theory of confidence intervals is clever but is generally a bad idea. It’s hard to explain in part because it ultimately doesn’t make sense in typical settings.

          A classical confidence interval represents all the points that are not rejected by the data, in the context of a particular statistical model. Or, to use Dan Lakeland’s helpful terminology in an earlier comment, a classical confidence interval represents a set of random number generators that are not rejected by the data using a particular test statistic.

          The classical confidence interval can thus be seen to have two serious problems: First, it is dependent on the test statistic. Using different test statistics, you’ll have different random number generators that are rejected by any particular dataset. Indeed, as we discussed in a post a few years ago, if you put enough information in the test statistic you’ll have no problem essentially rejecting every random number generator in your set of candidates. This gives you an empty confidence interval.

          Even worse, it’s possible to have a dataset that fits the model class almost as badly, so that almost every random number generator is rejected. This will give you a very narrow confidence interval, which is typically taken as evidence for strong evidence in the data.

          The second problem with the confidence interval is the same as with other classical methods, which is that it’s entirely data-focused, it’s all about what the data can reject. But ultimately we care about the underlying truth, about the larger population, not just about the data or sample. As John and I discuss in our recent paper, when data are noisy and the confidence interval excludes 0, some of the points outside the confidence interval can represent reality much better than any of the points inside the interval.

          OK, this needs its own post…

        • Andrew: “The second problem with the confidence interval is the same as with other classical methods, which is that it’s entirely data-focused, it’s all about what the data can reject.”

          I am not sure I would characterize it that way. Though I am no historian of thought, I would say often classical methods are focused on making decisions: Should I use fertilizer in my plots or not?

          I don’t think Fisher saw p-value as an indication of truth, or as rejecting anything. More like “On the basis of this evidence we should do A”.

          At this practical level I am pretty sure one can design Bayesian and frequentist decision procedures that have the exact same practical consequences (e.g. same number of correct decisions, mistakes).

          However, if the goal is to characterize truth, then the Bayesian framework makes a lot more sense. The problem is when the search for truth, and the need for practical decisions, get all mixed up.

        • Yup 161+ comments and its clear how to present this to others given our widely shared clear understanding ;-)

          In John and your paper this part of your definition “Definition: A mathematical technique to measure the strength of evidence from a single study.” is a big fail -no?

          John and your paper gets at a pragmatic or purposeful grasp on the p_value concept, what to make of them (e.g. “difficulties of interpretation”) which goes beyond the nominal definition of what they are or the ability to pick them out (e.g. Martha’s “classify them [here confidence intervals” as correct, partly correct but missing some details”). Folks need the first two of course but the pragmatic grasp is much harder to get across and I think necessary for the concept to do more good than harm.

        • “OK, this needs its own post…” – can you get at the relationship between this definition and coverage rates too, please? As in – we can think of a CI as the set over which we don’t reject the null, but also we can think of it as the output of a procedure that generates a set of parameter values which covers the true population parameter some proportion of the time in repeated trials.

          Certainly these must reduce to the same thing in many basic settings, but I’ve been working on some situations where I think the latter is better than the former. Also because I think of the relationship between CIs and the sampling distribution of BetaHat as more fundamental than the relationship between CIs and hypothesis tests. But I’ll bet you disagree with some of that (or have deeper thoughts about it), and I’d be interested to hear more.

        • “but also we can think of it as the output of a procedure that generates a set of parameter values which covers the true population parameter some proportion of the time in repeated trials”

          One thing we need to get right in this definition is that it only applies to repeated random sampling from a fixed population. So it works great for things like taking random samples of boxes in a warehouse where the contents are static, but it’s entirely conceivable that repeated samples in time of a dynamical process can produce results that 0% of the time does the CI include the current value of the population parameter. Probably that’s a bit pessimistic in practice, but the point is that the “guarantee” of coverage is nothing like a guarantee in the presence of dynamics rather than repeated random sampling of a static population.

        • Daniel – yes, absolutely. Thinking clearly about the data generating process you believe underlies your data is necessary before you can choose the procedure for generating CIs. And to be clear, I’m almost as suspicious as you are about the coverage properties of most CIs that people actually compute.

          But I think the conceptual difference between thinking of CIs in relation to sets in which the null hypothesis is rejected as opposed to thinking of them as procedures that produce sets with certain coverage properties is interesting. To me, the former links CIs with hypothesis tests and the latter with the sampling distribution of the parameter estimate. I think the latter relationship is more interesting, but I’m shopping around for other opinions on that.

        • Hi Martha,

          I’ve thought about how to teach students about “confidence intervals” a lot too. This topic always comes up *after* point estimation, so I try to use the terms “interval estimator” and “interval estimate” as much as possible, and am light on the use of “confidence interval” because it does not let me distinguish between the two. Further, I suspect to some, “confidence interval” refers to a specific kind of “interval estimator/estimate”.

          So then, I can be clear that the probabilistic statements applies to the random interval that is the interval estimator.

          JD

  23. I find very interesting to read sentences like:

    Miss Ingdata: I also get a little bored over the pedantic nitpicking that statisticians do over the definition of the p-value.

    Andrew: Pedantic is in the eye of the beholder. But I think these misconceptions have real consequences.

    “Misconceptions” and “real consequences” are also in the eyes of the beholder. One very important misconception occurs when someone defines p-values as something like “p-value = P(data | H0)”. It is a bad notation/analogy and promotes many misunderstandings, avoidable controversies and it has real consequences (as a banishment of classical tolls from a psychological journal).

    The p-value is not a conditional probability. Many conclusions taken considering p-values as conditional probabilities, at least, deviate and overshadow more depth discussions.

    First: The informal definition of conditional probabilities is P(A|B) = P(A&B)/P(B), where P(B) is positive. The events A and B must be measurable in the same space. This fact is important.

    Second: In a classical statistical model, the null hypothesis H0 makes statements about probability measures (the parameter space is just an indexer of probability measures) and therefore H0 is not defined in the same sigma-field of the data (if you imposes a sigma-field to includes H0, you are imposing a Bayesian structure and will be imposing also many non-allowed logic rules).

    Third: A general p-value has a formal definition that has nothing to do with the definition of conditional probabilities. If you decide to explain a p-value for a general public (or practitioners) by using conditional probabilities, you will be importing the Bayesian interpretation to a non-Bayesian tool. The practitioners that do not have technical training will import many logical rules that are allowed in the Bayesian world but are not inside the classical world to elaborate their (mis-)conclusions.

    Unfortunately, as statisticians have little contact with logics, alternative logics, limitations of probability, other types of uncertain measures, non-additive measure theory, etc, these misunderstandings will continue perhaps for many years.

    • Alexandre:

      Indeed, the p-value is not Pr (data | H0). It is Pr (T(y.rep) >= T(y) | H0).

      Two very important things here. First, it’s not at all the probability of the data, it’s always the probability of seeing something at least as extreme as the data—a much different thing! Second, the p-value depends crucially on the test statistic T(y) and how it is defined if other data were seen (that’s T(y.rep)).

      P.S. The above definition is itself a simplification in that it assumes that H0 is a point null hypothesis or that, if it is composite, that the p-value is constant across all models that make up the composite null. More generally, the p-value can vary across the space of the composite null hypothesis, and then a single p-value is created by some reduction. Methods proposed in the literature include using the supremum of all p-values; or the p-value from a plug-in estimate of the parameters in the null hypothesis; or some weighted average of the p-values.

      • Andrew

        Of course, but I was afraid to use the symbol “>” and have problems with my post. “Data” was a shortcoming for “subset of the sample space”, since the {T(Y)>t(y)} is a subset of the sample space.

        In your definition you still use the conditional probability to define p-values, this notation generates misunderstandings even for point null hypothesis. I hope one day you will realize this important issue and will start to avoid this bad notation (as you have strong influence, you should rethink with care your position). By the way, I think we are converging to the same notation, in more ancient posts you used different arguments to defend the definition of p-values by conditional probabilities, it is good to see a change in your.

        Indeed, p-values depend on T and T depends on H0 as I have explicit written on page 3 in my a paper paper published in FSS.

        • I’d be interested to see one real “in-the-wild” example of a serious, consequential misunderstanding arising from Bayesians’ willingness to write Pr( thing | H0 ) instead of Pr( thing ; H0) or similar.

        • I’d have to agree with Alexandre that the conceptual distinction between conditional probability and the hypothetical reasoning used to calculate a p-value is important enough to merit the notation P(T(Y) >= T(y); H0) for the latter. While I cannot cite an “in-the-wild” consequence of this misunderstanding, I think, especially in the teaching of statistics, it is important to be clear and consistent in our language and notation about what is treated as random and what is not in that other paradigm of statistics.

        • Corey,

          Let A = {T(X) greater than or equal to t(x)} and H0: theta in Theta0.

          First, the null hypothesis in the classical paradigm makes statements about the probability measures of the statistical model, then “theta in Theta0” means that those probability distributions that potentially explain the data should satisfy the restriction “theta in Theta0”. If M is the family of probability measures, then M0 is the family of probability measures under the restrictions in H0.

          Second, when you write P(A|H0) you are making many impostures. What does mean “P” in the statistical model? It does not make any sense out of the Bayesian language, since H0 makes statements about the probabilities in the statistical model. This definition P(A|H0) appears to suffer from self-referentiality, you define a P that depends on a statement that depends on P that… Well this causes so many paradoxes that Von Neumann in 1925 introduced the axiom of regularity that in particular prohibits self-referentiality.

          Third, by using this notation P(A|H0), only Bayesian interpretations make sense, since that P must be other measure than those listed in H0. You are explicitly imposing that the events {theta in Theta0} and {theta in (Theta – Theta0)} are in the same sigma-field of the data event A, then, in this context, P must be the Bayesian product measure. However, we want to use a classical tool, why on earth are we importing Bayesian interpretations? this just confuses the reader, students, practitioners…

          Fourth, if you interpret a classical quantity by using Bayesian language you will conclude that the classical quantity is not good. Of course, if you impose a Bayesian criteria to compare methods, you will conclude that Bayesian methods are better.

          If you want more detailed explanations, theoretical and numerical examples, I can provide but not here.

          Reference:
          von Neumann, J. (1929), Uber eine Widerspruchfreiheitsfrage in der axiomatischen Mengenlehre, Journal fur die reine und angewandte Mathematik 160: 227-241

        • To me, it is a language problem. Bayesians feel comfortable with this conditional notation and teach by employing it even for classical theory. To me, it is disasters for classical methods in a long run.

          In my opinion, if we want to teach classical theory, we must use a good notation to better understand its real/apparent problems. Otherwise, just be honest and say that we do not know how to formalize correctly the classical theory to avoid those so claimed “problems”.

        • Alexandre, where to begin?

          (1) Classical theory is a disaster no matter what.

          (2) You did not answer Corey’s question at all. The reason you can’t cite a real example where this caused a problem is because there is no such example. I’ve never heard of one, neither has anyone else.

          (3) Measure Theory is not statistics. All measure theory does is provide a certain kind of extension of statistics to infinite sets. If measure theory contradicts any aspect of what we consider necessary for statistics, such as that all probabilities are conditional, then measure theory should be dropped and a different formalism for statistics should be found.

          which brings me to:

          (4) Any issue in probability theory that only appears for infinite sets is irrelevant to both the foundations and practice of statistics.

          (5) Frequentist insistence that Bayesians should use P(thing ; H0) instead of P(thing | H0) is the most asinine thing I’ve ever heard. Frequentist may have jumped head first into the crazy swamp, but that doesn’t mean Bayesians are required out of politeness to jump in with them.

        • > (5) Frequentist insistence that Bayesians should use P(thing ; H0) instead of P(thing | H0) is the most asinine thing I’ve ever heard.

          You obviously don’t listen to talk radio but, ignoring talk radio for the moment, it scores pretty high (low?) for me too.

          I’m with Corey, if I’m to take the distinction seriously then I need to see some examples where the authors booted their analysis because of confusion over notation.

        • Anon,

          I am not advocating in favor of frequentism. Classical statistics is much more the frequentism.

          The real examples are the conclusions written in many papers. If you use a poor language, you have poor conclusions. Poor analogies produce poor comprehension.

          Perhaps logic/mathematics must be considered here, mustn’t? If a definition makes use of self-reference, then it is problematic.

          Examples of problematic sentences for the standard logic:

          1. “This sentence is false”.

          2. (A): “The sentence (B) is true”; (B): “The sentence (A) is false”.

          and finally:

          3. H: “P is of a certain type”, define: P(A|H).

          Self-reference is a real problem for standard mathematics.

        • This isn’t… responsive…?

          I am acquainted with your four points (even the Axiom of Regularity thing, a bit). I admit the charges: the notation could potentially cause confusion. What I’d like to see is historical evidence of some specific person (non-student preferred) actually becoming confused with said confusion being clearly attributable to Bayesians’ sloppiness in this regard.

        • Well, there are so many confusions, but one recent is that a psychology journal BASP decided banish p-values. If you go after the papers cited in that editorial note you will see the reasons and the answer for your question…

        • Could you maybe narrow it down to one specific paper cited in that editorial? Perhaps the one in which Bayesians’ use of P( thing | H0) is most clearly to blame?

        • Corey,

          See for instance:

          Trafimow, D. (2003). Hypothesis testing and theory evaluation at the boundaries: Surprising insights from Bayes’s theorem. Psychological Review, 110, 526–535.

          The author makes many conclusions about p-values by using all the (prohibited) rules lurked in the definition P(A|H0)

        • Yup, that’s pretty bad. Thanks for bearing with me.

          Trafimow doesn’t appear to understand the procedure he’s criticizing: he writes P( F | H0 ) where F stands for “finding”. But of course that’s not what a p-value is; the whole “probability of a result as or more extreme than the result actually observed” thing seems to have gone over his head.

        • Corey,

          Yep, but I think this is a minor issue. The major problem is to write P(A|H) and to use all operations it tacitly allows. Of course, that conclusions are valid only in a restricted Bayesian domain and do not touch in classical p-values, however, they think it does…

          This paper has 100+ citations in Google scholar. Well, if high recognized statisticians utilize this notation, these guys will certainly use it and will import all natural operations the notation induces.

          If he had written p-value = sup_{P in M0} P(A), where M0 is the family of probability restricted to H0, he would never reach such conclusions. The conditional probability to represent p-values is a bad fiction…

        • Alexandre, there are a lot of misconceptions in the paper; I singled out the “as or more extreme” one because I think it’s a prerequisite for treating P( F | H0 ) as a thing that can be put into Bayes’ Theorem. If a person becomes aware that the p-value is the probability of a hypothetical event that hasn’t actually occurred then it stops making sense to condition on that event in the typical Bayesian fashion. That realization would have headed off Trafimow’s entire line of inquiry in the paper.

        • Hi Keith,

          Thanks for your intervention. I am not arguing this linguistic principle. I am arguing against a very specific notation.

          Indeed, some notations force misunderstandings: assume that the quantity you want to define should have only properties listed in A={A1,A2,…}. If you define by using a notation that allows to reach an outsider property B not in A, then your quantity is ill-defined and will produce mis-understandings.

          There are many ways to justify bad notations and I will not embrace this fight in blogs. If what I’ve said is not enough, I would have to explain many basic things about logic principles, axioms of set-theory, probabilities, statistical models, sigma-fields, and so on and so forth. These subject are all interconnected, they make part of the required literacy to comprehend statistics avoiding many misunderstandings. Of course, it is possible to have a kind of comprehension without these tools, but this comprehension will be poorer than could be.

  24. A lovely definition in this post — just like the one I learnt 35 years ago from a mathematician ax NASA. In the intervening years, I wonder what happened to my profession( clinical psychology) that such rubbish now is taught. Chris Fonnesbeck is correct in observing that the curricula in most places is 30 year too old. But it is more than that, at least here; the curricula has evolved to reward clever use of packages in commercial software that are almost never understood in terms of fundamentals. I regularly re-explain the ideas around what independent variables are and how these are not like dependent variables. Barely lecture 1 of basic stats. A classic example in NZ is the use, in my former workplace (Corrections), of multiple measures of risk of reoffending, where these are treated additively — they are NOT independent. So thank you, I am working through your 3rd edition book, which is wonderful and I enjoy your posts — they help remind me I wasn’t going insane. One small aside — I’d love to see some evidence, or even a post, around the USA being the greatest country in the world — if you ever come to NZ, maybe this claim will seem a little closer to the null hypothesis than you presently think :) L

  25. No offense, but this could be the most arrogant academic article I’ve ever read, even if your argument is completely correct. Educate people instead of criticizing them. I imagine you’d be the type of academic to write to that columnist that they were wrong about the Monty Hall problem.

Leave a Reply

Your email address will not be published. Required fields are marked *