My favorite definition of statistical significance

From my 2009 paper with Weakliem:

Throughout, we use the term statistically significant in the conventional way, to mean that an estimate is at least two standard errors away from some “null hypothesis” or prespecified value that would indicate no effect present. An estimate is statistically insignificant if the observed value could reasonably be explained by simple chance variation, much in the way that a sequence of 20 coin tosses might happen to come up 8 heads and 12 tails; we would say that this result is not statistically significantly different from chance. More precisely, the observed proportion of heads is 40 percent but with a standard error of 11 percent—thus, the data are less than two standard errors away from the null hypothesis of 50 percent, and the outcome could clearly have occurred by chance. Standard error is a measure of the variation in an estimate and gets smaller as a sample size gets larger, converging on zero as the sample increases in size.

I like that. I like that we get right into statistical significance, we don’t waste any time with p-values, we give a clean coin-flipping example, and we directly tie it into standard error and sample size.

P.S. Some questions were raised in discussion, so just to clarify: I’m not saying the above (which was published in a magazine, not a technical journal) is a comprehensive or precise definition; I just think it gets the point across in a reasonable way for general audiences.

101 thoughts on “My favorite definition of statistical significance

  1. “we give a clean coin-flipping example”

    Examples are super useful in education!

    More so if you use them regularly so people already know what you are talking about and have talked about.

    And even more so if you can connect this to other things you want to teach people about.

    The only thing to keep in mind though is that you can only use the coin-flipping example a few times, otherwise it’s called bullying.

  2. Hmm. I’m surprised you still like this definition, Andrew. I don’t like the focus on “chance,” because I think it seduces people into thinking that a significant result is “not by chance,” i.e., probably “real.” I wonder what others will say.

    • I don’t object necessarily to the use of “chance”, but I think the explanation would be better if it left out the sentence,

      “More precisely, the observed proportion of heads is 40 percent but with a standard error of 11 percent—thus, the data are less than two standard errors away from the null hypothesis of 50 percent, and the outcome could clearly have occurred by chance.”

      That sentence gets too fuzzy and subject to misinterpretation.

    • I agree that using the word “chance” is easily misleading. I was just reading an otherwise interesting article on peer review (see link below) and came across this sentence: “It may look convincing if an experiment yields a result that would have had only a 1 per cent probability of happening purely by chance”, a classic misinterpretation of p (sorry for using the letter p…) as the probability that chance alone produced the result.

      https://www.the-tls.co.uk/articles/public/the-end-of-an-error-peer-review

  3. “we don’t waste any time with p-values, … and we directly tie it into standard error and sample size.” I do it this way too, the emphasis is on the statistical significance of differences between the aspect observed and what would be expected were the data generated by a process approximately described according to the test hypothesis.

  4. I think you can take it further to demonstrate the replication problem with some studies. Some studies do the equivalent of performing the 20-toss experiment many times, and then selectively presenting the rarest result that has occurred.

  5. I don’t like that chance word at all (although sometimes I do use it too…). Think about what it means in this case. Say you flip a coin n = 3 times and it must always land heads or tails. In that case there are 8 possible outcomes of your flips:

    > n = 3
    > m = expand.grid(rep(list(c(“H”, “T”)), n))
    > m
    Var1 Var2 Var3
    1 H H H
    2 T H H
    3 H T H
    4 T T H
    5 H H T
    6 T H T
    7 H T T
    8 T T T

    For each row we count the number of heads:
    > nHead = apply(m, 1, function(x) length(which(x == “H”)))
    > nHead
    [1] 3 2 2 1 2 1 1 0

    Then count up how many times 3, 2, 1, 0 appeared and divide by the total number of possibilities
    pHead = table(nHead)/nrow(m)
    > pHead
    nHead
    0 1 2 3
    0.125 0.375 0.375 0.125

    This the binomial distribution when p =.5:
    > dbinom(0:n, n, 0.5)
    [1] 0.125 0.375 0.375 0.125

    So where is “chance” in this? Chance is something we have encoded into the model. Here it is the assumption there is no reason to prefer any of the 8 possibilities more than another. Eg, why did outcome 4 happen rather than outcome 7? No idea.

    If we change the model, we change the meaning of “chance” as well. Saying “the outcome could clearly have occurred by chance” is just a confusing way of saying “the model predicted this outcome was highly likely”.

      • I think a lot of the time the model is pretty much “magic” to the end-user. Having no formal training in stats, until I derived that case for the post it was to me. If this is a standard thing I’d love to see someone modify the code for cases where p != .5. Do you somehow weight each possible outcome or what? Perhaps it requires a totally different derivation not dependent on number of possible outcomes?

        • I don’t know what kind of derivation are you expecting, but it seems to me the very definition of the binomial distribution answers your question.

          http://mathworld.wolfram.com/BinomialDistribution.html

          The formula counts how many equivalent outcomes there are (how many permutations with 0 heads, how many with 1 head, etc.) [this is the binomial coefficient] and “weights” them, as you say, according to their probability [this is the p^n * (1-p)^(N-n) factor]. Of course when p=1-p=0.5 the second term is constant and equal to 1/2^N (all the counts are normalised dividing by the total number of possible outcomes, nrow(n) in your previous comment).

        • and “weights” them, as you say, according to their probability [this is the p^n * (1-p)^(N-n) factor

          I am asking to derive this using steps like the one I posted (which I, of course, find very intuitive). Ie there are 8 possible outcomes and we have some reason to prefer some over others because heads is more/less likely than tails.

          If the probability of T and H are not equal, is it possible to derive the binomial distribution by assigning weights (other than 1, as in my example) to each of these possible outcomes:

          > m
          Var1 Var2 Var3
          1 H H H
          2 T H H
          3 H T H
          4 T T H
          5 H H T
          6 T H T
          7 H T T
          8 T T T

          What I am thinking of has that as the starting point.

        • Sorry, but not able to work it all out at the moment. Is it so utterly trivial as using something like:

          1: p(H)*p(H)*p(H)
          2: (1-p(H))*p(H)*p(H)

        • If the probability of T and H are not equal, is it possible to derive the binomial distribution by assigning weights (other than 1, as in my example) to each of these possible outcomes:

          Here is an example with p=0.3, if this is not enough I don’t really know how to make it more explicit:

          outcome weight
          H H H p^3 = 0.027
          T H H p^2*(1-p) = 0.063
          H T H p^2*(1-p) = 0.063
          T T H p*(1-p)^2 = 0.147
          H H T p^2*(1-p) = 0.063
          T H T p*(1-p)^2 = 0.147
          H T T p*(1-p)^2 = 0.147
          T T T (1-p)^3 = 0.343

          Count the outcomes with 0 heads (1) and multiply by the weight (0.343) => 0.343
          Count the outcomes with 1 heads (3) and multiply by the weight (0.147 in all cases) => 0.441
          Count the outcomes with 2 heads (3) and multiply by the weight (0.063 in all cases) => 0.189
          Count the outcomes with 3 heads (1) and multiply by the weight (0.027) => 0.027

          Note that the count of outcomes with k=0,1,2,3 heads out of n flips is the binomial coefficient.
          Alternatively, you could use the equivalent procedure:

          Sum the weights for the outcomes with 0 heads ( 0.343 ) => 0.343
          Sum the weights for the outcomes with 1 head ( 0.147 + 0.147+ 0.147 = 3 * 0.147 ) => 0.441
          Sum the weights for the outcomes with 2 heads ( 0.063 + 0.063 + 0.063 = 3 * 0.63 ) => 0.189
          Sum the weights for the outcomes with 3 heads ( 0.027 ) => 0.027

          > dbinom(0:3, 3, 0.3)
          [1] 0.343 0.441 0.189 0.027

        • Just for the record, I *know* how to make it more explicit ;-)

          outcome weight
          H H H p*p*p = p^3 = 0.027
          T H H (1-p)*p*p = p^2*(1-p) = 0.063
          H T H p*(1-p)*p = p^2*(1-p) = 0.063
          T T H (1-p)*(1-p)*p = p*(1-p)^2 = 0.147
          H H T p*p*(1-p) = p^2*(1-p) = 0.063
          T H T (1-p)*p*(1-p) = p*(1-p)^2 = 0.147
          H T T p*(1-p)*(1-p) = p*(1-p)^2 = 0.147
          T T T (1-p)*(1-p)*(1-p) = (1-p)^3 = 0.343

        • Thanks, I actually don’t like this since I was interpreting probability to mean selecting 1 out of n possible outcomes without preference in the p = 0.5 case. This seemed like it was allowing a deeper understanding.

          Here another layer of “chance” is included (the p^2*(1-p), etc) and we need to consider the “chance” of Heads or Tails of each individual toss as well. So “chance” is encoded into the model by looking at the probability of Heads on each toss, which isn’t enlightening.

        • And if you want to experiment … first they don’t always land head or tail, sometimes in the real world they land on their side. Second, try spinning instead of flipping. Third try sliding off the edge of a table.

    • This is the elaboration of what I suggested above. I think partly we are getting at the difference between a model that affirmatively predicts the observed variation, and a model in which the variation around a central value is treated as “errors.” If your own model predicts a tail likelihood of an event of 1/20, that event didn’t happen strictly “by chance.” (The “chance” is that it happened in this particular run of your experiment.) In the errors scenario—say, if the null hypothesis is that the speed of light in a given medium is invariant—I would be more comfortable saying an outlier result is or is not probably due to chance. But this breeds confusion when people conflate the scenarios.

      • The “error” perspective does make sense in some circumstances, but not in others.
        For example, in a manufacturing process where where it is important to get the product as close to specification as possible, then deviation from the specification is reasonably called error in that context.
        Also, if we are trying to estimate the mean of some random variable in some population, then “error” makes sense for the deviation of the sample mean from the (unknown) population mean.
        But in most cases where we are talking about individual measurements of some quantity, “error” is a misleading choice of terminology.

        • Yes. And this reminds me of something I ought to have mentioned in my comment: It is helpful to distinguish between two kinds of uncertainty, often called “epistemic” (or “epistemological”) and alleatory”.

          Epistemic uncertainty is uncertainty from lack of knowledge (e.g., we don’t know the value of some quantity exactly; it is sometimes also applied to models: we are not certain that a model exactly fits). The term “error” does fit well here.

          Alleatory uncertainty is uncertainty that arises from randomness — not just as in tossing a die, but also in random sampling. For example, when we choose a random sample of students to estimate the average score (over the population from which the sample is taken) on a standardized exam, the scores from the sample have alleatory uncertainty. So using the term “error” for the difference between each student’s score and the sample average is misleading.

      • I think partly we are getting at the difference between a model that affirmatively predicts the observed variation, and a model in which the variation around a central value is treated as “errors.” If your own model predicts a tail likelihood of an event of 1/20, that event didn’t happen strictly “by chance.” (The “chance” is that it happened in this particular run of your experiment.)

        This is a good distinction. If you flip a coin 20 times and get all heads, that is consistent with the binomial model. If we accept that result, the model is still ok. On the other hand, if the speed of light really does vary the model claiming it can’t needs to be improved/replaced.

        I think the speed of light is a good example to work with. An interesting fact is the standard analysis shows the model is wrong. However, instead people basically just think the statistical model of the measurement error is incorrect:

        Assessing uncertainty in physical constants.Henrion and Fischhoff. American Journal of Physics 54, 791 (1986); https://doi.org/10.1119/1.14447

        My impression is that there hasn’t been much thought put into the model of the errors in this case.

  6. I think “the outcome could clearly have occurred by chance” is going to lead students to think that significance means the outcome is unlikely to have occurred by chance, i.e. that the posterior probability is low.

    I like the coin flip examples. I like to get students to consider how many times you have to flip so that it is even possible for the observed to be at least 2 SEs from the mean, i.e. so that power is greater than 0. Though I do reference the term ‘p-value’. Acknowledging that people don’t find probabilities intuitive, I think beginning students tend to have less intuition of SE.

  7. I have a model that predicts the relative frequencies of occurance of various events _in a large number of trials_.

    I observe the result of a _single trial_. It is one of the events lying in a region of low relative frequency of occurance under _a large number of trials_.

    Q: What can I say?

    A: Not much?

    • Point being – I don’t think a single significant (or insignificant) result is very interesting. _At most_ it gives a hint of where to look next. Most cases where a significant result is interesting are, I would argue, interesting for reasons beyond being statistically significant. The same pvalue can ‘mean’ very different things.

      • Calling a single result significant only makes sense if you are using it as an indication and which you intend to demonstrate corresponds to an interesting ‘effect’ by _repeatedly_ ‘bringing about’ that level in further experiments (Fisher) or as an error rate in _long term_ decision making (NP).

        Any definition which doesn’t connect it to repetition requirements or long term decisions is, I think, misleading.

        • + 1. I still think Andrew Gelman’s explanation is useful though, in that it suggests a kind of filter. If we are 2 SEs means something is likely going on…look into it further. The key is to not map this (or any) arbitrary threshold into some kind of real/not-real or true/not-true dichotomy.

        • ergh, blog ate my symbols. If we are less than two SEs, all kinds of unaccounted variability could have produced observed data, best not to conclude much either way. If we are greater than two SEs, something is more likely to be going on…look into it further.

        • Yeah but I still think any attempt to compare a single realisation with a distribution is somewhat problematic unless ‘repetition’ is taken into account (which is just a way of going beyond single realisations!).

          In fact, Fisher’s requirement to ‘repeatedly bring about’ results that are unlikely under the model could be rephrased as ‘show the observed distribution of relative frequencies of events in a large number of trials looks very different to the model distribution of relative frequencies in a large number of trials’.

          In Fisher’s case we’re just saying the actual process assigns much higher frequency of occurrence to some events in a large number of trials than the model does. The two _distributions_ look quite different.

          At the level of distributions things are almost deterministic: we just compute d(Pn,Ptheta) where Pn is the empirical distribution. This difference will _always_ be large for small n, no matter whether the event that occurs is more probable/less probable than the others. Eg for a single trial it is a point mass at the observed data.

        • I agree if we are working with frequency-based confidence intervals, as the distribution to which we are comparing a single realization. For various reasons, demonstrating actual frequency properties seems very very tough! Would you agree?

        • > For various reasons, demonstrating actual frequency properties seems very very tough! Would you agree?

          Yup. Firstly you can never fully verify and secondly because it requires you to collect more data/carry out replications.

          But as a guide – you need replication – it seems useful and important.

        • More practically, much of nonparametric statistics is built on the large sample properties of Pn as a plug-in estimator of the unknown distribution. No large enough sample, no good plug-in estimator of your distribution.

  8. Yes, I think a student (or anyone) is interested to know if we have learnt anything about our hypotheses and not just the estimate. This seems to remain (somewhat understandably) hidden from the description.

    • “Not just the estimate” diminishes the importance of the estimate. We need the estimate in order to answer questions such as “is the estimate big enough to tell us that the hypothesis is of any practical importance?” (e.g., is the benefit worth the cost in money, time, effort, risk, etc.)

  9. Yes, I think a student (or anyone) is interested to know if we have learnt anything about our hypotheses and not just the estimate. This seems to remain (somewhat understandably) hidden from the description.

  10. What is your favorite definition when the data are within two standard errors? We often tell students they can’t conclude the null is true, so the standard is “fail to reject the null, there is not evidence of a treatment effect. Or ” the difference is not statistically significant”. Seems this has in part created the non publishing of null results and p hacking.

    • I would say “nonsignificance” does not mean there is no evidence of a treatment effect. From our review, respectively from Sander Greenland: “if the p-value is less than 1, some association must be present in the data, and one must look at the point estimate to determine the effect size most compatible with the data under the assumed model (Greenland et al. 2016).”
      https://peerj.com/articles/3544

      • thanks for a very nice paper. It is very challenging to decide how to teach these matters to introductory statistics students. I personally wonder if we should just take hypothesis testing out of introductory statistics as it is rather clear that it is not at all philosophically a simple idea. This would be hard sell though as “client departments” as well as many statisticians are so focused on them as well as my consulting clients.

        • Perhaps introductory statistics needs to take two (or more) courses. For example, first course devoted to probability, exploratory statistics, and sample variability. Then a second course building on the first to develop a more thorough understanding (than what is in a typical statistics course) of what inferential frequentest statistics are and are not. (And then a third course in multilevel modeling and Bayesian statistics!)

        • I used to say that I as an EE and math major unfortunately never had statistics in college. Then I realized: while fellow students in the social sciences were taking courses with the word “statistics” in the title, I had a course in probability using Dubes “The Theory of Applied Probability” and then a course in detection and estimation using Van Trees “Detection Estimation and Modulation Theory, Part I.” That missed some of what you describe (not much EDA, little on frequentist approaches or on using these tools to address the sort of questions that the social scientists were considering, no MCMC, and no third course), but I agree: I think a structure like this could work well. Interestingly, the emphasis was on, well, probability, detection, and estimation, not on “statistics.” That freed us up from having to think about significance and the like (we thought about utilities, priors, likelihood ratios, and decisions), but it also focused us on “big problems” (radar detection and the like) and didn’t make the connection to simpler problems (you mean I can use this to understand the probability that my circuit will work, given the assumed distributions of components parameters? Or, in light of the Shewhart discussion, you mean I might be able to use this to think about what’s signal and what’s noise in the measured results I get from a process?).

          So I’m beginning to like this general idea coupled with using the process in a broad enough range of example cases and perhaps without being coupled to the word “statistics.”

    • I totally agree regarding the danger in using the language “not statistically significant”. It is too easy for people to turn this into “our data says there’s no difference”, and this is painfully commonplace when reading results sections: “____ had no effect (p>0.05)” or “there was no correlation (p>0.05)”.

      I have recently been moving more toward describing significance / non-significance in terms of the data being inconsistent with or consistent with some hypothesis. So, reject the null is interpreted as “our data are inconsistent with the hypothesis of zero difference”; fail to reject the null is “our data are not inconsistent with the hypothesis of zero difference.”

      I’m not sure yet how well this is going over for my students (I know they hate the double and triple negatives), but I like that it sounds nothing like “our data suggest that there is no difference.” We then cover the fact that data resulting in “fail to reject the null” are consistent with both the null being true, and the null being false within some bounds.

      I think the students pick up on this when it’s presented in the context of interpreting a confidence interval… we can note that zero is in the confidence interval, so in this sense our data are consistent with a population effect of zero. But, our data are also consistent with any other value in the confidence interval. Only a very narrow interval around zero can reasonably be interpreted as “our data suggest no effect (or a very small effect).”

      I don’t know if emphasizing this line of thinking would help with publication bias. If anything it suggests null results are even more ambiguous than the phrase “not significant” seems to imply.

      • Ben,
        Your proposed,
        “reject the null is interpreted as “our data are inconsistent with the hypothesis of zero difference”; fail to reject the null is “our data are not inconsistent with the hypothesis of zero difference.” ”
        does not sit well with me. I suggest as more accurate:

        “Reject the null” means our data would be very unusual if the model (including the hypothesis of zero difference) were true” (i.e., “Very unusual” as opposed to “inconsistent”)

        and

        “Fail to reject the null” means that “our data are consistent with the null hypothesis” (but emphasize that “consistent with” is not the same as “prove that it is true”).

        • Thanks Martha, I like your interpretation of “reject the null” better than mine – “inconsistent with” does make rejection sound like a stronger statement than it really is.

          I’m torn on saying that “fail to reject” means the data are “consistent with” the null being true. On the one hand, it’s certainly correct – on the other, it sounds too much like “accept the null”. I think the more painful phrase “not inconsistent with…” better reflects how I’d like such results to be treated.

          Maybe the “surprise factor” interpretation is the better way to go. It’s less precise but easier to understand. Reject => “we would be surprised to see data like this if the null were true”, FTR => “we would not be surprised to see data like this if the null were true”. That at least doesn’t sound too temptingly close to “our data suggest the null is true”.

  11. I’m surprised to read that your favorite definition of statistical significance (or more precisely, insignificance) is simply saying that

    “an estimate is not at least two standard errors away from some null hypothesis or prespecified value that would indicate no effect present”

    is equivalent to

    “the observed value could reasonably be explained by simple chance variation.”

    Does it also mean that if the estimate is at least two standard errors away then the observed value cannot reasonably be explained by simple chance variation?

    I like illustrations based on coins though (or even better, on dice: loaded dice are more plausible from a physical point of view than biased coins).

  12. Response based on The American Statistical Association’s Statement on p-Values: Context, Process, and Purpose (http://dx.doi.org/10.1080/00031305.2016.1154108):
    “2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
    Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.”

    And I have a straight illustration for that:
    8 heads and 12 tails as a result is statistically insignificant for both a null hypothesis (50% probability) than for an alternative hypothesis with probability around 30%. So, can we say it “have occurred by chance”?

  13. “I would say “nonsignificance” does not mean there is no evidence of a treatment effect.”

    I totally agree with this. P = 0.1 doesn’t mean no evidence. It just doesn’t. Not under the Fisherian significance test framework or the N-P hypothesis testing framework. For Fisher, evidence was continuous, and the threshold for significance was dependent on context. If an effect was consistently significant after repeated testing, this provided strong evidence of the effect – replication was critical. The N-P framework doesn’t provide evidence at all because they felt it was impossible to obtain evidence from a sample. Instead the method is all about behaviors and controlling how often we make erroneous decisions. But the acceptance or rejection of a hypothesis was not the actual decision. The decision was what was done based on acceptance/rejection of a hyoothesis, like implementing a public policy, and this decision must be made by balancing the risks of both types of errors, i.e. requirement for a priori power analysis.

    I’m a masters student with interest in statistical Inference, in the misuse and misinterpretation of statistics, especially p-values, in causal Inference, and in the history of statistics. I’m being taught in all my classes that if p>0.05 there is no evidence of an effect. This is generally in the context of examples and assessments using an observational dataset, no power analysis, and the null hypothesis of exactly no effect is implausible. I complete assignments noting these things, not carrying out zillions of mindless hypothesis tests, and my grades have suffered. It is incredibly frustrating. People aren’t good at these things because they are difficult, but also because alot of people teaching are teaching an incoherent method to interpret p-values.

    If anyone is willing to take an enthusiastic and motivated PhD student that doesn’t have perfect grades, as described above, please let me know.

    • “I’m being taught in all my classes that if p>0.05 there is no evidence of an effect.”

      What your teachers have missed is that one can *decide* to take a particular significance level to use as a cut-of for essentially defining one’s personal criteria for what is “evidence” vs “no evidence”. But there is no God-given law as to what that cut-off should be — indeed, if one decides to use such a cut-off, sensible practice requires using whatever information is available (especially consequences of different types of errors) to choose that cut-off for a particular situation. What has happened is that the common desire to have some “authoritative” rule has (all too often) won out over good practice.

      • This suggests to me that one of the first things to do in teaching NHST is to hammer in the fact that the significance threshold is a convention – and by that fact alone it cannot reveal objective truths about the world.

    • When you say “not carrying out millions of hypothesis tests” are you doing all the work required and then making a philosophical argument? Even if you disagree with conclusions it can be beneficial as a student to do the work and give the expected conclusion perhaps with a disclaimer, using the method as taught in class we would conclude, then you could add ones own Additional perspective. It will give you more credibility and hopefully improve your grades and open more doors for you.

  14. Yes, I also find “different from chance” and “occurred by chance” to be confusing. In a coin tossing experiments, the results are always by chance, even if the coin is biased. You seem to be using “chance” as a synonym for “tossing an unbiased coin”, and thereby concealing a hidden assumption about the coin.

    • Roger:

      No hidden assumptions. The probability of a coin flip landing heads is 1/2; see here. If you flip a coin 1000 times (without bouncing the coin) and get 763 heads, this can’t have come by chance; the flips have to have been rigged.

        • Exactly. The problem is how do we define ‘chance’? If we do the effort to clearly define it, we are going to clarify a lot. I repeat what American Statistical Association says about this:
          “2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
          Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.”

        • So, if your empirical/physical claim (H0) is that the coin is ‘perfect’, so heads have a 50% prob, then the result of 763 heads in 1000 flips may indicate that the process to generate the data was not produced by random chance. However, if your empirical/physical claim (a different H0) is that the coin is loaded and heads have 76.3% prob, then ‘763 heads from 1000’ flips could possibly come from a ‘random chance’ procedure generating the data.
          As you can see, we need to assume some aspects of either the reality or the data in order to (try to) explain what’s happening.

        • > More precisely, the observed proportion of heads is 40 percent but with a standard error of 11 percent—thus, the data are less than two standard errors away from the null hypothesis of 50 percent, and the outcome could clearly have occurred by chance.

          What if the observed proportion of heads was 25%, with a standard error of 11%? The data would be more than two standard errors away from the null hypothesis (necessarily true!) of 50%.

          Would you just say the the outcome could not-so-clearly have occurred by chance?

        • Ok, maybe not “loaded”, if this is a specific technical term. But you can have biased coins or conditions in coin flipping/tossing:
          https://en.wikipedia.org/wiki/Fair_coin
          https://en.wikipedia.org/wiki/Coin_flipping#Physics
          https://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair
          But this is not the real problem in your “favourite definition”. IMHO, the main problem is that “true population” and hypotesis about it are confounded with “observed data” and procedures to get them.

        • Jl:

          You write, “you can have biased coins or conditions in coin flipping/tossing.” No. You can have a biased flipping procedure (most obviously by starting at a fixed position (heads or tails) and then flipping a very small number of rotations) but not a biased coin as usually conceived. Deb Nolan and I discuss this in our paper.

          Beyond this, see my P.S. above. The definition is not intended to be rigorous. I do think it conveys how the idea of statistical significance is used in practice, which was my purpose within that magazine article. The problem was with the title of this post, which did not convey the purpose of that definition.

      • This is very confusing to me. When I compute the binomial probability of 763 heads in 1000 tosses, given a fair coin (tossing procedure), what am I computing? I thought I’m computing the probability of something that *can* happen with a fair coin. You seem to be saying that there’s no way this could happen with a fair coin.

        • Alex:

          It could happen but it would take a really long time, so it would be much more plausible that the tosses are rigged or that there was some error in the data recording or transmission or that these 1000 tosses were not sequential but were selected out of a longer sequence, or some other weird thing.

        • I’m thinking something like ‘the probability of it taking a short time is low and the expected time is a long time’, no? It _could_ take a short time (minimal time relative to other sequences) it’s just unlikely…

        • Thanks.

          So the upshot is surely: you can’t explain probability in terms of ‘time taken’ if ‘time taken’ is itself explained in terms of probability, right?

          So instead of:

          > It could happen but it would take a really long time

          you may as well just say:

          It could happen but it is very improbable (under the null model).

          (And, I suppose: it is much more probable under an alternative model….)

  15. You could say that 763 can’t happen under the given assumptions, or that it can’t happen under the null hypothesis, or just that it can’t happen. But what does “by chance” add? Are you trying to say that the flips were rigged using some wholly deterministic process? That no chance was involved in getting that 763?

    That is the confusing thing about saying that something was by chance, or not by chance. It makes it sound as if the issue is whether some deterministic or random process was involved. That is not the issue. The issue is whether the outcome was a plausible consequence of the hypothesis, whether chance was involved or not.

      • I don’t think there is really anything “simple” about “chance variation” unfortunately. It’s a loaded term imbued with meaning by the audience, but ultimately not the *same* meaning for each audience member.

        It’s relatively straightforward though to say that to make such a thing come about relatively consistently requires a hidden physical mechanism other than the ones normally at work in coin flipping such as contact forces between the hand and the coin, drag forces in the air, and the contact forces as the coin hits the ground.

  16. I have a question. I am not a statistician but a psychologists who has done research in domains with both “strong” (anchoring) and “weak” (embodiment) effects. Now, what do we mean by “chance” (as in the new definition of significance)? In my understanding, “chance” means “ignorance”. Do we really assume that in principle, coin tosses are are not subject to the laws of mechanics? Probably not. If we knew all the relevant determinants, we would be able to predict the outcome just as precisely as any other mechanical result. Rather, it is our lack of knowledge (and control) that creates the error, not the quality of the phenomenon. Thus, I find it somewhat problematic to talk about “random” phenomena or to even claim that a phenomenon by itself is “not real” or “does not exist”. Instead, the contrast between systematic and error variance seems to be a contrast of knowledge versus ignorance. As a consequence, as knowledge grows, weak effect may become stronger over time. This could be called “scientific progress”. Am I completely wrong?

    • Fritz:

      Yes, chance is the part of the model that is not explained or considered predictable. In general, “chance” is defined only relative to the model, and advances in modeling and data collection can reduce the amount of variation that is labeled as chance.

      • Andrew:

        thanks, but let’s be a bit more concrete. Do you find it appropriate to qualify a result/phenomenon that not reach a given level of significance as “nonexistent” or ” not real”? Would it not be more to the point to describe it as “not sufficiently understood”?

        • I think the Amrhein paper linked above addresses this nicely: https://peerj.com/articles/3544/

          “Non significant” results should not be interpreted as positive evidence that an effect is nonexistent, because non-significant results are nearly always consistent with both “no effect” and “some effect that was not detected at P<0.05 using our model". There is a method called equivalence testing that it capable of getting close to "accept the null" – basically it amounts to looking at the width of the interval around your estimate. If the interval is narrow and contains zero, this could be treated as evidence for either no effect or a very small effect. This requires high power. Most of the time, "not significant" results are consistent with both no effect and at least a medium sized effect. For low power studies, "not significant" results are consistent with no effect and with a huge effect.

          I'd also hesitate to say that "non significant" results should be interpreted as "the proposed phenomenon I am investigating is not sufficiently understood", because this language seems to pre-suppose that it exists and simply needs to be understood better. I can easily make up wild hypotheses that, when tested, produce insignificant results. So the phrase "not sufficiently understood" would need to be broad enough to include "the mechanism that I think I'm testing for is non-existent".

          On the broader point regarding chance, I agree that what we call "error variance" can be thought of as representing ignorance. Our models may treat it as "pure random chance", but those are models, and reasonable people can disagree on the extent to which "pure random chance" is real. I also heed Popper's reminder that "our ignorance is sobering and boundless", and so we shouldn't take the interpretation of error variance as representing ignorance as being suggestive that we'll be able to overcome this ignorance if only we're clever and resourceful enough.

    • How does your view of “chance”, “ignorance”, “random”, and “not real”, fit with the following results described in this paper?:

      http://journals.sagepub.com/doi/pdf/10.1177/0956797611417632

      “An analysis of covariance (ANCOVA) revealed the predicted effect: People felt older after listening to “Hot Potato”
      (adjusted M = 2.54 years) than after listening to the control song (adjusted M = 2.06 years), F(1, 27) = 5.06, p = .033. In Study 2, we sought to conceptually replicate and extend Study 1. Having demonstrated that listening to a children’s song makes people feel older, Study 2 investigated whether listening to a song about older age makes people actually younger.”

      “An ANCOVA revealed the predicted effect: According to their birth dates, people were nearly a year-and-a-half younger
      after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba” (adjusted M = 21.5 years),
      F(1, 17) = 4.92, p = .040.”

    • Your post here reminds me of your paper: https://www.frontiersin.org/articles/10.3389/fpsyg.2017.00702/full

      I read it, and thought it might possibly contain several crucial errors in reasoning and (re-)presentation of evidence.

      I don’t know if you are interested, but in the following thread you can read all about them. For instance:

      http://statmodeling.stat.columbia.edu/2017/09/27/somewhat-agreement-fritz-strack-regarding-replications/#comment-573477

  17. “More precisely, the observed proportion of heads is 40 percent but with a standard error of 11 percent—thus, the data are less than two standard errors away from the null hypothesis of 50 percent, and the outcome could clearly have occurred by chance.”
    It seems to me that rather than being precise, this statement is misleading and arguably even wrong, because by saying that “the outcome could clearly have occurred by chance” it implies that this is not true of some of the other possible outcomes. In fact, any outcome *could* clearly have occurred by chance.

        • Technically there’s a 1 in 2^90 chance of getting 90 heads in a row. This is so small that virtually any other explanation is more plausible, but it’s not impossible to get 90 heads in a row, just ultra-incredibly improbable.

        • Another way to think about this. there’s a 1 in 2^90 chance of getting H,T,H,T,H,T… even though there’s exactly 50% head and 50% tails, it’s incredibly improbable to get that specific result. Basically it’s incredibly improbable to get **any** specific result.

        • > This is so small that virtually any other explanation is more plausible

          As you say in the other comment, 90 heads in a row is exactly as unlikely as any other result. This is like saying a lottery was rigged just because John (a random person you know nothing about) was the winner. But if a coin lands heads 90 times in a row (as opposed to some “random” looking sequence such as HTHHTHHTTTH…), we’d usually be right to suspect that the coin tossing setup wasn’t fair. Why? To simplify the problem, let’s suppose that there are only two possible hypotheses with non-negligible priors: either the coin is fair or somebody is purposefully controlling the outcome of the coin. Getting 90 heads in a row is much more likely under the human control hypothesis, that’s why we “reject” the fair coin hypothesis in this case. But now imagine a distant planet where people regard the “random” sequence HHTHTTHHHTH as very special, but think long sequences of heads are just as boring as THTTHTH or whatever. If somebody on this planet tosses a coin and gets HHTHTTHHHTH, this might be evidence against the fair coin hypothesis, even though the sequence looks “random”.

          See also the short paper “Remarks on ‘Random Sequences'” by Fitelson and Osheron

        • Indeed. However, one way to think about whether a sequence is “random” or not is to use Per Martin-Lof’s definition of randomness, which is basically sequences that pass a stringent test of randomness (in his case it’s actually “the most stringent computable test” which he proves the existence of without constructing it… so it’s a purely theoretical notion)

          Stringent tests of randomness would certainly look to see (among other things) if the sequence had “the right number” of heads (that is, a number which isn’t wildly incompatible with the “fair coin” hypothesis).

          Although we might find aliens which really like the sequence HHTHTTHHHTH and in that case its presence repeatedly in the data would be suspicious, it’s not suspicious strictly from the mathematics of random sequences of length 11. However a sequence of 11 heads is certainly a bit suspicious. A sequence of 90 is astronomically unlikely and can be rejected outright.

    • The statement “the outcome could clearly have occurred by chance” really needs “alone” added at the end.

      The whole paragraph would better serve readers of the literature if it recognized that everyone’s favorite whipping boy, the P-value, can be more quickly translated into coin-tossing than any other statistic, with absolutely no implied dependence on normality as arises with “standard errors” or Z-statistic descriptions. Just take the base-2 log of 1/p and there it is, s, the number of heads in a row that would give the same p when plugging “fairness” 1/2 of a single toss into the probability of s heads in a row: p = (1/2)^s.

      So with p = 0.04 that’s s = log2(.04) = 4.6, which is to say p=0.04 provides about the same evidence against the model p is “testing” as about 4 or 5 heads in a row provides against the tosses being independent with probability 1/2 of heads.

      This description applies to any P-value regardless of the underlying distributions or model, as long as that P-value is approximately uniform under the model (which may be a “null model” or any other set of constraints on the data generator).

      • Sander, I love your idea of the surprisal value and have passed it along several times. Do you have good examples of its use in research literature? I’m thinking particularly so of applications to non ‘point null’ models.

        • Thanks Chris. I’ve seen S-values/surprisals picked up and used in some studies, and now that you ask I wish I had kept track of some to cite.

          Regarding “non ‘point null’ models”, I’m not quite sure what you had in mind.
          The geometric interpretation of P-values and S-values I used in my 2019 TAS article at
          http://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625
          implies using max(p) and thus min(s) over test regions. In this Fisherian extension, p is merely the percentile at which the data fell in the distribution along the orthogonal projection of the data onto the test region, with the distribution determined by the model in the region nearest the data, i.e., the model that is the point of projection. In maximum-likelihood fitting, this model reduces to the model in the region nearest the data according to the Kullback-Leibler information criterion; other fitting methods correspond to other divergence criteria (e.g., ordinary least squares corresponds to Euclidean distance).

          As a technical footnote, this Fisherian extension to test regions differs from the UMPU extension of P-values and does not suffer from the “incoherence” problem raised in some P-value criticisms (e.g., Schervish TAS 1996), that one can have a higher p for a region contained inside another with lower p. I think of that problem with the UMPU extension as an artefact of letting Neyman-Pearson fixed-alpha test considerations distort direct geometric (Fisherian) considerations, e.g., that test regions closer to the data should have higher P-values and thus lower surprisals than more distant test regions.

Leave a Reply to Martha (Smith) Cancel reply

Your email address will not be published. Required fields are marked *