From my 2009 paper with Weakliem:

Throughout, we use the term statistically significant in the conventional way, to mean that an estimate is at least two standard errors away from some “null hypothesis” or prespecified value that would indicate no effect present. An estimate is statistically insignificant if the observed value could reasonably be explained by simple chance variation, much in the way that a sequence of 20 coin tosses might happen to come up 8 heads and 12 tails; we would say that this result is not statistically significantly different from chance. More precisely, the observed proportion of heads is 40 percent but with a standard error of 11 percent—thus, the data are less than two standard errors away from the null hypothesis of 50 percent, and the outcome could clearly have occurred by chance. Standard error is a measure of the variation in an estimate and gets smaller as a sample size gets larger, converging on zero as the sample increases in size.

I like that. I like that we get right into statistical significance, we don’t waste any time with p-values, we give a clean coin-flipping example, and we directly tie it into standard error and sample size.

**P.S.** Some questions were raised in discussion, so just to clarify: I’m not saying the above (which was published in a magazine, not a technical journal) is a comprehensive or precise definition; I just think it gets the point across in a reasonable way for general audiences.

“we give a clean coin-flipping example”

Examples are super useful in education!

More so if you use them regularly so people already know what you are talking about and have talked about.

And even more so if you can connect this to other things you want to teach people about.

The only thing to keep in mind though is that you can only use the coin-flipping example a few times, otherwise it’s called bullying.

Hmm. I’m surprised you still like this definition, Andrew. I don’t like the focus on “chance,” because I think it seduces people into thinking that a significant result is “not by chance,” i.e., probably “real.” I wonder what others will say.

I don’t object necessarily to the use of “chance”, but I think the explanation would be better if it left out the sentence,

“More precisely, the observed proportion of heads is 40 percent but with a standard error of 11 percent—thus, the data are less than two standard errors away from the null hypothesis of 50 percent, and the outcome could clearly have occurred by chance.”

That sentence gets too fuzzy and subject to misinterpretation.

I agree that using the word “chance” is easily misleading. I was just reading an otherwise interesting article on peer review (see link below) and came across this sentence: “It may look convincing if an experiment yields a result that would have had only a 1 per cent probability of happening purely by chance”, a classic misinterpretation of p (sorry for using the letter p…) as the probability that chance alone produced the result.

https://www.the-tls.co.uk/articles/public/the-end-of-an-error-peer-review

“we don’t waste any time with p-values, … and we directly tie it into standard error and sample size.” I do it this way too, the emphasis is on the statistical significance of differences between the aspect observed and what would be expected were the data generated by a process approximately described according to the test hypothesis.

I think you can take it further to demonstrate the replication problem with some studies. Some studies do the equivalent of performing the 20-toss experiment many times, and then selectively presenting the rarest result that has occurred.

This is spot on.

When I’m teaching I often have every student doing something like this and recording results on the board or in a Google sheet. Then we can do this.

I don’t like that chance word at all (although sometimes I do use it too…). Think about what it means in this case. Say you flip a coin n = 3 times and it must always land heads or tails. In that case there are 8 possible outcomes of your flips:

> n = 3

> m = expand.grid(rep(list(c(“H”, “T”)), n))

> m

Var1 Var2 Var3

1 H H H

2 T H H

3 H T H

4 T T H

5 H H T

6 T H T

7 H T T

8 T T T

For each row we count the number of heads:

> nHead = apply(m, 1, function(x) length(which(x == “H”)))

> nHead

[1] 3 2 2 1 2 1 1 0

Then count up how many times 3, 2, 1, 0 appeared and divide by the total number of possibilities

pHead = table(nHead)/nrow(m)

> pHead

nHead

0 1 2 3

0.125 0.375 0.375 0.125

This the binomial distribution when p =.5:

> dbinom(0:n, n, 0.5)

[1] 0.125 0.375 0.375 0.125

So where is “chance” in this? Chance is something we have encoded into the model. Here it is the

assumptionthere is no reason to prefer any of the 8 possibilities more than another. Eg, why did outcome 4 happen rather than outcome 7? No idea.If we change the model, we change the meaning of “chance” as well. Saying “the outcome could clearly have occurred by chance” is just a confusing way of saying “the model predicted this outcome was highly likely”.

+1

Indeed, all too often the dependence of outcomes on the *model* is slipped under the rug.

I think a lot of the time the model is pretty much “magic” to the end-user. Having no formal training in stats, until I derived that case for the post it was to me. If this is a standard thing I’d love to see someone modify the code for cases where p != .5. Do you somehow weight each possible outcome or what? Perhaps it requires a totally different derivation not dependent on number of possible outcomes?

I don’t know what kind of derivation are you expecting, but it seems to me the very definition of the binomial distribution answers your question.

http://mathworld.wolfram.com/BinomialDistribution.html

The formula counts how many equivalent outcomes there are (how many permutations with 0 heads, how many with 1 head, etc.) [this is the binomial coefficient] and “weights” them, as you say, according to their probability [this is the p^n * (1-p)^(N-n) factor]. Of course when p=1-p=0.5 the second term is constant and equal to 1/2^N (all the counts are normalised dividing by the total number of possible outcomes, nrow(n) in your previous comment).

I am asking to derive this using steps like the one I posted (which I, of course, find very intuitive). Ie there are 8 possible outcomes and we have some reason to prefer some over others because heads is more/less likely than tails.

If the probability of T and H are not equal, is it possible to derive the binomial distribution by assigning weights (other than 1, as in my example) to each of these possible outcomes:

What I am thinking of has that as the starting point.

Sorry, but not able to work it all out at the moment. Is it so utterly trivial as using something like:

1: p(H)*p(H)*p(H)

2: (1-p(H))*p(H)*p(H)

If the probability of T and H are not equal, is it possible to derive the binomial distribution by assigning weights (other than 1, as in my example) to each of these possible outcomes:

Here is an example with p=0.3, if this is not enough I don’t really know how to make it more explicit:

outcome weight

H H H p^3 = 0.027

T H H p^2*(1-p) = 0.063

H T H p^2*(1-p) = 0.063

T T H p*(1-p)^2 = 0.147

H H T p^2*(1-p) = 0.063

T H T p*(1-p)^2 = 0.147

H T T p*(1-p)^2 = 0.147

T T T (1-p)^3 = 0.343

Count the outcomes with 0 heads (1) and multiply by the weight (0.343) => 0.343

Count the outcomes with 1 heads (3) and multiply by the weight (0.147 in all cases) => 0.441

Count the outcomes with 2 heads (3) and multiply by the weight (0.063 in all cases) => 0.189

Count the outcomes with 3 heads (1) and multiply by the weight (0.027) => 0.027

Note that the count of outcomes with k=0,1,2,3 heads out of n flips is the binomial coefficient.

Alternatively, you could use the equivalent procedure:

Sum the weights for the outcomes with 0 heads ( 0.343 ) => 0.343

Sum the weights for the outcomes with 1 head ( 0.147 + 0.147+ 0.147 = 3 * 0.147 ) => 0.441

Sum the weights for the outcomes with 2 heads ( 0.063 + 0.063 + 0.063 = 3 * 0.63 ) => 0.189

Sum the weights for the outcomes with 3 heads ( 0.027 ) => 0.027

> dbinom(0:3, 3, 0.3)

[1] 0.343 0.441 0.189 0.027

Just for the record, I *know* how to make it more explicit ;-)

outcome weight

H H H p*p*p = p^3 = 0.027

T H H (1-p)*p*p = p^2*(1-p) = 0.063

H T H p*(1-p)*p = p^2*(1-p) = 0.063

T T H (1-p)*(1-p)*p = p*(1-p)^2 = 0.147

H H T p*p*(1-p) = p^2*(1-p) = 0.063

T H T (1-p)*p*(1-p) = p*(1-p)^2 = 0.147

H T T p*(1-p)*(1-p) = p*(1-p)^2 = 0.147

T T T (1-p)*(1-p)*(1-p) = (1-p)^3 = 0.343

Thanks, I actually don’t like this since I was interpreting probability to mean selecting 1 out of n possible outcomes without preference in the p = 0.5 case. This seemed like it was allowing a deeper understanding.

Here another layer of “chance” is included (the p^2*(1-p), etc) and we need to consider the “chance” of Heads or Tails of each individual toss as well. So “chance” is encoded into the model by looking at the probability of Heads on each toss, which isn’t enlightening.

And if you want to experiment … first they don’t always land head or tail, sometimes in the real world they land on their side. Second, try spinning instead of flipping. Third try sliding off the edge of a table.

+1

One nice thing about spinning is that certain coins don’t give an equal probability of heads or tails.

This is the elaboration of what I suggested above. I think partly we are getting at the difference between a model that affirmatively predicts the observed variation, and a model in which the variation around a central value is treated as “errors.” If your own model predicts a tail likelihood of an event of 1/20, that event didn’t happen strictly “by chance.” (The “chance” is that it happened in this particular run of your experiment.) In the errors scenario—say, if the null hypothesis is that the speed of light in a given medium is invariant—I would be more comfortable saying an outlier result is or is not probably due to chance. But this breeds confusion when people conflate the scenarios.

The “error” perspective does make sense in some circumstances, but not in others.

For example, in a manufacturing process where where it is important to get the product as close to specification as possible, then deviation from the specification is reasonably called error in that context.

Also, if we are trying to estimate the mean of some random variable in some population, then “error” makes sense for the deviation of the sample mean from the (unknown) population mean.

But in most cases where we are talking about individual measurements of some quantity, “error” is a misleading choice of terminology.

+1

And don’t forget measurement error.

Yes. And this reminds me of something I ought to have mentioned in my comment: It is helpful to distinguish between two kinds of uncertainty, often called “epistemic” (or “epistemological”) and alleatory”.

Epistemic uncertainty is uncertainty from lack of knowledge (e.g., we don’t know the value of some quantity exactly; it is sometimes also applied to models: we are not certain that a model exactly fits). The term “error” does fit well here.

Alleatory uncertainty is uncertainty that arises from randomness — not just as in tossing a die, but also in random sampling. For example, when we choose a random sample of students to estimate the average score (over the population from which the sample is taken) on a standardized exam, the scores from the sample have alleatory uncertainty. So using the term “error” for the difference between each student’s score and the sample average is misleading.

This is a good distinction. If you flip a coin 20 times and get all heads, that

is consistentwith the binomial model. If we accept that result, the model is still ok. On the other hand, if the speed of light really does vary the model claiming it can’t needs to be improved/replaced.I think the speed of light is a good example to work with. An interesting fact is the standard analysis shows the model

iswrong. However, instead people basically just think the statistical model of the measurement error is incorrect:Assessing uncertainty in physical constants.Henrion and Fischhoff. American Journal of Physics 54, 791 (1986); https://doi.org/10.1119/1.14447

My impression is that there hasn’t been much thought put into the model of the errors in this case.

I think “the outcome could clearly have occurred by chance” is going to lead students to think that significance means the outcome is unlikely to have occurred by chance, i.e. that the posterior probability is low.

I like the coin flip examples. I like to get students to consider how many times you have to flip so that it is even possible for the observed to be at least 2 SEs from the mean, i.e. so that power is greater than 0. Though I do reference the term ‘p-value’. Acknowledging that people don’t find probabilities intuitive, I think beginning students tend to have less intuition of SE.

+1

Without p-values, how do you explain the choice of two standard errors? Isn’t it motivated by a p=0.05 two-sided test?

I may be wrong, but my guess is that the idea of “being outside a certain interval” preceded the idea of p-value.

I agree. The choice of 2 standard errors, rather than 1 or 3 or 6, develops out of a probability assessment.

I have a model that predicts the relative frequencies of occurance of various events _in a large number of trials_.

I observe the result of a _single trial_. It is one of the events lying in a region of low relative frequency of occurance under _a large number of trials_.

Q: What can I say?

A: Not much?

…occurrence…

Point being – I don’t think a single significant (or insignificant) result is very interesting. _At most_ it gives a hint of where to look next. Most cases where a significant result is interesting are, I would argue, interesting for reasons beyond being statistically significant. The same pvalue can ‘mean’ very different things.

Calling a single result significant only makes sense if you are using it as an indication and which you intend to demonstrate corresponds to an interesting ‘effect’ by _repeatedly_ ‘bringing about’ that level in further experiments (Fisher) or as an error rate in _long term_ decision making (NP).

Any definition which doesn’t connect it to repetition requirements or long term decisions is, I think, misleading.

+ 1. I still think Andrew Gelman’s explanation is useful though, in that it suggests a kind of filter. If we are 2 SEs means something is likely going on…look into it further. The key is to not map this (or any) arbitrary threshold into some kind of real/not-real or true/not-true dichotomy.

ergh, blog ate my symbols. If we are less than two SEs, all kinds of unaccounted variability could have produced observed data, best not to conclude much either way. If we are greater than two SEs, something is more likely to be going on…look into it further.

Yeah but I still think any attempt to compare a single realisation with a distribution is somewhat problematic unless ‘repetition’ is taken into account (which is just a way of going beyond single realisations!).

In fact, Fisher’s requirement to ‘repeatedly bring about’ results that are unlikely under the model could be rephrased as ‘show the observed distribution of relative frequencies of events in a large number of trials looks very different to the model distribution of relative frequencies in a large number of trials’.

In Fisher’s case we’re just saying the actual process assigns much higher frequency of occurrence to some events in a large number of trials than the model does. The two _distributions_ look quite different.

At the level of distributions things are almost deterministic: we just compute d(Pn,Ptheta) where Pn is the empirical distribution. This difference will _always_ be large for small n, no matter whether the event that occurs is more probable/less probable than the others. Eg for a single trial it is a point mass at the observed data.

I agree if we are working with frequency-based confidence intervals, as the distribution to which we are comparing a single realization. For various reasons, demonstrating actual frequency properties seems very very tough! Would you agree?

> For various reasons, demonstrating actual frequency properties seems very very tough! Would you agree?

Yup. Firstly you can never fully verify and secondly because it requires you to collect more data/carry out replications.

But as a guide – you need replication – it seems useful and important.

More practically, much of nonparametric statistics is built on the large sample properties of Pn as a plug-in estimator of the unknown distribution. No large enough sample, no good plug-in estimator of your distribution.

Yes, I think a student (or anyone) is interested to know if we have learnt anything about our hypotheses and not just the estimate. This seems to remain (somewhat understandably) hidden from the description.

“Not just the estimate” diminishes the importance of the estimate. We need the estimate in order to answer questions such as “is the estimate big enough to tell us that the hypothesis is of any practical importance?” (e.g., is the benefit worth the cost in money, time, effort, risk, etc.)

Yes, I think a student (or anyone) is interested to know if we have learnt anything about our hypotheses and not just the estimate. This seems to remain (somewhat understandably) hidden from the description.

What is your favorite definition when the data are within two standard errors? We often tell students they can’t conclude the null is true, so the standard is “fail to reject the null, there is not evidence of a treatment effect. Or ” the difference is not statistically significant”. Seems this has in part created the non publishing of null results and p hacking.

I would say “nonsignificance” does not mean there is no evidence of a treatment effect. From our review, respectively from Sander Greenland: “if the p-value is less than 1, some association must be present in the data, and one must look at the point estimate to determine the effect size most compatible with the data under the assumed model (Greenland et al. 2016).”

https://peerj.com/articles/3544

thanks for a very nice paper. It is very challenging to decide how to teach these matters to introductory statistics students. I personally wonder if we should just take hypothesis testing out of introductory statistics as it is rather clear that it is not at all philosophically a simple idea. This would be hard sell though as “client departments” as well as many statisticians are so focused on them as well as my consulting clients.

Perhaps introductory statistics needs to take two (or more) courses. For example, first course devoted to probability, exploratory statistics, and sample variability. Then a second course building on the first to develop a more thorough understanding (than what is in a typical statistics course) of what inferential frequentest statistics are and are not. (And then a third course in multilevel modeling and Bayesian statistics!)

I used to say that I as an EE and math major unfortunately never had statistics in college. Then I realized: while fellow students in the social sciences were taking courses with the word “statistics” in the title, I had a course in probability using Dubes “The Theory of Applied Probability” and then a course in detection and estimation using Van Trees “Detection Estimation and Modulation Theory, Part I.” That missed some of what you describe (not much EDA, little on frequentist approaches or on using these tools to address the sort of questions that the social scientists were considering, no MCMC, and no third course), but I agree: I think a structure like this could work well. Interestingly, the emphasis was on, well, probability, detection, and estimation, not on “statistics.” That freed us up from having to think about significance and the like (we thought about utilities, priors, likelihood ratios, and decisions), but it also focused us on “big problems” (radar detection and the like) and didn’t make the connection to simpler problems (you mean I can use this to understand the probability that my circuit will work, given the assumed distributions of components parameters? Or, in light of the Shewhart discussion, you mean I might be able to use this to think about what’s signal and what’s noise in the measured results I get from a process?).

So I’m beginning to like this general idea coupled with using the process in a broad enough range of example cases and perhaps without being coupled to the word “statistics.”

I totally agree regarding the danger in using the language “not statistically significant”. It is too easy for people to turn this into “our data says there’s no difference”, and this is painfully commonplace when reading results sections: “____ had no effect (p>0.05)” or “there was no correlation (p>0.05)”.

I have recently been moving more toward describing significance / non-significance in terms of the data being inconsistent with or consistent with some hypothesis. So, reject the null is interpreted as “our data are inconsistent with the hypothesis of zero difference”; fail to reject the null is “our data are not inconsistent with the hypothesis of zero difference.”

I’m not sure yet how well this is going over for my students (I know they hate the double and triple negatives), but I like that it sounds nothing like “our data suggest that there is no difference.” We then cover the fact that data resulting in “fail to reject the null” are consistent with both the null being true, and the null being false within some bounds.

I think the students pick up on this when it’s presented in the context of interpreting a confidence interval… we can note that zero is in the confidence interval, so in this sense our data are consistent with a population effect of zero. But, our data are also consistent with any other value in the confidence interval. Only a very narrow interval around zero can reasonably be interpreted as “our data suggest no effect (or a very small effect).”

I don’t know if emphasizing this line of thinking would help with publication bias. If anything it suggests null results are even more ambiguous than the phrase “not significant” seems to imply.

Ben,

Your proposed,

“reject the null is interpreted as “our data are inconsistent with the hypothesis of zero difference”; fail to reject the null is “our data are not inconsistent with the hypothesis of zero difference.” “

does not sit well with me. I suggest as more accurate:

“Reject the null” means our data would be very unusual if the model (including the hypothesis of zero difference) were true” (i.e., “Very unusual” as opposed to “inconsistent”)

and

“Fail to reject the null” means that “our data are consistent with the null hypothesis” (but emphasize that “consistent with” is not the same as “prove that it is true”).

Thanks Martha, I like your interpretation of “reject the null” better than mine – “inconsistent with” does make rejection sound like a stronger statement than it really is.

I’m torn on saying that “fail to reject” means the data are “consistent with” the null being true. On the one hand, it’s certainly correct – on the other, it sounds too much like “accept the null”. I think the more painful phrase “not inconsistent with…” better reflects how I’d like such results to be treated.

Maybe the “surprise factor” interpretation is the better way to go. It’s less precise but easier to understand. Reject => “we would be surprised to see data like this if the null were true”, FTR => “we would not be surprised to see data like this if the null were true”. That at least doesn’t sound too temptingly close to “our data suggest the null is true”.

Yes, I can see your point about using the “surprise factor” in both cases.

I’m surprised to read that your favorite definition of statistical significance (or more precisely, insignificance) is simply saying that

“an estimate is not at least two standard errors away from some null hypothesis or prespecified value that would indicate no effect present”

is equivalent to

“the observed value could reasonably be explained by simple chance variation.”

Does it also mean that if the estimate is at least two standard errors away then the observed value cannot reasonably be explained by simple chance variation?

I like illustrations based on coins though (or even better, on dice: loaded dice are more plausible from a physical point of view than biased coins).

Response based on The American Statistical Association’s Statement on p-Values: Context, Process, and Purpose (http://dx.doi.org/10.1080/00031305.2016.1154108):

“2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.”

And I have a straight illustration for that:

8 heads and 12 tails as a result is statistically insignificant for both a null hypothesis (50% probability) than for an alternative hypothesis with probability around 30%. So, can we say it “have occurred by chance”?

“I would say “nonsignificance” does not mean there is no evidence of a treatment effect.”

I totally agree with this. P = 0.1 doesn’t mean no evidence. It just doesn’t. Not under the Fisherian significance test framework or the N-P hypothesis testing framework. For Fisher, evidence was continuous, and the threshold for significance was dependent on context. If an effect was consistently significant after repeated testing, this provided strong evidence of the effect – replication was critical. The N-P framework doesn’t provide evidence at all because they felt it was impossible to obtain evidence from a sample. Instead the method is all about behaviors and controlling how often we make erroneous decisions. But the acceptance or rejection of a hypothesis was not the actual decision. The decision was what was done based on acceptance/rejection of a hyoothesis, like implementing a public policy, and this decision must be made by balancing the risks of both types of errors, i.e. requirement for a priori power analysis.

I’m a masters student with interest in statistical Inference, in the misuse and misinterpretation of statistics, especially p-values, in causal Inference, and in the history of statistics. I’m being taught in all my classes that if p>0.05 there is no evidence of an effect. This is generally in the context of examples and assessments using an observational dataset, no power analysis, and the null hypothesis of exactly no effect is implausible. I complete assignments noting these things, not carrying out zillions of mindless hypothesis tests, and my grades have suffered. It is incredibly frustrating. People aren’t good at these things because they are difficult, but also because alot of people teaching are teaching an incoherent method to interpret p-values.

If anyone is willing to take an enthusiastic and motivated PhD student that doesn’t have perfect grades, as described above, please let me know.

“I’m being taught in all my classes that if p>0.05 there is no evidence of an effect.”

What your teachers have missed is that one can *decide* to take a particular significance level to use as a cut-of for essentially defining one’s personal criteria for what is “evidence” vs “no evidence”. But there is no God-given law as to what that cut-off should be — indeed, if one decides to use such a cut-off, sensible practice requires using whatever information is available (especially consequences of different types of errors) to choose that cut-off for a particular situation. What has happened is that the common desire to have some “authoritative” rule has (all too often) won out over good practice.

This suggests to me that one of the first things to do in teaching NHST is to hammer in the fact that the significance threshold is a convention – and by that fact alone it cannot reveal objective truths about the world.

Maybe cite one of Gelman’s papers in your next assignment. Or better yet, cite Neyman and Pearson!

When you say “not carrying out millions of hypothesis tests” are you doing all the work required and then making a philosophical argument? Even if you disagree with conclusions it can be beneficial as a student to do the work and give the expected conclusion perhaps with a disclaimer, using the method as taught in class we would conclude, then you could add ones own Additional perspective. It will give you more credibility and hopefully improve your grades and open more doors for you.

Yes, I also find “different from chance” and “occurred by chance” to be confusing. In a coin tossing experiments, the results are always by chance, even if the coin is biased. You seem to be using “chance” as a synonym for “tossing an unbiased coin”, and thereby concealing a hidden assumption about the coin.

Roger:

No hidden assumptions. The probability of a coin flip landing heads is 1/2; see here. If you flip a coin 1000 times (without bouncing the coin) and get 763 heads, this can’t have come by chance; the flips have to have been rigged.

So for you ‘chance’ here means according to a particular null distribution?

And this is an empirical/physical claim in the case of a coin?

Exactly. The problem is how do we define ‘chance’? If we do the effort to clearly define it, we are going to clarify a lot. I repeat what American Statistical Association says about this:

“2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.”

So, if your empirical/physical claim (H0) is that the coin is ‘perfect’, so heads have a 50% prob, then the result of 763 heads in 1000 flips may indicate that the process to generate the data was not produced by random chance. However, if your empirical/physical claim (a different H0) is that the coin is loaded and heads have 76.3% prob, then ‘763 heads from 1000’ flips could possibly come from a ‘random chance’ procedure generating the data.

As you can see, we need to assume some aspects of either the reality or the data in order to (try to) explain what’s happening.

Jl:

There is no such thing as a coin that is loaded and heads have a 76.3% prob. You can load a die, but you can’t bias a coin (when flipped and caught in the air).

> More precisely, the observed proportion of heads is 40 percent but with a standard error of 11 percent—thus, the data are less than two standard errors away from the null hypothesis of 50 percent, and the outcome could clearly have occurred by chance.

What if the observed proportion of heads was 25%, with a standard error of 11%? The data would be more than two standard errors away from the null hypothesis (necessarily true!) of 50%.

Would you just say the the outcome could not-so-clearly have occurred by chance?

Ok, maybe not “loaded”, if this is a specific technical term. But you can have biased coins or conditions in coin flipping/tossing:

https://en.wikipedia.org/wiki/Fair_coin

https://en.wikipedia.org/wiki/Coin_flipping#Physics

https://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair

But this is not the real problem in your “favourite definition”. IMHO, the main problem is that “true population” and hypotesis about it are confounded with “observed data” and procedures to get them.

Jl:

You write, “you can have biased coins or conditions in coin flipping/tossing.” No. You can have a biased

flipping procedure(most obviously by starting at a fixed position (heads or tails) and then flipping a very small number of rotations) but not a biasedcoinas usually conceived. Deb Nolan and I discuss this in our paper.Beyond this, see my P.S. above. The definition is not intended to be rigorous. I do think it conveys how the idea of statistical significance is used in practice, which was my purpose within that magazine article. The problem was with the title of this post, which did not convey the purpose of that definition.

This is very confusing to me. When I compute the binomial probability of 763 heads in 1000 tosses, given a fair coin (tossing procedure), what am I computing? I thought I’m computing the probability of something that *can* happen with a fair coin. You seem to be saying that there’s no way this could happen with a fair coin.

Alex:

It could happen but it would take a really long time, so it would be much more plausible that the tosses are rigged or that there was some error in the data recording or transmission or that these 1000 tosses were not sequential but were selected out of a longer sequence, or some other weird thing.

The idea that it would ‘take a long time’ is fairly murky right? In what sense does time enter into probability statements?

Ojm,

It depends on how long it takes to flip a coin, I suppose. I was imagining each flip took a few seconds.

I suggest revising your statement to say, “It could happen but it *could* take a really long time, so …”

I’m thinking something like ‘the probability of it taking a short time is low and the expected time is a long time’, no? It _could_ take a short time (minimal time relative to other sequences) it’s just unlikely…

Ojm’s suggestion sounds reasonable to me.

Thanks.

So the upshot is surely: you can’t explain probability in terms of ‘time taken’ if ‘time taken’ is itself explained in terms of probability, right?

So instead of:

> It could happen but it would take a really long time

you may as well just say:

It could happen but it is very improbable (under the null model).

(And, I suppose: it is much more probable under an alternative model….)

These were my thoughts, too. That bringing in the time factor does not really add to the explanation, and might even be confusing.

You could say that 763 can’t happen under the given assumptions, or that it can’t happen under the null hypothesis, or just that it can’t happen. But what does “by chance” add? Are you trying to say that the flips were rigged using some wholly deterministic process? That no chance was involved in getting that 763?

That is the confusing thing about saying that something was by chance, or not by chance. It makes it sound as if the issue is whether some deterministic or random process was involved. That is not the issue. The issue is whether the outcome was a plausible consequence of the hypothesis, whether chance was involved or not.

Roger:

763 could not happen by chance alone. As in the quote above, 763 could not “reasonably be explained by simple chance variation.”

I don’t think there is really anything “simple” about “chance variation” unfortunately. It’s a loaded term imbued with meaning by the audience, but ultimately not the *same* meaning for each audience member.

It’s relatively straightforward though to say that to make such a thing come about relatively consistently requires a hidden physical mechanism other than the ones normally at work in coin flipping such as contact forces between the hand and the coin, drag forces in the air, and the contact forces as the coin hits the ground.

+1 to first paragraph.

I have a question. I am not a statistician but a psychologists who has done research in domains with both “strong” (anchoring) and “weak” (embodiment) effects. Now, what do we mean by “chance” (as in the new definition of significance)? In my understanding, “chance” means “ignorance”. Do we really assume that in principle, coin tosses are are not subject to the laws of mechanics? Probably not. If we knew all the relevant determinants, we would be able to predict the outcome just as precisely as any other mechanical result. Rather, it is our lack of knowledge (and control) that creates the error, not the quality of the phenomenon. Thus, I find it somewhat problematic to talk about “random” phenomena or to even claim that a phenomenon by itself is “not real” or “does not exist”. Instead, the contrast between systematic and error variance seems to be a contrast of knowledge versus ignorance. As a consequence, as knowledge grows, weak effect may become stronger over time. This could be called “scientific progress”. Am I completely wrong?

Fritz:

Yes, chance is the part of the model that is not explained or considered predictable. In general, “chance” is defined only relative to the model, and advances in modeling and data collection can reduce the amount of variation that is labeled as chance.

Andrew:

thanks, but let’s be a bit more concrete. Do you find it appropriate to qualify a result/phenomenon that not reach a given level of significance as “nonexistent” or ” not real”? Would it not be more to the point to describe it as “not sufficiently understood”?

I think the Amrhein paper linked above addresses this nicely: https://peerj.com/articles/3544/

“Non significant” results should not be interpreted as positive evidence that an effect is nonexistent, because non-significant results are nearly always consistent with both “no effect” and “some effect that was not detected at P<0.05 using our model". There is a method called equivalence testing that it capable of getting close to "accept the null" – basically it amounts to looking at the width of the interval around your estimate. If the interval is narrow and contains zero, this could be treated as evidence for either no effect or a very small effect. This requires high power. Most of the time, "not significant" results are consistent with both no effect and at least a medium sized effect. For low power studies, "not significant" results are consistent with no effect and with a huge effect.

I'd also hesitate to say that "non significant" results should be interpreted as "the proposed phenomenon I am investigating is not sufficiently understood", because this language seems to pre-suppose that it exists and simply needs to be understood better. I can easily make up wild hypotheses that, when tested, produce insignificant results. So the phrase "not sufficiently understood" would need to be broad enough to include "the mechanism that I think I'm testing for is non-existent".

On the broader point regarding chance, I agree that what we call "error variance" can be thought of as representing ignorance. Our models may treat it as "pure random chance", but those are models, and reasonable people can disagree on the extent to which "pure random chance" is real. I also heed Popper's reminder that "our ignorance is sobering and boundless", and so we shouldn't take the interpretation of error variance as representing ignorance as being suggestive that we'll be able to overcome this ignorance if only we're clever and resourceful enough.

How does your view of “chance”, “ignorance”, “random”, and “not real”, fit with the following results described in this paper?:

http://journals.sagepub.com/doi/pdf/10.1177/0956797611417632

“An analysis of covariance (ANCOVA) revealed the predicted effect: People felt older after listening to “Hot Potato”

(adjusted M = 2.54 years) than after listening to the control song (adjusted M = 2.06 years), F(1, 27) = 5.06, p = .033. In Study 2, we sought to conceptually replicate and extend Study 1. Having demonstrated that listening to a children’s song makes people feel older, Study 2 investigated whether listening to a song about older age makes people actually younger.”

“An ANCOVA revealed the predicted effect: According to their birth dates, people were nearly a year-and-a-half younger

after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba” (adjusted M = 21.5 years),

F(1, 17) = 4.92, p = .040.”

Your post here reminds me of your paper: https://www.frontiersin.org/articles/10.3389/fpsyg.2017.00702/full

I read it, and thought it might possibly contain several crucial errors in reasoning and (re-)presentation of evidence.

I don’t know if you are interested, but in the following thread you can read all about them. For instance:

http://andrewgelman.com/2017/09/27/somewhat-agreement-fritz-strack-regarding-replications/#comment-573477

Also from Strack’s paper:

“Science progresses through critical discourse, and this is what must be revived again”

Good quote.