## He’s a history teacher and he has a statistics question

Someone named Ian writes:

I am a History teacher who has become interested in statistics! The main reason for this is that I’m reading research papers about teaching practices to find out what actually “works.”

I’ve taught myself the basics of null hypothesis significance testing, though I confess I am no expert (Maths was never my strong point at school!). But I also came across your blog after I heard about this “replication crisis” thing.

I wanted to ask you a question, if I may.

Suppose a randomised controlled experiment is conducted with two groups and the mean difference turns out to be statistically significant at the .05 level. I’ve learnt from my self-study that this means:

“If there were genuinely no difference in the population, the probability of getting a result this big or bigger is less than 5%.”

So far, so good (or so I thought).

But from my recent reading, I’ve gathered that many people criticise studies for using “small samples.” What was interesting to me is that they criticise this even after a significant result has been found.

So they’re not saying “Your sample size was small so that may be why you didn’t find a significant result.” They’re saying: “Even though you did find a significant result, your sample size was small so your result can’t be trusted.”

I was just wondering whether you could explain why one should distrust significant results with small samples? Some people seem to be saying it’s because it may have been a chance finding. But isn’t that what the p-value is supposed to tell you? If p is less then 0.05, doesn’t that mean I can assume it (probably) wasn’t a “chance finding”?

My reply: See my paper, “The failure of null hypothesis significance testing when studying incremental changes, and what to do about it,” recently published in the Personality and Social Psychology Bulletin. The short answer is that (a) it’s not hard to get p less than 0.05 just from chance, via forking paths, and (b) when effect sizes are small and a study is noisy, any estimate that reaches “statistical significance” is likely to be an overestimate, perhaps a huge overestimate.

1. We need to make it accessible for more folks to grasp that statistics “works” with respect to an inexhaustible collection of inquiries rather than any or every particular individual inquiry. Sometimes good inquiry will mislead but if good inquiry is persisted in – you can expect to eventually be corrected.

Here is one attempt to do that from an (informed) Bayesian perspective https://andrewgelman.com/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/ (the simulation more that the prose, I think)

Perhaps the next most dangerous words after “a new study shows” are “we did a small noisy study but got a convincing result”.

2. Josh says:

There’s also the concern that a small sample may not adequately represent the target population so you cannot extrapolate from your sample to the population. This is of course possible in a large sample but is more common in a small sample.

3. yyw says:

For an effect to be statistically significant in a small sample study, the effect size often has to be implausibly large. Exercising common sense can filter out many of these. As a teacher, you are best equipped to judge if some simple intervention can plausibly improve student performance by a large amount. Err on the side of being too skeptical. If you are looking for ways to improve your own classroom teaching, I guess it doesn’t hurt to try innovations that you think might work if they are not costly and unlikely to do harm. If you are making policy decision that might impact many people, consult a good statistician.

4. JM says:

One needs not necessarily attribute a lack of confidence in small studies to forking paths or p-hacking/HARKing. Giving the authors the presumption of innocence in these, I usually teach my medical/public health students that threats remain in both bias and confounding. Randomisation requires large numbers to work; in a small randomised study, confounding will almost certainly persist and in general is not controlled because of the assumption of cetis parebus from adequate randomization. Bias in small samples (recruitment, representativeness) will also almost certainly operate in the small-sample, particularly if chosen on convenience or availability, unless demonstrable measures were taken to control….and can’t be fixed in the analysis

5. Michael J Lew says:

This is a good question because it is based on a misunderstanding common among statistics commenters. The probability of obtaining P<0.05 when the null hypothesis is true is _exactly the same_ for small samples and large samples. (Assuming the model is appropriate and the samples are random, etc.)

Small samples do not predispose towards small P-values. The calculation of the P-value takes into account the uncertainty in a calibrated manner at any size of sample. Small samples and large samples are just as likely to explore garden paths. The issue with small samples is that many small sample results with P<0.05 overestimate the true effect or underestimate the true variability. The significance filter exaggeration machine works best with small samples, but that does not mean that there is a greater than 5% chance of a P<0.05 result with a small sample.

• Peter Chapman says:

At last someone is talking sense.

• You’re right, it’s not the smallness of the samples, it’s the inappropriateness of an assumption of unbiased random sampling from a particular distribution, which for various reasons including asymptotic distributional assumptions that don’t hold and failure to actually sample randomly becomes particularly problematic in small samples.

• Björn says:

And of course people are tempted to run lots of little studies. Even if they work with exactly pre-specified methods/hypotheses, this still means that most of these are negative (no power for realistic effect sizes) and disappear (journals are not so likely to publish small failed trials), while slightly over 5% are “statistically significant” and get published. Since you can do a lot of small trials, this floods the scientific literature will small trials that overestimate any true effect that might be there (the tested interventions might well do next to nothing).

6. Anonymous says:

>”I’ve taught myself the basics of null hypothesis significance testing”

This was your first mistake. https://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf

• Kyle C says:

GREAT paper. I just emailed it to my daughter, a college freshman. Thank you.

7. Mayo says:

Statistical significance tests are often criticized for their dependence on sample size (as we see in the famous Jeffreys-Lindley paradox or Bayes-Fisher disagreement). Suppose the significance level is fixed. As the sample size increases, the cut-off for rejecting the null hypothesis gets closer and closer to the null. Many recommend lowering the required P-value (or significance level cut-off) with increasing sample size. That’s fine, but we can always avoid inferring a discrepancy beyond what’s warranted (with a confidence bound or severity assessment).
An observed difference that is just statistically significant at level p indicates less of a discrepancy from the null hypothesis if it results from a larger rather than a smaller sample size–IF the assumptions of the test are approximately met. That’s a big if. What people suspect with small sample sizes, and rightly so, is that the assumptions are not met, and the result was cherry-picked or otherwise due to biasing selection effects.

8. Peter Chapman says:

I found the original question quite straightforward but some of the responses quite confusing. In situations like these I tend wonder what a simulation exercise would demonstrate.

We have a simple situation with two groups and we wish to know whether the (true) means are different. If we simulate a scenario in which the (true) means are identical we will get a p<0.05 in 5% of runs, and this will be the case whether we have a large or a small sample.

If we simulate a scenario in which the (true) means are different we will get p<0.05 in more than 5% of runs; the actual percentage will depend upon the true difference and how big it is relative to underlying variation. The expected value of the difference E(xbar1 – xbar2) is mu1-mu2, and for a straightforward situation the simulated observed differences will be spread symmetrically around this value. So, no overestimation.

So, the answer to the history teacher is that the theory works fine whatever the sample size. Much of the discussion goes beyond the scope of the question and focuses on whether or not the theory breaks down in the presence of small samples. I'm not convinced by some of these arguments. For example are small samples more likely to be unrepresentative of the population. One could argue that every sample in which the observed mean and variance are different from the true mean and variance (ie every sample) is unrepresentative, but the theory accommodates this situation just fine.

Of course, if you don't randomise properly the theory breaks down completely.

• That’s right, it’s guaranteed by the theory, after all here we have an example using a proper random number generator and everything that will prove it!

d=data.frame(x=replicate(10000,t.test(rexp(4),rexp(4))\$p.value))
sum(d\$x < 0.05)/NROW(d)
[1] 0.0246

Hmm… must have been unlucky, let's do it 100k times instead.

d=data.frame(x=replicate(100000,t.test(rexp(4),rexp(4))\$p.value))
sum(d\$x < 0.05)/NROW(d)
[1] 0.02592

• On the other hand, suppose I use a sample of 40 items and take those means?

d=data.frame(x=replicate(100000,t.test(rexp(40),rexp(40))\$p.value))
sum(d\$x < 0.05)/NROW(d)
[1] 0.04731

Hunh, pretty close, but still seems off a little. ok, now how about 400 samples, that's still "small" right?

d=data.frame(x=replicate(100000,t.test(rexp(400),rexp(400))\$p.value))
sum(d\$x < 0.05)/NROW(d)
[1] 0.04883

That's really not doing a lot better… let's try 4000

d=data.frame(x=replicate(100000,t.test(rexp(4000),rexp(4000))\$p.value)); sum(d\$x < 0.05)/NROW(d)
[1] 0.05004

Ok, so as long as by "independent of sample size" we mean "for samples greater than about 4000 items carefully randomly selected using a validated random number generator from a distribution not too weird compared to an exponential distribution"

Just as Andrew has said that the biggest assumptions in linear regression are not related to the normality of the errors but the appropriateness of fitting a line, so too, the biggest assumptions in testing random samples for differences in parameters are

1) That you actually have a process that behaves like random sampling from a fixed distribution.
2) That the shape of the distribution is not too far away from whatever you are assuming.

But both of those are routinely violated, and in real life as well as in the mathematical (asymptotic) theory, the violations are related to sample size.

• Anonymous says:

I think your simulations are off because you used the default var.equal = F argument.

• The null is exactly true here. I’m simulating from a single distribution two separate samples. Variance *is* equal.

• aha. now I see what you mean, the default is to assume the variances are *not* equal. All this does is reduce the power of the test though. The real reason the percentages are off is because I’m not simulating samples from two normal distributions, I’m simulating from exponential distributions. In the asymptotic realm of large samples, the CLT holds for the mean, but for samples less than several hundred the CLT fails to be good enough.

The point here was really that essentially *every* time we do a test of two means on real data, the data is *not* from a normal distribution, and it’s not even that hard to find wildly non-normal data (like say incomes or number of alcoholic drinks consumed in the last 30 days, or whatever you ask people in social sciences)

• It brings up an interesting point, where you don’t get away from Bayesian thinking even in a full on Frequentist analysis.

Which one do you put? Remember, in reality we aren’t calling the same function twice… we’re sampling from say incomes in Indiana and incomes in Missouri. Are they the same exact distribution? Do they have the same variance? We don’t know. We can test a particular hypothesis, but since we have no reason to believe that the variance is necessarily exactly the same, what should we use as our “null”. People act as if “The null hypothesis” is a relatively clear-cut thing in most cases. It’s not.

• Corey says:

The expected value of the difference E(xbar1 – xbar2) is mu1-mu2, and for a straightforward situation the simulated observed differences will be spread symmetrically around this value. So, no overestimation.

Try simulations that estimate E[abs(xbar1 – xbar2 ) | p < 0.05] for various effect sizes and sample sizes. (Or just reason out how it will look — I imagine you're perfectly capable of that.)

• library(ggplot)
experiment = function(n,d){
x=rexp(n,1);y=rexp(n,1/(1+d)); p=t.test(x,y)\$p.value
return(list(x=mean(x),y=mean(y),p=p))
}

mydata=data.frame(x=c(),y=c(),p=c())

for(i in 1:10000){mydata=rbind(mydata,experiment(4,.1))}

ggplot(mydata[mydata\$p < 0.05,])+geom_histogram(aes(y-x,..density..),alpha=.5)+geom_histogram(aes(y-x,..density..),data=mydata,fill=”red”,alpha=.5)

Red is the real distribution of difference in means, grey is the distribution of difference in means given p less than 0.05. I hope it doesn’t eat my code.

• the function experiment draws x and y from rexp with the mean value being 1 for x and 1+d for y. In my example d=0.1 so we’re comparing two random number generators that have mean 1 and mean 1.1. For those who aren’t going to run the code, typical values of the magnitude of the difference in means when p is less than 0.05 are around 1-2 or about 10 to 20 times larger than the real difference of 0.1

• Corey says:

By doing this with the exponential distribution you’re mixing two different critiques (significance filter and distributional assumptions).

Frankly I find the distributional assumption critique to be by far the more important one for the individual user to know because it’s easy for a lone practitioner to compensate for the significance filter pitfall if they’re aware of it.

• Realistically the distributional assumption is violated every time in an unknown way. It’s important to know about. Plus the significance filter is a major problem too. So it’s good in some sense to mix both because in reality that’s always what happens.

• Keith O’Rourke says:

Daniel one way to make that clearer in conversations is to add some words like these.

“If there were genuinely no difference in the population [and all ancillary assumptions are correct], the [assessment of the] probability of getting a result this big or bigger is less than 5%.”

• Malte Lau Petersen says:

In this case the substantial results are the same for sampling from a normal. Effect size conditional on p<=0.05 is 10-20 times too large

experiment = function(n,d){
x = rnorm(n, 0);
y = rnorm(n, 0 + d);
p = t.test(x,y)\$p.value
return(list(x=mean(x),y=mean(y),p=p))
}

9. Thomas says:

An interesting read on this question here:
Richard M. Royall. The Effect of Sample Size on the Meaning of Significance Tests. Richard M. Royall. The American Statistician Vol. 40, No. 4 (Nov., 1986), pp. 313-315

10. AS says:

I am not sure people are really answering Ian’s question. Assuming that questionable research practices (QRPs like publication bias, p-hacking, forking paths etc.), biases and confounds are not in play, then the long term error rate associated with the described Null Hypothesis Significance Testing approach is independent of sample size – repeated studies with 50 participants will draw a false positive conclusion about 5% of the time and repeated studies with 100000 participants will draw a false positive conclusion about 5% of the time. The argument behind NHST holds equally in both cases, as do the assumptions on which that argument is based. We certainly don’t need large numbers to make randomisation ‘work’ – the probabilistic underpinnings take account of sample size.

Perhaps studies with 100000 participants are less likely to suffer from QRPs, may be more likely to be externally valid and controlled to avoid post-randomisation confounds and biases than studies of 50; but small sample size is not in and of itself a reason to criticise a study. If the effect being researched is likely to be quite substantial (say teaching one group proper CPR and giving another group a ‘distractor’ activity so their knowledge of CPR is no more than folk knowledge; tested using a test sensitive to CPR rhythm, pressure, safety checks etc.) then small samples may be reasonable – though equally one might argue that really large effects should be so visible that we don’t really need an RCT to confirm them!

So we should be cautious about small sample studies, not because the sample size affects NHST, but because the social processes involved in research make it more likely that larger proportions of non-significant small sample studies probably go unpublished, across small sample researchers as a whole there may be a weaker understanding of methods and their underpinnings (so inadvertently use QRPs), less thought may go in to external validity etc. But we shouldn’t just say (as too many of my students say when critiquing literature), ‘the sample was too small’.

• Austin Fournier says:

Hold on, there’s a statistical reason to disagree with this. We’re not, in this case, interested in whether a future study of a false hypothesis will nevertheless turn up statistically significant. We’re looking at a statistically significant result and trying to figure out whether it’s a false positive or a true positive (unless the person in question is genuinely concerned with something else than whether to believe the hypothesis – which I would find strange). It should be noted that these are calculated rather differently.

Anyhow, it turns out that significant studies in low-power fields are more likely to be false positives, by Bayes Theorem. I have an explanation below, or you could look up “Positive Predictive Value” (this is the formal name for the probability that a positive result is a true positive).

I definitely agree that QRPs and the file drawer are more of a worry for small samples though.

Alternatively, if we’re looking at the effect estimates the studies are producing rather than the results of the statistical tests per se, small samples just won’t cut it because of the imprecision (plus upward bias from file drawer / garden of forking paths type stuff).

11. jrkrideau says:

I was just wondering whether you could explain why one should distrust significant results with small samples?

Let’s say you are investigating something to do with height in a high school.

You sample some small number of male students (20?)and by sheer chance, let’s say you get 3 members of the men’s basketball team.

This is likely to distort your sample mean and standard deviation as compared to the “real” mean and s.d.

If you have a much larger sample those basketball players will not have much effect.

Or you are looking at the mean wealth of US citizens. If you have a small sample and accidentally include Bill Gates or Mark Zukerman or alternatively include too many people living in tents on the streets of Los Angeles, oops!

• Clark says:

I think this comment clearly gets to the point of the issue. There may be lots of other side-issues, but a sample which is not representative of the population is at the core of the problem.

Sometimes there are other demographic measures available which can be adjusted for in a regression model which could help to reduce this type of bias. Further, controlling for “prognostic” variables may also reduce the residual error as well, helping to make the effect of interest more obvious. For instance, when studying a treatment for children, it can be beneficial to control for age or weight.

• All of these criticisms are actually at their heart equivalent to my criticism above, they arise because the distribution is *not normal* so there are outliers that dramatically affect the small sample statistics in ways that are not part of the assumptions allowable in the design of typical tests.

Eventually, with a large enough sample things like CLT take hold, but that large enough sample can be anything from several tens (20-40 samples) to several thousands (2000-4000). Worst case you have something where the concept of the mean is flawed, like a cauchy distribution.

• Dzhaughn says:

If you get Bill Gates, N = 10000 won’t save you, as far as mean wealth goes.

• Yes, but N of maybe 2 million would. So again, the non-normality of lots of real world data means that in order for the assumptions of the tests to work, the sample size needs to be bigger than some unknown number that depends on the unknown shape of the distribution. Asymptotics just guarantees you that there exists a number bigger than which the CLT will hold. If that number is N=330,000,000 then the test is never good for most real world datasets.

For example, if I sampled from a truncated unit cauchy between [-10000,10000] we’d need *really* big sample sizes. I modified my code (posted elsewhere) and found that for a true null of two truncated cauchys a sample size of 4000 wasn’t enough, but a sample size of 40000 was.

Note, the *real* assumption of the t test is that the mean of the sample is normally distributed about the actual mean of the data distribution with a standard deviation that can be estimated in a reliable way from the sample standard deviation. That’s *automatically* true if the raw data is normally distributed. It’s asymptotically true for large sample sizes from sufficiently nice distributions (ie. has a mean and variance) where the N required varies depending on the shape of the underlying distribution.

• Chris Wilson says:

Yup, and the higher moments take longer to converge with increasing N (where they exist at all)…

12. yyw says:

We are more likely to see statistical significance in small sample study because the true effect is almost never exactly zero. The OP should “distrust significant results with small samples” because these studies produce very noisy (and typically much exaggerated) estimates of the intervention effects. Especially in the OP’s field of education, any small study claiming statistical significance should be treated with extreme skepticism.

• yyw says:

Clarification: more likely (in reality) than the 5% chance under the assumption of no effect. Larger sample studies have an even higher odds of statistical significance, but at least the estimated effect size will have a higher chance to be appropriately small (if the actual effect is small and study is done right).

13. Bill Harris says:

Educate me: aren’t we missing the major point? Aren’t p-values all about P(D|H), while the conclusion we want to draw is all about P(H|D), and P(D|H) P(H|D)?

• Bill Harris says:

Oops: WordPress ate my equation.

That last should have been P(D|H) != P(H|D). I tried “.LT. .GT.” as “not equal,” and it obviously didn’t play well.

14. Austin Fournier says:

I don’t know if said history teacher is reading these comments, but I have a somewhat different answer. As the email alludes to, smaller sample sizes are not likely to give positive results even when your hypothesis is true – and indirectly, this means that a smaller percentage of positive results will correspond to true hypotheses.

It is helpful to remember that the false hypotheses won’t be achieving significance any less based on sample size – the one and only factor that affects the frequency of false positives is what p-value you choose as the cutoff for statistical significance (so long as test assumptions are met).

An easy way to see this is to draw up a two-by-two grid where the columns represent whether the hypothesis is true or not, and the rows represent whether you achieve statistical significance or not. Put dots in every quadrant of grid (representing tested hypotheses), and count what percentage of the dots in the “significant” row are also in the “true” column (that is, the probability that the hypothesis is true given that it is significant). Now, to simulate the effects of a smaller sample size, move some of the dots from the “true and statistically significant” square to the “true and not statistically significant” square. If you recalculate the probability that a random statistically significant hypothesis is true, it will now be smaller than before.

As may be noted from the other comments, there is some skepticism over whether there’s ever truly no effect at all (in my model, whether any dots belong in the “false” column). I would agree that there’s probably always going to be a real association between variables in correlational studies, but I’m not so sure there’s always going to be a real causal effect for you to find in experimental studies. Maybe – but I’m more skeptical of that conclusion.

The point about probable overestimation in small sample studies – and just bad estimation generally – is also quite good.