I’ll say it again

Milan Valasek writes:

Psychology students (and probably students in other disciplines) are often taught that in order to perform ‘parametric’ tests, e.g. independent t-test, the data for each group need to be normally distributed. However, in literature (and various university lecture notes and slides accessible online), I have come across at least 4 different interpretation of what it is that is supposed to be normally distributed when doing a t-test:

1. population
2. sampled data for each group
3. distribution of estimates of means for each group
4. distribution of estimates of the difference between groups

I can see how 2 would follow from 1 and 4 from 3 but even then, there are two different sets of interpretations of the normality assumption.
Could you please put this issue to rest for me?

My quick response is that normality is not so important unless you are focusing on prediction.

31 thoughts on “I’ll say it again

    • @Brendon; you might view it that way, but it’s not an argument that’s going to help many psychology students.

      For getting approximately-valid frequentist inference when testing equality of means of two groups, using a t-test, the quantity we *need* to be roughly Normal is 4.

      While it’s probably overkill here, approximate Bayesian validity of the corresponding point estimates and intervals also follows, so long as the priors are not doing all the work.

  1. Here’s a paper written by some of my colleagues, that talks sensibly about this issue. Often there are much bigger issues than how Normal the data might be.

  2. …ummmm… in an OLS setting we are talking about normality of the error term, right? That is not any of 1-4. Or am I totally missing the point here?

    A frequentist would be concerned about the distribution of E | X – that is, in a linear least squares regression setting, the distribution of error terms given covariates and a linear model.

    Another way of saying it: if Y = X’B + E , then we need E ~ N(0,sigma). 1-4 seem to be focused on the distribution of X’B + E. What am I missing?

    • you are missing part of the point. the normality assumption about the distribution of the errors means that the conditional distribution of the outcome is normal.

      • If you just want inference about linear contrasts (i.e. the difference in mean outcome comparing two groups, as in the original t-test) then you don’t *need* Normality of outcomes – conditional or otherwise – to motivate OLS. You *can* motivate it via Normality, but you *need* not.

        • sure, which is similar to andrew’s point. my point was just that a normality assumption about the errors is a normality assumption about the conditional distribution of the outcomes.

  3. Hi Andrew,

    I want to say this again:

    People in areas like psychology, who are focused on null hypothesis testing, often misinterpret your statement on this topic to mean that one should *never* transform dependent measures like reading times. You have written in your previous post that for positive response like RTs you would take the log; in Gelman and Hill I think you wrote that the reason for this is that you want to respect the additivity and linearity assumption, not because you think that the normality assumption on residuals should be checked.

    Here’s a reviewer of a paper of mine trying to force me to analyze reading time data on the raw scale, citing your book:

    “While it’s true that normal residuals (i.e. normally distributed noise) is an assumption of linear models, it is the least important assumption (Gelman and Hill, p. 50). Gelman and Hill note that violations of this assumption have almost no effect on estimation of coefficients (the content of the authors’ analyses).”

    One problem with statements like these is that we don’t just estimate coefficients, we carry out null hypothesis tests. As I justified in my comment to your last post to this, I believe we should not ignore the normality assumption of residuals in that case.

    One of the things people do a lot in areas like psychology is to make positive claims on the basis of null results. If serious violations of the normality assumption in the residuals causes loss of power (as I showed in the simulations I posted to your last entry on this topic), this has real consequences for the advice you give in the book.

    It gets even crazier. In eye tracking data for reading, there is a dependent measure called re-reading time; the amount of time you spend revisiting each word after you have passed it in a left-to-right sweep. Often, 80% of the values are 0 ms (i.e., no re-reading occurred). People will fit linear mixed models on a dependent variable with 0 values in 80% of the data, and non-zero values in the remaining rows. Residuals are not checked, but it is pretty impressive to look at the residuals in this situation.

    I’m only suggesting that it would help people a lot if you were to qualify your statement for the reading time or reaction time type of situation, instead of making a single statement like the one you make. Your recommendation is pithy, but it seems to lead people down the wrong path. I’m not a statistician (or at least, not yet anyway), so I’m happy to be corrected on this.

    • Just to elaborate: I love transformations. The reason I transform is typically not to get to a normal distribution, but rather to get to a model that makes more sense. It’s not about “loss of power,” it’s about validity, additivity, linearity, etc.

      • > it’s about validity, additivity, linearity
        Cochran, Tukey, Cox, etc. stressed that kind of thinking too, but there is so much to get accross, folks get caught up in less important stuff (and often even get it wrong).

        For instance, in the paper Ken gives, some quick glances and a search for additivity finds very little about that but rather a lot of other material folks might benefit from knowing more about.

        Maybe this should have been a direct reply to Shravan: a really bad reviewer can be someone who knows a bit about statistics but does not realise how little.

        • +1 on last sentence.

          Relatedly, I once had a reviewer complain about transformations. The reviewer felt I was massaging the data, kind of cheating. Go figure.

      • Hi Andrew,

        I do apologize for harping on about this, but I’m just trying to understand your response.

        I understand the point about linearity and additivity. I also understand the part about the model making sense.

        Regarding your comment that it’s not about loss of power, I’m trying to understand what that means *in a null hypothesis testing* setting. In your previous post about this topic, I had posted a simulation (on the advice of one of the other contributors) which suggests that there is a loss of power when the normality assumption of residuals is not satisfied.

        What’s wrong with that statement?

        In psychology and related areas, we often run low power studies, sometimes out of necessity. If on top of that (as I show below) we lose even more power due to non-normal residuals, at least from the frequentist viewpoint (the only viewpoint in such areas!) we should not be drawing conclusions from null results (well, we shouldn’t be doing that anyway, but let’s put that aside for the moment).

        Here’s the code again:

        nsim<-100
        n<-100
        pred<-rep(c(0,1),each=n/2)
        store<-matrix(0,nsim,5)

        ## should the distribution of errors be non-normal?
        non.normal<-TRUE

        ## true effect:
        beta.1<-0.5

        for(i in 1:nsim){
        ## assume non-normality of residuals?
        ## yes:
        if(non.normal==TRUE){
        errors<-rchisq(n,df=1)
        errors<-errors-mean(errors)} else {
        ## no:
        errors<-rnorm(n)
        }
        ## generate data:
        y<-100 + beta.1*pred + errors
        fm<-lm(y~pred)
        ## store coef., SE, t-value, p-value:
        store[i,1:4] 4/n:
        store[i,5]4/n)[2]
        }

        ## “observed” power for raw scores:
        table(store[,4]<0.05)[2]

        ## t-values' distribution:
        summary(store[,3])

        ## CIs:
        upper<-store[,1]+2*store[,2]
        lower<-store[,1]-2*store[,2]
        ## CIs' coverage is unaffected by skewness:
        table(lowerbeta.1)

        ## distribution of num. of influential values:
        summary(store[,5])

        ## power about 40% with non-normally distributed residuals.
        ## power about 70% with normally distributed residuals.

        ## typical shape of residuals in reading studies:
        library(car)
        qqPlot(residuals(fm))

  4. Although personally I find frequentist null hypothesis testing almost always the wrong thing to do, A lot can be said for simple simulation tests here if you want to understand what’s needed.

  5. The t-test of difference in means in independent (not paired) samples assumes that the sample mean of the data has a normal sampling distribution under repeated experiments, and the sample variance has a chi-squared sampling distribution under repeated experiments. For a paired test, we’re assuming that the sample mean of the paired differences is normal under repeated experiments and that the sample variance of the paired differences is chi-squared under repeated experiments. This is more or less pure probability theory. I call something probability theory if it tells us about how random number generators work rather than how the world works.

    The robustness of the test to violations in these assumptions varies depending on the degree of violation. On the other hand, as a Bayesian, I find the whole thing distasteful for reasons often repeated on this blog, such as:

    1) the whole thing depends on a potentially infinite sequence of repeatable experiments which will consistently generate data with a distribution that is constant between experiments. Tomorrow, when the lighting conditions are different, or your subjects have recently recovered from a hangover due to the big football game, or when your supplier gives you “identical” bolts from a different manufacturer without telling you… your infinite sequence of repeatable experiments goes out the window.

    2) It ignores the actual magnitude of the effect in favor of a dimensionless ratio of effect size to size of statistically discernible effect (the t statistic). With large sample size it is possible to find “statistically significant” effects that are meaningless in practice.

    3) The meaning of NHST is often very poorly understood in practice. People take a “significant” result as evidence in favor of their favorite hypothesis, instead of evidence against the idea that nothing is going on.

    etc etc.

    I think those problems are usually way more of an issue than the normality assumption

    • Re 1: The repeatability concept applies to theoretical infinity of experiments now, under identical conditions. What happens tomorrow is irrelevant.
      Re 2: If you don’t want to detect “statistically significant effects” that are meaningless in practice don’t take such big samples. Statistical methodology has tools that enable you to design a sensible experiment.
      Re 3: I have often been amused, and never had much sympathy, with this idea that hypothesis testing and the p-value are poorly understood. When I first graduated I was offered the opportunity to study for a PhD in Quantum Electrodynamics. Had I taken this opportunity then later in life I would not have been surrounded by amateurs trying (very badly) to do the same job as me. But that is what it is has been like in Statistics. If you understand something poorly you should either give up and pay someone else to do it for you or study it properly until you understand it.

      • RE 1: in the future things can change without you realizing it. We can’t control all variables in experiments, only those we know are “relevant”. This is a huge part of the non-replication of many effects in psychology labs etc. Accidental things that are idiosyncratic to the setup can control the effect. Almost no-one is interested in whether you could continue to replicate your ESP effect over and over again in one particular room in your lab at a state college in New Mexico; and nowhere else in the world. What happens tomorrow, at a different lab, in a different country with different “irrelevant” variables is *EXACTLY* what science is all about.

        RE 2: I disagree, much of the problem that Andrew mentions over and over in this blog is about the statistical significance filter producing a bunch of overestimates of null or small effects (Type M errors). You should prefer a statistical methodology in which you evaluate the size of the estimated effect against a “practically insignificant size” and try to maximize the precision using high powered studies while minimizing cost. Instead what is done is to compare to the precision and use low powered studies until tenure is obtained.

        RE 3: The fact is that lots of people use statistics who are not statisticians, just as lots of people use cars who are not mechanical engineers or professional stunt drivers. I don’t have much sympathy for the argument that only the dedicated high priests of stats should be the ones doing analysis. We should encourage non-experts to use the tools that are the most robust and least likely to cause confusion, and to use them in a sensible way. We should not have textbooks that still basically talk about the whole role of statistics as disproving null hypotheses.

        Also, I don’t know where you’ve been but it seems like Quantum physics is where all the armchair guys are trying hard to put in their two cents. See recent posts on this blog for example ;-)

        • hi daniel,

          regarding point 3, i do not think the analogy you draw with cars and mech engineers or professional stunt drivers is correct. yes, people drive. but not everyone is allowed to build a car, nor is everyone allowed to do the crazy stunts which stunt drivers perform. on the other hand, anyone is allowed to do incredibly stupid things with statistics, with little to no repercussions.

          regarding point 2, i do not know that you and peter are necessarily disagreeing, but that you are misunderstanding each other. as peter states, “Statistical methodology has tools that enable you to design a sensible experiment.” that is related to andrew’s point about type m errors. well designed experiments do try to avoid such problems.

    • [cite]
      The t-test of difference in means in independent (not paired) samples assumes that the sample mean of the data has a normal sampling distribution under repeated experiments, and the sample variance has a chi-squared sampling distribution under repeated experiments.
      [/cite]
      I do not remember so. Can you give any link? Thank you. I’ve always believed that are the observations to be normally distribuited. By the way, normal mean implies normal observations, so in my opinion you are stating that observation have to be normally distribuited.

  6. It is all explained in The Analysis of Variance (Scheffe, Wiley, 1959). In Chapter 1 Scheffe discusses estimation and introduces the Gauss-Markoff Theorem which states that “every estimable function has a unique unbiased linear estimate which has minimum variance in the class of all unbiased linear estimates. In other words a linear least squares estimate is the best you can do. The estimator (note the distinction between estimate and estimator)is unbiased and has a smaller variance than any other linear estimator. Chapter 1 does not discuss the Normal distribution at all. Normality is introduced in Chapter 2, which focusses on hypothesis tests and confidence intervals. In paragraph 1 Scheffe says “we now assume further that the observations have a normal distribution”. I take this to mean that the population from which the observations are sampled follows a normal distribution. He then goes on to show how, under this assumption, one can construct/calculate confidence ellipsoids, hypothesis tests, tests derived from the likelihood ratio, and power charts and tables. So, in summary it s the population from which the raw data values are taken that needs to follow a normal distribution, but only for tests and confidence intervals, not for estimation.

    In Chapter 10 (The Effects of Departures from Underlying Assumptions) he shows what happens with non normal data and when (and when you don’t ) need to worry about it.

    I read this book from cover to cover several times whilst studying for a PhD in the mid 70s so it was nice to have an excuse to dip into it again.

    • A = “every estimable function has a unique unbiased linear estimate which has minimum variance in the class of all unbiased linear estimates”

      B = “a linear least squares estimate is the best you can do”

      C = “a linear least squares estimator is the best you can do in the class of unbiased estimators”

      B does not follow from A. C does, but that doesn’t help me if I have relevant prior information and seek use it to trade off variance for “bias”. (I use scare quotes because in this case, it’s bias towards estimates already supported by the prior information.)

  7. The data most emphatically DOES NOT have to come from a normal distribution for hypothesis testing of difference in means. The only thing the t test guarantees is that if there is no difference in means, and the means have an approximately normal sampling distribution, that p < 0.05 only 5% of the time… For example some simple R code that takes samples of size 10, 20, and 50 from two identical exponential distributions and t tests them… 2000 times for each sample size, and then computes the false discovery rate (how often is p
    > size ttests sum(ttests
    > size ttests sum(ttests
    > size ttests sum(ttests plot(density(replicate(1000,mean(-log(runif(10))))));

    It’s not *super* normal, but it’s pretty decent. The Central Limit Theorem is a powerful result.

      • As a profane user and reader of statistical reseacher can I deduce from your’s and Andrew’s position that it’s wrong to say that the negative impact of outliers and heavy tailed distributons on accuracy is not as important as stated, for example, in Wilcox, Moderen Statistitics for the Social and Behavioral Sciences, 2012? More generally, if normal distributions assumptions are not relevant, do we need robust statistics methods only in the case of heteroscedasticity?

        • lubetano: Some deviations from normality are a problem, others are not. Outliers and heavy tails are indeed a problem. The general shape of the normal distribution is not important, but it is important that some sources of specific trouble can be excluded. This, I think, is simultaneously compatible with both what Andrew wrote and what you cite here.

    • [cite]
      The only thing the t test guarantees is that if there is no difference in means, and the means have an approximately normal sampling distribution, that p < 0.05 only 5% of the time…
      [/cite]
      I do not agree at all. You are speaking about distribution under a null hyphotesis. t has nothing to do with it. Replace t with z and you obtain the same valid statement.

  8. I like this hardline comment: “If you understand something poorly you should either give up and pay someone else to do it for you or study it properly until you understand it.”

    I literally implemented the above comment: I don’t really understand statistical theory so I spend my evenings studying statistics in a part-time MSc program (Sheffield), it’s a four-year commitment. I am making a good-faith attempt and am half-way through it.

    But the above comment is not a realistic position in general. A lot of people cannot afford to not work with data just because they don’t understand statistical theory. In a world where people are using statistical tools with imperfect understanding, many people rely on recommendations from statisticians.

    One problem with the recommendations of statisticians is that often they will often tell you what to do but it doesn’t fit with the details spelled out in countless statistics textbooks.

    An interesting test case for me is the claim that the normality of residuals. I’ve already spelled out the issues in comments to the previous post on this topic.

    I guess my point here is that pithy recommendations tend to lead people down the garden path because they deliver incomplete information, and people misinterpret them. (BTW, I am fully on board with the arguments against NHST, but I am not in control of the methods used in psychology and related areas.)

  9. >pithy recommendations tend to lead people down the garden path
    Or as someone put it “In the land of the blind, the one eyed man is King”

    There are some real challenges in learning and more so helping others to learn.

    Your R code won’t run (blog likely chewed it up) but for one thing, you are using an efficient test (based on _linear_ functions of observations) for Normal model but very inefficient for other models so loss of power _should not_ be surprising.

    For real data – there are real arguments about what to do – and no widely accepted solutions (despite Scheffe 1959 manifesto).

    Robustness as an insurance payment against not be able to get assumptions ever correct – to me seems an unwise purchase.

    Stigler has provided a more insightful and balanced? assessment https://files.nyu.edu/ts43/public/research/Stigler.pdf

  10. From Greene, W. “Econometrics”, 5th edition, pg 17:

    “[The assumption of] Normality is often viewed as an unnecessary and possibly inappropriate addition to the regression model”

    In general the assumption is made to justify using an exact Normal distribution for inference in small samples.

    In large samples the assumption is not needed due to some central limit theorem, where we use an asymptotic distribution.

    See Sections 4.7 and 6.4,

Comments are closed.