Statisticians take tours in other people’s data.

All methods of statistical inference rest on statistical models. Experiments typically have problems with compliance, measurement error, generalizability to the real world, and representativeness of the sample. Surveys typically have problems of undercoverage, nonresponse, and measurement error.

Real surveys are done to learn about the general population. But real surveys are not random samples. For another example, consider educational tests: what are they exactly measuring? Nobody knows. Medical research: even if it’s a randomized experiment, the participants in the study won’t be a random sample from the population for whom you’d recommend treatment. You don’t need random sampling to generalize the results of a medical experiment to the general population but you need some substantive theory to make the assumption that effects in your nonrepresentative sample of people will be similar to effects in the population of interest.

Very rarely, the assumptions of a statistical model will be known to be correct. The only examples of this that I’ve ever seen up close have been samples of documents. For example, we had a spreadsheet with a list of a few thousand legal files and we took a random sample of 600. The sample files were examined and then we used these to get inference for the full population. This doesn’t happen in surveys of people because we have nonavailability, nonresponse, and shifting sampling frames. But in rare cases we are sampling documents and the statistical theory is exactly correct.

Textbook statistical theory is like the physics in an introductory mechanics text that assumes zero friction etc. Friction can be modeled but that turns out to be a bit “phenomenological,” that is approximate.

Models are great and there’s no reason to be embarrassed about them. Assumptions are the levers that allow us to move the world.

Statisticians take tours in other people’s data. Assumptions about the underlying world + assumptions about the data collection process + the data themselves -> inferences about the world.

I think an important issue here is that some assumptions and some violations of assumptions are much more important than others. My standard example for this is that gross outliers do much more harm to analyses that are derived from the normal assumption than more or less strong discreteness of the data. Assumptions and assumption checking should point us to where a violation of an assumption may have the effect that the researchers get a result that will make them grossly misinterpret the situation.

We should not get too hung up with arbitrary violations of assumptions because there is always something, but quite a bit of this is actually harmless. Some isn’t, though.

People tend to have a wrong understanding of the meaning of a model assumption. If we do something that is optimal under normality, it doesn’t mean we are not allowed to use it if normality isn’t fulfilled. What exactly happens and how much it is a problem if model assumptions are violated (which they pretty much always are) depends on what we’re doing and on how exactly the assumptions are violated (which is sometimes hard to find out).

That’s a very good point. When people test normality, what they really want to know is “will my methods, which have been derived under normality, work well”? as opposed to “is the distribution of this estimator exactly normal?”. Hypothesis tests with H_O: Is normally distributed test the latter, not the former.

Sure, but the way comments like these are interpreted by non-statisticians is that no model assumptions matter at all. People don’t do nuanced analyses. They fit a single line of code and that’s it. Even experienced people don’t even bother to make a boxplot to look at the distribution to check if there is anything unusual. For them, this kind of comment is very dangerous and misleading.

I’m not sure if that’s really true. I think it would help a lot to explain and teach people when and to some degree why assumptions are relevant. Some arcane assumptions nobody really seems to care about when you read any publications are dangerous, too.

“Sure, but the way comments like these are interpreted by non-statisticians is that no model assumptions matter at all.”

Well, if people can’t read, they can misinterpret whatever anyone writes…

Agreed. But it’s not that they can’t read, and they’re not dumb. What I meant to say is that statements like

“What exactly happens and how much it is a problem if model assumptions are violated (which they pretty much always are) depends on what we’re doing and on how exactly the assumptions are violated (which is sometimes hard to find out)”

need to be fleshed out, not with a laundry list of do-this-don’t-do-that (“recommendations”), but with concrete examples that lead to real understanding of the issues.

Even Christian’s first point about outliers, which seems prety obvious, needs to be stressed in books (I haven’t seen much of that). See, for example, what happens in figure 1, a published dataset I re-analyzed:

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0077006

Would you agree that doing a paired t-test on the top left data might be a problem? Yet that’s exactly what was done, with p0.05 and the paper I got from that data would be unpublishable, because that was the basis for the whole paper. I have lots of data sets like these, so it’s a pretty common situation. When I talk to people about such problems in my field, they tell me that they’re just following Gelman and Hill 2007.

I don’t consider myself an expert by any means, and I’m happy to be corrected on all this. It’s great to be able to talk to real and experienced statisticians on this blog.

Shravan, my comment was a blog comment… you’re right that for use in practice this needs to be fleshed out. The best advice often is to work closely with a statistician.

Still, if somebody makes “no model assumptions matter at all” out of my initial comment, they don’t bother to read stuff properly and shouldn’t complain if they get things wrong.

Regarding the data you linked, you’re right, this plot shouldn’t make people feel comfortable about a t-test, though Andrew or others would need to comment on whether using Gelman and Hill as a reference for this also rather means that they didn’t read properly, which I suspect.

Christian: > work closely with a statistician

Really do think that depends on the statistician.

Also I don’t think most people can read stuff properly unless they almost already understand what is being written about.

Shravan: in your paper you write that

for the relatively large datasets used in psycholinguistics, hypothesis testing crucially requires that the normality assumption regarding residuals is satisfied.This seems rather confused. For comparison of means via a t-test withlargedatasets without hugely heavy tails, Normality is a minor issue at best. In small samples one might be more concerned about non-Normality… but rather than go all Box-Cox, a simple permutation test lets you obtain accurate p-values while retaining a test statistic that’s interpretable and minimally ad hoc.You also write about “Estimating the variance of parameters… depend[s] on distributional assumptions being satisfied”, but unless we’re being Bayesian we don’t estimate variances of parameters.

fred: Shravan’s reference to Gelman and Hill suggests to me that he’s thinking of multilevel models, in which (depending on the statistical jargon one is using) some “random effect” parameters are modeled as random variables drawn from a common distribution (with its own “hyperparameters”). In this context, estimation is either fully Bayesian or “empirical Bayes” — with frequentist assessment of estimators — and in the latter case, “estimating the variance of parameters” still makes sense.

(But perhaps Shravan has something else in mind…)

I remember that part of the book, too. However, I am a little unsure if Andrew and Jennifer mentioned anything regarding the assumptions of the so-called ‘random effects’ (or hyperparameters). There are some books that mention that the random effects have to be normally distributed, but has anyone come across anything that tells us whether those assumptions really matter?

Hi fred,

the type of objection you raise here is one point I’ve been trying to understand for the last three years, so help me out here. I’ve been doing an MSc in statistics from Sheffield, and everything I’ve learnt so far indicates that what you call my confused statement :) is correct. I understand that normality might be a minor issue compared to other issues. But the hypothesis test relies on it. I have raised this point before in more detail (Aug 5 2013):

https://andrewgelman.com/2013/08/04/19470/

If my reasoning in that post (sorry for the LaTeX) is wrong, I really would appreciate being told so (and why).

Responding to Corey, I had linear mixed models in mind.

> being told so (and why)

That is a real challenge and why I don’t think its just folks reading carefully.

(And MSc and Phds in statistics just provide the math and skills to start learning about how to apply statistics or at least this was my impression of watching two batchs of bright students doing the MSc at Oxford Statistics department.)

This quote points to the problem.

“Many of the problems with students learning statistics stem from too many concepts having to be operationalized almost simultaneously. Mastering and interlinking many concepts is too much for most students to cope with. We cannot throw out large numbers of messages in a short timeframe, no matter how critical they are to good statistical practice, and hope to achieve anything but confusion.”

http://www.rss.org.uk/pdf/Wild_Oct._2010.pdf

The many reasons for taking transformations other than to get “residuals that are approximately Normal” e.g. that getting additivity and linearity that Andrew referred to, which are to get commonness* of underlying parameters. Additionally, many features of techniques need to be considered, and arguably getting the coverage of intervals correct is usually focussed on first and taken as being required not just desired.

To Fisher, his t-test math was just a way to get an approximation to the permutation test in the early 1900,s – that’s all – and it’s very good (not perfect) for large sample sizes.

I believe you have read history (early papers), conceptual papers by leading statisticians and blogs like this.

* for the brave and adventurous I have written on how to more fully focus on the parameter space here https://andrewgelman.com/wp-content/uploads/2011/05/plot13.pdf

I believe you have TO read history (early papers), conceptual papers by leading statisticians and blogs like this.

“To Fisher, his t-test math was just a way to get an approximation to the permutation test in the early 1900,s – that’s all”

Exactamundo.

Hi Shravan

It’s a big issue to get into in a blog comment, but (with large samples) there are ways to justify most well-known regression estimates, and tests of whether the corresponding parameters are zero, without parametric modeling assumptions. A quick and rather general introduction is here, the same authors also wrote a recent book that’s good.

These justifications are no cure-all, but understanding how and why they work (and how they relate to more model-based justifications) should make it easier to understand which assumptions are actually doing the work – i.e. which assumptions we can basically ignore, depending on the circumstances, and which are critical.

Shravan: I remember this… yes, you can have a serious loss of power if distributions are skew and data set sizes are moderate. The thing about transformation is that if you decide how to transform depending on the data, this invalidates the theory, strictly spoken. The data may look more normal but you apply statistics based on theory that doesn’t take into account data-dependent transformation.

Again, this may not be a big issue, but how big an issue it actually is depends on how precisely you make your decisions. If you know in advance that you may try out logs if data look quite skew but nothing else, it won’t do much harm. Doing something very flexible such as Box-Cox with parameter estimation makes the problem somewhat bigger but may still not hurt that much, at least if the data set isn’t very small. Trying out transformations until the t-test comes out significant on the other hand is seriously evil.

Then, loss of power is often not such a big problem because it only really hurts in borderline situations, and also, if you interpret a non-significant outcome carefully (particularly not taking it to say that the null hypothesis is true), low power won’t lead you to a wrong conclusion but rather only means that you may miss a possible conclusion.

Apart from permutation testing, the Wilcoxon rank test is also a good option in such a situation (actually that’s probably what I’d do).

K? O’Rourke, what is this reference? The link is broken.

http://www.rss.org.uk/pdf/Wild_Oct._2010.pdf

Here’s a Gelmanian link to the Wild read paper on teaching statistics

I’m a little puzzled. The full figure shows boxplots of (1) the raw data; (2) log-transformed data; (3) inverse-transformed data, with (4) a Box-Cox plot showing that the inverse transformation is best supported. It’d be kind of surprising (although I admit freely that I am leaping to conclusions just from the figure, not having read the rest of the paper) if the authors had the thought to produce this figure and then went on to do the paired t-test on the raw data scale … ??

If by the authors you mean me and my co-authors, we were just re-analyzing a dataset that was published in an article and presented a paired t-test on the untransformed data. The point of the boxplots is to show that the analysis on the raw data was not a good idea. If by authors you mean the authors of the original article, I am guessing they didn’t look at boxplots but rather went right ahead and did the paired t-test (actually they did an anova, but since t^2=F, it amounts to the same thing). This is what a lot of people do: load data, run test.

Or maybe you mean that the original authors tried to get statistical significance from the data; I think that the probability of that being the case is near 0; I think it was just an oversight. The point I was trying to make in this exchange on this blog was that assumptions (or maybe it’s just the role of outliers or influential values) are not taken sufficiently seriously, at least in some circles.

I’m out of reply depth here. I like the idea of randomization and bootstrap tests; we should use them a lot more. I don’t know why we don’t.

Christian, at least in psycholinguistics a surprising number of people argue for null results, drawing strong conclusions that \mu=0, from a low-power (lowered even more by violations of normality of the residuals). So it’s a real problem.

I’ll look at all the other things K? O’Rourke and fred linked to, thanks for that.

So long as you stop to check once in a while whether you are using the right levers…

Andrew:

I would phrase it a little different:

Theory about the world ⇒ assumptions behind measurement and structural model + data = inference.

I would focus more on theory rather than assumptions. The latter ought to be the restrictions imposed by the former. (But granted plenty of assumptions are made out of convenience).

[…] -The “assumptions” of statistical models, Andrew Gelman writes, “are the levers that allow us to move the world.” [AndrewGelman.com] […]

[…] -The “assumptions” of statistical models, Andrew Gelman writes, “are the levers that allow us to move the world.” [AndrewGelman.com] […]

“Medical research: even if it’s a randomized experiment, the participants in the study won’t be a random sample from the population for whom you’d recommend treatment. You don’t need random sampling to generalize the results of a medical experiment to the general population but you need some substantive theory to make the assumption that effects in your nonrepresentative sample of people will be similar to effects in the population of interest.”

I think this is the assumption that most often causes problems. You have an “unbiased estimate”, but of what? What is the population? In most cases the population is really only the set of people/animals/cells you performed the study on.

question:

Additionally, in medical research, you have an “unbiased estimate” only where there is absolutely no effect even internally given non-compliance, drop-outs, missing data, patients being clearly told the experimental treatment may have no benefit at all, loss of blinding, etc.

One way of putting this, is that randomised clinical trials are really good at identifying a treatment effect but often horrible at estimating the size and or variability of the effect.

Also exactamundo.

Another similar sampling example where the model is known is sampling from stored blood samples for further analysis such as genotyping or measuring new biomarkers.

[…] -The “assumptions” of statistical models, Andrew Gelman writes, “are the levers that allow us to move the world.” [AndrewGelman.com] […]

I agree with Andrew here. The only thing I would add is that when the outside world hears the word “assumption” they sometimes hear “faith”. (My SO likes to say “when you assume you make an ass out of u and me”. I think she is kidding.) In my mind a key idea about assumptions is that they are (almost) all testable, and often even (weakly) within the data set you have in hand.

David:

Good point about assumptions being testable (or, as I sometimes like to say, grounded in reality). This is one of Deborah Mayo’s big points about the philosophy of statistics: what makes a method “objective” is not that it is conventional (for example, we can’t simply label a Jeffreys prior as “objective” just because it is a standard choice, and we can’t simply label a maximum likelihood analysis as “objective” just because it is the default textbook thing to do) but rather that it is tied (ideally, in multiple ways) to reality, for example by being motivated from some combination of logical argument and historical data, and being checkable in some way (possibly not until the future) by comparison to data.

Would love a post on the subject: What constitutes a test of an assumption; what assumptions are not testable, and so on? And bonus points if you can slip in a Scott Adams reference. I miss fraac.

One important point here is that it is more precise to ask “what assumptions can be distinguished from what alternatives”. All tests (or visual diagnostics) can distinguish the model assumption from some alternatives but not from others. For example, independence can be distinguished, given enough data, from regular positive or negative dependence between neighboring observations and from some other more complex but regular patterns of dependence, although this requires assumptions about identical distributions, either in the marginals or in innovations or something like this. If every different observation is allowed to follow their own distribution, nothing can be diagnosed about dependence. And whatever you observe can well be explained by some potentially weird dependence pattern or some potentially weird violation of identical distributions, so you can tell neither independence nor identical distributions apart from a “catch-all alternative”.

The implication of this is that although much can be tested, in order to say anything you need to make some kind of i.i.d. (or exchangeability) assumption that cannot be tested, either for the data themselves or at least on some level where it produces a regular pattern of dependence or non-identity as in regression or standard time series models.

Given enough data, we’re all dead.

I’m not sure what property called “independence” you think you’re testing for with all that data (all of which is taken in overlapping light cones), but whatever it is, there is a possibility it isn’t actually being relied on:

http://www.entsophy.net/blog/?p=276

Entsophy: I read this some time ago and it’s a nice result and a good thing to know indeed. However, often in Statistics I am rather interested in having a model (even if this is based on some untestable assumptions, and even if I’m using this rather to discuss its weaknesses than to “believe” it) for the process generating the data rather than some kind of distributional representation of the data that doesn’t attempt to connect in any way to how the data came about.

Christian, that response is like a homeless man currently living in cardboard box who refuses to accept a house as a gift. When asked why they don’t want the house, they respond “because I like having a roof over my head when I sleep”.

How does your “house” give me the roof I want?

> “objective” … being checkable in some way (possibly not until the future)

Would be found too wrong if enquired into sufficiently.

I have been thinking about this quote from Thomas Hoskyns Leonard, A personal history of Bayesian statistics

“It was used [by Ramsey] to demonstrate that Bayesian probability measures can be falsified, and so met an empirical criterion of Charles S. Peirce, whose work inspired Ramsey.”

At least, this does seem consistent with my understanding of Peirce who was none the less very disparaging of Laplace’s indifference priors.