Anova is great—if you interpret it as a way of structuring a model, not if you focus on F tests

Shravan Vasishth writes:

I saw on your blog post that you listed aggregation as one of the desirable things to do. Do you agree with the following argument? I want to point out a problem with repeated measures ANOVA in talk:

In a planned experiment, say a 2×2 design, when we do a repeated measures ANOVA, we aggregate all responses by subject for each condition. This actually leads us to underestimate the variability within subjects. The better way is to use linear mixed models (even in balanced designs) because they allow us to stay faithful to the experiment design and to describe how we think the data were generated.

The issue is that in a major recent paper the authors did an ANOVA after they fail to get statistical significance with lmer. Even ignoring the cheating and p-value chasing aspect of it, I think that using ANOVA is statistically problematic for the above reason alone.

My response: Yes, this is consistent with what I say in my 2005 Anova paper, I think. But I consider that sort of hierarchical model to be a (modern version of) Anova. As a side note, classical Anova is kinda weird because it is mostly based on point estimates of variance parameters. But classical textbook examples are typically on the scale of 5×5 datasets, and in these cases the estimated variances are very noisy.