A research psychologist writes in with a question that’s so long that I’ll put my answer first, then put the question itself below the fold.
Here’s my reply:
As I wrote in my Anova paper and in my book with Jennifer Hill, I do think that multilevel models can completely replace Anova. At the same time, I think the central idea of Anova should persist in our understanding of these models. To me the central idea of Anova is not F-tests or p-values or sums of squares, but rather the idea of predicting an outcome based on factors with discrete levels, and understanding these factors using variance components.
The continuous or categorical response thing doesn’t really matter so much to me. I have no problem using a normal linear model for continuous outcomes (perhaps suitably transformed) and a logistic model for binary outcomes.
I don’t want to throw away interactions just because they’re not statistically significant. I’d rather partially pool them toward zero using an informative prior. Or, in the short term, set interactions to 0 if they help you understand the model, and use statistical significance as a guideline if you’d like, but in concert with your substantive goals. If a certain interaction is something you’re just including to correct for potential imbalance between groups, and it’s not statistically significant, maybe you can toss it. But if it’s central to your understanding, keep it in, while recognizing that you will have a lot of uncertainty in coefficients and comparisons that arise from that interaction of factors.
Regarding your conceptual point, yes yes yes yes yes I agree that you should use those continuous variables, don’t chop them up as binary, that would just throw away info. If you _must_ make a variable binary, please break it into 3 categories and compare high to low (see my paper with David Park on splitting a predictor at the upper quarter or third and the lower quarter or third).
And now here’s the question:
Recently, there has been a shift in field away from ANOVA to the use of mixed effects logit models. It was primarily based on the advice in this paper: Jeager, T. F. (2008) Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. Journal of Memory and Language, 59: 434–446 . (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2613284/)
It is becoming the gold standard technique in the field, but I [the psychologist who’s asking this question] am having some issues understanding it. Learning to program it is relatively easy, learning how to use it appropriately, and especially, understanding how to interpret logit models is much harder. And I have overheard too many discussions about interactions amongst my poli sci and economist friends, especially in logit models, to not be somewhat sceptical of the advice in said paper. So I took to reading, and have ended up more confused than I started. Basically, given some of the issues, I am not sure that it is worth the switch. But my priors lead me to believe that the author of the paper knows much more about stats than I do, and given that, that I’m confused. Hopefully this will be intriguing enough to you to respond.
Here’s what I’m having trouble with.
The main impetus for the shift away from ANOVA to logit is two-fold: 1) arguing that we actually have categorical response data, and 2) a demonstration of a spurious interaction effect in ANOVA – as in, it’s significant in ANOVA (even using transformed data) but not in the logit model. I will deal with the latter first, since it is a statistical issue, whereas I see the former as conceptual, and I can see arguments for both sides.
As far as I can tell, the interpretation of interactions in logit is very tricky. This point is made by Golder and colleagues (https://files.nyu.edu/mrg217/public/presentation_interaction.pdf), among others. Moreover, it depends on the type of variables considered in the interaction, i.e., continuous vs. categorical (http://www.ats.ucla.edu/stat/stata/seminars/interaction_sem/interaction_sem.htm). Given all the complications, I am loathe to throw away a result because it was not significant in a logit model. Basically, logit results are being treated as ANOVA results (just look at the p value and you know all you need to know), but getting rid of a problem and with bonus information (basically, effect sizes, in this case, the odds ratios, although, they are not reported in the papers I’ve read using logit). But according to Golder and colleagues “ the coefficient and standard error on the interaction term does not tell us the direction, magnitude, or significance of the ‘interaction effect’” . Or some folks at UCLA (http://www.ats.ucla.edu/stat/stata/seminars/interaction_sem/interaction_sem.htm ) “Just because the interaction term is significant in the log odds model, it doesn’t mean that the probability difference in differences will be significant for values of the covariate of interest. Paradoxically, even if the interaction term is not significant in the log odds model, the probability difference in differences may be significant for some values of the covariate.” Or Berry, DeMeritt, Esarey (2010, AJPS vol 54). Or in Ai et al., (2003)” In probit or logistic regressions, one can not base statistical inferences based on simply looking at the co-efficient and statistical significance of the interaction terms. So reading off the p values for an interaction term is not a straightforward matter, or should I say, using them to directly reject the hypothesis that there is an interaction is not the same as in an ANOVA.
This is not to say that using untransformed data is unproblematic. But there are other transformations that can, I think, deal with the problem raised, as long as one is willing to think about the data as more or less continuous, rather than binary. (I am not very sure about this however, hence the “I think”, http://oak.ucc.nau.edu/rh232/courses/EPS625/Handouts/Data%20Transformation%20Handout.pdf).
This brings me to my more conceptual point. Often, we are interested in an underlying variable that is not binary, rather, it varies along some dimension, e.g. strength of a memory, effectiveness of learning given different study or teaching methods, things of that sort. And by probing a person’s knowledge multiple times, we hope to have an estimate of that underlying variable. This same point is made by Long (1997). So if we have an estimate that is closer to being continuous (it will always be somewhat constrained by the number of times we ask people what is essentially the same question), doesn’t it make sense to use it? Often, researchers are restricted to binary response variables as measures, and that makes their lives more complicated. So if we have data more closely related to what we are interested in (i.e., the person’s overall performance), why not use it? We do not care about particular responses, we care about overall patterns in responses, and we have those. Example: I want to know if one learning method is better than another. I have two groups of people learn under different conditions. I test their eventual knowledge. I now have their responses on individual test items, and their overall performance. Since I care about their overall performance, why would I use an approximation, or put differently, a single sample of their performance, to test whether learning methods affect overall performance. Moreover, it gets rid of including the random variation in an individual’s performance on an item. One might argue that it does not account for the consistency in a person’s performance over items, but that seems to be misguided. (And if you have a repeated measures design where the same person is included in different conditions, then you will have some estimate of that part of the total variance.) However,, I can see reasons for doing both. E.g., if I were interested in whether increased studying led to better performance on a test, I would want to use overall performance. If I were instead interested in whether my individual questions were fair (meaning, they each reflected the relationship between studying and better performance) then I would most definitely want to include data from each question individually. But that’s a different question.
The points just mentioned are about whether the switch (to logit) is really necessary (Clearly, if the data have to be considered as binary, then linear regression isn’t appropriate). The following points are about implementation and recommendations as I understand them. If I’m going to have to use logit, I want to do it right.
Fixed or random effects. I was both heartened and disheartened to see your posting on random vs fixed effects. I have been trying to figure out why things like participant are being treated as random effects, and can’t: There is no discussion in my field (as far as I can tell anyway) about whether predictor variables should be treated as fixed or random, although I see that there are ways of deciding based on the nature of the error (rather than any a priori assumptions). I ask because it seems as if the decision is not trivial. My understanding of the difference (from the perspective of assumptions) is that random effects are more efficient but biased, and that in other disciplines the choice of a random effects model would have to be tested and justified. Moreover, they fail to account for what they are supposed to account for, error that is consistently attributed to an individual and is associated with that person in each and every measurement taken by that person. So while they make the model more efficient, they are also less conservative. They also have different interpretations (http://www.cscu.cornell.edu/news/statnews/stnews76.pdf). Given that we psychologists are typically actually interested in the effect in general, not simply with respect to the people we are testing, according to the site(sic) just referenced, it would seem that individuals should be treated as fixed effects for logical reasons as well. The push to use mixed effects models has been predicated on ‘the fact that ordinary logit models provide no direct way to model random subject and item effects’. But if the error attributed consistently to individuals can be handled as fixed effects, and this is a less biased model, then it would be preferred. Or am I missing something.
Multicollinearity: One of two approaches to multicollinearity are being advocated (more elsewhere than in the article: 1) drop a variable, or 2) center the variables. The first approach seems to me to be somewhat suspect as it introduces bias into the model. (I guess one could argue that if you had never thought about that variable and never included it, the same bias would exist. But since we are supposedly testing our hypotheses about how something works (your theory is built into the design of the experiment – why include an IV if you don’t think it does something), not including a variable in your model because of multicollinearity seems like knowingly (even in not intentionally) introducing bias that could make your other predictor variables seem more important than they really are, and thereby artificially inflating the evidence for other parts of your theory of the phenomenon. But given what I just talked about, random vs fixed effects, bias doesn’t seem to be too much of a concern…) My reluctance seems to be supported by Kennedy (1998). The second solution is argued against by Golder and colleagues (https://files.nyu.edu/mrg217/public/presentation_interaction.pdf ) as a) trying to get rid of part of the error that is seen as an overestimation, but in reality is not, and b) not doing what people think because it doesn’t add any additional information (which is the real solution).
I should be clear that all of this is not to say that I think that ANOVA is perfect (I have seen, but I will admit not yet carefully read your paper on ANOVA, but it seems to be more about complicated designs than I typically use so I may be fine with the standard version. But, even though I use SPSS, I can program in it – I learned to use it back in the SPSS for DOS days – so using an improved ANOVA model is something I could do with some work). But I know how to interpret things in ANOVA. Unlike what seems to be the case for practitioners of regression (from what I gleaned from a presentation and paper by Golder and colleagues), I was taught to be careful interpreting main effects given a significant interaction in an ANOVA. Basically, I was taught to not interpret them, but instead to separately analyse the effect of one DV at each level of the other separately. So moving to something that is difficult to interpret causes me some trepidation. Regression clearly has some benefits, in particular co-efficients, but I am unconvinced that logit is the way to go. But if not logit (or something like it), then people like me will run into serious issues with ns – many of our experiments just don’t have enough participants to include many variables in a linear regression (which I can also interpret) with one or two observations per person (given repeated measures designs, we can have e.g. two different percentages for instance from the same person, but usually not too many more than that).
(Not to mention that adding and removing variables from an analysis based on results is something that drives me batty. As an experimentalist, my view is that you have a theory about how things work, that drives your experimental design, and you should include all the variables you included in your experiment in your statistical model. Even if something does not explain a significant amount of the variation, leaving it out will produce a biased estimate of the contributions of the other variables. As experimentalists we are not in the business of finding the model that best fits our data, we are in the business of testing our theories. That can be done just as well, possibly better, with regression techniques as compared to ANOVA, but the temptation to fiddle with models and see what’s best is contrary to the logic of running a well-controlled experiment. But that’s a bit of an aside. Not a problem with the technique, just how it seems to be being used.)