Bill Harris wrote in with a question:
David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent.’
I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging.
For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor air temperature (T). We know E, because we can measure it. We typically get T at a “nearby” location (within 5-10 miles, perhaps), but we know microclimates cause that to be in error for what counts at the particular building.
So ‘E ~ T’ makes sense, but ‘T ~ E’ may violate fewer regression assumptions. At least in the short term and over a volume that’s bigger than covered by the exhaust plume from the air conditioner, the natural interpretation of that (“the outdoor air temperature is a function of the energy you consume to cool the building”) is hard to swallow.
How do you handle this? In a complete modeling sense, I see modeling the uncertainty in x and y, but often a simpler ‘lm(y ~ x)’ suffices. Which would you put as x and which as y? If you do ‘T ~ E’, how do you interpret the results in words?
Do we really use the terms “independent” and “dependent” variables in this sense in ARM and BDA? I don’t think so. If we do, these are mistakes that we should fix. I don’t ever like the use of this term. In ARM I think we make it pretty clear that regression is about predicting y from x. There is no rule that y have higher variance than x. Sometimes people want to predict y from x, but x is not observed, all we that is available is z which is some noisy measure of x. In this case one can fit a measurement error model. I believe we discuss this briefly somewhere in our books but it’s an important enough topic that I think for the next edition of ARM, I’ll add a section on such models.
Bill then responded:
I must have been looking too fast; I can’t find that anymore. I do see p. 37, which seemed crystal clear until I read Hogg (below); then it wasn’t clear if the predictor on p. 37 of ARM really means what I think it means (energy use doesn’t drive outside air temperature, at least on the short term, but I /could/ interpret it as energy use can be used to /predict/ outdoor air temperature more accurately than temperature can predict energy use).
In footnote 5 of http://arxiv.org/abs/1008.4686, Hogg et al. mention that you should regress x on y, not y on x, in those cases if you don’t model the measurement error.
I sense you don’t agree that Hogg’s approach is a reasonable intermediate step between a simple lm(y ~ x) and a full-blown model. Perhaps that’s something to cover more fully in a new ARM: is there anything to do in particular when working up from a simple lm() to a full-blown model of measurement error (or perhaps you have and I forgot or missed it).
My reply: We’ll definitely cover this in the next edition of ARM. We’ll do it in Stan, where it’s very easy to write a measurement error model. Indeed, we’re planning to get rid of lm() and glm() entirely.