Regression and causality and variable ordering

Bill Harris wrote in with a question:

David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent.’

I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging.

For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor air temperature (T). We know E, because we can measure it. We typically get T at a “nearby” location (within 5-10 miles, perhaps), but we know microclimates cause that to be in error for what counts at the particular building.

So ‘E ~ T’ makes sense, but ‘T ~ E’ may violate fewer regression assumptions. At least in the short term and over a volume that’s bigger than covered by the exhaust plume from the air conditioner, the natural interpretation of that (“the outdoor air temperature is a function of the energy you consume to cool the building”) is hard to swallow.

How do you handle this? In a complete modeling sense, I see modeling the uncertainty in x and y, but often a simpler ‘lm(y ~ x)’ suffices. Which would you put as x and which as y? If you do ‘T ~ E’, how do you interpret the results in words?

I replied:

Do we really use the terms “independent” and “dependent” variables in this sense in ARM and BDA? I don’t think so. If we do, these are mistakes that we should fix. I don’t ever like the use of this term. In ARM I think we make it pretty clear that regression is about predicting y from x. There is no rule that y have higher variance than x. Sometimes people want to predict y from x, but x is not observed, all we that is available is z which is some noisy measure of x. In this case one can fit a measurement error model. I believe we discuss this briefly somewhere in our books but it’s an important enough topic that I think for the next edition of ARM, I’ll add a section on such models.

Bill then responded:

I must have been looking too fast; I can’t find that anymore. I do see p. 37, which seemed crystal clear until I read Hogg (below); then it wasn’t clear if the predictor on p. 37 of ARM really means what I think it means (energy use doesn’t drive outside air temperature, at least on the short term, but I /could/ interpret it as energy use can be used to /predict/ outdoor air temperature more accurately than temperature can predict energy use).

In footnote 5 of http://arxiv.org/abs/1008.4686, Hogg et al. mention that you should regress x on y, not y on x, in those cases if you don’t model the measurement error.

I sense you don’t agree that Hogg’s approach is a reasonable intermediate step between a simple lm(y ~ x) and a full-blown model. Perhaps that’s something to cover more fully in a new ARM: is there anything to do in particular when working up from a simple lm() to a full-blown model of measurement error (or perhaps you have and I forgot or missed it).

My reply: We’ll definitely cover this in the next edition of ARM. We’ll do it in Stan, where it’s very easy to write a measurement error model. Indeed, we’re planning to get rid of lm() and glm() entirely.

30 thoughts on “Regression and causality and variable ordering

  1. In a simple regression with only x and y, then it is well known that if x is measured with error it is best to regress x on y. I think Wooldridge discusses this in his Intro to Econometrics textbook.

    • Maybe Andrew can weigh in here also. Standard regression assumes that in y = a + xb + u, x is measured without error, and the “error” is in y. If there is error in both x and y, this is an errors in variables problem, and the coefficients are not identified unless the distributions of x and y allow the use of higher moments to do so (in other words, x and y may not be normal). Yet, intuitively, pretty much all variables in the social sciences have considerable errors (biological, astronomical, and environmental sciences also–I might give physics a pass). How does BDA resolve these type of issues?

      • As Andrew says in the body of the post, “Sometimes people want to predict y from x, but x is not observed, all we that is available is z which is some noisy measure of x. In this case one can fit a measurement error model.”

        Chapter 10 of the Stan manual discusses measurement error models (and meta-analysis).

        For example, if we have a high-throughput sequencer that uses image processing on fluorescence, there are known biases at a per-base level. We can model these and use them to adjust our predictions. Similarly, we know that at least the short-read sequencers I was familiar with a while ago (Illumina and SOLiD) had a strong front-of-the-sequence bias with increasing noise the longer the read was, so we can model this process in an aligner.

        Physics has the same sort of problems, by the way — any science where you use imperfect measurement tools does.

    • Fernando, Numeric:

      If you have errors in both variables, the model can (sometimes) be written as
      y ~ N (a+bx, sigma_y)
      x_obs ~ N (x, sigma_x)
      and then you need a model for x, and you’re good to go. We discuss this model in our books (and of course it is a standard model in statistics and econometrics) but I believe we don’t emphasize it enough. In next editions it will get its own section, at least. Also it’s easy to write and fit the model in Stan.

      • Yes. The best approach is to model the measurement error. But in the very simple case we are discussing here I remember from Stats101 the regress x on y. But I am with you on this. Of course I’d suggest laying out your measurement theory using a DAG.

      • > We discuss this model in our books…

        I’ve got an “uncertainty in x and y” problem I need to work. Right now the only reference at my fingertips is Reed’s, “Linear least-squares fits with errors in both coordinates”. I’ll check BDA3 tomorrow.

  2. The relevant quote from endnote 5 (on page 40) is “If x has much smaller uncertainties, then you must fit y as a function of x; if the other way then the other way, and if neither has much smaller uncertainties, then that kind of linear fitting is invalid. We have more to say about this generic situation in later Sections.”

    The context of endnote 5 is on pages 2 and 3; I guess the most relevant bit of context is the point “It is a miracle with which we hope everyone reading this is familiar that if you have a set of two-dimensional points (x,y) that depart from a perfect, narrow straight line y = m x + b only by the addition of Gaussian-distributed noise of known amplitudes in the y direction only, then the maximum-likelihood fit or best-fit line for the points has a slope m and intercept b that can be obtained justifiably by a perfectly linear matrix-algebra operation…” That is, the context is: if you want to use weighted linear least square fitting of this form, you better choose x and y such that y is the “noisy” variable and x is the “not noisy” variable. Etc.

    Interested to know if what’s written there is false or misleading or ought to be modified; that paper is on arXiv so we can revise it reversibly!

    • David:

      I don’t understand what you’re saying. It seems to me that fitting y as a function of x, or x as a function of y, or some other model, would depend on your applied goals, not on the uncertainties of measurement.

      • All I mean is that if you are doing weighted linear least-square fitting and think that you are maximizing a likelihood, you are wrong except in this case (noise in y no noise in x). It relates to the connection between “procedure” (wls) and “concept” (maximizing likelihood). If you are willing to use flexible noise models (and you should be!), then you should do whatever meets your goals.

        • Is this (“you are wrong”) even mathematically true?
          Consider the opposite extreme. I.e. our model is y = a + b z, but I only have noisy observations x_i = z_i + N(0, sigma^2). So _all_ noise in in the “independent” variables. To evaluate likelihood I’ll need a
          distribution over the z_i’s but, if I’m doing this correctly (may not be), the maximum likelihood estimate of a and b exists, is independent of the distribution over z, and is given by the ols regression line of y on x.

    • David:

      I just downloaded http://arxiv.org/abs/1008.4686. A quick question: Figure I use Eqs.(26)-(32) in Section 7 to compute the ML regression line given my data. Is it straightforward to compute confidence interval and prediction interval about that line?

      (For what it’s worth, I’m calibrating a radiometer. There’s some measurement noise (uncertainty in y) as well as some uncertainty in the calibration source (x) that I’m using. For all intents and purposes the measurement noise is Gaussian with st.dev. that I can accurately determine. In contrast, there’s essentially no random error in x but I only know the value to within a percentage of x, i.e., I have a sensor which reports x but the true value of x is (1+k)*x where |k|<<1 and the same for all values of x.)

      • > I have a sensor which reports x but the true value of x is (1+k)*x where |k|<<1 and the same for all values of x.

        In the interest of saving time I've kluged a solution. I've set up to do a straight y vs. x weighted least-squares regression and then add an additional factor of k uncertainty to the ML estimate of the slope. My goal is to estimate x (and call out confidence limits) given a measurement y. I think I can get away with what I'm doing because the percentage error is the same for all x-values.

      • This model would be pretty easy to run in Stan, you could do it via rstan about as easily as running lm().

        The main question I have is: is k identified? In other words, do you have ANY information about k contained in the data? If it really is just constant throughout the measurements I don’t think you could find out what k is. You can always use a prior over k to propagate your prior knowledge of k through the calculation and determine appropriately wider bounds on the calibration coefficients, but you won’t get an improvement in your knowledge of k without some way to extract k from the data.

        Could you perhaps get at least one measurement using a different source, maybe one which isn’t adjustable but has very precisely known output? That precisely known reference would be enough to zero in on k. Another technique might be to use a variety of sources each with their own k and give a prior over the k values. You won’t be able to zero in on any given k that way, but you might do all right in calibrating the instrument itself.

        • > The main question I have is: is k identified? In other words, do you have ANY information about k contained in the data?

          Short answer: No.

          It’s reasonable to say k~N(0,sigmak^2). I can look up vendor data and do an error propagation analysis to get sigmak to reasonable accuracy but there’s no info from which to infer the actual value of k. That’s acceptable. What I want to do is incorporate sigmak in calculating the uncertainty in my estimated value of x given measurement y. (My calibration curve is determined by fitting x-y data in the lab. I then make measurements in the field and use the calibration curve to estimate x from a measured y-value.) I believe I can avoid a full-blown ‘uncertainty in x and y’ analysis if all the x-values are off by the same percentage. Correlated errors just introduce additional uncertainty in the slope of the cal curve. If k were different for each x-value, i.e., each value corresponded to a different draw from the distribution, then I’d need to do the full analysis but it’s the same k-value for each x so that shouldn’t be necessary. Am I missing something?

        • I think you’re right, but if you could get a sample of different sources with different k values (which hopefully have an average value near zero, based on your N(0,sigmak^2) model), you could calibrate your instrument better and then have less uncertainty in your final measurement. If that’s not possible, then you’ll probably just have to live with the unknown factor of (1+k).

  3. I consider independent/dependent something different than cause/effect. To me dependent means, there is a dependency, based on a theory we can predict a value based on one of the other (independent) variables but without assuming cause and effect. In teaching sociology research methods this is really important, because for example we can predict salary based on gender but that does not mean we thing gender is the cause in some biological or other simple sense, we think that gender represents a common cluster of identities and experiences (of everyone involved) that come into play when a salary is being determined. Similarly we believe the census should collect race information not because we think race is a biological cause of various things but because the experience of people of different races in American society is different. Now, can we predict gender based on salary or race based on whether someone has been subject to stop-and-frisk or not? Sure, and that is probably of interest to someone, like maybe a lawyer looking for clients who want to sue somebody. But in general assignment to independent and dependent is based on your theory. It is very hard for students to get this straight. So we practice saying “Mean salary depends on gender” and contrasting it with “Gender depends on mean salary” which might be true for people who are data mining but makes limited sense otherwise. And most research is about trying to understand what those variables proxy for (employer attitudes, time out for child care, past salary history etc).

    (Okay I’m sitting here thinking .. maybe if I hear about someone being stop and frisked I unconsciously assume they are not white but in that case I’d say the variable is imagined race.)

    The only time in my class that I talk about cause and effect is when I get to experimental design and evaluation research, and honestly a lot of that is about (a) the extreme challenges of it and (b) situations where someone is going to make a decision about whether to continue to spend money and time on an intervention and in that case, yes, we are extremely interested in knowing if an intervention seems to be causing the intended change in the outcome measure. Since lots of my students will end up in jobs where they are implementing or making decisions to implement various interventions this matters a lot. We spend lots of time talking about all the reasons that research that seems to show cause and effect change actually may be deceiving.

    Of course this is just an undergraduate introduction to methods class, and a great deal is simplified. But to me the language of independent/dependent is helpful and a huge improvement over cause/effect.

    • Elin,
      The problem here, is that *you* may “consider independent/dependent something different than cause/effect”, but your students and others reading what you write may interpret using the definition of independent/dependent variables that they learned in high school: The independent variable is what you plug into a function, and the dependent variable is what you get out.

      To attempt to forestall this possible confusion, I try to use the terms “predictor” and “response.” I think this language tends to be more commonly used in, e.g., engineering than in e.g., the social sciences, but I wish it were more common in the social sciences, since the distinction is indeed important in promoting thinking in terms of prediction rather than causality.

      • Well I hope my students don’t since we spend quite a lot of time on it; further, if someone leaves a sociology major thinking that we study race and gender because we think they are simple causes and not social constructs we would have failed in our curriculum. I actually do like the the term predictor in many cases, certainly after you have collected data. I think language matters a lot, and I as I said lots of things can predict values of other things, but that does not mean that they make sense theoretically or logically as sociological explanation. So yes, I could and would say to my students that you can predict someone’s gender based on their salary and that is really useful for certain kinds of problems and definitely certain kinds of data where we haven’t been able to design the data collection. But, that does not mean the same thing or tell the same story as saying that I can predict someone’s salary based on their gender. You would design data collection really differently to study those two questions. (Writing this is making me think about how female software developers in open source projects are sometimes encouraged to use gender neutral or masculine screen names in IRC and that does make me think about gender as not even fixed in time.) I agree that a lot of times methods textbooks (especially older ones from the methods wars days) do tend to conflate the two (just like they will describe gender and race as “fixed” variables). Now, race and gender are kind of straw variables, of course what it really is about is which variable is it that you are trying to understand. Do you want to test a theory about why some students drop our of school and others don’t or do you want to understand why some are involved in delinquency and others aren’t? The language of prediction there is okay but not really quite enough; you really have to decide what it is you are trying to explain and what your proposed explanation is. (Any my students are beginners so while we would talk about the possibility that the relationship will be complex over time but probably not actually do anything about that) But you’ve made me think about using the term predictor more this fall and seeing if it helps them.

        I’m not sure that in sociology that response really would be the right word especially for undergraduates, for a few reasons, one because a lot of times we are talking about responses to survey questions so it would probably be confusing that way since everything is a response, and we want to always remember that. This is probably a good example of why disciplines tend to develop their own vocabularies.

        In my experience almost no students have any idea what dependent and independent variables are and have a lot of trouble even linking the idea to a more likely/less likely hypothesis while I can say that cause/effect is easy for students, which may be why people use that. I’ve never had a student who learned independent/dependent in high school or at least any student who thought such an idea would have any meaning in my classes. Maybe if we were writing formulas rather than sentences it would be different? I don’t know. Maybe I just have unusual students.

        • Why do you call gender a “straw variable”? Ok, maybe there’s some niche cohort which is fuzzy or switches gender but surely for most of the population gender is as much a “fixed” or “independent” variable as anything else?

          If one cannot even call gender an independent variable what can one actually call independent at all?

        • That’s why, because they are easy cases to make and the case where you would treat them as dependent variables would be a pretty unusual edge case (like studying passing) since for most people they are fixed. The tougher case is when you have variables like whether a 17 year old is attending school and whether s/he is involved in delinquency.

  4. When you say, you’ll get rid of lm() and glm() (and probably also lmer()?) entirely in a new edition of ARM, does that mean that you will go “completly” Bayeasian in a new ARM from the beginning? I always thought that it’s a big feature of ARM to be somewhat agnostic using standard R functions in the first half/ two thirds of the book. I might not even have read and used it at that time otherwise.

    • Daniel:

      I’m mostly talking computation here. The idea is to have R functions lm_stan() and glm_stan() which will be essentially equivalent to lm() and glm() from the user’s point of view, except that they will directly access Stan models, which will make it essentially trivial to generalize the models in various ways such as nonadditivity, nonlinearity, unequal variances, etc. So, for that purpose, it’s fine to think of all this as maximum likelihood; indeed, you can compute the maximum likelihood estimate with a single call to Stan. But then it will be good to have priors at hand when you need them. Even non-Bayesians get a little antsy when fitting logistic regression with complete separation. It will also be convenient that glm-like models such as multinomial logit will be fittable in the exact same way. So no need to keep introducing new functions to fit new classes of models. We’ll start with the stan code for linear regression and then it’s easy to go to logistic regression, overdispersed Poisson regression, multinomial logit, nonlinear regression, measurement-error models, and so forth. All this can be done Bayesianly, but the Bayesian part isn’t needed for the basic models. And we’ll make it clear that (a) lots of models can be fitted using maximum likelihood, and (b) Bayes is there for you if you want to add prior information. The key reason to switch from lm/glm/polr/etc to Stan is flexibility in modeling, which is something that non-Bayesians should want too.

      • Andrew:

        Thanks for the answer, that sounds quite interesting and reassuring. I’m used to being introduced to R by the basic functions, though. Having learned R that way it was quite easy to get into ARM myself and I thought I’d be teaching using R similarly: Introducing the basic functions and teach how to use them and point to ARM for the advanced quantitative students in the Social Sciences. I wonder if it will be a bigger jump from basis R to ARM in the new edition. Nonetheless, using Stan without enforcing Bayes might be an even better way to “sneak” in Bayesian modeling ideas and to allow for flexibility in modeling as you say. The first edition was already excellent for delving further into regression and multilevel modeling and hinting towards Bayesian options for the unsuspecting frequentist.

    • It’ll be 2 volumes (because we think it’s important to have a stand-alone regression book, as there are so many courses on that topic). I’m not sure when the new books will come out, maybe a year or two? We have to do some writing and some research first!

Leave a Reply to numeric Cancel reply

Your email address will not be published. Required fields are marked *