Why I don’t like the terms “DV” and “OLS”

Carlisle Rainey writes:

In an earlier blog post, you suggest: “…do a global search-and-replace to change ‘DV’ to ‘outcome’ and to change ‘OLS’ to ‘linear regression’.” Would you provide a quick explanation why or point me somewhere to find the answer myself?

My reply:

1. I don’t like the term “dependent variable” because of confusion with dependence of random variables. To me, “outcome” makes it clearer that you are choosing which variables to use as predictors and which as outcome. “Predictee” would be ok too, I guess.

2. “OLS” focuses on the optimization task; “linear regression” focuses on the model. I think the model is more important that how it’s estimated. To put it another way, “OLS” generalizes to weighted least squares, least absolute deviation, etc. “Linear regression” generalizes to logistic regression, nonlinear regression, etc. I find the latter set of generalizations more important and interesting.

8 thoughts on “Why I don’t like the terms “DV” and “OLS”

  1. Also, in a multiple regression setting, usually all the variables are correlated with each other to some degree, and the distinction between "independent variables" and "dependent variable" is artificial. Other possibilities are "target variable" and "response variable".

  2. I don't like DV because it implies that the (value of the) variable depends on the other variables. And we don't know that (a) that's why we're running the model, and (b) because that assumes we know about the causal direction.

    I like the argument about OLS. I'm going to stop saying it. (Unless I mean it, of course).

  3. I always thought the orthodox reason for the names was that dependent meant 'dependent on the error term' (not of any/all the IVs) and independent meant 'independent of the error term' (not of the other IVs). They refer to the structure of the model – not any actual relations between the variables in reality. I know that people don't normally interpret the names that way, but the distinction's useful.

    Personally, I don't like 'outcome' because the word carries connotations of causation and time order. If that's what you do – deal with measurements clearly ordered in time and have the assumptions needed to make a causal inference met – then by all means go for it and call the DV an outcome. But 99% of the time people are not going to be in that situation. I think the same thing about 'predictee', sometimes people do use regression for prediction, but again, that's not always (or even usually?) the objective.

    So unless you're in quite a specific line of work I don't think using these terms is totally appropriate, and other times it'll just be plain weird. What if you're trying to specify a model of a measurement taken last week using events that happened this week? You're not trying to predict anything and something that happened before your IVs isn't an outcome. I'm not sure if switching the names would help more than it'd hurt.

  4. … and gain some clarity (at least for us outsiders). Jargon (as shorthand) can increase bandwidth, or (as tribal signifiers) exclude an audience. In this case, thank you for including!

  5. I agree about DV.

    Less sure about OLS as i don't think it's always clear to replace it by 'linear regression' — when i'm contrasting the estimate from an instrumental variables model to the estimate from OLS, it's confusing to call the latter 'linear regression' as the IV model is also linear, and also a form of regression. Often the econometrics literature will talk about the two-stage least squares (2SLS) estimate vs. the OLS estimate, but i agree that seems to focus too much on the mechanics of the estimation procedures. Some of my epidemiologist colleagues have used the terms IV estimate vs. 'observational' estimate, 'conventional' estimate or even 'naive' estimate, but i don't really like those terms either so have sometimes gone back to OLS.

    Maybe just 'instrumented' vs. 'non-instrumented' estimates? Perhaps i'll try that in the next paper and see how what the co-authors and referees think.

  6. I've always *hated* the terms DV/IV in the social sciences. They are misleading, and as soon as you want to talk to someone working in the area about actual dependence and independence, confusion ensues.

    "independent variables" usually aren't independent.

    Before you fit your model, you are hoping the "dependent variable" is dependent on your candidate predictors (at least, dependent enough that its not swamped by a little noise)… but at that stage you can't say for sure.

    In order to have a discussion of any depth about the dependence structure among the variables, you have to abandon the DV/IV terminology. Why use it to begin with? There are better terms available.

    I agree, more weakly about the OLS issue. I don't think the estimation method should take priority over the model. You specify the model and derive a good estimator, not shoehorn your model to fit a prechosen algorithm. I like my model to be explicit, not having its assumptions hidden away in the estimation algorithm.

    That way, if I choose to use least squares to fit a model for which it's not optimal, I can at least tell I am doing that.

  7. I can see the IV/DV problem. They're ambiguous, confusing and probably should not be used much, if at all.
    Your problem with "OLS" is more one of style than substance, though. To get published in most social sciences, you have to compare different ways of specifying the problem OLS, IV (haha!), or other "linear" types of models – in order to justify your conclusions and demonstrate thoroughness. If you can suggest a reason why "OLS" could be taken as ambiguous or misunderstood by a statistically literate audience, I'll bite on your suggestion. Otherwise, I'll have no qualms continuing to use it.

Comments are closed.