How to understand coefficients that reverse sign when you start controlling for things?

Denis Cote writes:

Just read this today and my unsophisticated statistical mind is confused.

“Initial bivariate analyses suggest that union membership is actually associated with worse health. This association disappears when controlling for demographics, then reverses and becomes significant when controlling for labor market characteristics.”

From my education about statistics, I remember to be suspicious about multiple regression coefficients that are in the opposite direction of the bivariate coefficients. What I am missing? I vaguely remember something about the suppression effect.

My reply:

There’s a long literature on this from many decades ago. My general feeling about such situations is that, when the coefficient changes a lot after controlling for other variables, it is important to visualize this change, to understand what is the interaction among variables that is associated with the change in the coefficients. This is what we did in our Red State Blue State paper, for example, and we also developed some such tools in our paper on police stop-and-frisk. I have not read the particular article you cite, so I can’t comment on that particular application, but in general I think these multiple regression analyses can be fine but I like to understand where in he data the switching of sign is coming from.

I think there’s still a lot of useful work to be done on graphical methods for understanding the effects of conditioning on regression models.

27 thoughts on “How to understand coefficients that reverse sign when you start controlling for things?

  1. Often you can see it in the zero-order correlation matrix if its small enough. Assume the variables are in order of Y, X1, X2, with Y the criterion and X1, X2 the predictors. Assume they have the following zero-order correlations:

    1
    .6 1
    .2 .5 1

    Then the bivariate regression coefficient between Y and X1 and Y and X2 will both be positive. However, conditionally, they are .67 and -.17. This is due to the fact that X1 and X2 correlate more strongly than X2 does with X1. (Statistical significance is more of a function of sample size so assume it’s “sufficiently large”.) If you know matrix regression, then the formula to compute these coefficients is easy. If not, they are semi-partial correlation coefficients.

    There is a literature on suppression in multiple regression. The Google didn’t turn up any useful results but there are several in American Statistician as I recall, e.g., Schey, H. 1993. The Relationship between the Magnitudes of SSR(x2) and SSR(x2∣ x1): A Geometric Description. Am. Stat., 47(1), 26-30. This literature tends not to be graphical at all, and in a high dimensional model can be quite tricky to track down. Anthony Atkinson’s book Plots, Transformation, and Regression (Oxford, 1986) has a number of useful results.

  2. Multicollinearity. The bivariate relationship is not very strong to begin with (p ~0.1 for N >11000). Then you include variables that must be highly correlated with union membership (education level, self-employment dummy, occupation dummies) and the coefficient switches sign, becomes more/less significant. This fits multicollinearity.

    • But that is just to say that the variables are correlated. In fact, the only reason the OLS coefficient can change at all is a (sample) correlation between that variable and the one added.

      I find that a lot of folks training in data analysis in social science statistics courses think of multicollinearity as some kind of special phenomenon and use it to justify excluding confounders from their regressions.

  3. Union members are slightly more likely to be healthy than adequately matched non-members (possibly because they have better health plans). However, union members are less likely to be well educated than non-members (possibly because professions requiring college degrees are less unionized) and college-educated adults are considerably more likely to report good health than uneducated adults. Different ratios of college-educated adults to adults with no high school degrees among union members and nonmembers reverse the sign of correlation if you don’t control for education.

  4. I thought the whole point of using multiple regression is to control for potentially confounding factors that could lead to spurious inferences in simple bivariate comparisons. If I suspected that union members were more likely to be less-educated, many working in hazardous occupations, etc., then if someone showed me a correlation between poor health and union status my first reaction would be: Run the multiple regression and control for X1 X2 and X3, then show me the results! The sign reversal is not surprising or suspicious at all, to me. Of course there will still be issues of selection on unobservables, etc., which loom large in labor economists’ attempts to estimate the wage effect of unions, for example.

  5. Table 3 in the paper gives you a pretty good idea of what’s going on. Health effects of union membership are small. Health effects of education are _huge_ (someone with a BA is more than twice as likely to report favorable health than someone with just a HS degree and almost four times as likely as someone without a HS degree).
    Given patterns of private sector unionization in the US, you’d expect unionized workers to be on average less educated, so as per Nameless above, a unionized worker with similar education will have slightly better health outcomes. Their matching analyses confirm that.
    That said, I’d have liked to see explicit interactions terms in there. I’m also a bit confused by the unionization patterns, because a large share of union members in the US are in the public sector and have college degrees.

    • This is also a case where straightforward linear regression hides some of the important interactions.

      Unionization affects different groups to different degrees. Unionized worker with only a high school degree is much less likely to be uninsured than a matched non-unionized worker, but it’s not necessarily the case for workers with college degrees.

      Table 5 of the paper repeats the regression on subsets of the full sample. Unionization improves health outcomes for men, workers without college degrees, and workers below 75th percentile by income. It’s neutral for women, and the coefficient is negative for workers with college degrees and above 75th percentile by income.

      As to the unionization patterns, notice that they are running their analysis over the sample that goes back in time as far as 1973. I tried to rerun their GSS queries and I see that union members are, on average, less educated than nonmembers prior to 1990 or so (before the completion of private sector union busting) and more educated than nonmembers after 1990.

  6. It looks to me (just looking at the table, and assuming that the negative sign on the Z-score gives us the direction of the coefficient – which is weird to me, but ok) that it is the inclusion of “Industry and Occupation dummies” that changes the sign. So one way of interpreting that is just to say, “within any particular occupation, union membership is associated with better health outcomes.” It’s not really that surprising – it’s due to a correlation between occupation and health, and a correlation between unionization and occupation. Occupation is an omitted variable that is correlated with health status and with unionization, so when you leave it out of the regression, you get an omitted variables bias (I know, I’m guessing use of the word “bias” on this blog is not totally welcome, but in this case I think it provides a reasonable interpretive framework – though my next paragraph will go back to an interpretation framework that doesn’t think of the differences in terms of bias).

    Now… which coefficient you are interested in (conditional or unconditional on occupation) would likely depend on the empirical question you are asking. If you are asking “Do people in unions have better or worse health than the average person in society”, you’d want the unconditional correlation. It just turns out that union members have worse health than some total population mean, and this is likely related to the occupations where unions still operate. But if you want to asking something like “Supposing someone does occupation X, then are they likely to have better or worse health if they are in a union” you’d want the conditional regression. If you really wanted to know “What is the causal effect of union membership on health”, you’d need a whole different empirical set-up (in my opinion).

    • The way to understand this phenomenon
      is to recall that sign reversal (due to data aggregation)
      is not uncommon among statistical parameters; this is what
      Simpson’s paradox is all about (see Wikipedia “Simpson’s
      paradox”, or Wasserman’s “All of statistics”). The reason
      such sign reversals appear surprising is because we
      tend to confuse statistical with causal parameters,
      which do not change sign by aggregation.
      And, Yes, graphs are needed to predict when sign reversal
      might take place and when it is impossible. The Wikipedia even
      claims that the Simpson’s paradox is an exercise in graph theory.

      • Galileo:

        Sign reversal can occur in casual settings as well. All you need are interactions. The math doesn’t care whether the associations are causal.

        • Andrew,
          Not so.
          Causal parameters (i.e., effects size) do not undergo reversal upon aggregation.
          No drug can be harmful to males, harmful to females, and beneficial to the population
          as a whole.
          It is not the “setting” that counts, but the nature of the parameters.

          Let us summarize:
          Causal effects do not undergo reversal upon aggregation of subpopulations whose sizes
          are not affected by the treatment,( e.g., gender)
          This is a theorem.

          You wrote: “All you need are interactions.”
          Not so. Associations can be reversed even in linear systems, with no interaction
          whatsoever. Conversely, no reversal of causal effects is possible, no matter how
          strong the interaction.

          The reason that association reversals evoke surprise (even in linear systems) is because people
          tend to misinterpret associations as “causal effects” which, as the theorem says,
          cannot undergo reversal.

          Finally, “The math doesn’t care whether the associations are causal.”
          Almost. The math of associations does not care; the math of causation does.

        • suppose your datapoints (say, drug efficacy vs. dosage) for men are (1,100),(2,101),(3,102) and for women are (4,1),(5,2),(6,3). Wouldn’t the sign of efficacy vs. dosage change for the aggregated population?

          You might say this example is has obvious separation issues with the data, but more subtle versions of this can occur anytime you’re working outside a perfectly controlled experiment. For example, you could have overlap, but sample women more heavily at higher dosages and men more heavily at lower dosages.

        • Anonymous:
          What you are saying is whether a drug can be protective in one population and harmful in the other. That’s possible at least theoretically. However you misunderstood Galieleo’s point. I will quote him:

          “No drug can be harmful to males, harmful to females, and beneficial to the population
          as a whole.”

          Don’t you agree with this?

        • Actually the drug is helpful in _both_ males and females in my example (in the sense that +dosage is associated with +outcome in both subpopulations). It’s basically illustrated by the wikipedia illustration of simpsons paradox: http://en.wikipedia.org/wiki/File:Simpson%27s_paradox_continuous.svg
          I don’t claim that it’s novel, but it’s an issue that arises all the time in practice.

          I don’t doubt that his theorem is valid, I’m guessing his theorem refers to a special case of what a sign change with aggregation, and he probably has a specific idea in mind of what it means to make a statement about a causal effect in each subpopulation. However my example is not atypical of the kind of data whereby inferences regarding causality are drawn. Even if one is working within a causal inference framework, this kind of issue can arise if the causal assumptions are violated (either the graphical model was wrong, or the instrument is not exogenous, etc.)

        • Anonymous,
          One nice thing about having math to our aid
          is that we are no longer at the mercy
          of guesswork as to what one “has in mind” or “what it means,”
          —- it is all in the math.

          The theorem says exactly what you understood it
          to mean: “efficacy of treatment does not
          undergo reversal upon aggregation of subpopulations”
          This works for continuous and discrete treatments
          and outcomes, no exception.

          Are you saying that your example demonstrates reversal of
          efficacy? I do not see it. Referring to your data and to the wikipedia
          plots, what we see is two dose-response curves,
          but no efficacies. The dashed, down-sloping line,
          does not represent aggregate efficacy, but a best-fit
          line. Efficacy is whether the population as a whole
          would benefit from high dose compared with small dose.

          No matter what interpretation you give to the data,
          negative efficacy is not what I read in it.
          If I were a policy maker I would recommend this
          drug to the population as a whole because, no matter
          whether a patient is a male or a female, he/she would
          benefit from the treatment.

          The confusion between best-fit slopes and causal efficacy is
          precisely what gives Simpson’s paradox its paradoxical
          flavor — slopes reverse signs, effects do not,
          so when we interpret the former as the latter our
          mind resists the reversal.
          Your data show slopes, not efficacy.

          There are, of course, cases where the aggregate data
          gives us the correct efficacy (and the correct sign),
          but this will take the discussion too far. All I
          wanted to convey on this blog is that the age of
          hand-waving about sign reversals is over.

        • “If I were a policy maker I would recommend this drug to the population as a whole because, no matter whether a patient is a male or a female, he/she would benefit from the treatment.”

          I’m glad you’re not a policymaker. The problem here is that sampling of the independent variable between the male and female populations is incomplete and nonrandom. One doesn’t know what will happen to the male population for dosing at x = 4, 5, and 6. What’s to say that the dose response of the male population at 4, 5, and 6 doesn’t follow that of the female population and hence, if one obtains the the effect for the full range of the independent variable (dosing here, but it could be anything), you obtain a negative slope? Likewise with the female population for x = 1, 2, 3.

          Granted this is a toy example, and you could make a fair counterargument that what you intended was to dose the male population within the range of the data that it’s available. But it is easy to misapply an absolute statement like “a sign change is mathematically impossible” when the conditions under which the data are acquired don’t perfectly conform to the conditions needed to make such guarantees, which is often the case with observational data.

        • Dear Anonymous,
          I thought you presented your data in order to refute
          the general statement: “efficacy of treatments does not
          undergo reversal upon aggregation of subpopulations”
          Now you say that there are problems with the data, because
          it is incomplete, imperfect and more.

          Fine! But does it refute the statement?

          Please note, I did not intend the statement to apply only to a limited
          range where the data is available. I intended it to
          be a general and provable statement about ACE (average
          treatment effect of populations), a well defined
          property of populations that exists even before
          we take any data, and which we may attempt to estimate by
          various designs.

          Here it is again: “efficacy of treatments does not
          undergo reversal upon aggregation of subpopulations”
          The only data that could potentially refute this statement is one
          taken from three large-sample randomized experiments,
          one on males, one on females and one on the mixed population.
          Any other kind of data would describe associations and imperfections,
          and would cause us to argue all day what a decision maker
          should make out of it.

          So, please make me understand,
          1. Do you object to the general statement:
          “efficacy of treatments does not
          undergo reversal upon aggregation of subpopulations” ???
          2. Does your data refute this statement ???
          3. Can any data refute this statement?
          4. Do you have another explanation why people get
          irritated when they see sign reversal in associations?
          (this is where this discussion started)
          5. Do you agree that the days of hand-waving
          about reversals are over.? (this is where I hope this
          discussion will end)

  7. Because the data are observational, somewhat different language would be more appropriate. The logistic regression analyses do not “control for” the effects of the covariates; they “adjust for” those effects. The covariates are not under control, so it would be more accurate not to refer to them as “controls” in the first place.

    In presenting results, the authors usually mention the types of covariates in the model, but it is easy for readers to pick out an odds ratio without paying adequate attention to the fact that it is an adjusted odds ratio. To interpret it properly, one has to keep in view the whole list of covariates involved in the adjustment. (The definition of a regression coefficient in a multiple regression includes the list of other variables in the model.)

    Another comment suggested that the purpose of multiple regression (in this study, multiple logistic regression) is to “control for potentially confounding factors.” All multiple regression can do, however, is “adjust for” those factors. The phrase “control for” invites readers to think that those other variables are being held constant, but that is not how multiple regression (and related methods) works.

    I wonder about the use of Age Squared in the models. I did not see an explanation that the authors had explored the functional form of the contribution of Age and determined that it was adequately approximated by a quadratic.

    Finally, the analysis used only complete cases. It would be useful to consider whether multiple imputation could be applied to deal with the missing values.

    • A lot of things were overlooked:

      -Odds ratio is not collapsible i.e. crude and adjusted OR could differ even in the absence of confounding. I would be more concerned if the outcome is not rare. Not sure if the authors are aware of this..

      -They did a mediation analysis but no adjustment on mediator-outcome confounders was done. Also not sure if they examined exposure-mediator interaction. Are they familiar with the current literature on this type of analysis?

      – Graphical models would have helped to determine whether adjustment is harmful or not. I don’t see anything suprising on the change in sign–it all depends on the direction of effect (protective or harmful) among covariates.

    • You’ve got to be kidding, how did you go from “understand what is the interaction among variables” to “don’t be afraid of stereotyping”?

Comments are closed.