Modeling y = a + b + c

Brandon Behlendorf writes:

I [Behlendorf] am replicating some previous research using OLS [he’s talking about what we call “linear regression”—ed.] to regress a logged rate (to reduce skew) of Y on a number of predictors (Xs). Y is the count of a phenomena divided by the population of the unit of the analysis. The problem that I am encountering is that Y is composite count of a number of distinct phenomena [A+B+C], and these phenomena are not uniformly distributed across the sample. Most of the research in this area has conducted regressions either with Y or with individual phenomena [A or B or C] as the dependent variable. Yet it seems that if [A, B, C] are not uniformly distributed across the sample of units in the same proportion, then the use of Y would be biased, since as a count of [A+B+C] divided by the population, it would treat as equivalent units both [2+0.5+1.5] and [4+0+0].

My goal is trying to find a methodology which allows a researcher to regress Y on a number of Xs, but which accounts for the uneven variation in the distributions of the individual phenomena [A+B+C] that constitute Y. I have thought that it could be treated within a Structural Equation Model as multiple dependent variables, or through a process of joint estimation, but in essence I know the latent factor (Y) that one usually does not know when trying to measure through some sort of SEM or Rasch Model. I have also considered weighting [A,B,C] by converting them into percentages of the total count of each phenomena within the sample (i.e. (A1/sum A(1-100)) + (B1/sum B(1-100)) + (C1/sum C(1-100))), but the result lacks interpretational quality as to the overall relationship between Xs and Y.

My reply:

First off, the reason for logging is to model a multiplicative relationship using an additive model. Skewness is typically irrelevant (see the discussion of regression assumptions in chapter 3 or 4 of ARM). No big deal here, I just wanted to get that out of the way. Also, if y is a count, you might want to use an overdispersed Poisson regression as discussed in chapter 6.

My main question is, if you have a, b, and c, why not just model them separately? Is it a sample size issue, that by combining a,b,c into y, you get more stable estimates? If so, that’s ok, and you could always try weighted averages if that makes sense in your application.

5 thoughts on “Modeling y = a + b + c

  1. Would you still model a, b, and c separately if they are different aspects of the same observation? Say you’re modeling dietary preferences, and a=number of calories in a dish, and b=whether the meal was consumed in a solitary setting.

  2. My response if given this question would first be to ask for more context!
    Then some reasonable model surely could be found! I will not try to propose something without a context, as here.

  3. Context would be helpful. I would commonly model an aggregate if it was more stable than the components, the components are easy substitutes, and the predictor variables were more closely associated with the aggregate Y than the individual a,b,c components.

    Example: I would rather model a brand’s sales (Y) than the components (anchovy, barbecue and chocolate flavors), particularly if brand advertising was a major predictor.

    In this specific example the 3 components (anchovy, barbecue and chocolate) are equally important and substitutable and the fact that chocolate might outsell the other two isn’t of interest. I would feel differently if Y was total crimes and a,b,c were murder, burglary and jaywalking.

  4. “First off, the reason for logging is to model a multiplicative relationship using an additive model. Skewness is typically irrelevant (see the discussion of regression assumptions in chapter 3 or 4 of ARM).” While we’re quoting authorities, see p. 59 of ARM: “It commonly makes sense to take the logarithm of outcomes that are all-positive.” (Yes, I see the first sentence in the next paragraph, too.)

    But an unbounded all-positive outcome is pretty much guaranteed to be skewed, right? What does that say, if anything, about the probability of all-positive unbounded outcomes being multiplicative in nature? And how does that all fit into the Box-Cox transformation discussion of a few weeks ago? In EDA, we aren’t really saying we need to know the model is a priori additive or multiplicative, are we?

Comments are closed.