Brandon Behlendorf writes:
I [Behlendorf] am replicating some previous research using OLS [he's talking about what we call "linear regression"---ed.] to regress a logged rate (to reduce skew) of Y on a number of predictors (Xs). Y is the count of a phenomena divided by the population of the unit of the analysis. The problem that I am encountering is that Y is composite count of a number of distinct phenomena [A+B+C], and these phenomena are not uniformly distributed across the sample. Most of the research in this area has conducted regressions either with Y or with individual phenomena [A or B or C] as the dependent variable. Yet it seems that if [A, B, C] are not uniformly distributed across the sample of units in the same proportion, then the use of Y would be biased, since as a count of [A+B+C] divided by the population, it would treat as equivalent units both [2+0.5+1.5] and [4+0+0].
My goal is trying to find a methodology which allows a researcher to regress Y on a number of Xs, but which accounts for the uneven variation in the distributions of the individual phenomena [A+B+C] that constitute Y. I have thought that it could be treated within a Structural Equation Model as multiple dependent variables, or through a process of joint estimation, but in essence I know the latent factor (Y) that one usually does not know when trying to measure through some sort of SEM or Rasch Model. I have also considered weighting [A,B,C] by converting them into percentages of the total count of each phenomena within the sample (i.e. (A1/sum A(1-100)) + (B1/sum B(1-100)) + (C1/sum C(1-100))), but the result lacks interpretational quality as to the overall relationship between Xs and Y.
First off, the reason for logging is to model a multiplicative relationship using an additive model. Skewness is typically irrelevant (see the discussion of regression assumptions in chapter 3 or 4 of ARM). No big deal here, I just wanted to get that out of the way. Also, if y is a count, you might want to use an overdispersed Poisson regression as discussed in chapter 6.
My main question is, if you have a, b, and c, why not just model them separately? Is it a sample size issue, that by combining a,b,c into y, you get more stable estimates? If so, that’s ok, and you could always try weighted averages if that makes sense in your application.