Variable ordering fallacy: why people continue to disagree

A couple of debates seem to never stop: nature vs nurture, ability versus luck, role of society vs personal responsibility. The fundamental problem in these discussions is that one group of people considers one of the causes more important than the other one, and the other group disagrees. In this entry, I will attempt to show an explanation of this problem with my interaction analysis framework.

I have taken the “rodents” dataset. Cases are apartments in New York City, the covariates are the number of defects, the poverty score and the race for the apartment, whereas the outcome is whether there were rodents found in the building. The result of the analysis in the form of an interaction graph is as follows:

rodents3.png

The defects are clearly by far the best predictors of rodents (13.2% of explained variation), this is followed by race (7.9%) and then by the poverty score (7.1%). What is important is that none of the covariates is explained away by the others. The links between covariates indicate the correction that is necessary as both covariates provide in part the same information about the outcome. In particular, should we predict rodents using poverty and race, the actual amount of variance explained would be 7.1+7.9-3.0=12.0%.

The trouble is that -3.0 factor. If race and poverty weren’t correlated, it would be zero (or positive). But as they are correlated, there is ambiguity with respect to what is primary, race or poverty, in predicting the rodents. In particular, one could say that the increased frequency of rodents among minorities can be explained by poverty. With this, we would assign 7.1% of explained variance to poverty and 7.9-3.0=4.9% to race.

On the other hand, we could say that minorities have a cultural bias, an example of which is that don’t keep as many pets like cats and dogs that prey upon rodents. Thus, cultural biases can explain an increased likelihood of rodents, along with, say, racist landlords that refuse to fix cracks in an apartment of a householder of the wrong race. Poverty could also be a consequence of these cultural biases (preferring one profession to another) or even race directly, either in terms of innate ability, in terms of discrimination or in terms of the “poverty trap”. With such an interpretation we would allocate 7.9% of explained variance to race, a proxy for culture, and 7.1-3.0=4.1 to poverty.

Same data, same model, but two interpretations: because of the correlation between race and poverty, we do not know how to divide the 3% of shared information among the two variables. People will continue to disagree. Sometimes it is possible to resolve this dilemma when one variable completely explains away the other one, but this isn’t the case here. What to do?

If poverty and race were not correlated, this problem would not appear. So one way of remedying the problem would be through controlled experiment. The trouble is that one cannot change someone else’s race at random.

Another is the shut up and calculate approach: just employ logistic regression and see what the coefficients are:


glm(formula = y ~ defects + poor + as.factor(race), family = binomial,
    data = nd)
                 coef.est coef.se
(Intercept)      -3.10     0.06
z.defects         1.38     0.04
z.poor            0.60     0.05
as.factor(race)2  1.07     0.06
as.factor(race)3  1.08     0.08
as.factor(race)4  1.34     0.07
as.factor(race)5  0.69     0.09
as.factor(race)6  0.85     0.45
as.factor(race)7  0.60     0.27
  n = 13931, k = 9
  residual deviance = 12427.1, null deviance = 15185.1 (difference = 2758.1)

The regression gives specific values that assign the importance to a particular covariate. In this case, race is more important than poverty. The trouble is that the regression coefficients are sometimes haphazard or even counterintuitive as measures of feature importance. Imagine that y = a*x1+b*x2+e, and that x1=x2: clearly any choice of a+b=c will be equally fitting.

For that matter, given the same data with correlated covariates, people will continue to disagree on how important individual covariates are. Regression coefficient magnitude can be seen as a tiebreaker, but if one denies the authority/truth of the linear best-fitting model, it can be questioned. In many cases, it is impossible to disentangle variables that always tend to stick together. Trying to separate them would be artificial.