Average predictive comparisons and the All Else Equal fallacy

Annie Wang writes:

I’m a law student (and longtime reader of the blog), and I’m writing to flag a variant of the “All Else Equal” fallacy in ProPublica’s article on the COMPAS Risk Recidivism Algorithm. The article analyzes how statistical risk assessments, which are used in sentencing and bail hearings, are racially biased. (Although this article came out a while ago, it’s been recently linked to in this and this NYT op-ed.)

ProPublica posts a github repo with the data and replication code. I wanted to flag this part of the analysis:

• The analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 45 percent more likely to be assigned higher risk scores than white defendants.
• The violent recidivism analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 77 percent more likely to be assigned higher risk scores than white defendants

The basic method is to build a logistic regression model with the score as outcome and race and few other demographic variables as the independent variables. (You can also reasonably argue that a logistic regression without any interaction terms is not the best way to analyze this data, but for the moment, I’ll just stick within the authors’ approach.)

Here’s the problem: to arrive at the numbers above, they compare an `intercept-only` model vs. `intercept + African-American Indicator` model. (See Cell 16-17 of the original analysis)

But since it’s a logistic regression, the marginal effect of being African-American isn’t captured by the coefficient alone. Instead, they calculate the marginal effect of being African-American with all the other factors set to 0, i.e., it’s a comparison among White and African-American males, between age 25-45, with zero priors, with zero recidivism within the last two years, and with a particular severity of crime.

Fewer than 5% of the entire dataset meets these specifications in the first analysis and it’s only 7% in the second, so the statistical result reported is really only applicable for a small portion of the population.

If you calculate marginal effects over the entire dataset, taking into account men and women, all ages, and the full distribution of prior crimes, severity, and recidivism, those numbers are more modest:

• The analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 45 20 percent more likely to be assigned higher risk scores than white defendants.
• The violent recidivism analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 77 33 percent more likely to be assigned higher risk scores than white defendants.

This doesn’t change the piece’s overall argument, but some of these claims seem a little misleading in light of the actual comparison being made. My full analysis here (written for an undergraduate who’s taken a first course on statistics): https://github.com/anniejw6/compas-analysis/blob/master/01-regression-correction.ipynb

Curious to get your take here. I emailed the authors of this article, who responded with “Very interesting and informative. We were advised that our way of reporting is standard practice.”

My reply: Without getting into any of the specifics (not because I disagree with the above argument but just because I don’t have the energy to try to evaluate the details), I’ll say that this reminds me a lot of my paper with Iain Pardoe on average predictive comparisons for models with nonlinearity, interactions, and variance components. The key point is that predictive comparisons depend in general on the values of the other variables in the model, and if you want some sort of average number, you have to think a bit about what to average over. I hadn’t thought of the connection to the All Else Equal fallacy but that’s an interesting point.

1. Rodney Sparapani says:

This is very interesting and important. But, if you want a color-blind assessment, then why have race in the model at all?

• I haven’t read it, but I’ve seen this kind of analysis done before. The point usually is that the original algorithm doesn’t have race, but then you analyze the predictions of the algorithm with your own model, and your own model uses race, and find that for your model, race is predictive of *what the other algorithm does*. So the other algorithm is “racist” even though it doesn’t explicitly make decisions based on race…

Not sure if that’s what’s going on here, but it’s a common enough story line. This kind of analysis has been done on credit risk scores, and recidivisim, and housing subsidy eligibility or whatever similar type of stuff.

Of course, if that’s what’s going on, this means that some combination of characteristics of the data that is actually put into the algorithm itself is different between black and white criminals. In the linked article they say that *the convict themselves fill out the COMPAS survey* that is used to score them. So, it’s entirely plausible that the two groups answer questions about themselves differently, their baselines, and their expectations of what is “normal” may vary a lot. On average, the experience of being Black in the US is nothing like the experience of being White in terms of how people treat you, what your family situation is, education levels of your family members, your interaction with the criminal justice system… etc asking the criminals to fill out a survey seems unlikely to lead to unbiased self-analysis.

2. Carlos Ungil says:

She makes a very interesting point, but I wonder what the analysis is supposed to show in the first place.

> Our analysis (…) found that black defendants were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism, while white defendants were more likely than black defendants to be incorrectly flagged as low risk.

If someone is flagged as high risk but doesn’t commit another offense in the following two years, does that mean that the flag was “incorrect”? Does “high risk” mean “certitude of recidivism”?

> In forecasting who would re-offend, the algorithm correctly predicted recidivism for black and white defendants at roughly the same rate (59 percent for white defendants, and 63 percent for black defendants) but made mistakes in very different ways. It misclassifies the white and black defendants differently when examined over a two-year follow-up period.

It seems that to equilibrate the outcomes the algorithm should raise the flagging threshold for whites or lower it for blacks. That would bring closer the 59 and 63. This would also solve the problem with “low” risk ratings, closing the gap between the 29% of recidivism in whites and the 35% in blacks. But this would increase the “bias” they are talking about.

Looking at the contingency table they show it’s true that blacks are more likely to be classified high risk (59%) than whites (35%). For the sake of the argument, let’s say that the classification is correct and high risk defendants have 60% of probability of committing another offense while for those classified as low risk the probability is 30%. Out of 10 people classified as high risk we expect 6 to commit another offense in the next two years but that doesn’t mean that 4 of them were misclassified.

It’s kind of unavoidable that, given the higher recidivism rate for blacks, more blacks will be classified as high risk. And if high risk corresponds to 60% probability of recidivism it’s a mathematical fact that there will be more “false positives” in the high prevalence group than in the low prevalence group.

This is related to the recent discussion on the concept of fairness and whether outcome tests “fail to accurately detect discrimination”: https://andrewgelman.com/2018/03/24/problem-infra-marginality-outcome-tests-discrimination/

• yyw says:

Suppose the classification of low versus high risk is perfect, then the probability of being classified as high risk conditioned on future recidivism increases monotonically with the true proportion of high risk individuals in a population, unless high risk category has a 100% recidivism rate. This likely explains their binary logistic regression models.

Their Cox model with interaction terms focus on black:medium and black:high to suggest the risk model was unfavorable to blacks, but they ignored the relevant main effect term (black). A different coding of the model will likely lead to different interpretations. The authors in fact stated that black recidivism rate was higher for every risk category.

• Curious says:

It appears risk is being conflated with causation. What is being described as a mathematical fact is actually a crude coding of behavior into categories. What we know is that equifinality, the reality that there are multiple pathways to the same behavior, and multifinality, the reality that a common behavior can lead to multiple outcomes, is a feature of virtually all human behavior. Thus, both actual prior crimes and wrongful convictions result in higher risk scores. If wrongful convictions are more likely in one group than another, then this relevant on both the predictor side and the outcome side as it will overestimate the relationship between the two.

• Carlos Ungil says:

The more I think about this issue the less defensible I find their assumption that high risk individuals which “survived” were misclassified, especially when they define “high risk” as the union of High (decile scores: 8-10) and Medium (5-7).

Saying that “the algorithm is more likely to misclassify a black defendant as higher risk than a white defendant” or “was more likely to wrongly predict that white people would not commit additional crimes if released compared to black defendants” doesn’t seem appropriate for a score that was a gradual propensity measure which has been forced into a binary variable.

It would make a little more sense if only the highest score was taken as a strong prediction of recidivism (but it’s not supposed to be a perfect prediction anyway). They mention that looking at High scores only (decile scores: 8-10) doesn’t change their conclusion:

> We also tested whether restricting our definition of high risk to include only COMPAS’s high score, rather than including both medium and high scores, changed the results of our analysis. In that scenario, black defendants were three times as likely as white defendants to be falsely rated at high risk (16 percent vs. 5 percent).

78% of blacks and 75% of whites classified as High risk (deciles 8-10) commit another offense in the following two years. Again, one can say that the algorithm performs better for blacks than for whites. But again the find “bias” because 27% of blacks are classified as High risk while only 11% of whites are classified as High risk, resulting in more so-called “false positives” for blacks than for whites.

3. Ian Fellows says:

I think it is important to note that this conversation is possible due to their publishing of the data and analysis in a public forum, so kudos to ProPublica. Their response was also appropriate, and they seem open to your correspondent’s points.

I would say that standard practice would be to either report the odd ratio (61% increase in odds compared to whites), or the risk ratio with means/modes imputed for other covariates (instead of the baseline traits), or the marginal risk ratio (20% higher probability than whites).

4. Elin says:

Also when the data are really censored survival data adds another bit to the complexity of prediction issue.

5. Matt says:

Sounds like another case where a LPM would have done just fine and avoided this confusion altogether. Marginal gain to using logistic regression almost always is dominated by the cost when you care about marginal effects and not just prediction.