Using statistical prediction (also called “machine learning”) to potentially save lots of resources in criminal justice

John Snow writes:

Just came across this paper [Human Decisions and Machine Predictions, by Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan] and I’m wondering if you’ve been following the debate/discussion around these criminal justice risk assessment tools.

I haven’t read it carefully or fully digested the details. On the surface, their general critique of the risk assessment tools seems reasonable but what caught my attention are the results of the simulation they report in the abstract:

Even accounting for these concerns, our results suggest potentially large welfare gains: a policy simulation shows crime can be reduced by up to 24.8% with no change in jailing rates, or jail populations can be reduced by 42.0% with no increase in crime rates.

Those numbers seem unrealistic in their size. I’d be curious to hear your take on this paper in the blog.

Ummm, I think that when they said “24.8%” and “42.0%,” they really meant 25% and 42%, as there’s no way they could possibly estimate such things to an accuracy of less than one percentage point. Actually there’s no way they could realistically estimate such things to an accuracy of 10 percentage points, but I won’t demand a further rounding to 20% and 40%.

In all seriousness, I do think it’s misleading for them to be presenting numbers such as “83.2%” and “36.2%” in their paper. The issue is not “sampling error”—they have a huge N—it’s that they’re using past data to make implicit inferences and recommendations for new cases in new places, and of course there’s going to be variation.

In any case, sure, I can only assume their numbers are unrealistic, as almost by definition they’re a best-case analysis, not because of overfitting but because they’re not foreseeing any . . . ummmm, any unforeseen problems. But they seem pretty clear on their assumptions: they explicitly label their numbers as coming from “a policy simulation” and they qualify that whole sentence with a “our results suggest.” I’m cool with that.

In our radon article, my colleagues and I wrote: “we estimate that if the recommended decision rule were applied to all houses in the United States, it would be possible to save the same number of lives as with the current official recommendations for about 40% less cost.” And that’s pretty similar. If we can claim a 40% cost savings under optimal policy, I don’t have a problem with these researchers claiming something similar. Yes, 40% is a lot, but if you have no constraints and can really make the (prospectively) optimal decisions, this could be the right number.

P.S. Parochially, I don’t see why the authors of this paper have to use the term “machine learning” for what I would call “statistical prediction.” For example, they contrast their regularized approach to logistic regression without seeming to recognize that logistic regression can itself be regularized: they write, “An important practical breakthrough with machine learning is that the data themselves can be used to decide the level of complexity to use,” but that’s not new: it’s a standard idea in hierarchical modeling and was already old news ten years ago when Jennifer and I wrote our book.

On the other hand, it may well be that more people consider themselves users of “machine learning” than “statistical prediction,” so maybe I’m the one who should switch. As long as these researchers are using good methods, it’s not so important if we have similar methods under different names that could also solve these problems. They’re the ones who fit a model to this problem, and they deserve the credit for it.

No big deal either way as long as (a) we’re clear on what we’re assuming, what our algorithms are doing, and what data we’re using; and (b) we remember to adjust for bias and and variance of measurements, nonrepresentative samples, selection bias, and all the other things we worry about when using data on a sample to draw inference about a population.

15 thoughts on “Using statistical prediction (also called “machine learning”) to potentially save lots of resources in criminal justice

  1. we apply a machine learning algorithm— specifically, gradient-boosted decision trees—trained on defendant characteristics to predict crime risk

    What implementation of gradient boosting was used? What are the hyper-parameters (eg learning rate, max depth)? Etc, etc to make it reproducible. Even moreso than usual, if this is going to be used by the justice system this data and code should be publicly available for perpetuity.

    As is standard, we randomly partition our data into a training and hold-out set to protect against over-fitting. All
    results for the performance of the algorithm are measured in the hold-out set.

    They have date info, the hold out should be the most recent data (eg predict 2013 using 2008-2012 data). This is closest to how the model will actually be used, and it is a cheap and simple improvement.

    Of the initial sample, 758; 027 were subject to a pre-trial release decision and so relevant for our analysis.29 We then randomly select 203; 338 cases to remain in a “lock box” to be used in a final draft of the paper; it is untouched for now.30 This leaves us with a working dataset for training and preliminary evaluation of our algorithm of 554,689 cases.

    Nice that they have the test set (“lockbox”), it is usual to overfit to the validation set by tuning hyper-parameters, feature engineering, etc. I guess they haven’t run the model on that yet so we don’t know how it turned out.

    The algorithm is trained using the Bernoulli loss function:
    L[(y_i, m(x_i)] = -[y_i*log(m(x_i)) + (1 – y_i) * log(1 – m(x_i))]

    This is usually called cross entropy, and a search for it didn’t yield anyone calling this equation by the name they use:
    https://en.wikipedia.org/wiki/Cross_entropy

    Finally, I kept looking for a simple confusion matrix (Actual vs predicted “committed crime after release”), can’t find it.

    • With correct formatting (hopefully, feel free to just replace the earlier post…):

      we apply a machine learning algorithm— specifically, gradient-boosted decision trees—trained on defendant characteristics to predict crime risk

      What implementation of gradient boosting was used? What are the hyper-parameters (eg learning rate, max depth)? Etc, etc to make it reproducible. Even moreso than usual, if this is going to be used by the justice system this data and code should be publicly available for perpetuity.

      As is standard, we randomly partition our data into a training and hold-out set to protect against over-fitting. All
      results for the performance of the algorithm are measured in the hold-out set.

      They have date info, the hold out should be the most recent data (eg predict 2013 using 2008-2012 data). This is closest to how the model will actually be used, and it is a cheap and simple improvement.

      Of the initial sample, 758,027 were subject to a pre-trial release decision and so relevant for our analysis.29 We then randomly select 203,338 cases to remain in a “lock box” to be used in a final draft of the paper; it is untouched for now.30 This leaves us with a working dataset for training and preliminary evaluation of our algorithm of 554,689 cases.

      Nice that they have the test set (“lockbox”), it is usual to overfit to the validation set by tuning hyper-parameters, feature engineering, etc. I guess they haven’t run the model on that yet so we don’t know how it turned out.

      The algorithm is trained using the Bernoulli loss function:
      L[(y_i, m(x_i)] = -[y_i*log(m(x_i)) + (1 – y_i) * log(1 – m(x_i))]

      This is usually called cross entropy, and a search for it didn’t yield anyone calling this equation by the name they use:
      https://en.wikipedia.org/wiki/Cross_entropy

      Finally, I kept looking for a simple confusion matrix (Actual vs predicted “committed crime after release”), can’t find it.

      • Saying “Bernoulli loss function” is correct. In stats terms, they would say “we used maximum likelihood for Bernoulli outcome”. For every distribution, there is a unique likelihood function (which we maximize) and, equivalently, a unique loss function (which we minimize). Bernouilli = cross-entropy, Normal = least squares.

        • Can you give a reference for using the term “Bernoulli loss”, what area is calling it that? I am not saying that term is technically incorrect, just that it is not commonly used (as evidence by searching . Perhaps it is a shibboleth: “these authors must come from a stats/economics background rather than data science or ML”.

      • If your likelihood is Bernoulli then surely it is ok to talk about the Bernoulli loss. That seems more appropriate than calling it Cross Entropy (which is more common in machine learning) — the information theory connection isn’t illuminating here.

        • I’m not saying there is anything technically wrong with the term “Bernoulli Loss”, just that searching for that term didn’t yield any explanation for the equation (thus making it a poor choice).

          I’m not sure of the history but I would guess that the term “cross entropy” is used because then we have an appropriate name for a single function/method that can deal with both the 2 outcome and 2+ outcome cases depending on the input. The “Bernoulli loss” is just a special case of the “cross entropy”. Besides cross entropy, you can also look up log-loss: https://www.kaggle.com/wiki/LogLoss

          But Bernoulli Loss does not appear to be widely used.

  2. Authors get big mileage from phrases like “machine learning” and little from “statistical prediction”. Author credibility in much of scientific publication is aesthetic, especially since many authors and reviewers are incapable of explaining exactly what analysis was done.

  3. Yes, “machine learning” is a hot topic in economics at the moment, so the authors are highlighting that to their advantage. Also, for better or for worse, economics convention is to report the center of the confidence interval, but not to report uncertainty except in tables. It can definitely be misleading.

    • I rather like the machine learning methods. I second the need for them to release the data/methodology, particularly if they intend for this to influence policy. Another practice which should be changed is the one you allude to – reporting the center of confidence intervals rather than the uncertainty. But in the case of many machine learning methods there is no confidence interval to report. This is an area that needs more work, in my opinion. How do you quantify the uncertainty in classification models that use these machine techniques? I am aware of one method – conformal prediction – but I have not yet figured out exactly what it does and whether I trust it. Predicting classifications is great and confusion matrices (more generally, AUC and other measures) provide some measures of accuracy. But I would like to assess the uncertainty surrounding the classifications. In the case of the paper under discussion, I would think that is important if it is to be used for parole decisions.

  4. The paper is very interesting. And quite good IMO. (It can be found here for free: https://www.cs.cornell.edu/home/kleinber/w23180.pdf)

    Figure 2 on page 45 gives some good insight into what is driving the results. The release rate as a function of predicted crime risk declines linearly for low predicted crime risk, but then the function flattens out dramatically and asymptotes to a release rate of 50% for even the highest crime risk defendants. So, almost all defendants with very low crime risk ( 30% are released 50% of the time.

    This seems to be key to understanding the magnitude of the improvements proposed by the paper. The paper assumes the optimal policy is a 0% release rate for the highest risk defendants and a 100% release rate for all other defendants. I’m guessing the cutoff would be at a predicted crime risk of about 50%. Figure 2 clearly shows how this optimal policy would dramatically improve results.

    Is such an improvement possible? Is it even close to possible? Why are judges releasing 50% of defendants with bright red warning signs that they are high-risk? Why are judges not releasing defendants with very low crime risk? I don’t know. I don’t even have a guess.

    • Sorry. I garfed up the second paragraph in my post above.

      The paper is very interesting. And quite good IMO. (It can be found here for free: https://www.cs.cornell.edu/home/kleinber/w23180.pdf)

      Figure 2 on page 45 gives some good insight into what is driving the results. The release rate as a function of predicted crime risk declines linearly for low predicted crime risk, but then the function flattens out dramatically and asymptotes to a release rate of 50% for even the highest crime risk defendants. So, almost all defendants with very low crime risk (<10%) are released, but *all* high risk defendants are released 50% of the time.

      This seems to be key to understanding the magnitude of the improvements proposed by the paper. The paper assumes the optimal policy is a 0% release rate for the highest risk defendants and a 100% release rate for all other defendants. I’m guessing the cutoff would be at a predicted crime risk of about 50%. Figure 2 clearly shows how this optimal policy would dramatically improve results.

      Is such an improvement possible? Is it even close to possible? Why are judges releasing 50% of defendants with bright red warning signs that they are high-risk? Why are judges not releasing defendants with very low crime risk? I don’t know. I don’t even have a guess.

  5. I’ve never understood your distaste for reporting decimal points. Sure 32.808231487091845982304523450124159476987698762345 +- 10 is a general waste of space…but inversely, saying reported decimal points should represent certainty isn’t good for a variety of other reasons.

  6. In regression models, there is such thing as prediction interval (for individual) which is always wider than confidence interval (Intro Stats, Richard D. De Veaux CHAPTER 27 Inferences for Regression). Since machine learning is a much more advanced form of regression, I think they should also have prediction and confidence intervals.

Leave a Reply to Anoneuoid Cancel reply

Your email address will not be published. Required fields are marked *