## Using output from a fitted machine learning algorithm as a predictor in a statistical model

Fred Gruber writes:

I attended your talk at Harvard where, regarding the question on how to deal with complex models (trees, neural networks, etc) you mentioned the idea of taking the output of these models and fitting a multilevel regression model. Is there a paper you could refer me to where I can read about this idea in more detail? At work I deal with ensembles of Bayesian networks in a high dimensional setting and I’m always looking for ways to improve the understanding of the final models.

I replied that I know of no papers on this; it would be a good thing for someone to write up. In the two examples I was thinking of (from two different fields), machine learning models were used to predict a binary outcome; they gave predictions on 0-1 scale. We took the logits of these predictions to get continuous scores; call these “z”, then we ran logistic regressions on the data, using, as predictors, z and some other things. For example,
Pr(y_i = 1) = invlogit(a_j[i] + b*z_i) [that’s a varying-intercept model]
Pr(y_i = 1) = invlogit(a_j[i] + b_j[i]*z_i) [varying intercepts and slopes]
Pr(y_i = 1) = invlogit(a_j[i] + b_j[i]*z_i + X*gamma) [adding some new predictors]
You’d expect the coefficients b to be close to 1 in this model, but adding the varying intercepts/slopes and other structures can help pick up patterns that were missed in the machine learning model, and can be helpful in expanding the predictions, generalizing to new settings.

Gruber followed up:

It is an interesting approach. My initial thought was different. I have seem some approaches to bring some interpretability to complex models by learning the prediction of the complex model as in

Buciluǎ, Cristian, Rich Caruana, and Alexandru Niculescu-Mizil. “Model Compression.” In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 535–41. ACM, 2006. http://dl.acm.org/citation.cfm?id=1150464.

Ba, Lei Jimmy, and Rich Caurana. “Do Deep Nets Really Need to Be Deep?” CoRR abs/1312.6184 (2013). http://arxiv.org/abs/1312.6184.

And more recently
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier,” 1135–44. ACM Press, 2016. doi:10.1145/2939672.2939778.

That’s all fine, it’s good to understand a model. I was thinking of a different question, which was taking predictions from a model and trying to do more with them by taking advantage of other information that had not been used in the original fit.

1. Matt says:

Andrew Crane-Droesch at USDA has been doing some of this:

2. Keith O'Rourke says:

When fitting additional covariates the work on Pre-validation and inference in microarrays by Tibshirani and Efron might be worth being aware of. In addition to additional covariates it involved cross-validation in an attempt to get a better senses of the contribution of additional covariates.

A problem arises with the likelihoods (see last two paragraphs here http://andrewgelman.com/2017/09/20/using-black-box-machine-learning-predictions-inputs-bayesian-analysis/#comment-566482) but apparently (I just noticed now) they were able to permute their way around that http://statweb.stanford.edu/~tibs/ftp/PreValidationArticle.pdf

For anyone wishing to see the gory details – see page 38 here https://phaneron0.files.wordpress.com/2015/08/thesisreprint.pdf

3. Tom Passin says:

Neural network results can also be used to set weights in a fuzzy logic controller. The fuzzy network can be computed with much less computing power than a neural network, so it’s good for production use.

4. zbicyclist says:

Ungated version of the Ribeiro paper: http://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf

5. Wayne says:

This sounds similar to Dark Knowledge, where a large (deep) neural network is trained on data, then a smaller neural network is trained on its (non-binary) outputs.

https://arxiv.org/abs/1312.6184

6. Xi'an says:

In our paper with Jean-Michel Marin and co-authors ABC random forests for Bayesian parameter inference, now recommended by Michael Blum on PCI Evol Biol (the peer community review platform in evolutionary biology), we follow this path of building summaries in random trees before/while running an ABC analysis of the data.

7. Mike says:

This approach has been around for sometime, like using factor scores (PC scores) as input in to cluster analysis; however, the robustness of the new model (e.g. optimal number of clusters) entirely depends on the validity of the original model (e.g. factor analysis/PCA). In general, this approach works quite nicely especially on highly correlated variables (to capture the profile of the variables in each cluster and simplify the interpretation of the clusters), but they could be misleading as well as those derived scores/variables may not capture all the information exists with the original variables that used in the original model. Thus, we need to be careful when we are using derived variables for further analysis.