Rachael Meager writes:

We’re working on a policy analysis project. Last year we spoke about individual treatment effects, which is the direction we want to go in. At the time you suggested BART [Bayesian additive regression trees; these are

notaverages of tree models as are usually set up; rather, the key is that many little nonlinear tree models are being summed; in that sense, Bart is more like a nonparametric discrete version of a spline model. —AG].But there are 2 drawbacks of using BART for this project. (1) BART predicts the outcome not the individual treatment effect – although those are obviously related and there has been some discussion of this in the econ literature. (2) It will be hard for us to back out the covariate combinations / interactions that predict the outcomes / treatment effects strongly. We can back out the important individual predictors using the frequency of appearance in the branches, but BART (and Random Forests) don’t have the easy interpretation that Trees give.

Obviously it should be possible to fit Bayesian Trees if one can fit BART. So my questions to you are:

1. Is it kosher to fit BART and also fit a Tree separately? Is there a better way?

2. Our data has a hierarchical structure (villages, implementers, countries) and it looks like trees/BART don’t have any way to declare that structure. Do you know of a way to incorporate it? Any advice/cautions here?

My reply:

– I don’t understand this statement: “BART predicts the outcome not the individual treatment effect.” Bart does predict the outcome, but the individual treatment effect is just the outcome with treatment=1, minus the outcome with treatment=0. So you get this directly. At least, that’s what I took as the message of Jennifer Hill’s 2011 paper. So I don’t see why anything new needs to be invoked here.

– Your second point is that a complicated fitted model is hard to understand: “It will be hard for us to back out the covariate combinations / interactions that predict the outcomes / treatment effects strongly.” I think you should do this using average predictive comparisons as in my paper with Pardoe. In that paper, we work with linear regressions and glms, but the exact same principle would work with Bart, I think. This might be of general interest so maybe it’s worth writing a paper on it.

– I would strongly *not* recommend “backing out the important individual predictors using the frequency of appearance in the branches.” The whole point of Bart, as I understand it, is that it is a continuous predictive model; it’s just using trees as a way to construct the nonparametric fit. In that way, Bart is like a spline: The particular functional form is a means to an end, just as in splines where what we care about is the final fitted curve, not the particular pieces used to put it together.

– I disagree that trees have an easy interpretation. I mean, sure, they seem easy to interpret, but in general they make so sense, so the apparent easy interpretation is just misleading.

– Jennifer and I have been talking about adding hierarchical structure to Bart. She might have already done it, in fact! Jennifer’s been involved in the development of a new R package that does Bart much faster and, I think, more generally, than the previously existing implementation.

In short, I suspect you can do everything you need to do with Bart already. But the multilevel modeling, there I’m not sure. One approach would be to switch to a nonparametric Bayesian model using Gaussian processes. This could be a good solution but it probably does not make so much sense here, given your existing investment in Bart. Instead I suggest an intermediate approach where you fit the model in Bart and then you fit a hierarchical linear model to the residuals to suck up some multilevel structure there.

GP, like Bart, can be autotuned. To some extent this is still a research project, but we’ve been making a lot of progress on this recently. So I don’t think this tuning issue is an inherent problem with GP’s; rather, it’s more of a problem with our current state of knowledge, but I think it’s a problem that we’re resolving.

When Jennifer says she doesn’t trust the estimate of the individual treatment effect, I think she’s saying that (a) such an estimate will have a large standard error, and (b) it will be highly model dependent. Inference for an average treatment effect can be stable, even if inferences for individual treatment effects are not.

I really don’t like the idea of counting the number of times a variable is in a tree, as a measure of importance. There are lots of problems here, most obviously that counting doesn’t give any sense of the magnitude of the prediction. More fundamentally, all variables go into a prediction, and the fact that a variable is included in one tree and not another . . . that isn’t really relevant. Again, it would be like trying to understand a spline by looking at individual components; the only purpose of the individual components is to combine to make that total prediction.

Why do trees make no sense? It depends on context. In social science, there are occasional hard bounds (for example, attitudes on health care in the U.S. could change pretty sharply around age 65) but in general we don’t expect to see such things. It makes sense for the underlying relationships to be more smooth and continuous, except in special cases where there happen to be real-life discontinuities (and in those cases we’d probably include the discontinunity directly in our model, for example by including an “age greater than 65” indicator). Again, Bart uses trees under the hood in the same way that splines use basis functions: as a tool for producing a smooth prediction surface.

**P.S.** More from Jennifer here.

Carlos Carvalho gave a very nice talk about this at Columbia when I was there. He argued, if I remember correctly, that it’s all about where in the model you put the BART. It was an extension of the idea in this paper https://arxiv.org/pdf/1602.02176.pdf.

Hi Daniel, thanks for thinking of our work. Indeed, we’ve been working on mitigating the complications of using BART (and other nonlinear regression methods) for causal inference, building off of Jennifer’s work while trying to address some of the concerns reflected in the OP and our earlier paper that you link to. The paper is one simulation study from being posted, but I’ll get in touch with Rachael Meager by email.

This is the same thing we were discussing the other day[1]:

https://arxiv.org/pdf/1602.02176.pdf

The problem is for many (most, nearly all?) use cases the model is not thought to be even close to correctly specified. This doesn’t matter for predictive skill, but for estimating coefficients or feature importance it can lead to very wrong conclusions. As mentioned in the earlier thread, “omitted variable bias” seems to be a big issue Andrew has with a lot of these papers. How can sensitivity analysis deal with that? That info is just missing.

[1] http://andrewgelman.com/2017/05/15/needed-good-research-hint-not-just-much-weight-given-small-samples-tendency-publish-positive-results-not-negative-results-perhaps-unconscious-bias/#comment-488709

What I’ve seen done is to sum up the “gain” (a metric of the improvement) for each step that includes a given variable. Eg http://stackoverflow.com/questions/33654479/how-is-xgboost-quality-calculated

Really such stats mean nothing if collinearity, etc are present so I don’t wouldn’t much stock into it other than as a sanity check though.

It’s typically good practice to do a spot check to evaluate predictors for multicollinearity before using them to train. Depending though on the source of the multicollinearity among the predictors methods like XGboost will do a better job than random forest by themselves.

The XGBoost tutorial touches on this http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html#measure-feature-importance.

So while the Gain is muddled some by the multicollinearity the better predictor, in this case age handily “wins” relative to the discretized versions. So it may not be perfect but the multicollinearity doesn’t render Gain a useless measure.

Ok, but that is a simple example where information is just being straight up lost by discretizing. Instead lets pick something else that could be correlated with age: use of OTC painkillers.

Say, for example, that the older you are the less likely you are to also take OTC painkillers along with this treatment. Also, lets say this is due to cultural reasons. There is no necessary connection with age, ie in 50 years the pattern could be reversed and the OTC painkillers are once again out of fashion in the young (the attitude is to just deal with it), but now the generation that was taking them while young is still doing it.

In that case, assuming the painkillers are at least somewhat effective, there would be more room for improvement in those taking them (ie the baseline pain levels were lower). This would lead you to be more likely to observed “marked” improvement in older patients. So add in the painkiller-use feature and now age will drop below treatment and sex features in importance.

I don’t think it is necessary useless though. You can compare relative importance within the set of features fed into the model, and maybe see a feature with high importance that shouldn’t be there (eg patient ID number).

Well that’s not just multicollinearity: you’re offering a far more complex causal structure. Which to be fair, I think is way more interesting and way more of a problem and needs to be studied. I’ve done some stuff before using DAGs to generate “nasty” causal structures and tested different ML strategies against them. Granted in those tests I wasn’t looking at variable importance specifically so it might be worth revisiting some of the cases to see how Gain and other variable importance measures perform.

I did just come across this recent arXiv https://arxiv.org/abs/1701.05306 in that vain that might be worth skimming regarding confounder bias … and now that I started searching there’s also this article from 07 https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-25

Thanks, but let me give this feedback. I think all concerns about regularization, bias, etc will end up being dealt with as they were before using statistics/math/logic came to replace the scientific method rather than act as only tools.

It will be with empirical heuristics based on “old school” scientific schemes like waiting for a priori predictions and independent replication. Some examples of this happening would be that for CARTs use early stopping heuristic and grid search to tune all the hyperparameters; similarly for NNs, use drop out.

If you could figure out a better way and demonstrate the new method convincingly, I will be the first to adopt. But until then you need to find a way to test on something closer to real world implementations. For an example of how, see here: http://andrewgelman.com/2016/12/05/best-algorithm-ever/

I’m not sure I understand the point of the link. DAGs are just direct acylic graphs. They are useful for describing causal structures generally and provide a language for describing causal structures that give rise to things like Simpson’s paradox (http://dagitty.net/learn/simpson/). They have nothing to do with whatever the “best algorithm EVER !!!!!!!!” is. Or is there a specific comment thread that is relevant on that page?

I’ve used DAGs to create causal structures that I believe could reflect real world situations like the one you describe. And then just like any other Montecarlo simulation approach, I use models like xgBoost, bartMACHINES, MARS, Gaussian process smooths, etc to see how they perform.

If you want an interpretable tree with a different conditional average effect in each leaf, this paper by Athey and Imbens may be of interest (nothing about hierarchical structure though): https://arxiv.org/pdf/1504.01132.pdf.

For visualizing the tree or perhaps more appropriately visualizing the dependencies uncovered by a tree, I’ve seen mixed results with the following approaches.

https://cran.r-project.org/web/packages/ICEbox/index.html

http://forestfloor.dk/

https://cran.r-project.org/web/packages/ExplainPrediction/index.html

https://cran.r-project.org/web/packages/visreg/index.html

So a quick reply that 1) Anyone interested in BART should be using Vince Dorie’s dbarts package in R, 2) Vince and I are going to be working on the MLM version of BART (as well as some other extensions for causal inference) over the next few months so stay tuned, and 3) I’ve yet to see any “auto-tuned” version of GP that works anywhere close to as good as BART with its default prior (though this is potentially a cool research topic and one that Andrew and I have funding to look at), and 4) re trusting individual effects, another way of saying that such estimates are highly model dependent is that I think average treatment effect estimation often benefits from bias cancellation … that advantage evaporates when estimating at the individual level. Thanks for the post Andrew!

> Anyone interested in BART should be using Vince Dorie’s dbarts package in R,

Hoy does it compare to bartMachine? dbarts doesn’t seem to handle missingess, for example.

I’d be curious to hear about this as well. I’ve found bartMachine to have a much nicer feature set than the other packages, although it uses a slightly simplified version of the algorithm.

This recent paper: proceedings.mlr.press/v54/mueller17a.html

uses Gaussian Processes to estimate individual & population treatment effects.

I found it interesting that they are able to efficiently find the best individual treatment through gradient ascent since the predicted outcomes depend smoothly on the covariates in GP unlike in BART.

Also, for an interesting view on estimating individual treatment effects with neural networks, see:

http://www.cs.nyu.edu/~shalit/cfr_cam2.pdf