Daniel Kapitan writes:

We are in the process of writing a paper on the outcome of cataract surgery. A (very rough!) draft can be found here, to provide you with some context: https://www.overleaf.com/read/wvnwzjmrffmw.

Using standard classification methods (Python sklearn, with synthetic oversampling to address the class imbalance), we are able to predict a poor outcome with sufficient sensitivity (> 60%) and specificity (>95%) to be of practical use at our clinics as a clinical decision support tool. As we are writing up our findings and methodology, we have an interesting debate on how to interpret what the most relevant features (i.e. patient characteristics) are.

My colleagues who are trained as epidemiologist/doctors, have been taught to do standard univariate testing, using a threshold p-value to identify statistically significant features.

Those of us who come from machine learning (including myself) are more inclined to just feed all the data into an algorithm (we’re comparing logistic regression and random forest), and then evaluate feature importance a posteriori.

The results from the two approaches are substantially different. Comparing the first approach (using sklearn SelectKBest) and the second (using sklearn Random Forest), for example, the variable ‘age’ ends up somewhere halfway in the ranking (p-value 0.005 with F_classif) vs. top-6 (feature importance from random forest)

As a regular reader of your blog, I am aware of the ongoing debate regarding p-values, reproducible science etc. Although I get the gist of it, my understanding of statistics is too limited to convincingly argue for or against the two approaches. Googling the subject, I come across some (partial) answers:

https://stats.stackexchange.com/questions/291210/is-it-wrong-to-choose-features-based-on-p-value.

I would appreciate if you could provide some feedback and/or suggestions how to address this question. It will help us to gain confidence in applying machine learning in the day-to-day clinical practice.

My reply:

First, I think it would help to define what you mean by “most relevant features” in a predictive model. That is, before deciding on your procedure to *estimate* relevance, to *declare based on the data* what are the most relevant features, first figure out how you would *define* relevance. As Rubin puts it: What would you do if you had all the data?

I don’t mind looking at classification error etc., but I think it’s hard to make any progress at all here without some idea of your goals.

Why do you want to evaluate the importance of predictors in your model?

You might have a ready answer to this question, and that’s fine—it’s not supposed to be a trick. Once we better understand the goals, it might be easier to move to questions of estimation and inference.

Kapitan replied:

My aim of understanding the importance of predictors is to support clinical reasoning. Ideally, the results of the predictor should be ‘understandable’ such that the surgeon can explain why a patient is classified as a high risk patient. I.e. I would like to combine clinical reasoning (inference, as evidenced in ‘classical’ clinical studies) with the observed patterns (correlation). Perhaps this is a tall order, but I think worth trying. This is one of the reasons why I prefer using tree-based algorithms (rather than neural networks), because it is less of a black box.

To give a specific example: patients with multiple ocular co-morbidities are expected to have high risk of poor outcome. Various clinical studies have tried to ‘prove’ this, but never in relation patterns (i.e. feature importance) that are obtained from machine learning. Now, the current model tells us that co-morbidities are not that important (relative to the other features).

Another example: laterality ends up as second most important feature in the random forest model. Looking at the data, it may be the case that left-eyes have a higher risk of poor outcome. Talking to doctors, this could be explained that, given most doctors are right-handed, operating a left-eye is slightly more complex. But looking at the data naively (histograms on subpopulations) the difference does not seem significant. Laterality ends up in the bottom range with univariate testing.

I understand that the underlying statistics are different (linear vs non-linear) and intuitively I tend to ‘believe’ the results from random forest more. What I’m looking for is sound arguments and reasoning if and why this is indeed the case.

My reply:

To start with, you should forget about statistical significance and start thinking about uncertainty. For example, if your estimated coefficient is 200 with a standard error of 300, and on a scale where 200 is a big effect, then all you can say is that you’re uncertain: maybe it’s a good predictor in the population, maybe not.

Next, try to answer questions as directly as possible. For example, “patients with multiple ocular co-morbidities are expected to have high risk of poor outcome.” To start with, look at the data. Look at the average outcome as a function of the number of ocular co-morbidities. It should be possible to look at this directly. Here’s another example: “it may be the case that left-eyes have a higher risk of poor outcome.” Can you look at this directly? A statement such as “Laterality ends up in the bottom range with univariate testing,” does not seem interesting to me; it’s an indirect question framed in statistical terms (“the bottom range,” “univariate testing”), and I think it’s better to try to ask the question more directly.

Another tip is that different questions can require different analyses. Instead of fitting one model and trying to tell a story with each coefficient, list your questions one at a time and try to answer each one using the data. Kinda like Bill James: he didn’t throw all his baseball data into a single analysis and then sit there reading off conclusions; no, he looked at his questions one at a time.

Epidemiologists in general love to have interpretable models (at the level of the coefficients), which isn’t a bad thing, but it’s a different goal than “creating the best predictive model.” It seems like these two goals lead to two very different modelling strategies pretty often. I wonder how much of a hit to the sensitivity/specificity of a model we should be willing to accept in the name of allowing the surgeon to “explain why a patient is classified as a high risk patient”…

I don’t think it is a good thing in the simple case, just highly misleading. How many of these people really understand that the coefficient values are conditional on the model being “corectly specified”? Eg:

https://andrewgelman.com/2017/01/04/30805/

And I’m sure if you include some new features that age is being used as a proxy for (eg blood pressure, healing rate, etc) then it will drop, or if you remove some it will rise. I don’t think this is meaningful either.

For making predictions a statistical/ml model is fine. For interpreting coefficients you will need to derive the model from some theory you think is approximately correct.

Does anyone really believe these statistical models are a near approximation to the process that generated the data? This is really annoying to explain to people btw: “Everyone else is interpreting their coefficients without a care in the world, why cant I? Something is wrong here.” It would be so much easier to just go along with the scam…

I’m not sure what you’re saying with the example about age. If our goal is prediction and age is the best proxy we have for those things, we use age, right? Keeping in mind that age is acting through pathways that we haven’t directly measured, and that our model would be better if we did directly measure those things. Maybe I’m misunderstanding the original post, but I thought that they weren’t really going for causal interpretations of their coefficients, but rather to be able to explain what patient characteristics result in a poor outcome being predicted (which should be fine to do?).

I’m saying that age may be a really “important” predictor of negative outcome in model A, but not important or even predictive of positive outcome in a different model B.

Doesn’t this seem problematic:

– You choose model A based on cross validation (or whatever), and tell the patient “You are high risk because of your advanced age”.

– I choose model B based on hold out performance (or whatever), and tell the patient “You are low risk because of your advanced age”.

For the example I used there would need to be two different patients. I didn’t mean for the models to make vastly different predictions, just use the features differently.

When there are a number of variables that are related, the relative influence of any particular variable is likely to vary considerably according to the presence or absence of the other related variables. While it may seem good enough to know that age (e.g.) is associated with higher risk, it may well be that BMI, cholesterol, or other factors should similar associations, depending on the particular model that is run. It is hard to imagine that causative interpretations can be avoided. The physician can try to say that we have observed an association between age and increased risk – without saying age is causing the increased risk – but they alternatively might say it is BMI or cholesterol that is associated with higher risk. All they are really saying is that someone “like you” has been observed to have higher risk. The natural question will be “why?” Is it my age, my BMI, my cholesterol, etc? The answer “it is none of these, they are merely indicators for some unknown factor(s) doesn’t seem like it will accomplish much.

It seems to me that (after training your model) you could build an interactive tool that the doctor could just put in the patient’s observed characteristics and the model spits out the predicted probability of outcome X. Then the doctor can say, “Because of the patient’s age, weight, comorbidities, etc., you are at risk for outcome X.” If a doctor told me that because of my age, I was at risk for something, I would go to another doctor because I know that there is more to health risk than just age. And given “importance” doesn’t necessarily mean the variable with the highest correlation with the outcome (importance might just be due to the complex interactions one variable has with other variables), it seems a bit strange to even use that as a criterion when discussing how a patient’s characteristics might affect the outcome.

Kapitan writes, “Using standard classification methods (Python sklearn, with synthetic oversampling to address the class imbalance), we are able to predict a poor outcome with sufficient sensitivity (> 60%) and specificity (>95%) to be of practical use at our clinics as a clinical decision support tool.” I’m not sure whether that classification method is the logistic or the random forest, but given he says >60% for sensitivity, it makes me think that sensitivity is around 60%, which doesn’t seem anywhere close to being of “practical use”.

I’m not deeply involved in the literature on machine learning, but I have not seen this proposed (though again, I don’t think I can be the only one who has thought of it): create a fake dataset that holds everything constant except for the variable you want to analyze. Run your model on the fake dataset, for which only the variable of interest changes in value. The output will be the predicted probability of the outcome. Then graph the values of your variable of interest on the x with the predicted probability on the y, and you can see how changes in that variable affect the predicted probability of the outcome. If you want to restrict to various subpopulations (e.g. male vs. female, ), then you can vary the values of the those variables (that’s some pretty good alliteration), and then create different plots for different subpopulations (e.g. male vs. female).

Like this, where column three holds the feature of interest (Im sure the formatting will be mangled…)?

[,1] [,2] [,3] [,4] [,5]

[1,] -1.26 1.763 0.439 -1.066 0.862

[2,] -1.26 1.763 0.053 -1.066 0.862

[3,] -1.26 1.763 0.980 -1.066 0.862

[4,] -1.26 1.763 -0.395 -1.066 0.862

The problem is that the effect of feature three is also going to depend on the (constant) values in the other columns, so how do you choose those?

I agree that the problem is choosing which variables to hold constant. This is where the researchers’ knowledge would come in. They should know which variables are of interest and what categories might be most relevant to not hold constant when examining that variable of interest. Let’s say you wanted to see the effect of diabetes on medical expenditure. There are lots of other variables that obviously affect medical expenditure. But many policy analysts are interested in outcomes by race and poverty status. But age seems like a pretty big factor in medical expenditure, too. So you could present your results holding everything constant but diabetes, race, poverty status, and age. You’d have separate bar graphs (with predicted medical expenditure on the y), say by age (maybe in 3 categories and race 4 categories (or whatever floats your boat)), and then you present two groups of bars for poverty status (one group for diabetes = yes, the other for no). So you have 12 bar graphs, each with 2 groups of poverty status bars. But like I said this is just a suggestion when the alternative for presenting machine learning results is a list of variables by some importance measure.

I don’t think I’m as funny as abuuu thinks I am, but it’s not clear why you couldn’t present results from machine learning algorithms like this. Like I mention in my reply to his comment, this is basically what regression output is.

No, I mean the effect of age could be “positive” for rich people in California but “negative” for poor people in Texas. There is no single value that is independent of the other features.

Why plot “predicted expenditures” here? Just plot the actual data on expenditures segmented by these groups, the data that the predictions are based on… The whole point of these ML algos is to give you answers at an individual level, reaggregating afterwards makes no sense.

“No, I mean the effect of age could be “positive” for rich people in California but “negative” for poor people in Texas. There is no single value that is independent of the other features.” I get what you meant. So if you were interested in how age and geographic region contributed to final outcome, then you would vary those features while holding the others constant.

‘The whole point of these ML algos is to give you answers at an individual level, reaggregating afterwards makes no sense.’ Well… if you want an individual level prediction, there really is no point in presenting results. So lets say you learn that age is important (as defined by the importance measure sklearn package uses). So what. If you just care about individual level predictions, then just plug and chug. But if you want to explain something to a patient (which the researcher seems to want to do) or if you are doing some kind of policy analysis, then you might want to aggregate up a bit. In some cases, you don’t want to aggregate (say with google ads), in other cases (say in explaining something to a lay person or if you want to know whether your HR software is going to get you into trouble with the EEOC) you might want to aggregate up.

‘Why plot “predicted expenditures” here?’ Because if you just plotted the data you already have, what have you learned. You could have done that without any analysis at all, and presumably you would have already done that. By plotting predicted values, you are controlling (holding constant) all those other factors.

Basically it seems that you could use the method I describe as giving something akin to an interpretable coefficient (obviously not the same but something like it) when you take differences in predicted outcomes. So if you just wanted to know the impact of sex on the outcome, the only thing you vary in your fake dataset is sex, and the difference in the predicted outcome is something akin to a regression coefficient.

I really don’t know why holding all else constant is something controversial. Obviously, there are limitations and caveats, but this is standard practice for regression analysis. If Andrew Gelman wanted to tell us the effect of income on vote choice, then he would hold a lot of other variables constant to give us that answer.

Ok, perhaps we need something more concrete so I played around a bit and made this: https://pastebin.com/9wsMcfWv

It generates fake data that looks like:

weight state age income

1 187.38 IL 54 88.070

2 174.42 NJ 76 40.363

3 183.63 DE 46 58.060

4 166.62 MN 70 41.555

5 165.14 PA 51 56.808

6 169.94 NV 56 65.522

There is a binary outcome class (0-1) that is a function of the four variables plus some noise. Eg, the outcome has 95% chance of being one if:

weight < 170 &

state %in% state.abb[1:20] &

age > 50 &

income < 70

Using xgboost I got auc ~0.78 which corresponds to false positive rate ~ 0.35 and false negative rate ~ 0.2. Here is a plot of the results:

https://image.ibb.co/coihs8/classifier.png

Can you show what you mean with the dataset generated by that script?

I don’t see how any insight can be gained by holding some columns constant and plugging numbers into the model, since the model is using

allthe columns for each prediction. Eg, It doesn’t make sense to say “you are classified as high risk due to your age”, because a person of the same age could be classified as low risk if they came from a different state or had a different income, or whatever.Ok, this seems to be what you are suggesting to do. I ran this code at the end of what I posted earlier: https://pastebin.com/NmSxfK5Q

Here is what the probability of “class 1” looks like by age for the first 25 different combinations of weight, state, and income I generated. The x axis is age and y axis is Pr(class 1). The red line is the same in each panel, its the overall average:

https://image.ibb.co/et0vUo/byage.png

We see sometimes the probability increases with age, other times not. The effect of age depends on the other variables.

I don’t have access to a computer at the moment, but let’s say instead tgis was just a regression where the estimated equation is risk = b1*weight + b2*income + b3*age + state effects. How would you interpret any of those coefficients? The betas are the change in risk given a one unit change in the x variable, *holding all other x variables constant*.

Now let’s say you are interested in how age effects risk independent of the other variables. For convenience, do it for Illinois. So fake_data has the values for weight and income at their medians and state held constant at IL. Let your age variable range between 40 and 60 (1 standard deviation on either side of the mean). Get predicted probabilities predict(bst, fake_data). Theb graph the predicted probs on the y with the age on the X. That gives you the relationship between age and risk in Illinois. If you want, you could make it a line graph and put 50 lines on it, one for each state. If there isn’t much difference between the lines, then while state might be important in building the model, it might not be important as a determinant of risk (though in this example it is by construction).

If I get a chance I’ll toy around with it and post later.

>The betas are the change in risk given a one unit change in the x variable, *holding all other x variables constant*.

I hear this a lot, and it’s misleading. one unit change in x can bring you well outside the region in which your model is applicable. For example suppose x is income divided by 1,000,000$/yr, one unit is $1M which if you have a model for regular everyday people is well outside all of your data… I’m just making this up to make a point, there are other more reasonable examples.

The better statement is that beta is the partial derivative with respect to x, and so beta * dx is approximately the change in risk for a “small” change dx

If x is our income/(1M$/yr) then a small change in x of 0.001 is $1000 / yr which is a small fraction of the range over which we expect normal incomes to vary.

Jfa:

You write, “If Andrew Gelman wanted to tell us the effect of income on vote choice . . .”

I’ve written a lot on the correlation of income and vote choice, and on how income predicts vote choice, but not on the

effectof income on vote choice. The work I’ve done compares people of different incomes. Theeffectof income has to do with changes in income within people; that’s an entierly different thing.Andrew, you are correct. I was a little loose with my language, but presumably you identify the correlation by holding other things constant.

Daniel, yes you are correct, but my main point in all of this discussion is the idea of holding things constant is a (widespread) way of interpreting the data (or at least getting a grasp on what one’s model is actually saying).

Sure, understanding the numerical value of partial derivatives is helpful, but I think it’s best to understand what you are doing and how the value can be legitimately used. I suspect many social science researchers have never taken a multivariable calculus class, and they probably should, it would help them understand multivariate stats models as well as Stan’s HMC fitting issues

@Daniel: Good points.

Agreed

…assuming the model includes all relevant variables and no irrelevant ones, in general it is specified correctly.

Was this a joke ? I’m sorry, but seriously?

This is the same as just running a univariate regression (or for whatever model you are using just including 1 variable). What’s the point of generating the fake data? The variable you aren’t holding constant is still correlated in the true DGP with all those things you’ve artificially held constant… so plotting the predicted outcome as a function of your Y is just nonsense.. not interpretable.

No, this wasn’t a joke. I’m not a comedian. Many people want to know the risk of X if someone is male or female, black or white, etc. I really don’t see how training a random forest to build your model then feeding fake data to see how certain variables affect the outcome is anywhere close to running a univariate regression. Of course the variables you are holding constant are part of the true DGP, but that is the case in any regression analysis.

Regressions tend to give relatively easily interpretable results on how individual variables effect the y (have you ever seen regression lines plotted… researchers do it all the time… you know what those results are… a plot of the outcome as a function of the variable of interest). Machine learning algorithms tend not to give easily interpretable results. You can use the fake data to essentially get your regression line out of the algorithm. I don’t see how that is anything like running a univariate regression or how it is any different from plotting a regression line from any simple regression. Maybe you think that plotting regression lines to display results is completely bogus because the variable being held constant “is still correlated in the true DGP with all those things you’ve artificially held constant” and that those regression plots are “not interpretable”. That’s a perfectly fine position to hold. But I really don’t see how doing the same thing with the model built using random forest is so out of the realm of possibility.

I was as confused as abuuu. I suspect I’m misreading your advice. If you create a fake dataset where the data values for some predictors are all the same, how are you supposed to regress the response on them? There has to be variability in your predictors.

Here’s how I interpreted your suggestion:

x1=rnorm(30)

x2=rep(1,30)

y=x1+rnorm(30)

summary(lm(y~x1))

summary(lm(y~x1+x2))

The 2nd model is no different from the 1st, except for the error message.

You’re not regressing the response on the fake data. You have already built your model. ML algorithms tend not to have interpretable output. You put your fake data into the model you have already built. Then it spits out the predicted values for that data.

If you are interested in, say, how sex affects the outcome, create a dataset with 2 observations where all the variables have the same value (maybe held at the mean median or mode of the observed distibution) except for the sex variable (for which one observation will be male, the other female). If the output of the model is the probability of a particular outcome, your model will spit out 2 predicted probabilities. The difference in the predicted probabilties could be interpreted as the marginal effect of being a woman (if male takes the value of 0).

That should read, “once you build your model, you put your fake data…”.

This is a kind of numerical partial derivative evaluated at some point. The relevance of that partial derivative at that point may or may not be much, it depends a lot on the model and the choice of evaluation point, and the accuracy of the numerical partial, which depends on the size of the delta you put in.

Not sure what’s going on, but 60% can be of huge practical use, depending on the scenario. Suppose for example you have a surgery that works for 90% of people. Now suppose given your patient characteristics, the output of the algorithm is a yes or no to “will this person have serious complications”. Suppose that 60% of the people who get complications are detected by the screening (yes) p(yes | complications) = 0.60 and 95% of the people who have no complications are correctly identified by the screening (no) p(no | no complications) = .95

p(no complications | without screening) = .9 by assumption

but p(no complications | screening = NO) = p(NO | no complications) p(no complications) / p(screening = NO)

= .95 * .9 / (.95 * .9 + .4 * .1) = .955

so now your previously 10% complications rate has been cut to 4.5% complication rate.

i’m not a big fan of this calculation. The bayesian decision theory is more appropriate here: choose surgery or not to maximize expected outcome for each patient… considering what *each patient* values (maybe the complications from surgery are still better than what they’re dealing with today… in that case you should ALWAYS do the surgery for example)

But deciding whether a thing is of “practical use” depending entirely on some kind of sensitivity and specificity numbers *in isolation* is basically always wrong.

“But deciding whether a thing is of “practical use” depending entirely on some kind of sensitivity and specificity numbers *in isolation* is basically always wrong.”

I completely agree. You want to no what the cost of not going through with the procedure might be along with many other things. I might have been too hasty in saying it wasn’t clinically relevant, but it doesn’t seem like the author is thinking about excluding any patient from screening (i.e. it seems he wants to use the tool to give a prediction to everyone), so I don’t see how your calculations are relevant. I think if your prediction says that of the people who were predicted to have complications, 40% won’t (1 – sensitivity). If you think that being told there will be complications will keep people from having a procedure, then (depending on the number of people who are predicted to have a complication) then several people who should have gotten the procedure will forgo. There are a lot of assumptions that go into that previous sentence, but my main point is that if something has a false negative rate of 40%, that might limit its clinical practicality.

The block quote ends too soon

The question of “which effect is more important?” comes up over and over again. I guess in a certain sense it’s reasonable to argue that, say, diet has a greater effect on the risk of diabetes than color preference. More precisely, preferring the color blue over the color green may increase the risk of diabetes by a tiny amount, while drinking 6 gallons of Mr. Pibb a day may increase the risk by a large amount. However, who ever said that “green vs. blue” has any interpretable equivalency to “no Mr. Pibb vs. 6 gallons of Mr. Pibb”?

> Another tip is that different questions can require different analyses. Instead of fitting one model and trying to tell a story with each coefficient, list your questions one at a time and try to answer each one using the data. Kinda like Bill James: he didn’t throw all his baseball data into a single analysis and then sit there reading off conclusions; no, he looked at his questions one at a time.

I thought your whole thing was “put everything into a full probability model”. Could you explain how these two positions go together?

“I thought your whole thing was “put everything into a full probability model”.”

My impression is that this is Kapitan’s goal, but that Andrew is saying that this is not a good goal in this circumstance.

From my perspective (as someone who has cataracts but has not had cataract surgery), Andrew’s proposal makes much more sense.

Speaking as a patient, I would like to see a good data base of patient characteristics and outcomes so that a particular (out of sample) patient’s characteristics can be plugged in, and then an estimate of the patient’s risk obtained just using patients with those characteristics. (i.e., don’t develop a “full model” and then plug my characteristics into the model; instead, calculate the estimate of my risk from the subsample with my characteristics.)

Also, “number of comorbidities” is not useful for practical purposes, since different comorbidities (or combinations thereof) may have different risks.

Also, my experience (including that of friends) is that doctors don’t say “now you need cataract surgery”. Instead, they tell you when they first observe cataracts; they might say then that you don’t need surgery yet; and they say that the decision is up to you as to when the cataracts are causing you enough difficulty that you think the surgery is worthwhile.

Also, (based on friend’s and relative’s experiences), eye doctors don’t seem to tell you in advance that sometimes you may need laser adjustments after the cataract surgery — perhaps just for fine tuning, or perhaps later if you start developing macular degeneration later (or possibly for other reasons I haven’t yet heard about).

Let’s say you have a bunch of variables, and you want to estimate the causal effect of each one on a given outcome. For each variable, it will be appropriate to adjust for a different subset of the other variables. Therefore, you cannot estimate the causal effects of all the variables at once by fitting one regression model. What you can do, and what I feel confident saying Andrew would recommend, is fit a *hierarchical* model positing that the individual causal effects estimated by separate regressions (or maybe by methods other than simple regression) come from some common distribution. This regularizes the effect estimates so that you do not need to worry that effects that pop out were only due to multiple comparisons.

If you’re not estimating causal effects and just want to know what predicts best conditional on “everything else”, I’m skeptical that you’ve really thought out how your substantive aims map onto your statistical analysis since the results depend heavily on what “everything else” comprises and yet I’d bet not much thought went into defining “everything else”.

The LOCO method described in this paper may be of interest:

https://arxiv.org/abs/1604.04173

> Another tip is that different questions can require different analyses. Instead of fitting one model and trying to tell a story with each coefficient, list your questions one at a time and try to answer each one using the data.

Great quote.

I have found that a large share of end user’s problems with, or misunderstandings of, statistical tools is due to confusion about what the fundamental question should have been. Too often the analysts start to play with the data and tools they know, rather than the questions.

Note that if the test you’re using is the Wilcoxon rank-sum, the p-value is just a monotonic transformation of the area under the ROC curve (since the number of cases and controls is fixed across tests). In that case, ranking predictors in terms of p-values is the same as ranking them in terms of AUC—that is, (one measure of) their marginal discrimination. Whether that’s a good idea depends on whether the AUC is capturing the predictive value you care about, and whether you care only about their standalone value, as opposed to their value in combination with other predictors.

Ram:

You can rank however you want. It’s not clear why ranking is a good idea in the first place. I think it’s better to go back to first principles, and from there I doubt there’s any reason to be ranking anything in these sorts of problems.

In many problems I work on, we have a long, long list of candidates from which we want to select a promising subset for further study. Ranking is one way to decide which to include in the subset and which to drop. What you base the ranking on depends on what you’re after, of course, but I see plenty of applications for ranking where it’s not clear a more complicated exercise would add much value in practice.

Ram,

I want to push back on this just a bit on this idea.

What do you believe are the unstated assumptions about the data generating process of the variables when one chooses to use simple ranking? When you say that a more complicated exercise would not add value, what assumptions are you making about the decision rules and the data at hand?

*edit:

I want to push back against this idea just a bit.

I suppose I’m not assuming anything in particular about the DGP. I’m simply trying to reduce a large set of candidate variables to a smaller set of follow-up variables, where what I want in the small set is to preserve most of the predictive value in the larger set while keeping the set small. It’s possible that some variables don’t have much predictive value marginally, but have a good deal conditionally (this is analogous to “unfaithfulness” in the causal inference context), but absent this screening variables in terms of their marginal predictive value seems like a procedure that will more or less give me what I’m looking for. I don’t mean to say that there are no DGPs where a more complicated approach would do better in practice, just that it isn’t usually clear what we’re missing by doing things this way. If your goal is to build a big, well-specified probability model, I agree this is not a good way to do that, but that isn’t what I’m trying to do in these types of problems.

This is what I read in your statement:

1. You are not explicitly assuming anything (though you are operating as if the data generating process is not influential on the observed variation and that this knowledge would not be useful in understanding the uncertainty in that variation).

2. You are uncertain about how a different method may provide more useful information (though without understanding the cause of variation in your variables and how much uncertainty is associated with it, you are unlikely to be able to design an effective selection or intervention process).

3. You believe that those who build models do so simply to build models and not because the model building process produces better information with which to make decisions (this I could not disagree with more as the model building process not only provides information about the uncertainty under which one is operating, it helps one identify the information that is missing which could improve ones understanding of the causal processes at work).

I encourage you to not only identify the assumptions that are implicit in your method, but to think about the value that can be derived from understanding the causal processes which underlie your problem and whether they are adequately captured by the data you have. Focusing on a solution to “these types of problems” is where it is most fruitful to focus and not simply finding a quick and dirty way to reduce the amount of data at hand.

1. What I described is a procedure. Procedures do not, by themselves, make any assumptions. We make assumptions when we interpret the output of these procedures as enjoying specific statistical properties. As I said, the goal in these problems is often simply to shrink a list of candidate variables to a more manageable set of interest for further study. Certainly dropping a bunch of variables is going to accomplish the shrinking part! As for the “of interest” part, whether a given ranking procedures retains the desired variables and drops the undesired variables depends on the what the ranking score is, what the threshold is, and what the desiderata for these variables is. My claim is that often such ranking procedures based on appropriate predictive scores will give us something close to what we want.

2. What I would like to see is a demonstration of the superiority of a (substantially) more complex method in problems such as these. I understand all the conceptual reasons why one might imagine that it is possible to do better than something so simple, such as by building a well-specified model, but I’m not convinced that building a well-specified model with orders of magnitude more variables than observations is so easy, and so the gains of doing so are perhaps harder to secure than would be worthwhile. It’s fine to do a simulation showing that, if the DGP looks like such-and-so, and we use a model-based procedure that correctly captures the important features of that DGP, we do better than something relatively atheoretical like what I’m talking about, but it’s a different story to show consistently better performance in settings where the true DGP is some complex, high-dimensional, unwieldy black box, while the question we’re asking is relatively straightforward.

3. I build models all the time! I love models. I’m just saying I don’t see it helping much in this type of problem, unless we’re assuming we’ve got something close to the true model, which is usually serious underidentified in this type of problem even in principle.

4. I’m all for utilizing as much background information and expert knowledge as we can. Usually in these exercises I’m discussing what we’re going to do in great detail with the subject matter expert, who is making detailed suggestions about how to refine this sort of thing in view of these considerations. It’s often possible to utilize prior information without fitting a big model that encodes that, but instead by incorporating it into your data pre-processing and choices of metrics, etc.

1. The world is causal. Causal processes generate events with more or less certainty. Data are typically generated as a result of multiple causal processes. A procedure used to reduce these uncertain data that ignores the causal processes does in fact come with assumptions whether you acknowledge them or not.

2. Pretending that a crude sifting of variables about which the causal processes are not well understood will consistently produce information from which a level of precise knowledge will result strikes me as a bit naive.

3. The problem is that the question itself is a pathway to convincing ourselves that things are true with evidence that cannot possibly support it.

4. Experts are great, but susceptible to the same biases that all humans are. We convince ourselves that we understand how things work at a deep level, when our knowledge is only superficial. Thinking creatively about the causal processes is how understanding is deepened. Models don’t have to be big, but they do need to be causal to be truly useful.

Curious,

I pretty much agree with what you say in points 1 – 4. However, I suggest some slight changes:

In 1: I”d amend the second sentence to read, ‘Data are typically generated as a result of multiple causal processes, including processes that produce uncertainty.”

In 4: I’d replace “Thinking creatively about the causal processes is how understanding is deepened.” by something more like, “Understanding is deepened by a process combining thinking creatively and thinking critically.”

But I also think it’s important to consider what is important to the end user, and what methods best address those things, taking into account the nature of the data. (See my comment https://andrewgelman.com/2018/06/21/answering-question-predictors-important-going-beyond-p-value-thresholding-ranking/#comment-769377 above for an example of one potential end-user’s perspective on the particular question discussed in the original post.)

“Procedures do not, by themselves, make any assumptions.”

I think this is backwards: but maybe I haven’t been paying attention. Which would be typical – for me (SMH – again, no doubt). W.V.O. Quinne … or maybe not.

In science I think this is usually true, but there are vast realms where causality isn’t so important. If you’re trying to correctly predict the quantity of fidget spinners to produce, to maximize your profit, knowing that they will be a short lived fad… Do you care about the psychology and network effects for why children’s fads have certain durations? Or do you just want to know you won’t be stuck with multiple shipping containers worth of useless trinkets, while ensuring you won’t have far more orders than you can fulfill also?

Daniel:

Sure, if your only goal as owner of the fidget spinner company that has licensed production through a third party manufacturer of trinkets, then you simply want the best estimate that gives you a realistic estimate of the low and high end of the demand. If you are that same owner and you are paying a consultant to advise you on production quantities, if it were me I would want that consultant to understand how these types of fads typically cycle up, down, and out. Why should I pay a consultant to calculate an estimate of a simple forecast that has very little chance of helping me make a good decision because it is simply based on the last 3 months of demand?

To expand slightly.

I agree that one does not need to understand every aspect of the production process to generate a good forecast, but having historical data from other trinket fads and even some information on upcoming changes that may affect demand either positively or negatively can certainly improve our ability to model the future.

Doesn’t this look like some causal inference work?

If you go question by question to answer your problem wouldn’t you be better off creating a causal graph for it?

i don’t know if this thread is still open, but:

1. RF importance measures are known to be biased, and they measure something like the “conditional independence” of variables

2. classical tests, like F-test, only check for certain parametric differences (like first two moments), whereas RF uses information from all moments.

based on that, i’d recommend:

1. use a nonparametric test, such as https://cran.r-project.org/web/packages/mgc/index.html (which is mine), or HHG, Dcorr, or HSIC, probably HHG if you are interested in only low-dimensional stuff. using which ever one of these, rank the variables in terms of importance with regard to your variable of interest.

2. us RF, but do a stepwise regression type strategy, where you include only 1 additional feature at a time. keep adding features until predictive accuracy stops increasing (much)

3. at that time, you can use RF feature importance to determine which features are conditionally most important.

email me if you have any questions, since i don’t get email alerts to responses here for some reason.