Haynes Goddard writes:

I have been slowly working my way through the grad program in stats here, and the latest course was a biostats course on categorical and survival analysis. I noticed in the semi-parametric and parametric material (Wang and Lee is the text) that they use stepwise regression a lot.

I learned in econometrics that stepwise is poor practice, as it defaults to the “theory of the regression line”, that is no theory at all, just the variation in the data.

I don’t find the topic on your blog, and wonder if you have addressed the issue.

My reply:

Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticans but are considered by statisticians to be a bit of a joke. For example, Jennifer and I don’t mention stepwise regression in our book, not even once.

To address the issue more directly: the motivation behind stepwise regression is that you have a lot of potential predictors but not enough data to estimate their coefficients in any meaningful way. This sort of problem comes up all the time, for example here’s an example from my research, a meta-analysis of the effects of incentives in sample surveys.

The trouble with stepwise regression is that, at any given step, the model is fit using unconstrained least squares. I prefer methods such as factor analysis or lasso that group or constrain the coefficient estimates in some way.

“The trouble with stepwise regression is that, at any given step, the model is fit using unconstrained least squares. I prefer methods such as factor analysis or lasso that group or constrain the coefficient estimates in some way.”

As a wanna-be statistician, I’d be greatly indebted if you could explicate this further, or provide a reference where I read more.

Thanks.

lasso: http://statweb.stanford.edu/~tibs/lasso.html

Much obliged.

I found the free online course by Hastie and Tibshirani both edifying and entertaining. It ran from January to April 2014, but will presumably run again sometime. Lasso and Ridge Regression are covered, among many other topics. Course materials included access to the books “An Introduction to Statistical Learning, with Applications in R” and “The Elements of Statistical Learning” as well as access to R code implementing numerous examples. Video lectures and slides presented the course materials. Quizzes/quiz-like questions assessed comprehension. May you find it enriching too! Here is a link to the course: https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about

Why is outlier detection a joke?

Google sums it up quite nicely; search “outlier detection” and see how long it takes you to find the phrase “data mining.”

Is this a matter of phrasing? For example, outlier detection has reasonable uses – statistical process control leaps to mind as an area where, at least conceptually, what we’re doing is trying to detect events that don’t fit a model. Seems pretty legitimate to me, but I’m not a pro statistician.

That’s what I was thinking. Flagging outliers seems fairly common and quite useful practically in a lot of settings.

Rahul:

Outlier detection can be a good thing. The problem is that non-statisticians seem to like to latch on to the word “outlier” without trying to think at all about the process that creates the outlier, also some textbooks have rules that look stupid to statisticians such as myself, rules such as labeling something as an outlier if it more than some number of sd’s from the median, or whatever. The concept of an outlier is useful but I think it requires context—if you label something as an outlier, you want to try to get some sense of why you think that.

I think the distinction is outlier detection vs outlier rejection. In my work, I use exactly the method you describe – I check whether servers are healthy by frequently checking response times for certain actions, and if I get several readings in a row that are more than 5 standard deviations from the mean of recent values, I send an alert message. It’s intentionally crude, and it’s just supposed to be a tool to grab attention and help push for investigation of what the heck is going on.

The point is, this is very literally outlier *detection*. No automatic action is taken other than an alarm going off, since it could be anything: new code is deployed that’s not performant, or a power outage in a data center across the country that we depend on, or a momentary blip due to everyone starting a House of Cards episode at the same time.

I realize this is basically the general point you were making, but it feels worth fleshing out in defense of outlier detection :)

Ironically, even what seems like a stupid rule to you, the

“labeling something as an outlier if it more than some number of sd’s from the median”prescription, in practice I can think of very few processes where that’s a terrible rule.i.e. even that crude rule of outlier detection seems to work fine for many real world examples. e.g. QC / QA or log correlation etc.

…except that the sd is really not a good statistic for doing this, because it is itself heavily affected by outliers.

Yes, I agree with both of you. The problem is not with outlier detection, it is what is done after applying such a rule.

My gripe is mainly with automatic methods of detection or labelling that are not sensitive to context and have no understanding of what the researcher is trying to do.

Andrew, I think your original statement was correct. The problem with outlier detection is when people don’t see the gap between what they conceive an outlier to be (“a bad point”, “a suspicious transaction”, etc) and what outlier test do: determine how likely a point is with respect to a particular model.

People who stop at detection are implicitly acknowledging that their model may be useful but it’s not well-motivated. That’s why you as a statistician can tolerate that approach: who can argue with a kind of tripwire that simply says, “You might want to look more closely at this.” But “outlier” is still a loaded term and one of the more dangerous ones in statistics, and “outlier test” without further explanation should raise a red flag in your mind.

(It’s more than just the definition of “outlier” and the key concept of probability under a model. As someone else pointed out, some outlier tests are themselves influenced by outliers, leading to epicycle-like kludges. Then there is the baseline period which is considered “normal” for establishing your limits, then obsessions with intervals in models that do not reflect tolerance or prediction…)

Agree with this point and all others related to it.

I work in behavioral statistics, and I have yet to hear of a really striking instance in which outlier testing was the lynchpin that people seem to think it is. It actually effectively tells me that the person doesn’t know what they’re doing.

That being said, as Wayne pointed out, there’s clearly a definitive difference between outlier detection and outlier testing and/or outlier rejection. I would be wary of one who doesn’t practice outlier detection because, at least in my field, it’s often the outliers that we’re supposed to be paying attention to. But that doesn’t beget any sort of data manipulation necessarily, and I think people fail to realize this.

Rather, because “outlier detection” seems to be a term that would be defined in a vocabulary section of a statistics textbook, I think people end up conflating that with other such terms, which are typically tests or calculations, thereby implying that “outlier detection” is a test or method, rather than simply a behavior of acknowledgement. I’ve had countless students as me for the formula for outlier detection when I ask them questions about the topic in assignments or tests.

Haynes:

Since another motivation of stepwise regression is to produce a simpler model in terms of number of coefficients, discussion of the bet on sparsity principle may be of interest:

http://andrewgelman.com/2013/12/16/whither-the-bet-on-sparsity-principle-in-a-nonsparse-world/

There are many things not to like about stepwise. One not mentioned in the post is that it doesn’t even necessarily do a good job at what it purports to do. Given a set of predictors, there is no guarantee that stepwise will find the “best” combination of predictors (defined as, say, the highest adjusted R^2); it can get stuck in local optima. Example here:

http://stats.stackexchange.com/questions/29851/does-a-stepwise-approach-produce-the-highest-r2-model

On an unrelated note, I wonder if Andrew or someone else you could say a little more about why outlier detection is considered “a bit of a joke” to statisticians. Do you just mean they take a dim view of simple, thoughtless rules like “delete any observation with Cook’s D above a certain threshold,” or that they view the entire enterprise of identifying and dealing with outlying observations as fundamentally dubious? The former view is certainly understandable, I’m just wondering if you’re actually thinking of the second sort of view.

I’m curious about the outlier thing, too. Developing influence statistics for multilevel model seemed to me to even be a recent area of applied statistical research. I think I’ve seen some stuff from Snijders and Berkhof and from Loy and Hofmann.

I was primarily looking for diagnostics and solutions for heteroscedasticity issues, though*, and spotted the influence statistics stuff only by accident.

*To be more precise, I wondered if there is any work on multilevel models where heteroscedasticity is not seen as something you have to correct for but as something which is of substantive interest.

How do you equate stepwise regression and Lasso with something like BMA?

Aren’t these all a form of model selection procedures which if not useful for theory testing, can be legitimate for forecasting?

I was also skeptical of stepwise regression as an Biology major with an emphasis on Ecology and Molecular Ecology. Interestingly, I have seen this practice creep in both the climate and ecological literature, and it seems to be gaining popularity in fields with “messy data”. For now I’m avoiding using these methods (stepwise regression, quantile regression, etc.), but keeping my eye on them to see if they gain a broader acceptance. I do agree though that these methods will tend to produce statistically significant results that might not actually be “biologically relvant”, ie. a trend that can actually be applied and developed into a useable model.

I’m curious, why are you lumping stepwise and quantile regression together here? While I can see the problems of stepwise regression, I can imagine settings in which quantile regression may be reasonable.

Andrew doesn’t mention this piece, but here is a nice little review of the problems with stepwise: http://www.nesug.org/proceedings/nesug07/sa/sa07.pdf

Issues are (paraphrased from Harrell, 2001):

1. R2 values are biased high

2. The F and c2 test statistics do not have the claimed distribution.

3. The standard errors of the parameter estimates are too small.

4. Consequently, the confidence intervals around the parameter estimates are too narrow.

5. p-values are too low, due to multiple comparisons, and are difficult to correct.

6. Parameter estimates are biased high in absolute value.

7. Collinearity problems are exacerbated

There is yet another problem with Stepwise Regression; a big one. It encourages you not to think.

Doing stepwise using significance values of the parameters is definitely a bit of a joke, but I wouldn’t necessary say so when using a criteria such as AIC. There is at least a theoretical justification for finding the ‘optimal’ model as measured by AIC. Stepwise does not necessarily find this optimum, but it does do approximate optimization.

The real sin though is when p-values are reported with a stepwise regression (shudder).

People such as Frank Harrell would, and do, argue differently on this point. Frank has often said that (me paraphrasing) using the AIC in this manner is just the same as stepwise using p-values because the AIC is just a restatement of the p-value. They don’t give the same result but the process is the same; you’re just (potentially) using a different threshold than say p <= 0.05 when you use AIC.

Anything that imposes a hard selection threshold will fall foul of at least:

6. Parameter estimates are biased high in absolute value.

from the list above.

I love stepwise regression.

It is a very simple effective way to

do variable selection.

The lasso and stepwise are approximately the same

(as shown the the LARS paper by Efron et al)

There are are results by Andrew Barron et al that show that

stepwise achieves optimal risk.

see:

Barron, Andrew R., et al. “Approximation and learning by greedy algorithms.” The annals of statistics (2008): 64-94.

Of course one should not the use the output of this (or any selection method) for inference.

But for prediction it is great.

Larry

I think the LARS paper is actually pretty critical of stepwise; Efron et al find it to be too greedy. The paper shows that LASSO and STAGEwise are approximately the same, and have better properties than stepwise regression. Stagewise takes smaller steps than stepwise, and as such allows multiple colinearity variables into the model in a way that might be better for predictive accuracy.

-Brad

Yes it is stagewise that is closer.

But in practice, when the dimension is large,

I find they are almost always very similar.

There is little practical difference.

And stepwise is easier to implement and easier to explain.

And the risk bounds, as I mentioned, are the same as those derived

by Greenshtein-Ritov for the lasso. So from that perspective they

are the same.

Larry

Larry: “Of course one should not the use the output of this (or any selection method) for inference.” Of course? What are the AIC and other people doing? Just black box prediction? This distinction between inference/prediction is coming up on my current post (on Potti and Duke), and if what you say is true, then it seems problematic to be using any of these model selection techniques in recommending treatments for patients.

I meant things like p-values after selecting variables by stepwise

If anyone is serious about reliably calling out poor statistical practices rather than cherry picking pet conflicts with other professors about how taking a look at things like this:

http://blogs.discovermagazine.com/d-brief/2014/06/02/hurricanes-with-female-names-are-deadlier-than-masculine-ones/

Journalists being nothing but parrots with large amounts of salt-and-pepper noise is understandable. In general tenured researchers being completely incompetent at statistics is less so.

If you want to criticise that research, why are you not doing it yourself? What are your problems with the approach? Etc.

In a manuscript review I performed last year, I criticized the use of stepwise regression and recommended the authors select covariates for their model based on their knowledge of the field (in which they are experts). I also referenced Frank Harrell’s criticisms of stepwise regression.

The reply to this criticism: “This is a standard method in the field”

(Not an exact quote but it went something like that.)

Oh, and assigned statistical reviewer did not criticize the use of stepwise regression, but noted that perhaps the study may have been underpowered. The dataset was approximately the same size as the three previous datasets used to study the effect of interest (by now, to confirm that the effect was probably not present).

Yes, I am still a little annoyed by this…

You win some and you lose some.

In an earlier version of this paper -

Intraveous immunoglobulin therapy for streptococcal toxic shock syndrome — a comparative observational study -

http://scholar.google.ca/citations?view_op=view_citation&hl=en&user=R064zwoAAAAJ&citation_for_view=R064zwoAAAAJ:2osOgNQ5qMEC

– I was originally displaced from the research group by a well known biostats department one of the co-authors was associated with who had been convinced by them only the one best adjustment model (found by all possible selection) be presented in the paper.

The reviewers of the initial journals they submitted the research to stepped on them hard enough, it enabled the co-author I was associated with, to re-instate me. Both propensity score analysis and a summary of all possible linear adjusted estimates was given along with, I think, a clear discussion of uncertainties that could not be further refined.

Stepwise regression has two massive advantages over the more advisable alternatives. One, it’s intuitive – unlike even lasso, it’s simple to explain to non-statistician why some variables enter the model and others do not. Two, it’s implemented in an easy-to-use way in most modern statistical packages, which the alternatives are not. Would I publish a paper with it or advise its inclusion in a statistical plan? No way. Am I okay with folks using it to explore their own data sets, with all the necessary caveats? Yep.

So, please consider the alternative hypothesis that the researchers who use stepwise regression are aware of the problems in a general sense, but perhaps don’t know a better option. Not unlike the problem with overreliance on p-values, actually.

And… to the frequently-repeated assertion by statisticians that clinicians/scientists should ‘use their domain expertise to select variables manually, rather than relying on the computer’: close your eyes and imagine reading that sentence in a manuscript or a grant. Now imagine just how quickly it would be shot down for lack of rigor, or suspected of data-dredging.

cassowary37: if the clinicians/scientists don’t give just a vague statement about domain expertise, but instead say “we will adjust for X and V because of plausible confounding as described in figure Z” with some relevant citations to show they know what they’re talking about, I think getting shot down would be harsh.

However, to successfully argue one is using domain expertise, there has to be a very specific goal in mind for the analysis – a specific aim of the grant, say. When that’s not available (and it may not be) I agree stepwise approaches may have some merit as exploratory tools, although other tools are – these days – easy to use and should at least be considered.

cassowary37, you suggest that using domain expertise to select variables would be shot down as data dredging, but stepwise regression would not. You may be right that some reviewers would react that way, but those reviewers would have it backwards. It is stepwise regression that is “data dredging”, and explicitly so: the procedure tries to identify the set of explanatory variables with the most power, whether or not they make any sense whatsoever. If you throw in a bunch of random vectors of explanatory ‘data’, some of them will be selected by the stepwise regression procedure for inclusion in the model, whereas no educated human would make that mistake.

I consider stepwise regression to be a useful tool for exploratory data analysis — here are a bunch of variables that I think might be predictive, show me which ones actually are — but for going beyond the exploratory stage it can easily lead you down the garden path.

When I was a postdoc, I was working on analyzing indoor radon data (radon is a naturally occurring radioactive gas that is present in high concentrations in a small fraction of houses). I had analyzed some survey data using Bayesian techniques; you can search on this blog and find a couple of mentions of the work in the unlikely event that you want more detail. I met with our funders to present my results, and a consultant was at the same meeting to present his analysis of the same data. One of the datasets was from the National Residential Radon Survey, which included…I forget, I think about 6000 houses from a stratified sample of census tracts around the country. There was a lot of information on each home: does it have a garage, does it have a carport, does it have slab-on-grade construction, does it have a gas water heater, etc. Most of these could plausibly be associated with indoor radon concentrations in some way. The consultant had thrown them all into the hopper and performed stepwise regression, even including lots of interaction terms (like carport x basement). The result was a model with a fair amount of explanatory power that was utter nonsense. I wasn’t sure how to handle it…should I publicly shame the guy with some pointed questions, or quietly approach our funders later, or ask the guy some gentle questions that would gradually reveal that he didn’t know what he was talking about, or what? As it turned out, I didn’t have to say anything: the review committee asked some questions that exposed the silliness of the whole thing. It was a bit painful to watch.

I’m curious, did your alternative model have more or less explanatory power than the consultant’s brute force model?

Further, isn’t what you are criticizing essentially the over-fitting aspect? If an

ad hocmodel performs with as good an explanatory power on out-of-sample data, can you still apply the“nonsense model”critique? i.e. a validation step is what’s needed?At some degree of performance doesn’t one have to concede the superior explanatory power of a model notwithstanding how silly one thinks the model structure is?

Explanatory power without causal understanding can be dangerous because you don’t know when the correlations that make the model work will be broken. You end up with black swan types of failure modes.

It also becomes unclear how to move forward with improving the model when you don’t understand why it works.

There are niche applications where you don’t care, for example, image editing / texturing software. But in cases where the goal is scientific, then no – out-of-sample prediction is not the be all end all.

Yes, but can I dismiss a model with bad causal structure ( & excellent explanatory power ) because my alternative has an appealing causal structure yet crappy explanatory power?

e.g. in Phil’s example I think it’s too easy to make fun of the consultant but is a predictively crappy alternative any better, no matter how enticing its causal structure?

that’s not the right dichotomy. both a model that’s poorly predictive and a model that can’t be interpreted are fairly useless for scientific purposes.

i would however, generally make more use of a model with a plausible causal interpretation with reasonable predicive power (relative to measurement error) than one that’s slightly more predictive of the available data but uninterpretable.

keep in mind our estimates of generalization error tend to be very crude,

Rahul, the consultant’s model performed better on the data at hand but everyone realized it would perform much worse when applied to new data. Therein lies the whole problem, or at least most of the problem.

Out of sample validation solves that though right? Just hold some data back & validate on it?

no it’s not ‘solved’. cross validation error != generalizability.

Interesting. I didn’t know that. Thanks.

So what’s an objective way to evaluate the generalizability of a model?

roughly speaking there’s two aspects to generalizability – the bias variance tradeoff (which encompasses “overfitting”) and heterogeneity.

You can think of reality as a mixture distribution. Often cross validation error won’t translate into a real out of sample error because your sample underestimates the variance of the variance of the variance of the variance, etc. I tend to be wary of machine learners who do a single train/test/validation error and think they’re done. What’s the credible interval on one’s estimate of a prediction error? Guess what, that’s going to depend on a model assumption (most of the time researchers don’t even provide one).

It’s important to realize that cross validation is relying on modeling assumptions which are just as subject to modeling failures as anything else.

The best case scenario for characterizing generalization error is probably when you are doing a timeseries prediction with bounded outcomes (eg election prediction).

There’s some interesting debate of this here http://andrewgelman.com/2012/07/23/examples-of-the-use-of-hierarchical-modeling-to-generalize-to-new-settings/

If the consultant had used stepwise regression to find a model based on data from half of the sampling units, he would have come up with a different model from the one he came up with. It would have performed well on the data being fit, and poorly in cross-validation. What that _should_ tell you is not to use stepwise regression, or at least not for constructing your final model.

If, instead, you keep doing different random selections and testing them, you will eventually find one that works well on both the fitted dataset and the cross-validation set. But it will generate nonsense if applied to new data.

I think there is a much bigger problem with how many people like to interpret the results of whatever variable selection procedure than with any specific one including stepwise. People need to understand that many things they would like to identify cannot be identified from the data, particularly “variable A has an effect on Y whereas variable B hasn’t”. I don’t think that there is anything more to like about such interpretations if they use a result of Lasso or something Bayesian than of stepwise.

[…] Why we hate stepwise regression « Statistical Modeling, Causal … […]

“Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticans but are considered by statisticians to be a bit of a joke.”

Tibshirani and Hastie in their recent Statistical Learning MOOC were quite positive about stepwise regression, in particular forward stepwise selection for variable selection. Their book also covers these topics. http://statweb.stanford.edu/~tibs/ElemStatLearn/

(Of course they also emphasized lasso, but were not critical of forward stepwise by any means.)

Stepwise regression in a reasonable use case for variable selection would be simply to rank order the theoretical ‘importance’ of the variable to the model. But the outputs of a fwd stepwise regression I merely consider a mere guide on which variables to begin with, not as a viable model. In fact, I will use fwd stepwise iteratively.

Let’s say I’m trying to develop a reasonably well parameterized, well fit, minimal deviance and information loss, and variable interaction accounted model which captures the probability of some event y occurring. A let’s say I have 200-300 variables to examine for candidacy in the training model (and that 200-300 may have been a reduction from several thousand other dimensions). Where do I reasonably begin in that case?

I find that fwd stepwise helps streamline the process in this regard. I may take these variables and simply output an initial rank ordered list of variables the stepwise may be inclined to include (examining aic/bic, deviance of the residuals that theoretically may be reduced, and a few GoF measures). I may begin examining each variable as I add them, one by one, into the model.

What fwd stepwise allows me to do is determine a ‘stopping point’, which may be reached in the list when a further reduction in deviance becomes insignificant. Once I’ve reached that stopping point, I stop the manual model dev and will then run another fwd stepwise, this time having it consider a new model wherein the variables that made it successfully thus far are known. This generates a new rank ordered list, and I then return to variable testing in the model with that. I cycle through this until I’ve run through the entire list of possible candidates.

At that point I search for possible interactions and go through the process of examining each. Of course this is still quite a raw model and candidate interactions should be somewhat intuitive (and that is an admitted source of bias, but there is little perfect about ‘explanatory/predictive’ output). There is inevitably some subject domain expertise (the ‘art’ of this entire process) that comprises selection bias on what interactions are reasonable to test. I haven’t really experimented yet with how this might be improved (reduce selection bias, choice of degree of interactions [n-way] to consider).

Once that is completed, I will then use backward stepwise to examine what variables, if any, may now add little value to the model once new within variable associations have been discovered.

For the test set, I apply the training model to examine how it accepts new data. I then conduct the entire process again, and then perform a set of model comparison tests (for error noted between training/test application, what would explain this? By having both the applied model and independent model outputs generated, diagnosing potential issues I believe is aided immensely.).

So yeah, this can be quite laborious. But independent model development, particularly if what is being modeled is all rather ‘new’ (a ‘first run’), is I think a valuable added bit of ‘insurance’ that the resultant model is sound.

For certain applications, such as in certain types of risk, where a single event’s maximum severity is in the scale of things rather low in margin, I often find the more stable generalizable model over time is the one which is slightly *underfit* in the grand scheme of things. For examining what is behind more severe risk events, I don’t believe this may be sufficient, but then I don’t generally use this modeling paradigm (ie GLM, etc) for those varieties of problems anyway, unless ‘intuition’ is all that is requested. The controversy over the importance of model parsimony and stability vs accuracy is truly context dependent.

Right, but they aren’t very complimentary about such methods in their Elements of Statistical Learning book, which whilst not the main text for that course was suggested reading for more savvy participants. There was a distinct focus of the Stanford StatLearn course on prediction, so they weren’t specifically using it for inference either.

This outlier detection is performed a lot by the neuroscience community. I must admit, I am one of those non-statisticians. Could you elaborate why it is a bad idea? Is it because of bias introduction?

Luca:

Speaking generally, we want to understand where the outliers are coming from. There’s a big difference between an observation that happens to have a high value, and a data recording error, for example. Automatic rules for removing outliers can’t really handle that. Beyond this, the concept of an “outlier” seems in many cases to be a crude substitute for the more valuable concept of a “distribution.” It disturbs me that, of all the statistics jargon, the term “outlier” is so popular.

[…] Modeling and Meta-Analysisfor Decision Making” http://stat.columbia.edu/~gelman/… via http://andrewgelman.com/2014… Why we hate stepwise […]

[…] dark arts. This leads to the common situation where I know I’m doing something wrong, such as using stepwise regressions to build a model, the fact I use frequentist over Bayesian probabilities, and even my over reliance on P Values to […]