The robust beauty of improper linear models in decision making

Andreas Graefe writes (see here here here):

The usual procedure for developing linear models to predict any kind of target variable is to identify a subset of most important predictors and to estimate weights that provide the best possible solution for a given sample. The resulting “optimally” weighted linear composite is then used when predicting new data. This approach is useful in situations with large and reliable datasets and few predictor variables. However, a large body of analytical and empirical evidence since the 1970s shows that the weighting of variables is of little, if any, value in situations with small and noisy datasets and a large number of predictor variables. In such situations, including all relevant variables is more important than their weighting. These findings have yet to impact many fields. This study uses data from nine established U.S. election-forecasting models whose forecasts are regularly published in academic journals to demonstrate the value of weighting all predictors equally and including all relevant variables in the model. Across the ten elections from 1976 to 2012, equally weighted predictors reduced the forecast error of the original regression models on average by four percent. An equal-weights model that includes all variables provided well-calibrated forecasts that reduced the error of the most accurate regression model by 29 percent.

I haven’t actually read the paper, but I have no reason to disbelieve it. I assume that you could get even better performance using a Bayesian approach that puts a strong prior distribution on the coefficients being close to each other. This can be done, for example, in a multiplicative model like this:

Suppose your original model is y = b_0 + b_1*x_1 + b_2*x_2 + . . . + b_10*x_10, and suppose you want the coefficients b_3,…,b_10 to be close to each other. Then you can write b_j = a*g_j, for j=3,…,10. The equal-weighting model sets g_j=1 for all j. A Bayesian version could set g_j ~ N(1,s^2), where s is some small value such as 0.2. Or something like that. This is just an idea I’ve had; I’ve never actually tried it out.

1. Jonathan says:

It should be noted that is among the best named papers in decision research. It even caught the attention of a musician who named one of his albums after coming across the paper title.

http://www.allmusic.com/album/the-robust-beauty-of-improper-linear-models-in-decision-making-mw0000616698

2. Anonymous says:

I take it the title is a reference to the Robin Dawes paper? http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.188.5825

John D. Cook commented on this previously (well commented on kahneman’s commentary on it) :

http://www.johndcook.com/blog/2013/03/05/robustness-of-equal-weights/

I previously posted in his comments wondering if the effect was simply an example of the counterintuitive benefits of shrinkage estimation (persisting even when taken to an absurd extreme). Another way to think about this is the good old bias-variance tradeoff.

• Anonymous says:

whoops – Robyn Dawes

• Andrew says:

Anon:

You write, “the counterintuitive benefits of shrinkage estimation.” But from a Bayesian perspective it’s not counterintuitive at all!

• K? O'Rourke says:

Stigler argued that it was simply regression to the mean – that Galton understood very well.

Meng provided a rigorous version of that in Meng, Xiao-Li. “From unit root to Stein’s estimator to Fisher’sk statistics: If you have a moment, I can tell you more.” Statistical Science 20.2 (2005): 141-162. (see Stigler’s Galtonian perspective revisited.)

• Andrew says:

Keith:

Yes, regression to the mean in classical statistics is considered a prediction problem, i.e. the values to be predicted are considered random variables, so Bayesian methods are used (even if they are called predictive rather than Bayesian inference).

• K? O'Rourke says:

OK Andrew, I’ll risk stepping into it further.

I thought regression to mean occurred simply because of the presence of a bivariate distribution, perhaps for an unknown parameter and outcome (Bayes) or just for two outcomes, say x and y (regression). Admittedly (I believe) Galton did get Bayes first and regression second and there are always implicit priors involved, but Bayes should mean making those priors explicit and that is not needed for regression to the mean.

Now for hierarchical modeling, the between location differences are being considered _as_ being random like (technically exchangeable) but the uncertainty there is aetiological rather than epistemological. Some (e.g. Sprott and Godambe) have argued that means it is not necessarily Bayes – though I would rather think of it as Bayes. Now I think Fisher thought it was Bayes and that may explain why he disliked hierarchical modeling so much and likely got _upset_ when Cochran and Yates started promoting it for the analysis of repeated agricultural studies in 1937 & 38.

• Z says:

What’s the Bayesian intuition? I fancy myself a Bayesian but it still seems weird to me… I understand that if “real” weights do tend to be close to each other then shrinking should help. But is this a natural intuition to have? Shrinking to 0 on the assumption that most real weights are small seems more natural, but clearly the problem could be with my intuition.

• Andrew says:

Z,

It depends on the context. If you have 100 predictors and you think that only 1 or 2 matter and the others don’t, then a model such as lasso makes sense. But in social science prediction it seems to me to make much more sense that all 100 predictors will have some predictive power, which suggests a model that partially pools the coefficients together.

• Anonymous says:

I think this is a common misconception – that there must be some domain-specific justification for shrinkage. However, Stein’s paradox applies even if the parameters being estimated have absolutely no relation to one another at all.

It’s a much deeper issue than most people realize. It’s unfortunate that the takeaway a lot of people seem to get from the Dawes paper was “This regression stuff doesn’t work, why bother?”.

• David Shor says:

If you ran BayesGLM’s default in a situation with lots of predictors and not many variables, and then rescaled the variables and coefficients as needed so that you can have “weights” in a traditional context, wouldn’t you have roughly equal weights? Is it really necessary/correct to specify additional structure?

• Anonymous says:

Do you mean “not that many observations”?

I’m not familiar with bayesGLM. I believe it has a prior on the coefficients, so in the limit where n is small, I assume the prior will dominate and the coefficients will be similar, if not exactly equal.

Using a prior already imposes some “structure”, but you could do better if you batched regression coefficients in a way that mimics their underlying correlation structure. i.e. the less mis-specified the latent priors on the coefficients are, the better your model should be prediction and efficiency wise. So yes, introducing a prior already introduces some necessary structure to get some shrinkage, but why stop there?

Ultimately the objective here shouldn’t be to pursue “all parameters are equal” as the ideal model, but to understand why the “all parameters are equal” model outperforms a standard regression.

• David Shor says:

Sorry, I meant “May variables, not many observations”.

I just suspect that the reason why the “all parameters are equal” model outperforms a standard regression is because an all parameters are equal model is closer to “shrink the parameters to zero” model than a standard regression is.

The interesting thing to me is whether the kind of prior Gelman specified would produce superior outcomes in general to simply shrinking the coefficients to zero, which in practice, will give you pretty equal weights in a sparse data setting.

• David Shor says:

“Shrinking to 0 on the assumption that most real weights are small seems more natural, but clearly the problem could be with my intuition.”

My intuition on this is that shrinking to zero automatically will make the coefficients closer to each other, since the coefficients further away from zero get shrunken more than the ones that are closer to zero, standard errors held equal.

“But in social science prediction it seems to me to make much more sense that all 100 predictors will have some predictive power”

Sure. But why would they have the same sign?

I suspect that if you took your 100 variables and did a bunch of bivariate regressions (post-standardized) and set a prior shrinking the beta’s to zero, and then “flipped” the variables so that the signs would always be positive, your posterior distributions of your weights would be pretty close to equal.

• Andrew says:

David:

Prior knowledge is used to assign the directions of the predictors. For example, if predicting incumbent party’s vote share, GDP growth would have a positive coefficient and unemployment rate would have a negative coefficient.

3. K? O'Rourke says:

One way to make important variables less important is to included variables that are unimportant but correlated.

Perhaps simply including all variables makes them almost “all equal”.

4. Pretty sure you want g_j ~ N(1,s^2) not N(0,s^2).

One thing I’m confused about though is that the only way this kind of equally weighted model can make any sense is if you’ve rescaled the variables, otherwise depending on the units you measure things in,you will get different results. For example, suppose you’re predicting test scores and one of your predictor variables is how long the pencil is, you figure that gripping a short pencil will cause fatigue. Another predictor is GPA on a 4.0 scale. Do you measure the pencil in millimeters or meters? I think the right answer is something like

Score ~ GPA/4 + PencilLength/LengthOfNewPencil

In other words, if you don’t at least rescale the variables into dimensionless groups that are all O(1) of each other, you’re just hopelessly lost.

• Andrew says:

Daniel:

1. Typo fixed; thanks.

2. Yes, scaling is necessary. This is a minimal form of prior information. Without knowing you to scale the predictors, you’re dead.

• bxg says:

But if I am understanding the paper correctly, they are z-scaling so that stdev = 1. This scaling is determined by objective properties of the predictor, so it seems strange to describe it as “prior information” – at least if that is suppose
to suggest Bayesianism. You don’t need any prior information about an individual predictor to follow this rule, and moreover if you have it there’s no obvious way to apply it. (Unless the prior information is the meta-belief “good weights should shrink towards 1/stddev” but why?)

• In this formulation, you’re forcing all the predictors to have the same scale, possibly you’re also removing the average value?? So the only thing that’s different about them is the *shape* of the distribution. Basically if you assume that all of the predictors *do* have some information about the process, then by rescaling according to standard deviation you’re just preventing any one of them from dominating the prediction since on average all of them will be a O(1) in size.

• bxg says:

That seems like a nice description of what may be going on. In effect, with uniform weights you are reducing it to something like a vote, and normalization (which yes, also removes the mean and also must take care that the sign of the effect is correct) make sure it’s a tolerably democratic vote.

I’m still not convinced that a Bayesian lens, or the concept of “prior information”, is a particularly appropriate one.

• Well, there are two possibilities for rescaling, one is to rescale based on the data, ie create predictors x from raw data y by doing x = d*(y-ybar)/sd(y) where d (for direction) is -1 or 1 depending on whether we expect the predictor to be anticorrelated or correlated with the outcome.

another option would be d*(y-yhat)/s_y where yhat and s_y are picked based on some general scientific knowledge, this general scientific knowledge could be considered prior information.

in building differential equation models, something a lot like this goes under the general heading of “non-dimensionalization” and there the yhat and s_y are picked based on prior knowledge for the purpose of making important terms O(1) in the new scaling.

The question is, does one approach generally work better than the other? The data based approach is nice because it’s very straightforward to implement, the “prior” based approach might work well when we know that some things are of higher “quality” as predictors than others but not exactly how much.

In general, I’ve found that rewriting models like this improves MCMC procedures because it’s easier to choose jumping scales and to specify priors that work well. I should really blog about it but I’ve been having trouble with my blog. I think I’ve identified a work-around so maybe I will put up something tomorrow.

• One thing that might address the question is if you change from using all the coefficients as 1 to all the coefficients as say exponentially distributed with average value 1 (or more generally maybe gamma distributed) . Random perturbations to the coefficients (retaining the sign and the order of magnitude) may not make that much difference when there are quite a few predictors. In that context, the bayesian “prior” version could be considered something like a random perturbation to the uniform coefficient version, and when the priors are “good” it might improve things, and when they’re “not that good” it might not hurt things much.

• Manoel Galdino says:

In the paper, the author also argues that they must have the same signal with respect to the dependent variable, which makes sense to me.
Anyway, I found this idea quite counter intuitive. And I’m wondering what’s the effect with the wrong model. In any case, it seems to me that it only makes sense when there is more noise in the data than signal. In this case, fitting the same mean to all coefficients avoids over-fitting.

• Manoel Galdino says:

Ah, I forgot to ask. One doubt: why g_j ~ N(1,s^2) instead of g_j ~ N(a,s^2)?

• Phil says:

Manoel, you can either put the factor of a inside the distribution as you are doing (b_j = g_j, where g_j ~ N(a,s^2)) or put it outside (b_j = a*g_j, where g_j ~ N(1,s^2)). Those are equivalent.

• Manoel Galdino says:

Thanks, Phil. At a first glance, I imagined it, but then on a second thought it seemed that it wasn’t right.

5. […] Gelman linkou e comentou um paper que argumenta que é melhor (capacidade preditiva) estima um modelo de regressão com […]