Standardization and an implicit hierarchical model

As I’ve discussed here on occasion, I like to standardize continuous regression inputs by dividing by two standard deviations. That way the rescaled variables each have sd of 1/2, which is approximately the same sd as any binary predictor, allowing the coefficients to be interpreted together.

Standardizing is often thought of as a stupid sort of low-rent statistical technique, beneath the attention of “real” statisticians and econometricians, but I actually like it, and I think this 2 sd thing is pretty cool.

As Aleks pointed out, however, standardizing based on the data is not strictly Bayesian, because the interpretation of the model parameters then depends on the sample data. As we discussed, a more fully Bayesian approach would be to think of the scale for standardization as an unknown parameter to itself be estimated from the data.

P.S. Recall that “inputs” are not the same as “predictors.”

P.P.S. I scale by 2 sd to be consistent with 0/1 predictors. In retrospect, I wish I’d scaled by 1 sd and then coded binary predictors as -1 and 1 to be consistent. This would’ve been simpler overall. But I think it’s too late now.

8 thoughts on “Standardization and an implicit hierarchical model

  1. > Recall that "inputs" are not the same as "predictors."

    Could you clarify this for the slow-witted amongst us?

  2. That's interesting. I have been taught to rather recode all continuous and pseudo-continuous variables to range from 0 to 1 (0 being the lowest variable in the data set (or possible) and 1 being the highest variable in the data set (or possible).

    The idea was to make interpretation of coefficient easy because you could say that for any variable going from one end of the scale to the other is reflected by a change in the dependent variable corresponding to the size of the coefficient.

  3. Stealing directly from the man himself:

    "The set of input variables is not, in general, the same as the set of predictors. For example, in a regression of earnings on height, sex, and their interaction, there are four predictors (the constant term, height, sex, and height × sex), but just two inputs: height and sex".

  4. Never too late!
    (coded binary predictors as -1 and 1)

    Just add it as an option to the R procedure and make it the option in the first example shown.

    Really liked this "We recommend it as an automatic adjunct to displaying coefficients
    on the original scale." but the archiving of the data would even be better.

    Keith

  5. A couple of comments from an applied perspective. I've recommended this technique to analysts, and used this is presentations with clients. It's very helpful.

    From my point of view, you have it right now, using the KISS criterion. Clients understand 1/0 dummy coding. 1/-1 takes extra time, although that would probably be tolerable.

    They also understand dividing by the standard deviation (or by 2 sd). Presentations like this usually have a lot of statistical results, and doing as Aleks suggested would provide just one more thing to explain. I already have quite enough things to explain already.

  6. "I have been taught to rather recode all continuous and pseudo-continuous variables to range from 0 to 1 (0 being the lowest variable in the data set (or possible) and 1 being the highest variable in the data set (or possible)."

    That might have interesting properties for variables like income or C-reactive protein (CRP) that have a lot of informative large values. I suppose that you could transform and then rescale but I find the interpretation of the coefficients gets harder the more that I alter them.

    People understand a change of X per Y dollars. I can explain a change of Z per standard deviation of income. But anything more than that and this epidemiologist's head starts to hurt.

  7. I tend to agree with Zbicyclist: 0/1 codings are generally more natural than -1/1. Moreover, much of the time continuous covariates have natural codings (dollars, years, whatever) that non-quantitative people prefer to preserve. (This is why most political scientists are taught not to standardize variables — and sometimes not even to rescale them — and consequently why our journals are lousy with estimates like "-2.21e06").

  8. -1 to 1 and 1 sd normalization have long been the standard in many fraud modeling shops for exactly the reasons stated.

    The use of this as standard practice for fraud modeling dates back to the late 80's to my own knowledge.

    It is also pretty common in the neural network community where they prefer to separate bias terms from normal inputs.

Comments are closed.