How to use lasso etc. in political science?

Tom Swartz writes:

I am a graduate student at Oxford with a background in economics and on the side am teaching myself more statistics and machine learning. I’ve been following your blog for some time and recently came across this post on lasso. In particular, the more I read about the machine learning community, the more I realize how none of this work is incorporated into the majority of economics research.

I was wondering if you could give some advice on how to use techniques such as lasso, which retain a certain degree of interpretability, in a situation like economics or political science? Given that the goal is largely to describe, rather than just optimize an un-interpretable model, how would you use such techniques in a way that reduces variance and point estimate overestimation while at the same time interpreting the particular coefficients in a meaningful way?

My reply:

I don’t really buy the idea that lasso gives more interpretability; I think of it as a way to regularize inferences from regression. In most settings I actually find it difficult to directly interpret more than one coefficient in a regression model. Think of it this way: the coefficient of some predictor x represents a comparison of two items that differ in x while being identical in all other predictors of the model. Typically this only has a clear interpretation if x is the “last” predictor in the model, so that all the other predictors come “before” it.

Regularization is great, I just think the way to think of lasso is as a way of regularizing a regression model. The model is what’s important. What’s good about lasso and other regularizers is that they allow you to fit a regression model with lots of predictors. But the interpretability, or lack thereof, is a property of the regression model, not of the regularization.

28 thoughts on “How to use lasso etc. in political science?

  1. “Typically this only has a clear interpretation if x is the “last” predictor in the model, so that all the other predictors come “before” it”

    What about mediation analysis?

  2. One of the great things about Bayesian modeling is that it gives in principle a single *method* that applies to all problems. This makes it obvious that the real issue is in building the model appropriately, since there’s nothing to be had in tuning or choosing the method. (well, at least not the theoretical method, there’s plenty to be had in the computational step though as Stan shows us!)

    • But we miss all those weeks/months of tricky math and higher order asymptotic expansions every time we make minor modifications!

      (Of course sampling form posteriors can be tricky too.)

  3. Sure, we penalize the likelihood function (or use informative priors in Bayes land) for the purpose of regularization. But then why use the Lasso? A double-exponential prior is just weird.

    I’m sure most people use the Lasso because of bad reasons — e.g. someone told them parsimony is always a good thing, or they interpret it as a big hypothesis test (‘the coefficient of variable A is shrunk to zero, so it must not have any effect’), or because it gives the illusion of interpretability (high-dimensional models are just uncomfortable to think about). If we cannot really interpret individual coefficient estimates (and I tend to agree with that), then what’s the point of inducing sparsity? A hundred nonzero coefficients are no less interpretable then ten.

    In fact, the only good reason I can think of for using the Lasso is laziness: smaller models with fewer variables are just more convenient. But I don’t remember seeing anyone use that as a justification.

    • Suppose you plan to use the fitted model prospectively for some purpose. Shrinking trivial coefficients to zero preserves most of the predictive utility of the model, while reducing the number of variables one needs to have to use it. Lasso may not solve any one decision problem of this form optimally, but in lots of practical situations it’s probably roughly optimal.

    • What’s weird about a double-exponential prior? It’s the maxent prior for something whose mean absolute deviation from a given value is known (in the lasso case I guess it’s mean absolute deviation from 0).

      I often use exponential priors for positive values where I suspect I know the approximate size of the thing. If I know it can’t be zero, I’ll sometimes go to a gamma(1+epsilon,1/x) which is exponential for epsilon=0.

      Doesn’t seem weird to me.

      • Daniel:

        I think lasso can be useful but actually I agree about the weirdness of the prior. This is perhaps worth a research paper . . . the basic idea is that parameters of interest typically have additive and multiplicative interactions. Random additive interactions will give the density a certain smoothness—adding a random variable is a convolution of the density—and so any reasonable model should be continuously differentiable. That is, the “corner” of the double-exponential density at 0 doesn’t make sense.

        Here’s a model where the double-exponential could make sense: if the sign of the parameter is fixed and there is only multiplicative interaction. It’s my impression that a lot of researchers think this way, but to me it makes sense for there to be additive as well as multiplicative variation, and this puts constraints on how sharp the variation can be in the prior.

        Again, that doesn’t make the model useless. But I do agree that for the problems I work on, it can’t really be the right model.

        • Ok, so what if you took the double exponential and convolved it with a normal with trivially small standard deviation (epsilon = 0.000001 for a problem where you’re using double-exponential with scale 1 for example). The density would be trivially different everywhere, except that it would be infinitely differentiable at the corner. Does the smoothness now actually affect the output of your model really?

          I mean, would you get meaningfully different samples out of Stan?

          I think if the only thing you think is wrong with the double exponential is the corner… then that’s not really a strong indictment by itself.

          I think it’s particularly true that the double exponential is a decent prior when you really do expect the parameter to be somewhere about a certain size away from 0 but you don’t know which direction.

  4. > Regularization is great, I just think the way to think of lasso is as a way of regularizing a regression model. The model is what’s important. What’s good about lasso and other regularizers is that they allow you to fit a regression model with lots of predictors. But the interpretability, or lack thereof, is a property of the regression model, not of the regularization.

    +1

  5. There’s a very good reason we don’t use regularisation/machine learning techniques in econometrics: (outside of financial econometrics) we don’t build predictive models. Machine learning techniques are very good at answering the question E[Y|X]. But in econometrics, we’re almost exclusively interested in estimating the (causal) parameters of an economic model; we’re interested in E[Y|do(X)]. Our structural errors are never orthogonal to our RHS variables. We always have endogeneity. Better predictive models will just result in more confidence about a wrong-er result.

    That said, I’m a big believer that we can use some machine-learning techniques to help in econometrics. Here’s a paper I’m currently writing on using random forest proximity scores to give weight to analogous histories (so that we wind up with parameters that are relevant to today, and so that posterior predictive distribution blows up when our history is not analogous):
    https://github.com/khakieconomics/Thesis_work/blob/week_t_minus_2/Outline%20of%20thesis.Rmd

    I’m also working on a paper with Tan and Miller using the random forest’s proximity matrix to build synthetic control groups by matching. It gets around the dimensionality issues of exact matching while still effectively matching on the Xs. And because of the construction of OOB proximity scores, matches are robust; Smith/Todd is less relevant. A huge improvement, and extremely simple!

  6. I’m not sure the MV double-exponential is a weird prior. The argument about convolutions and continuous differentiability was too advanced for me to follow, but intuitively the MVDE seems reasonable as a heavier-tailed alternative to the MV normal, even if we’re not primarily motivated to achieve sparsity. Unlike the MVN, it doesn’t tend to make all the parameter estimates become more similar to each other as the amount of shrinkage increases which might be desirable. MVN and MVDE (and elastic net) seem to be good reference priors for regularisation.

    On the other hand, I have concerns about your suggestion to use a product of t/cauchy distributions as a default regularisation prior. There are two potential problems with this. First, t-distributions do not have equal gradient at all points on the influence function. The amount of shrinkage applied reaches a peak and then declines to zero. This means that the relative shrinkage of the different parameters will depend on the scale chosen and makes it a bit more fiddly to use as a default prior.

    The second concern is that, whilst the MV normal is strictly convex, and the DE is weakly convex, the contours of the MV t/cauchy are strongly concave. I could be wrong, but doesn’t this mean that highly correlated parameters will lead to multi-modal posteriors, with modes at each axis? In other words, when two continuous covariates are strongly positively correlated, the MVT prior would give a posterior indicating two sparse solutions, one for each parameter. In contrast, strictly convex priors such as MVN/elasticnet have an averaging and grouping effect.

    • Mat:

      My comment about the prior being weird refers to the behavior of the prior around 0 where it is not differentiable. Your first paragraph above refers to the tail. The double-exponential prior can be reasonable at the tails but not right at 0. In practice, the behavior right at 0 might not matter so much, which is why I wrote that the prior could be ok in practice. Speaking from first principles: if there is some tail behavior you want, there is no reason this needs to drive the behavior at 0.

      Regarding your second and third paragraphs: there seem to be two issues here. One issue is what model better captures the distribution of the underlying population of parameters. The other issue is what can be comfortable to compute. In any case, I think it can be fine that the proportion of shrinkage depends on where you are: that’s in some sense the point of using these non-normal models. To put it another way, the normal model is computationally convenient and we will use it where possible—but when it is not giving reasonable answers, we might need to go to something more realistic even though it is harder to compute.

  7. Thanks for the explanation. The logistic is similar in the tails to the DE but differentiable at zero; has anyone considered this as a regularisation prior?

    I’m interested in your thoughts on my second point, which is that t/cauchy priors might have computational/identification issues with correlated covariates. You say you are not interested in sparse solutions, but my intuition with, say, two correlated covariates is that you would get multi-modal posteriors, where each mode represents a “sparse” solution (i.e. one parameter very large and the other near zero) and very different substantive implications. Whereas a ridge prior or similarly log-convex prior would produce a uni-modal posterior, and the lasso would produce a posterior with a ridge. These known issues with the lasso were what prompted the development of elastic net, but it seems to me that t/cauchy might have this problem to an even greater degree.

    The examples in your paper were binary covariates, where this problem might not apply, but you advocated the t/cauchy as a general default prior and its worth looking into its properties with correlated continuous covariates.

    • Yeah. Fair point. But in the causal inference social science literature, the idea is both there and still imperfectly formulated (in my opinion – maybe Judea or Guido or someone else way better at this will come school me). But if you think of the “important” covariates in a model being those related to the identifying variation (so, the pre/post T/C dummies in a diff-in-diff, or state and year fixed effects in a state-year panel), then the variable of interest (the interaction of post*T in the diff-in-diff, or on the covariate of interest in the panel) is sort of “last” in that it latches on to the remaining variation.

      You see this kind of thinking behind regression tables where people start with just the fixed-effects, and then add in more and more covariates and see what happens to the coefficient of interest. The idea being that most covariates (the X’s in the model that we are not interested in interpreting but want to soak up variation) don’t matter, and your treatment variable is (or should be) orthogonal to them (conditional on the fixed-effects) – hence their inclusion shouldn’t change the coefficients.

      So I think of this as there are two kinds of covariates in the model that come “before” the covariate of interest (the one we want to causally interpret): variation-isolating fixed effects (period, unit, region) such that given those, everyone in some group has the same values of treatment T; and characteristics/Xs that are used to decrease standard errors and check robustness.

      But I agree that the terminology is bad and my reasoning is at best under-developed. Jonah Gelbach has an interesting paper on thinking about the X’s and how “last” doesn’t really make sense*, but that is a different context. And I’ve never seen a really good paper describing the difference between the kind of “fixed-effects” I’m discussing and other X’s, but I think there is something to that.

      *http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1425737

  8. People in genetic studies like lasso because it gives a finite set of non-zero predictors. Then they interpret this (wrongly) as THE set of predictors, and almost as THE set of causal variants. They do not realize that removing 5% of the data will change the predictors, and that there are many possible sets. Also, this comes from the deterministic nature of the lasso algorithm that gives one answer (compared to equivalent Bayesian MCMC solutions).

    Of course this is wrong, but lasso gives more, wrong, interpretability.

    The question is: are there methods that lead themselves easily to wrong interpretations?

    • I think lots of biology people have strong priors that only a few genes are typically involved in some process, but when they are involved they will have a biggish effect, and under that assumption the output of a maximum penalized likelihood with double-exponential penalties isn’t a bad place to start necessarily. Though I agree that it’s probably not as cut-and-dried as a naive user might want to believe. Another way to interpret this is as a kind of asymptotic model, where you’re pulling out the biggest terms and just setting the others to zero provided they’re small enough. It can be useful to use such models, though I agree it’s worthwhile to know what you’re doing when doing so.

Leave a Reply

Your email address will not be published. Required fields are marked *