The king must die

And then there was Yodeling Elaine, the Queen of the Air. She had a dollar sign medallion about as big as a dinner plate around her neck and a tiny bubble of spittle around her nostril and a little rusty tear, for she had lassoed and lost another tipsy sailor“– Tom Waits

(2021 edit: The equations on this post have died, so I’ve posted a lightly edited version on my distill blog. Enjoy.)

It turns out I turned thirty two and became unbearable. Some of you may feel, with an increasing sense of temporal dissonance, that I was already unbearable. (Fair point) Others will wonder how I can look so good at my age. (Answer: Black Metal) None of that matters to me because all I want to do is talk about the evils of marketing like the 90s were a vaguely good idea. (Narrator: “They were not. The concept of authenticity is just another way for the dominant culture to suppress more interesting ones.”)

The thing is, I worry that the real problem in academic statistics in 2017 is not a reproducibility crisis, so much as that so many of our methods just don’t work. And to be honest, I don’t really know what to do about that, other than suggest that we tighten our standards and insist that people proposing new methods, models, and algorithms work harder to sketch out the boundaries of their creations. (What a suggestion. Really. Concrete proposals for concrete change. But it’s a blog. If ever there was a medium to be half-arsed in it’s this one. It’s like twitter for people who aren’t pithy.)

Berätta för mig om det är sant att din hud är doppad i honung

So what is the object of my impotent ire today. Well nothing less storied than the Bayesian Lasso.

It should be the least controversial thing in this, the year of our lord two thousand and seventeen, to point out that this method bears no practical resemblance to the Lasso. Or, in the words of Law and Order: SVU, “The [Bayesian Lasso] is fictional and does not depict any actual person or event”.

Who do you think you are?

The Bayesian Lasso is a good example of what’s commonly known as the Lupita Nyong’o fallacy, which goes something like this: Lupita Nyong’o had a break out role in Twelve Years a Slave, she also had a heavily disguised role in one of the Star Wars films (the specific Star Wars film is not important. I haven’t seen it and I don’t care). Hence Twelve Years a Slave exists in the extended Star Wars universe.

The key point is that the (classical) Lasso plays a small part within the Bayesian Lasso (it’s the MAP estimate) in the same way that Lupita Nyong’o played a small role in that Star Wars film. But just as the presence of Ms Nyong’o does not turn Star Wars into Twelve Years a Slave, the fact that the classical Lasso can be recovered as the MAP estimate of the Bayesian Lasso does not make the Bayesian Lasso useful.

And yet people still ask if they can be fit in Stan. In that case, Andrew answered the question that was asked, which is typically the best way to deal with software enquiries. (It’s usually a fool’s game to try to guess why people are asking particular questions. It probably wouldn’t be hard for someone to catalogue the number of times I’ve not followed my advice on this, but in life as in statistics, consistency is really only a concern if everything else is going well.) But I am brave and was not asked for my opinion, so I’m going to talk about why the Bayesian Lasso doesn’t work.

Hiding all away

So why would anyone not know that the Bayesian Lasso doesn’t work?  Well, I don’t really know. But I will point out that all of the results that I’ve seen in this directions (not that I’ve been looking hard) have been published in the prestigious but obtuse places like Annals of Statistics, the journal we publish in when we either don’t want people without a graduate degree in mathematical statistics to understand us or when we want to get tenure.

By contrast, the original paper is very readable and published in JASA, where we put papers when we are ok with people who do not have a graduate degree in mathematical statistics being able to read them, or when we want to get tenure.

To be fair to Park and Casella, they never really say that the Baysian Lasso should be used for sparsity. Except for one sentence in the introduction where they say the median gives approximately sparse estimators and the title which links it to the most prominent and popular method for estimating a sparse signal. Marketing eh. (See, I’m Canadian now).

The devil has designed my death and is waiting to be sure

So what is the Bayesian LASSO (and why did I spend 600 words harping on about something before defining it? The answer will shock you. Actually the answer will not shock you, it’s because it’s kinda hard to do equations on this thing.)

For data observed with Gaussian error, the Bayesian Lasso takes the form

$latex \mathbf{y} \mid \boldsymbol{\beta} \sim N( \mathbf{X} \boldsymbol{\beta}, \boldsymbol{\Sigma})$

where, instead of putting a Normal prior on $latex \boldsymbol{\beta}$ we put independent Laplace priors

$latex p(\beta_i) = \frac{\lambda}{2} \exp(-\lambda | \beta_i|).$

Here the tuning parameter $latex \lambda = c(n,p,s_0,\mathbf{X})\tilde{\lambda}$ where $latex p$ is the number of covariates, $latex n$ is the number of observations, $latex s_0$ is the number of “true” non-zero elements of $latex \boldsymbol{\beta}$, $latex \boldsymbol{\Sigma}$ is known, and $latex \tilde{\lambda}$ is an unknown scaling parameter that should be $latex \mathcal{O}(1)$.

Important Side note: This isn’t the exact same model as Park and Castella used as they didn’t use the transformation

$latex \lambda = c(n,p,s_0,\mathbf{X}) \tilde{\lambda}$

but rather just dealt with $latex \lambda$ as the parameter. From a Bayesian viewpoint, it’s a much better idea to put a prior on $latex \tilde{\lambda}$ than on $latex \lambda$ directly. Why? Because a prior on $latex \lambda$ needs to depend on np, $latex s_0$, and and hence needs to be changed for each problem, while a prior on $latex \tilde{\lambda}$ can be used for many problems.  One possible option is $latex c(n,p,s_0,\mathbf{X}) = 2\|\mathbf{X}\|\sqrt{\log p }$, which is a rate optimal parameter for the (non-Bayesian) Lasso. Later, we’ll do a back-of-the-envelope calculation that suggests we probably don’t need the square root  around the logarithmic term.

Edit: I’ve had some questions about this scaling, so I’m going to try to explain it a little better.  The idea here is that the Bayesian Lasso uses the i.i.d. Laplace priors with scaling parameter $latex \lambda$ on $latex \beta_j$ to express the substantive belief that the signal is approximately sparse.  The reason for scaling the prior is that not every value of $latex \lambda$ is consistent with this belief.  For example, $latex \lambda = 1 $ will not give an approximately sparse signal.  While we could just use a  prior for $latex \lambda$ that has a very heavy right tail (something like an inverse gamma), this is at odds with a good practice principle of making sure all of thee parameters in your models are properly scaled to make them order 1.  Why do we do this? Because it makes it much much easier to set sensible priors.

Other important side note: Some of you may have noticed that the scaling $latex c(n,p,s_0,\mathbf{X})$ can depend on the unknown sparsity $latex s_0$.  This seems like cheating. People who do asymptotic theory call this sort of value for $latex \lambda$ an oracle value, mainly because people studying Bayesian asymptotics are really really into databasing software. The idea is that this is the value of $latex \lambda$ that gives the model the best chance of working. When maths-ing, you work out the properties of the posterior with the oracle value of $latex \lambda$ and then you use some sort of smoothness argument to show that the actual method that is being used to select (or average over) the parameter gives almost the same answer.

Only once in Sheboygan. Only once.

So what’s wrong with the Bayesian Lasso? Well the short version is that the Laplace prior doesn’t have enough mass near zero relative to the mass in the tails to allow for a posterior that has a lot of entries that are almost zero and some entries that are emphatically not zero.  Because the Bayesian Lasso prior does not have a spike at zero, none of the entries will be a priori  exactly zero, so we need some sort of rule to separate the “zero” entries from the “non-zero” entries.  The way that we’re going to do this is to choose a cutoff $latex \epsilon$ where we assume that if $latex |\beta_j| <\epsilon$, then $latex \beta_j =0$.

So how do we know that the Lasso prior doesn’t put enough mass in important parts of the parameter space? Well there are two ways. I learnt it during the exciting process of  writing a paper that the reviewers insisted should have  an extended section about sparsity (although this was at best tangential to the rest of the paper), so I suddenly needed to know about Bayesian models of sparsity. So I read those Annals of Stats papers. (That’s why I know I should be scaling $latex \lambda$!).

What are the key references? Well all the knowledge that you seek is here and here.

But a much easier way to work out that the Bayesian Lasso is bad is to do some simple maths. Because the $latex \beta_j$ are a priori independent, we get a prior on the effective sparsity $latex s_\epsilon$

$latex s_\epsilon \sim {Bin}(p, \Pr(|\beta_j| > \epsilon)).$

For the Bayesian Lasso, that probability can be computed as

$latex \Pr ( | \beta_j | > \epsilon ) = e^{- \lambda \epsilon}$.

 

Ideally, the distribution of this effective sparsity would be centred on the true sparsity.  That is, we’d like to choose $latex \lambda$ so that

$latex \mathbb{E}(s_\epsilon)= p e^{- \lambda \epsilon}= s_0 $.

A quick re-arrangement suggests that

$latex \lambda = \epsilon^{-1} \log(p) – \epsilon^{-1} \log(s_0)$.

Now, we are interested in signals with $latex s_0 = o(p)$, i.e. where only a very small number of the $latex \beta_j $ are non-zero.  This suggests we can safely ignore the second term.  Some other theory  suggests that we need to take $latex \epsilon$ to depend on $latex p$ in such a way that it goes to zero faster than $latex p^{-1}$ as $latex p$ gets large. (Edit: Oops – this paper used $latex \mathbf{X} = \mathbf{I}$ and $latex p = n$.  For the general case, their reasoning leads to $latex \epsilon = o(\| \mathbf{X} \|^{-1})$.)

This means that we need to take $latex \lambda = \mathcal{O}(p \log(p))$ in order to ensure that we have our prior centred on sparse vectors (in the sense that the prior mean for the number of non-zero components is always much less than $latex p$).

Show some emotion

So for the Bayesian Lasso, a sensible parameter is $latex \lambda = p\log p$, which will usually have a large number of components less than the threshold $latex \epsilon$ and a small number that are larger.  But this is still not any good.  To see this, let’s consider the prior probability of seeing a $latex \beta_j$ larger than one:

$latex \Pr ( | \beta_j | > 1) = p^{-p} \downarrow \downarrow \downarrow 0$.

This is the problem with the Bayesian Lasso: in order to have a lot of zeros in the signal, you are also forcing the non-zero elements to be very small. A plot of this function is above, and it’s clear that even for very small values of $latex p$ this probability is infinitesimally small.

Basically, the Bayesian Lasso can’t give enough mass to both small and large signals simultaneously.  Other Bayesian models (such as the horseshoe and the Finnish horseshoe) can support both simultaneously and this type of calculation can show that (although it’s harder. See Theorem 6 here).

Side note: The scaling that I derived in the previous section is a little different to the standard Lasso scaling of $latex \lambda = \mathcal{O} (p \sqrt{\log p})$, but the same result holds: for large $latex p$ the probability of seeing a large signal is vanishingly small.

Maybe I was mean, but I really don’t think so

Now, obviously this is not what we see with real data when we fit the Bayesian Lasso in Stan with an unknown $latex \lambda$.  What happens is the model tries to strike a balance between the big signals and the small signals, shrinking the former too much, and letting the latter be too far from zero.  You will see this in the event that you try to fit the Bayesian Lasso in Stan.

Update: I’ve put some extra notes in the above text to make it hopefully a little clearer in one case and a little more correct in the other. Please accept this version of the title track as way of an apology

(Also this stunning version, done wearing even more makeup)

30 thoughts on “The king must die

  1. People who want sparsity do regular Lasso. If you want to be fully Bayesian, you’re not going to get sparse results (I mean you can if you set small coefficients to zero, but this seems sort of like a hack to me). Bayesian Lasso should be done when the Laplace prior is reasonable for the coefficients of the specific problem. If Bayesian Lasso shrinks big signals too much and allows small signals to be too far from zero, then the Laplace prior is a poor choice of prior. I imagine there are some cases where it works well; there are even some cases where the Gaussian prior works well.

    • You can get sparsity with Bayesian methods (like the horseshoe and the Finnish horseshoe), although what you really are doing is using a sensible high-dimensional prior that doesn’t artificially inflate signals (like the Gaussian will) and then you do a decision analysis to “sparsify” the output.

      • I am not really sure I understand what you are getting at. In a Bayesian context the idea of sparsity cannot be thought of as getting posterior estimates (all of them) to be zero. This does not make sense, and is not possible with any method including the horseshoe. There will need to be some decision rule, or procedure, to impose sparsity on the estimates by either using some interval or, as Aki suggested below, decision analysis.

  2. > we tighten our standards and insist that people proposing new methods, models, and algorithms work harder to sketch out the boundaries of their creations
    So those lawyer like limitations often in the discussion section that seem to just say “don’t use when not appropriate” are not enough?

    You want to do away with caveat emptor?

  3. “a prior on \lambda needs to depend on n, p, s_0, and X ”

    Why? This strikes me as nonsensical. Neither the number of observations (n) nor the points at which you took the obervations (X) influence the effect sizes, nor do n and X *alone* generally give you any information about the effect sizes. Put another way: does your opinion of likely effect sizes change after someone tells you how many observations they took, or even the predictor values for these observations, without telling you anything about the values for the outcome variable?

    • This is an interesting question and is one of the key things I didn’t bring up (because the post was long enough). Our substantive prior knowledge is that our signal is sparse. We are attempting to implement this substantive knowledge using independent Laplace prior distributions with a common scale parameter. The prior on the scale parameter needs to also reflect this substantive knowledge. In the “Sheboygan” section, I derived a simple scaling that depends on p and s_0. (The true sparsity dropped out because the term with it in it was much smaller than the other term, but there’s a discussion of the value of oracle parameters somewhere in there)

      In general, you also need to know something about the *precision* of the experiment, which is encoded in n and X. Why do you need this? Because we need to choose the cut off epsilon, which will depend on how well the individual beta can be resolved, which is a function of n and X.

      So the scaling is needed to reflect our substantive prior knowledge of sparsity. The extra $latex \tilde{\lambda}$ reflects the fact that we only have an “order of magnitude” idea of the scaling, so we still need to learn the exact value from the data. But with this scaling, we know that $latex \tilde{\lambda}$ should be $latex \mathcal{O}(1)$.

      • The illegitimate step is when you use n and X to choose the threshold epsilon below which you consider a regression coefficient to be effectively zero. A prior is information you have before collecting the data, so epsilon must be chosen without reference to n or X. A legitimate prior doesn’t change as you add new observations: in Bayes’ rule, the factor for the prior is the same regardless of what data you observe or how much data you collect.

        • Kevin:

          Regarding your comment, “A legitimate prior doesn’t change as you add new observations: in Bayes’ rule, the factor for the prior is the same regardless of what data you observe or how much data you collect”:

          Not necessarily. See this recent article with Simpson and Betancourt, “The prior can often only be understood in the context of the likelihood.”

        • A short version of an answer to this: the fact that $latex \epsilon$ needs to depend on $latex \mathbf{X}$ is really an ugly consequence of using a continuous approximation to a spike and slab prior. If $latex \epsilon$ was fixed, then a p-dimensional vector that was supposed to approximate zero could have a norm as large as $latex \epsilon p$. Now if we want this to be negligible (because this is supposed to approximate a vector of zeros), we need $latex \epsilon$ to decrease with $latex p$.

          One way to get out of this particular problem is to throw the “sparsity” part of the model to some sort of utility function and treat sparsity as a decision analysis. While I’m completely and totally happy with that (as I said in a previous comment), that’s not the soil on which the Bayesian Lasso was built. And I think it’s very important to meet the proposed method where it is.

          This is not the only case where parameters in the prior need to be scaled based on outside information in order for them to have the interpretation that they are intended to have. But that discussion is too long for a comment, which probably means I’ll blog about it at some point.

        • Interestingly of course, in a Bayesian decision theory you choose an action to minimize the expected loss. The expected loss is more or less

          integrate(P(Data | Parameters) P(Parameters) U(Outcome(Parameters)), Parameters)

          The p(Data|Parameters) is a description of how you think the world works… but the p(Parameters) U(Outcome(Parameters)) is a function of parameters, and the two components aren’t “identifiable”

          So, the right way to think about this issue is I think to realize that you’re not “picking a prior” but rather choosing a risk function which is both a prior and a utility multiplied together… And if your Utility has a strange form like “the number of nonzero parameters needs to be exactly N” then you’ll expect all of it to depend on lots of things that a pure inference about the truth wouldn’t depend on.

    • Maybe a prior should be thought of as how much background/external information is best to bring to bear to learn from this data set.

      The analysis should stand on the data set as much as is reasonable – not any more or less. The goal being to get a purposeful and convincing analysis rather than truly representing one’s prior and truly updating that with all the data. I don’t believe anyone can truly state their prior and its almost never the case that all the data is used/usable.

      This may sound like “don’t take any wooden nickels” but maybe the meta-statistics of discerning fully adequate and separate priors and data generating models is just a poor meta-statistics.

      • Oh absolutely. That’s basically my current obsession: working out how to tell if/when your prior/model is doing what you built it to do. For super easy cases like sparsity, the goal of the inference is sufficiently simple (have lots of zeros and some big things) that you can say quite precise things.

        In general, it’s not nearly as straightforward.

  4. Distinctly off topic but still on topic somehow:

    I don’t know anything about lassos. But since you weirdly enough have quoted two Swedish singers you should also check out Anna Järvinen. Like with Säkert and Frida Hyvönen it’s much better if you understand the lyrics. Not that that have stopped you before. Oh yeah, also Nina/Nino Ramsby.

Leave a Reply

Your email address will not be published. Required fields are marked *