Comments on: The king must die

By: Daniel Lakeland

Daniel Lakeland — Mon, 06 Nov 2017 21:07:01 +0000

Interestingly of course, in a Bayesian decision theory you choose an action to minimize the expected loss. The expected loss is more or less

integrate(P(Data | Parameters) P(Parameters) U(Outcome(Parameters)), Parameters)

The p(Data|Parameters) is a description of how you think the world works… but the p(Parameters) U(Outcome(Parameters)) is a function of parameters, and the two components aren’t “identifiable”

So, the right way to think about this issue is I think to realize that you’re not “picking a prior” but rather choosing a risk function which is both a prior and a utility multiplied together… And if your Utility has a strange form like “the number of nonzero parameters needs to be exactly N” then you’ll expect all of it to depend on lots of things that a pure inference about the truth wouldn’t depend on.

By: Dan Simpson

Dan Simpson — Sun, 05 Nov 2017 21:04:27 +0000

In reply to Kevin S Van Horn.

A short version of an answer to this: the fact that $latex \epsilon$ needs to depend on $latex \mathbf{X}$ is really an ugly consequence of using a continuous approximation to a spike and slab prior. If $latex \epsilon$ was fixed, then a p-dimensional vector that was supposed to approximate zero could have a norm as large as $latex \epsilon p$. Now if we want this to be negligible (because this is supposed to approximate a vector of zeros), we need $latex \epsilon$ to decrease with $latex p$.

One way to get out of this particular problem is to throw the “sparsity” part of the model to some sort of utility function and treat sparsity as a decision analysis. While I’m completely and totally happy with that (as I said in a previous comment), that’s not the soil on which the Bayesian Lasso was built. And I think it’s very important to meet the proposed method where it is.

This is not the only case where parameters in the prior need to be scaled based on outside information in order for them to have the interpretation that they are intended to have. But that discussion is too long for a comment, which probably means I’ll blog about it at some point.

By: Andrew

Andrew — Sat, 04 Nov 2017 18:38:58 +0000

In reply to Kevin S Van Horn. Kevin: Regarding your comment, "A legitimate prior doesn’t change as you add new observations: in Bayes’ rule, the factor for the prior is the same regardless of what data you observe or how much data you collect": Not necessarily. See this recent article with Simpson and Betancourt, "The prior can often only be understood in the context of the likelihood."

By: Kevin S Van Horn

Kevin S Van Horn — Sat, 04 Nov 2017 18:28:12 +0000

In reply to Dan Simpson. The illegitimate step is when you use n and X to choose the threshold epsilon below which you consider a regression coefficient to be effectively zero. A prior is information you have before collecting the data, so epsilon must be chosen without reference to n or X. A legitimate prior doesn't change as you add new observations: in Bayes' rule, the factor for the prior is the same regardless of what data you observe or how much data you collect.

By: Dan Simpson

Dan Simpson — Fri, 03 Nov 2017 13:22:16 +0000

In reply to Jens Åström. I will take women singing about big feelings in any language.

By: Dan Simpson

Dan Simpson — Fri, 03 Nov 2017 13:19:16 +0000

In reply to Keith O'Rourke. Oh absolutely. That’s basically my current obsession: working out how to tell if/when your prior/model is doing what you built it to do. For super easy cases like sparsity, the goal of the inference is sufficiently simple (have lots of zeros and some big things) that you can say quite precise things. In general, it’s not nearly as straightforward.

By: Keith O'Rourke

Keith O'Rourke — Fri, 03 Nov 2017 12:43:17 +0000

In reply to Kevin S Van Horn.

Maybe a prior should be thought of as how much background/external information is best to bring to bear to learn from this data set.

The analysis should stand on the data set as much as is reasonable – not any more or less. The goal being to get a purposeful and convincing analysis rather than truly representing one’s prior and truly updating that with all the data. I don’t believe anyone can truly state their prior and its almost never the case that all the data is used/usable.

This may sound like “don’t take any wooden nickels” but maybe the meta-statistics of discerning fully adequate and separate priors and data generating models is just a poor meta-statistics.

By: Jens Åström

Jens Åström — Fri, 03 Nov 2017 07:53:12 +0000

Distinctly off topic but still on topic somehow:

I don’t know anything about lassos. But since you weirdly enough have quoted two Swedish singers you should also check out Anna Järvinen. Like with Säkert and Frida Hyvönen it’s much better if you understand the lyrics. Not that that have stopped you before. Oh yeah, also Nina/Nino Ramsby.

By: Patrick B

Patrick B — Fri, 03 Nov 2017 03:32:11 +0000

In reply to Wyman. +1

By: Donny Williams

Donny Williams — Fri, 03 Nov 2017 01:16:57 +0000

In reply to Donny Williams. My comment was not meant for a different comment. Please disregard!

By: Donny Williams

Donny Williams — Fri, 03 Nov 2017 01:03:26 +0000

In reply to Dan Simpson. I am not really sure I understand what you are getting at. In a Bayesian context the idea of sparsity cannot be thought of as getting posterior estimates (all of them) to be zero. This does not make sense, and is not possible with any method including the horseshoe. There will need to be some decision rule, or procedure, to impose sparsity on the estimates by either using some interval or, as Aki suggested below, decision analysis.

By: Dan Simpson

Dan Simpson — Thu, 02 Nov 2017 21:41:08 +0000

In reply to Bob Carpenter. Bob> Dan’s unmistakeable style means he never has to write, “This post by Dan!” As you can probably guess, this is not an accident.

By: Dan Simpson

Dan Simpson — Thu, 02 Nov 2017 21:40:22 +0000

In reply to Kevin S Van Horn.

This is an interesting question and is one of the key things I didn’t bring up (because the post was long enough). Our substantive prior knowledge is that our signal is sparse. We are attempting to implement this substantive knowledge using independent Laplace prior distributions with a common scale parameter. The prior on the scale parameter needs to also reflect this substantive knowledge. In the “Sheboygan” section, I derived a simple scaling that depends on p and s_0. (The true sparsity dropped out because the term with it in it was much smaller than the other term, but there’s a discussion of the value of oracle parameters somewhere in there)

In general, you also need to know something about the *precision* of the experiment, which is encoded in n and X. Why do you need this? Because we need to choose the cut off epsilon, which will depend on how well the individual beta can be resolved, which is a function of n and X.

So the scaling is needed to reflect our substantive prior knowledge of sparsity. The extra $latex \tilde{\lambda}$ reflects the fact that we only have an “order of magnitude” idea of the scaling, so we still need to learn the exact value from the data. But with this scaling, we know that $latex \tilde{\lambda}$ should be $latex \mathcal{O}(1)$.

By: Aki Vehtari

Aki Vehtari — Thu, 02 Nov 2017 20:35:47 +0000

In reply to Dan Simpson. Dan> and then you do a decision analysis to “sparsify” the output. as discussed here and demonstrated, e.g., here and here.

By: Corey Yanofsky

Corey Yanofsky — Thu, 02 Nov 2017 20:16:04 +0000

In reply to Dan Simpson. The worthwhile content outweighs the pain of reading on balance -- that doesn't mean I like pain.

By: Kevin S Van Horn

Kevin S Van Horn — Thu, 02 Nov 2017 19:58:38 +0000

“a prior on \lambda needs to depend on n, p, s_0, and X ”

Why? This strikes me as nonsensical. Neither the number of observations (n) nor the points at which you took the obervations (X) influence the effect sizes, nor do n and X *alone* generally give you any information about the effect sizes. Put another way: does your opinion of likely effect sizes change after someone tells you how many observations they took, or even the predictor values for these observations, without telling you anything about the values for the outcome variable?

By: Bob Carpenter

Bob Carpenter — Thu, 02 Nov 2017 19:54:43 +0000

In reply to Wyman. Dan's unmistakeable style means he never has to write, "This post by Dan!" Sort of like my favorite RPG blogger, The Angry GM. Only without the cussing and with cabaret instead of fantasy fiction and statistics instead of role-playing games. I also laughed out loud at his summary of the whole blog endeavour:

But it’s a blog. If ever there was a medium to be half-arsed in it’s this one. It’s like twitter for people who aren’t pithy.

By: Keith O'Rourke

Keith O'Rourke — Thu, 02 Nov 2017 19:49:41 +0000

In reply to Dan Simpson.

> :p
A typo?

Or maybe http://consc.net/misc/proofs.html

By: B D McCullough

B D McCullough — Thu, 02 Nov 2017 19:10:49 +0000

In reply to Wyman. +1

By: Dan Simpson

Dan Simpson — Thu, 02 Nov 2017 19:07:21 +0000

In reply to Keith O'Rourke. It really doesn't feel like it's too much to ask for that :p

By: Keith O'Rourke

Keith O'Rourke — Thu, 02 Nov 2017 19:05:00 +0000

In reply to Dan Simpson. Providing a better sense of when something would and would not work for what purposes would also be nice.

By: Dan Simpson

Dan Simpson — Thu, 02 Nov 2017 18:41:50 +0000

In reply to Keith O'Rourke. Ideally I'd like us to come up with ways to check if a model or method will work for our type of problem before even trying it on data. But it would also be nice to not publish things that don't work.

By: Keith O'Rourke

Keith O'Rourke — Thu, 02 Nov 2017 18:23:06 +0000

> we tighten our standards and insist that people proposing new methods, models, and algorithms work harder to sketch out the boundaries of their creations
So those lawyer like limitations often in the discussion section that seem to just say “don’t use when not appropriate” are not enough?

You want to do away with caveat emptor?

By: Wyman

Wyman — Thu, 02 Nov 2017 18:18:51 +0000

In reply to Corey Yanofsky. That's funny, because this is not tool I've ever used or plan to, but I read this entire post just because I enjoy Dan's writing.

By: Andrew

Andrew — Thu, 02 Nov 2017 18:12:07 +0000

In reply to Dan Simpson. +1

By: Dan Simpson

Dan Simpson — Thu, 02 Nov 2017 17:46:47 +0000

In reply to Bryan. You can get sparsity with Bayesian methods (like the horseshoe and the Finnish horseshoe), although what you really are doing is using a sensible high-dimensional prior that doesn't artificially inflate signals (like the Gaussian will) and then you do a decision analysis to "sparsify" the output.

By: Dan Simpson

Dan Simpson — Thu, 02 Nov 2017 17:43:41 +0000

In reply to Corey Yanofsky. It's just so easy not to read things you don't like. So very easy.

By: Corey Yanofsky

Corey Yanofsky — Thu, 02 Nov 2017 17:02:21 +0000

Prolixity is a bug, not a feature…

By: Eh2406

Eh2406 — Thu, 02 Nov 2017 16:47:59 +0000

This was really really helpful! Thank you! I need to go experiment with the horseshoe some more. :-)

By: Bryan

Bryan — Thu, 02 Nov 2017 15:56:18 +0000

People who want sparsity do regular Lasso. If you want to be fully Bayesian, you’re not going to get sparse results (I mean you can if you set small coefficients to zero, but this seems sort of like a hack to me). Bayesian Lasso should be done when the Laplace prior is reasonable for the coefficients of the specific problem. If Bayesian Lasso shrinks big signals too much and allows small signals to be too far from zero, then the Laplace prior is a poor choice of prior. I imagine there are some cases where it works well; there are even some cases where the Gaussian prior works well.