2 new thoughts on Cauchy priors for logistic regression coefficients

Aki noticed this paper, On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression, by Joyee Ghosh, Yingbo Li, and Robin Mitra, which begins:

In logistic regression, separation occurs when a linear combination of the predictors can perfectly classify part or all of the observations in the sample, and as a result, finite maximum likelihood estimates of the regression coefficients do not exist. Gelman et al. (2008) recommended independent Cauchy distributions as default priors for the regression coefficients in logistic regression, even in the case of separation, and reported posterior modes in their analyses. As the mean does not exist for the Cauchy prior, a natural question is whether the posterior means of the regression coefficients exist under separation. We prove two theorems that provide necessary and sufficient conditions for the existence of posterior means under independent Cauchy priors for the logit link and a general family of link functions, including the probit link. For full Bayesian inference, we develop a Gibbs sampler based on Polya-Gamma data augmentation . . .

It’s good to see research on this. Statistics is the science of defaults, and an important part of statistical theory at its best is the study of how defaults work on a range of problems. It’s a good idea to study the frequency properties of statistical methods—any methods, including Bayesian methods.

I have not read through the paper, but based the above abstract I have two quick comments:

1. We no longer recommend Cauchy as our first-choice default. Cauchy can be fine as a weakly informative prior, but in the recent applications I’ve seen, I’m not really expecting to get huge coefficients, and so a stronger prior such as normal(0,1) can often make sense. See, for example, section 3 of this recent paper. I guess I’m saying that, even for default priors, I recommend a bit of thought into the expected scale of the parameters.

2. I assume that any of the computations can be done in Stan, no need for all these Gibbs samplers. I’m actually surprised that anyone is writing Gibbs samplers anymore in 2015!

18 thoughts on “2 new thoughts on Cauchy priors for logistic regression coefficients

  1. That wonderful paper by Nicolas Chopin and James Ridgeway would suggest that MCMC in general is the wrong approach for this type of problem.

    Too often the first resort when it should be the last!

      • Because a medium sized (<100k) logistic regression should be solved in a blink not in a few minutes. Mainly because you often need to run a lot of them, or to get some reasonable idea of model sensitivity etc.

        I'm obviously not suggesting that everyone should write their own complex code – Nicolas and James' paper has an attached R package, for example.

        (I just remembered that STAN has some non-MCMC things in it, which may make it appropriate for this task – sorry, I've only ever used the MCMC part! If the non-MCMC code for logistic regression has an R wrapper that makes it look like the glm() function, then that's even better!)

        • Dan:

          Yes, my point is that Stan should be the first resort because it can be implemented in about 2 minutes. Writing a custom program takes a lot more than 2 minutes! But if running the Stan program takes too long, or if it is embedded in a big loop, then, sure, take the next step and write your own program. That would be the second resort. The third resort, I suppose, would be coming up with a new algorithm: worth doing if resorts #1 and 2 don’t do the job.

          And, yes, Stan can compute mle or posterior mode, along with curvature. It’s faster than the glm program in R. We have the stan_glm function in R that can do this point estimation as an option; we’re planning to release it soon.

        • Sorry Andrew – I just don’t see where you’re coming from here. Is it that people can only learn one Bayesian inference tool, so it should be Stan?

          Because for logistic regression there are just so many other decent options. (Coding it yourself is not one of them)

          EPGLM is fast. INLA Is fast (it’s just a convenient interface to a pure Laplace approximation in this case, which is decent, but not as good as the EP in EPGLM. Logistic regression is too easy for us too!). They both also have very easy interfaces. (1 line of code.)

          There are probably more, but I don’t keep a list of logistic regression solvers.

          Sampling-based strategies would be where I went if a deterministic method didn’t work for some reason. Not because I don’t think Stan is fabulous (I do), but because logistic regression is far too simple a problem to crack with that nuclear bomb. Computational strategies should scale to fit the problem at hand.

          So yes. Stan will solve this. But it’s not the easiest way, or the fastest way. It’s amazing but it isn’t (and shouldn’t be!) the only tool in town.

          If this was a mixed effects or GP logistic regression or something else, then it may be a different story.

          [There are caveats – if you have a tiny amount of data, then you will need full MCMC, but for the cases in the linked paper, the asymptotic approximations should be good enough. (As an aside, it’s difficult to get excited about missing, say, he third decimal place with one of these “approximate” methods – the statistical error will be much larger than that most of the time. So “good enough” isn’t necessarily a very high barrier)]

        • Dan:

          The abstract says, “For full Bayesian inference, we develop a Gibbs sampler based on Polya-Gamma data augmentation . . .”

          “We develop” implies a bit of work! It’s possible that existing programs did not solve the problem, in which case “we develop” was necessary. But if the problem could be solved using Epglm or Inla or Stan or whatever, then I’d recommend one of those more direct approaches.

          To put it another way, my problem was not with “Gibbs sampler,” it was with “we develop.” Using custom-developed software requires more effort, is in general less trustworthy, and is in general less generalizable, compared to using existing software.

        • Dan — logistic regression is certainly easy but what if you want to generalize it at all? Even small perturbations to the model will render many once-fast algorithms terribly slow. I don’t want to speak for Andrew, but my motivation for a tool like Stan is that is separates modeling from computation. Build the model you want _first_ and then see if you can fit it, don’t just use a logistic regression because it could be specified from a small list of models in some computational software. It forces you to think, so it might not be appropriate for everyone :-p, but ultimately that thinking is critical to any good analysis!

  2. “I assume that any of the computations can be done in Stan, no need for all these Gibbs samplers. I’m actually surprised that anyone is writing Gibbs samplers anymore in 2015!” I’m not a STAN user so I’m not sure I understand what you mean by that and whether that applies to STAN in particular (STAN has general estimation functions for Bayesian statistics) or in general (Gibbs samplers have been superseded by something else…).

  3. I wonder if the differences in coefficients between using a n(0,1) and a Cauchy model are statistically significantly different. Seem to recall reading a paper along those lines somewhere by someone’s who name I can’t remember.

    • Hand-waving away the logistic regression likelihood for a moment (everything is approximately Gaussian, right?), you’ll get a difference in the posteriors when the point of maximum likelihood (or in the case of separation, the set of parameter values leading to perfect separation) is far from the prior mode. For the normal prior, the shrinkage has the same “prior weight” no matter where parameter values favored by the data are in parameter space; for the Cauchy, as the set of parameter values favored by the data moves away from the prior mode past some radius, the amount of the shrinkage decreases.

Leave a Reply

Your email address will not be published. Required fields are marked *