Hierarchical logistic regression in Stan: The untold story

Corey Yanofsky pointed me to a paper by Neal Beck, Estimating grouped data models with a binary dependent variable and fixed effects: What are the issues?, which begins:

This article deals with a very simple issue: if we have grouped data with a binary dependent variable and want to include fixed effects (group specific intercepts) in the specification, is Ordinary Least Squares (OLS) in any way superior to a (conditional) logit form? . . . it can be argued that OLS is not “too bad” and so can be used when its use simplifies other matters (such as endogeneity).

I replied that this all seems pretty pointless in the Stan era.

Corey wrote back:

Hmm . . . maybe the right fodder for a blog post is Chris Blattman touting the paper as justifying his loathing of logistic regression? As in:

I think the reasons economists often go the linear route is because it generates very simple to interpret estimates, which is not the case with logit.

I know you’ve written about interpreting logistic regression coefficients before; can’t remember if it was on the blog or in ARM or both (or neither?)… IIRC a factor of 4 was involved…

To which I replied:

1. Hey, what could be simpler to interpret than a predicted probability of 1.2? [That’s sarcasm. The point is that if you fit linear regression to binary data, then you can get predicted probabilities that are lower than 0 or more than 1. —ed.]

2. Yes, the factor of 4 is in chapter 5 of ARM. When predicted probabilities are not far from 0.5, you can divide the logistic regression coefficient by 4 to get the approximate coefficient on the probability scale.

Seriously though, I don’t think it’s quite right to describe Chris as “touting” Neal’s paper. Chris just seemed to be saying it was a cute idea. And I don’t think Neal is saying that we should stop fitting logistic regression; he’s just giving interpretation tips for people who are currently not able to handle logistic regression.

In any case, as noted above, a lot of this sort of thing is becoming obsolete now that we can so easily fit multilevel models in Stan.

48 thoughts on “Hierarchical logistic regression in Stan: The untold story

  1. When presenting model estimates, I aim to tell a story about a lot of numbers (data) with a few numbers (statistics). Often different disciplines are more or less comfortable with certain models (economists seem more comfortable with linear probability models than other disciplines). Do you have suggestions on how to present results from fitting multilevel models in Stan to audiences that are more comfortable with other models? Do you think model choice may be audience specific?

    • If only it weren’t so common. I work on issues related to fish passage and survival in the federal Columbia River hydropower system. Probably not a big deal to you East Coasters, but the various dam owners out here make billions of dollars off the river and spend millions of it on trying to improve fish passage and survival pass these dams. The federal agencies that make critical decisions on how this money is spent and on how the whole Columbia river is operated regularly use a model (a variant of the Cormack-Jolly-Seber model) which will generate survival probabilities greater than 1. That’s right, dams don’t kill fish, they resurrect them! The reason for this that the person who came up with model uses a maximum likelihood model that doesn’t use the logit transformation and is unconstrained on the upper-end.

      Bob Carpenter wrote a version of the CJS model in STAN that simulations show does a better job than the maximum likelihood method being used now. There’s a paper waiting to be written. Bob, if you’re listening, how’d you a like a fisheries paper on your CV?

        • I thought so too, but this is an example where Stan actually makes a difference. I’ve done direct simulations against Jeff Laake’s marked R package and replicated the MLE model I mentioned in my original comment in R. The problem is really at the boundary conditions, when you trying to estimate survival in excess of 90% or so. marked, using the logit transform, breaks down in a large proportion of the simulations because the variance estimates blow up. The MLE method implemented in SURPH and used by NOAA and other agencies will yield estimates greater than one with the bonus side effect that standard error estimates increase with the survival parameter estimate.

          You might think that’s not such a big deal but most of the projects on the Columbia are required to hit targets well above 90% for adult and juvenile survival. Sure, we could go ahead and fit a model in BUGS or JAGS but there are hundreds of thousands of fish tagged every year in the river, and 12 million plus fish have been tagged since they’ve adopted passive integrated transponder technology. JAGS/BUGS simply take too long to produce estimates.

        • To echo Dalton, there is a whole genre of paper in quantitative ecology that mistakenly criticizes models because those models are estimated with MLE. I’ve seen it for CJS, as he notes, as well as occupancy models and even plain old negative-binomial (gamma-Poisson). The models are obviously not the problem here; it’s MLE that’s the problem.

        • That makes sense, I helped with some CJS models in Ben Letcher’s lab for a large mark-recapture data set and abusing JAGS worked for that analysis. I tried the same for a larger data set as part of my dissertation and ended up re-doing everything in Stan just because of the convergence problems (well, and wanting to do multi-dimensional splines). I don’t think it’s impossible in JAGS (especially in the formulation that marginalizes out the state) but I would be surprised if it did better than Stan.

          Bumping something like SURPH would probably require a nice R wrapper along the lines of rstanarm and some fisheries-data specific simulation studies. I don’t think it’s out of question since a number of people on the Stan users list (me included) have CJS implementations.

      • D’oh! That indeed sounds like a poor modeling decision.

        Let’s separate the modeling from the inference and then the computation of that inference.

        The ecologists have decent logistic and probit models for these kinds of things up their sleeves, where they have predictors for animal weights and also for seasonal effects like temperature and rainfall. You can see such a model, for example, in Fränzi Korner-Nievergelt et al.’s new book on Bayesain ecological modeling.

        Traditionally, maximum likelihood estimates were generated by marginalizing out the latent discrete parameters and optimizing the rest. That’s what you see in the 1980s literature. This is a bit tricky with hierarchical (mixed effects) models, where you need to apply max marginal likelihood. But you could still do it, I would think.

        Then BUGS let you explicitly model the latent discrete parameters and sample over them using generalized Gibbs. This was where the big jump in expressiveness for users who didn’t want to code their own MCMC happened. And I agree that it’s very easy to think about modeling in terms of the latent discrete parameters. There are several books based on building ecology models in BUGS and as far as I can tell, they’re at least partly responsible for there being so many Bayesians in ecology. Once you’ve started coding in something like BUGS, it’s hard to go back to packages offering you a fixed palette of models or building one-offs yourself—that’s certainly what hooked me.

        Stan goes back to marginalizing out the latent discrete parameters, but samples using HMC (NUTS, specifically). You could, of course, compute the penalized MLE with Stan, too.

        So there’s MLE (or MML if we have a hierarchical model) vs. full Bayes on the one hand, and Gibbs vs. HMC on the other.

        I can believe full Bayes would win a bakeoff on inference, especially with smaller data sets and hierarchical modeling of uncertainty. I’m pretty confident that Stan would win the bakeoff vs. BUGS or JAGS for computation (our experiments showed about a factor of 100 speedup — it’s tricky to condense this to a single number because there are multiple parameters being estimated).

        P. S. Please send me email to talk about a paper at [email protected].

        • I really got into BUGS/JAGS for the ability to explicitly model the latent discrete parameters. It’s a lot easier to think and write the model in terms of the conditional data likelihood. For more complicated hierarchical models it’s still my go-to (dcat how do I love you, let me count the ways). Given all the variations you can have in study designs for mark-recapture studies, I don’t know why you wouldn’t go Bayesian for the flexibility of matching a model to a study design.

          What I was talking about in the comment above is plain vanilla CJS though, no predictor,s just a single group level survival estimate for a given reach of the river. I’ve already played around with the code you wrote for some simulations and it has worked beautifully. I think the ability to do simulations so quickly is reason enough to use Stan. One of the things on my spare time to do list is to figure out how to take your code and hierarchical-ize it. I’ll shoot you an email, even if just to bounce the idea off of you.

        • Dalton – did you ever hierarchical-ize this? Creepers of this convo like me would very much like to see any example code associated with CJS models in Stan that have covariates incorporated, hierarchical or not. I’ve been having difficulty deciding on how to code the indexing of the covariates (i.e. some covariates for tagged fish data are at the reach level, while some are at the individual fish level), and would really appreciate seeing how other folks are currently implementing this.

  2. Wait, economists, who I think of as “mathy-er than thou” in the social sciences don’t like logistic regression because…. too mathy?

    Logistic regression is just pushing a line through a nonlinear transform so that it can’t get out of the range 0,1

    • Re factor of 4:

      the derivative of 1/(1+exp(-x)) at x=0 is 1/4 and the graph of this function is basically a line from about -1 to 1, when x = 0 then the function = 1/2 (ie p = 0.5) that’s pretty much the story on interpretation right?

    • Daniel:

      Economist optimize.

      They are willing to trade-off theoretical elegance for simplicity and tractability, especially in interpreting results.

      Surely most of these economists find OLS is welfare maximizing in most applied scenarios they deal with.

    • Not just that it can’t escape that range, but that prediction are basically pushed towards 0 and 1 increasingly if the model is better (derivative at mean is greater).
      What’s odd to me is that even in early 1980s when i was in grad school we were doing logit and probit using software that I believe was developed by an economist (Peck).

      • @Elin

        There is a difference between marginal effect and average marginal effect. In most economic applications researchers are looking at ghe latter. The ATE with a binary outcome is just a difference in proportions. OLS is fine for that.

        Here is Wooldridge (2002), Econometric Analysis of Cross Section and Panel Data, p.455:

        “If the main purpose is to estimate the partial effect of x_j on the response probability, averaged across the distribution of X, then the fact that some predicted values are outside the unit interval may not be very important. The LPM need not provide very good estimates of partial effects at extreme values of X”.

        • so, you’re averaging d/dx (p(Outcome | X)) p(X) ???

          (by marginal effect I assume you mean the effect of a marginal increase in X on p(Outcome))

          If you’re using a linear model, you’re just averaging a constant. If you don’t mean the d/dx thing, then you’re averaging the prediction, so then it’s the population mean probability of outcome you’re looking for?

          It’s true that if the prediction is wildly off for things that have almost no probability ( P(X) small ) then it won’t matter for the average. But I gotta say, this all seems like more of a headache than it’s worth. Why not just fit the logistic regression and calculate the thing you’re interested in?

          I don’t disagree that under some circumstances it can be an OK approximation to use a linear model, but it seems to me that checking that your linear approximation is valid is about the same level of difficulty as just doing what you “ought” to be doing in the first place and then extracting the quantity of interest from it.

          I still believe that the likely reason people consider “doing it wrong” as a reasonable choice is because the software they’re using has a push-button nature which doesn’t make it easy to work with the model. For example, in Stan you’d fit the logistic regression, and then you’d use generated quantities to randomly sample according to the population frequency and average the effect. In R, you’d do something similar, probably just fit the logistic using glm but then randomly sample from the population for X and use “predict” to get an averaged probability of occurrence.

          But in SAS you’d most likely be SOL if you don’t have some serious SAS macro skills.

        • Daniel:

          I agree with you on the latter point. People are making trade-offs. They have a specific quantity of interest in mind. They find that taking an expedient shortcut gets them there faster than the kosher route, so they take the shortcut.

          Now, if economist are rational, they will go kosher as doing so becomes cheaper in terms of time, effort, interpretability.

          If I am being customer oriented I’d say this is not a failure of economists so much as a failure of software providers. Doing the right thing is still too costly for many people.

        • I mean it’s fine if you don’t care about what it means and you aren’t going to use it for decision making or prediction, sure.

        • @Elin

          Just to be clear. This is not about me but about economists. I have chosen to defend them bc I don’t think they are stupid. Most are very well trained. So much that they feel confident departing from the rule book and doing a hack when convenient.

          In terms of decision, experiments are being used all the time to make decisions. But often those decisions are made on the basis of the average. Most bureaucracies cannot engage in personalized medicine. Typically they find it hard to do one thing well across the board.

  3. Also, OLS is a fitting method (minimizing squared error), whereas logistic regression is a modeling technique and can be fit in various ways (maximum likelihood, maximum penalized likelihood, least squares on the probability scale… bayesian methods, whatever)

    So, basically I agree with the pointlessness of this kind of thing. What’s needed is to teach people who are going to need to use math how to use it.

    • I do find it odd that so many people seem to confuse modeling and inference/estimation. Is that because it just all comes together as a procedural package for them?

      And yes, derivatives are the way to calculate linear approximations. In fact, I started thinking of derivatives that way after reading the Magnus and Neudecker matrix calc book. I had never understood Taylor series as an undergrad. Of course, it helps to have Michael Betancourt on hand to answer questions at the blackboard!

      • I worked at a finance company back in the 90’s and they used SAS a lot. One of my colleagues described SAS as the worlds biggest pocket calculator. If you wanted to “push a button” to get a canned analysis on a big dataset, it was great. If you wanted to do ANYTHING else… UGH.

        Core SAS itself is not turing complete, but it contains a turing complete MACRO language (but it’s a textual macro language, it doesn’t operate on parse-tree type structures). It makes me shudder to think about it. But I think it explains a lot of people’s perspectives on statistics.

        I started using R at version 0.xx back in about 1998 as an alternative to having to learn SAS and never looked back…

        Amusingly one guy at the company once sent out a like 1000 line macro package that would grub through a directory and delete files older than a certain age. I sent out the line “find . -mtime +5 | xargs rm” and I think I alienated the whole department.

        SAS etc is like learning to ride a bike with training wheels… Sure it teaches you to do it all wrong, but it’s not as scary and gives you a nice path to eventually bolting on a 15 ton dump truck to act as a side-car to your bicycle.

        • A long time ago my thesis advisor told me that competency in SAS guarantees a job. I’m well into my career and I think he’s right, especially now as our research group is expanding. However, I’d really like to hire people with competency in R as well. Necessary and sufficient conditions and all that.

  4. I use hierarchical logistic regression all the time (or at least used to, during my PhD). And this was before Stan! Yep, good old days of Jags and Bugs, or my own R code.
    Having said that, the best argument I see for economist to use linear probability model is because it make it easier to user Instrumental variables, DD, RDD etc. I never found any simple paper on using logistic regression with instrumental variables and alike. I know IV are a bit (a lot?) discredit nowadays, but still…
    Any points to literature on identification strategies for logistic multilevel models?

  5. In most cases economists want to know the marginal effects of policies on average, not the effect on each individual. The linear probability model is pretty good at that (if the conditional expectation function is non-linear, the LPM approximates it). Besides, how do we make sure we use the right non-linear model? If we use the wrong non-linear model, we’ll get wrong marginal effect.

    • Rap:

      It may be that economists have traditionally been interested only in averages but I think this is changing, certainly in microeconomics which is ultimately all about the individual. In any case, the average is only defined relative to some population which is not in general the same as the sample being studied. To put it another way: if you use the wrong linear model, you’ll get the wrong marginal effect. Restricting yourself to linearity does not get rid of misspecification bias.

      • Andrew,

        Most, if not all, recent empirical papers published in AER, JPE, QJE, Econometrica, REStud, REStat just have estimates of average effects, maybe because estimating average effects is difficult enough.

        I think it is a good practice is to use logit and probit, and others, in addition to LPM, to show the results are robust (if the effects are real usually the results are similar).

        • Rap:

          The average effects being estimated are only averages of interest under certain assumptions. That is, they can be interpreted as special cases of interaction models that are more relevant to the underlying questions of interest. As I discussed in my recent talk at Princeton econ (see here), one way that economists (and researchers more generally) attain robust, statistically significant results is to pool across space and time. But when patterns of interest actually vary a lot by group or by time, this can be a mistake. See here for a simple cautionary tale.

    • If all the economists care about is the slope, then just fit a logistic regression and take the derivative after fitting.

      If you fit a line to a binary outcome using least squares, you are doing it wrong, period. The reason I say this with such confidence is that the logistic function is quite linear in the vicinity of x=0, so if you’re fitting to a moderate probability (say 20% to 80%) then you’re basically fitting a line already. And, if you’re fitting to a more extreme probability, then the linear model is going to bust right through p = 0 or 1 and lead to bad results including all kinds of bias etc.

      The logistic curve is just an easy CDF to work with. You could use the CDF of any common unimodal distribution, and do a better job than a line.

      I think this is a situation where Bob Carpenter’s diagnosis is correct. People who can’t interpret logistic regression are not doing “math” in the statistical modeling, they’re pressing buttons, they’ve got one button for “do a logistic regression” and another for “do a linear regression” and they aren’t really sure what the math is hidden behind the “do a logistic regression” button so they aren’t really sure how to interpret the output.

      I think this is made worse by the weird stuff they teach people about regression in classical stats textbooks.

      It’s virtually impossible to understand the typical exposition of Generalized Linear Models (https://en.wikipedia.org/wiki/Generalized_linear_model) because it usually is talking a lot about the “distribution of the errors” which you’re supposed to have FREQUENTIST concepts of the distribution (ie. that you somehow know the frequency of a given error). Pretty much any beginner can see that they don’t have such a thing. It’s only the “experts” who have duped themselves into thinking that they’re modeling the frequency of errors.

      As soon as you have a flexible Bayesian modeling language and a Bayesian concept of the error distributions, it’s suddenly so much easier to interpret the model, because you’re CREATING the model, not using a push-button canned thing.

      • Daniel:

        I think you underestimate economists. All this is discussed in standard econometric textbooks including Greene, Cameron & Trivedi, Wooldridge.

        But specially see the relevant discussion in Angrist and Pinchske.

        Most economists use the linear model to generate an estimate, not to make predictions. So predicting a probability of 120% is a red herring.

        • I’m not even talking about predicting (by which I think you mean extrapolating).

          Suppose you have the following simple situation, a predictor ranges from 0-20, the binary outcome occurs with probability equal to some cdf that isn’t the logistic one (here a gamma cdf)… and you want to estimate the probability of occurrence when predictor = 5

          library(ggplot2);

          preds=rep(0:20,10);

          pvals <- pgamma(preds,shape=2,rate=1/1.5);

          data <- data.frame(preds=preds,pvals= pgamma(preds,shape=2,rate=1/1.5),
          outcomes =rbinom(NROW(pvals),size=1,prob=pvals));

          coefs <- lm(outcomes ~ preds,data=data);

          logcoefs <- glm(outcomes ~ preds,family=binomial,data=data);

          ggplot(data)+geom_point(aes(preds,outcomes)) + geom_point(aes(preds,pvals),col=”red”)+geom_line(aes(preds,predict(coefs)),col=”blue”) + geom_line(aes(preds,predict(logcoefs,type=”response”)),col=”green”);

          You’re fitting a line to something that logically HAS to have a horizontal asymptote. It just doesn’t work. In this problem the estimate of p you get from the linear model is wrong almost *everywhere*

          if your predictor spans a small range, in this problem say from 3 to 4, then yes, you can get away with it, but even if it spans a small range in the region where the nonlinearity is nontrivial, such as say 6-7, you’ll potentially have trouble.

          I’m not just talking about economists, but anyone who might read this paper and think “gee, I guess I’ll just fit lines because they’re good enough”

        • Daniel:

          No one is disputing what you are saying. It is not news.

          At issue is whether in common practical situations economists deal with the gains to be had by more realistic modeling are worth the (slight) hassle of working out average marginal effects, which is what they care about.

        • Fernando:

          I will say, though, that it’s not so clear that average marginal effects (whatever exactly that means) is what economists should be caring about. It seems like more a matter of convention: this is what people say they want, so they go do it, so they say they want it, etc. There are some good reasons, sure, but I think we have to be careful about circular reasoning here, assuming this is the right thing to be looking at just because it’s what is done. Just consider if, a few years ago, someone had looked at the psychology literature and said that psychologists for good reasons just care about p-values! My point is not that average marginal effects and p-values are the same thing, just that ultimately we have to justify these methods choices based on first principles.

        • Agreed.

          But certainly in the RCT literature is very common, as they are interested in the ATE. Besides, sample sizes are often not large enough to get into heterogeneity. And even then, people often do experiments because they don’t want to make strong assumptions. Ergo, I doubt these people would want to make predictions for the few cases at the tail ends of the distribution where much depends on the model relative to the data.

          Now, if I were using a model to predict risk of readmission for individual patients at discharge, then sure, I would not fit a linear model.

        • @Elin

          If you are referring to readmission, in practice what matters is 30 day re-admission for regulatory reasons. So a typical database would have binary data on patients indicating whether she was readmitted in <= 30 days or not.

        • Put differently. When you write code, do you sometimes do hacks, or load up on technical debt?

          This is what they are doing. In part because they only care about average effects, and also because they are not going to put the model into production to regulate a damn or whatever. If they were I am sure they would be more careful.

          I would say we all do hacks all the time. It is risky but optimal in many situations (optimal here in terms of effort / payoff not technical correctness)

Leave a Reply

Your email address will not be published. Required fields are marked *