Andrew recently blogged the following: Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson. This is the kind of model Andrew et al. used for one example in Red State, Blue State, which is the varying effect of income on Republican preference by state. Each state had its own slope and intercept related with a multivariate hierarchical prior. The version in Gelman and Hill’s regression book is a hack that tried to scale an inverse Wishart; the LKJ is what they would have used if Ben Goodrich had created it at that point.
Andrew points to a tutorial on Bayesian varying effects models from Will Hipson, which is really nice in the way it steps through workflow, building up the model in stages. The model Hipson develops is an improvement on what we have in our User’s Guide. After everything else, I circle back and talk about doc, trying to connect it to my recent post on why doc is so dangerous.
I think we can do a bit better in the current verison of Stan, but I have to confess up front that Andrew’s right—this is still painful. This took me around three hours to put together the model and simulations and blog post and I’m the one who designed the language! This would’ve been much faster if I wasn’t trying to bring it up to a “publishable” standard as an example of how I like to see Stan code written.
The original Stan model
Here’s Will Hipson’s model:
data {
int N_obs; // number of observations
int N_pts; // number of participants
int K; // number of predictors + intercept
int pid[N_obs]; // participant id vector
matrix[N_obs, K] x; // matrix of predictors
real y[N_obs]; // y vector
}
parameters {
matrix[K, N_pts] z_p; // matrix of intercepts and slope
vector[K] sigma_p; // sd for intercept and slope
vector[K] beta; // intercept and slope hyper-priors
cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
real sigma; // population sigma
}
transformed parameters {
matrix[K, N_pts] z; // non-centered version of beta_p
z = diag_pre_multiply(sigma_p, L_p) * z_p;
}
model {
vector[N_obs] mu;
// priors
beta ~ normal(0, 1);
sigma ~ exponential(1);
sigma_p ~ exponential(1);
L_p ~ lkj_corr_cholesky(2);
to_vector(z_p) ~ normal(0, 1);
// likelihood
for(i in 1:N_obs) {
mu[i] = beta[1] + z[1, pid[i]] + (beta[2] + z[2, pid[i]]) * x[i, 2];
}
y ~ normal(mu, sigma);
}
generated quantities {
matrix[2, 2] Omega;
Omega = multiply_lower_tri_self_transpose(L_p);
}
Warning: There’s a bug in this code in that it only handles the K = 2 case. You can see this with the 1 and 2 hardcoded in the definition of mu[i].
My Stan model
The documentation for the model is at the top of the Stan code, then the Stan code only has a single line of doc other than explanations of the variables (which I wouldn’t include in non-tutorial code, just to link this back to what I was saying a few posts ago about comments).
/**
* Varying slopes and intercept hierarchical linear regression.
* N observations organized into J groups, with jj[n] being the group
* and x[n, 1:K] the covariates for observation n. The covariate
* matrix x should include a column of 1s to include a slope.
*
* The slopes and intercept per group have a multivariate normal prior
* and the scale has an exponential prior. The location of the
* multivariate normal prior has a standard normal hyperprior and its
* covariance is decomposed into a correlation matrix with an LKJ
* hyperprior and a scale vector with an exponential hyperprior. In
* symbols:
*
* Likelihod:
* y[n] ~ normal(x[n] * beta[1:K, jj[n]], sigma) for n in 1:N
*
* Priors:
* sigma ~ exponential(1)
* beta[1:K, j] ~ multi_normal(nu, Sigma) for j in 1:J
*
* Hyperpriors:
* nu ~ normal(0, 1)
* scale(Sigma) ~ exponential(1)
* corr(Sigma) ~ lkj(2)
*
* where scale(Sigma) is the scale vector and corr(Sigma) is the
* correlation matrix of Sigma.
*
* For efficiency and numerical stability, the covariance and
* correlation matrices are Cholesky factored.
*/
data {
int<lower=0> J; // number of groups
int<lower=0> N; // number of observations
array[N] int<lower=1, upper=J> jj; // group per observation
int<lower=1> K; // number of covariates
matrix[N, K] x; // data matrix
vector[N] y; // observations
}
parameters {
vector[K] nu; // location of beta[ , j]
vector<lower=0>[K] tau; // scale of beta[ , j]
cholesky_factor_corr[K] L_Omega; // Cholesky of correlation of beta[ , j]
matrix[K, J] beta_std; // standard beta (beta - nu) / Sigma
real<lower=0> sigma; // observation error for y
}
transformed parameters {
matrix[K, J] beta = rep_matrix(nu, J)
+ diag_pre_multiply(tau, L_Omega) * beta_std;
}
model {
nu ~ normal(0, 1);
tau ~ exponential(1);
L_Omega ~ lkj_corr_cholesky(2);
to_vector(beta_std) ~ normal(0, 1); // beta[ , j] ~ multi_normal(nu, Sigma)
sigma ~ exponential(1);
y ~ normal(rows_dot_product(x, beta[ , jj]'), sigma);
}
generated quantities {
matrix[K, K] Sigma // covariance of beta[, j]
= multiply_lower_tri_self_transpose(diag_pre_multiply(tau, L_Omega));
}
(WordPress is really annoying in its mishandling of angle brackets in pre environments.)
I started the first version using K = 2 and loops rather than vectorization. Next, I generalized from K = 2 to general K. Then I found the rows dot product function and got rid of the last loop. (Loops are fast in Stan—it’s the redundant autodiff, like multiple calculations of log(sigma) that are the time wasters in non-vectorized code.)
One could make the case for suffixing all the parameters of the prior for beta using _beta. You’d probably want to do that in a context with multiple groupings. It gets very hard to read even in this simple case—with multiple groupings it’s a right royal mess. Long variable names are very hard to read in math.
I put all the comments to the right at end of lines, so they don’t waste any vertical space and don’t get in the way of actually reading the code. Except the one true inline code comment that I’d leave, which is the implication of all the non-centered parameterization on the distribution of beta.
Let’s take it for a spin
I actually wrote the model first. I would generally recommend debugging the stages, but I’ve built this model many times, so I decided to roll the dice.
What I haven’t done before is simulate data in Python. Luckily, I have a very helpful coding assistant named ChatGPT. I have a system prompt that looks like this (I’m not actually a professor any more, but I thought it’d be easier to underestand that way).
WHO: I am a professor in statistics and machine learning with a Ph.D. in cognitive science and computer science. I understand (a) graduate level probability and statistics, (b) graduate-level machine learning including neural networks, (c) applied mathematics including calculus and analysis, matrices and linear algebra, combinatorics and graph theory, mathematical logic and set theory, (c) cognitive science including psycholinguistics and cognitive anthropology, (d) linguistics including phonology, morphology, syntax, semantics, and pragmatics, and (e) philosophy including 20th century analytic philosophy and philosophy of mind.
HOW: Respond in a concise and neutral academic style without embellishments.
Please call me “Bob”.
This was a one-and-done prompt. I’ve also had a bit of practice formulating these queries.
BOB: In Python, I want to set K = 5 covariates J = 10 groups. I want to generate a random K-vector location nu from a standard normal, and a (K x K) random covariance matrix Sigma of roughly unit scale (multiplying Cholesky factors or taking entries from a random walk would be OK). Then generate a K x J matrix beta where the columns are drawn from a multivariate normal distribution with location nu and covariance Sigma.
Then I want to set N = 1000 observations. For each observation in in range(N), generate a random group jj[n] in range(J) uniformly. Take a scale sigma = 1.5 for errors, and generate an N x K standard normal matrix x of covariates, with the first column set to all 1 values for an intercept. Then generate y from a hierarchical linear regression where the coefficient vector for observation n is given by beta[:, jj[n]] and the noise is normal with location 0 and scale sigma.
GPT-4 (Feb 28, 2024): [It just led with the code, which is unusual.]
import numpy as np
import scipy.linalg
# Set parameters
K = 5 # number of covariates
J = 10 # number of groups
N = 1000 # number of observations
sigma = 1.5 # scale of errors
# Generate nu and Sigma
nu = np.random.normal(0, 1, K) # K-vector location from a standard normal
A = np.random.normal(0, 1, (K, K))
Sigma = np.dot(A, A.T) # create a positive semi-definite matrix for covariance
# Generate beta
beta = np.random.multivariate_normal(nu, Sigma, J).T # K x J matrix
# Generate observations
jj = np.random.randint(0, J, N) # random group for each observation
x = np.random.normal(0, 1, (N, K)) # N x K standard normal matrix
x[:, 0] = 1 # set first column to all 1s for intercept
# Generate y
y = np.zeros(N)
for n in range(N):
y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)
y[:5] # Show the first 5 values of y as a sample output
The hierarchical linear regression model was successfully generated with the specified parameters. The first five values of the dependent variable y from the model are approximately:
−2.27
8.86
8.20
1.30
7.27
These values are based on the simulation with K=5 covariates,
J=10 groups, N=1000 observations, and an error scale σ=1.5.
It looks like it actually ran that Python code? I haven’t seen output in this format before from GPT—the code’s in a pulldown labeled “Analyzing”.
Anway, GPT’s code worked out of the box. It only took a couple of minutes to pull the data generation into a function and return a dictionary with simulations and simulated parameter values to compare to the fitted values. My Python code illustrates my suggestion a few posts back to turn blocks of code into functions rather than putting a bunch of comments around them in the script.
import numpy as np
import scipy.linalg
def sim_param_data(K, J, N, sigma):
nu = np.random.normal(0, 1, K)
A = np.random.normal(0, 1, (K, K))
Sigma = np.dot(A, A.T)
beta = np.random.multivariate_normal(nu, Sigma, J).T
jj = np.random.randint(0, J, N)
x = np.random.normal(0, 1, (N, K))
x[:, 0] = 1
y = np.zeros(N)
for n in range(N):
y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)
return nu, Sigma, beta, {'J': J, 'N': N, 'jj': jj + 1, 'K': K, 'x': x, 'y': y}
import cmdstanpy as csp
nu, Sigma, beta, data = sim_param_data(K = 5, J = 10, N = 1000, sigma = 1.5)
model = csp.CmdStanModel(stan_file = 'rsbs.stan')
fit = model.sample(data)
It takes Stan about 20 seconds to fit this data set, R-hats all less than 1.01, is low, ESS in the thousands from a sample of size 4000, and all but a couple parameters are all recovered within 95% posterior intervals. There is quite a lot of uncertainty here with this little data and this many groups—don’t take those point estimates of covariance too seriously!
Appendix on doc
Let’s digress and talk about doc. I wrote a blog post a few days ago on doc, and this triggers some of the same issues. I want to say up front that doc is hard and if you go and look at code I’ve written, there will be a lot of places where you can improve the doc. Same for the code. So this is a kind of normative theory of doc, not what one might expect in reality. People only have a finite amount of time for any project. You might want to take a look at the doc in R parts of his example with the same eye.
First, there’s a scaffolding example which has the classic problem of documentation just for the sake of documentation.
vector[N] mu; // declaring a mu vector
You see the same thing in the final example where “vector[N] y” is documented as “y vector”. For the same reason, I don’t like this from an early example,
sigma ~ exponential(1); // using an exponential prior on sigma
And this is what I meant by documenting the language.
mu[i] = x[i] * beta; // * is matrix multiplication in this context
and
cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
Going back to the final example, rather than “population sigma”, I would prefer “error scale” as it does not rely on the conventional notation sigma to pick out the scale.
The comment for z says “non-centered version of beta_p”, but the non-centered variable here is z_p. The terminology of “centering” is around the population mean, not zero.
Continuing with doc for z, I don’t understand what it means to be a version of beta_p. There is no beta_p in the model, so maybe some external doc? In the definition of mu, you can see beta acting as the location of the non-centered parameterization.
Did anyone spot the bug in this model? This is the real reason we don’t trust doc and have to read the code. It only works for K = 2. You’ll see a hard-coded 1 and 2 on the line defining mu[i] despite other parts of the program using K. My advice in this situation is to just bite the bullet and code the K = 2 case first. Then generalize later if you need to. I code the general case in the next section.
I want to emphasize that I’m not trying to pick on Will Hipson here. I’m assuming his intent was to be pedagogical, as the comment density drops as the models get more complicated. And the code is really good—better than in our User’s Guide.
This example also emphasizes why code review is so useful—a second set of eyes is the top thing I can recommend for improving code. Just knowing your code will be reviewed helps you write better code.