Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment)

Tom Vladeck writes:

I thought you may be interested in some internal research my company did using a conjoint experiment, with analysis using Stan! The upshot is that we found that vaccine hesitant people would require a large payment to take the vaccine, and that there was a substantial difference between the prices required for J&J and Moderna & Pfizer (evidence that the pause was very damaging). You can see the model code here.

My reply: Cool! I recommend you remove the blank lines from your Stan code as that will make your program easier to read.

Vladeck responded:

I prefer a lot of vertical white space. But good to know that I’m likely in the minority there.

For me, it’s all about the real estate. White space can help code be more readable but it should be used sparingly. What I’d really like is a code editor that does half white spaces.

Leap Day Special!

The above graph is from a few years ago but is particularly relevant today!

It’s funny that, in leap years, approximately 10% fewer babies are born on 29 Feb. I think it would be cool to have a Leap Day birthday. But I guess most people, not being nerds, would prefer the less-“weird” days before and after.

There’s lots of good stuff at the above link; I encourage you to read the whole thing.

In the years since, we’ve improved Stan so we can fit and improve the birthdays time series decomposition model using full Bayesian inference.

Here’s Aki’s birthday case study which has all the details. This will also be going into our Bayesian Workflow book.

Varying slopes and intercepts in Stan: still painful in 2024

Andrew recently blogged the following: Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson. This is the kind of model Andrew et al. used for one example in Red State, Blue State, which is the varying effect of income on Republican preference by state. Each state had its own slope and intercept related with a multivariate hierarchical prior. The version in Gelman and Hill’s regression book is a hack that tried to scale an inverse Wishart; the LKJ is what they would have used if Ben Goodrich had created it at that point.

Andrew points to a tutorial on Bayesian varying effects models from Will Hipson, which is really nice in the way it steps through workflow, building up the model in stages. The model Hipson develops is an improvement on what we have in our User’s Guide. After everything else, I circle back and talk about doc, trying to connect it to my recent post on why doc is so dangerous.

I think we can do a bit better in the current verison of Stan, but I have to confess up front that Andrew’s right—this is still painful. This took me around three hours to put together the model and simulations and blog post and I’m the one who designed the language! This would’ve been much faster if I wasn’t trying to bring it up to a “publishable” standard as an example of how I like to see Stan code written.

The original Stan model

Here’s Will Hipson’s model:

data {
  int N_obs; // number of observations
  int N_pts; // number of participants
  int K; // number of predictors + intercept
  int pid[N_obs]; // participant id vector
  matrix[N_obs, K] x; // matrix of predictors
  real y[N_obs]; // y vector
}

parameters {
  matrix[K, N_pts] z_p; // matrix of intercepts and slope
  vector[K] sigma_p; // sd for intercept and slope
  vector[K] beta; // intercept and slope hyper-priors
  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
  real sigma; // population sigma
}

transformed parameters {
  matrix[K, N_pts] z; // non-centered version of beta_p
  z = diag_pre_multiply(sigma_p, L_p) * z_p; 
}

model {
  vector[N_obs] mu;
  
  // priors
  beta ~ normal(0, 1);
  sigma ~ exponential(1);
  sigma_p ~ exponential(1);
  L_p ~ lkj_corr_cholesky(2);
  to_vector(z_p) ~ normal(0, 1);
  
  // likelihood
  for(i in 1:N_obs) {
    mu[i] = beta[1] + z[1, pid[i]] + (beta[2] + z[2, pid[i]]) * x[i, 2];
  }
  y ~ normal(mu, sigma);
}

generated quantities {
  matrix[2, 2] Omega;
  Omega = multiply_lower_tri_self_transpose(L_p);
}

Warning: There’s a bug in this code in that it only handles the K = 2 case. You can see this with the 1 and 2 hardcoded in the definition of mu[i].

My Stan model

The documentation for the model is at the top of the Stan code, then the Stan code only has a single line of doc other than explanations of the variables (which I wouldn’t include in non-tutorial code, just to link this back to what I was saying a few posts ago about comments).

/**
 * Varying slopes and intercept hierarchical linear regression.
 * N observations organized into J groups, with jj[n] being the group
 * and x[n, 1:K] the covariates for observation n.  The covariate
 * matrix x should include a column of 1s to include a slope.
 * 
 * The slopes and intercept per group have a multivariate normal prior
 * and the scale has an exponential prior.  The location of the
 * multivariate normal prior has a standard normal hyperprior and its
 * covariance is decomposed into a correlation matrix with an LKJ
 * hyperprior and a scale vector with an exponential hyperprior. In
 * symbols: 
 *
 * Likelihod:
 *   y[n] ~ normal(x[n] * beta[1:K, jj[n]], sigma) for n in 1:N
 *
 * Priors:
 *   sigma ~ exponential(1)
 *   beta[1:K, j] ~ multi_normal(nu, Sigma) for j in 1:J
 * 
 * Hyperpriors:
 *   nu ~ normal(0, 1)
 *   scale(Sigma) ~ exponential(1)
 *   corr(Sigma) ~ lkj(2)
 *
 * where scale(Sigma) is the scale vector and corr(Sigma) is the
 * correlation matrix of Sigma.
 *
 * For efficiency and numerical stability, the covariance and
 * correlation matrices are Cholesky factored.
 */
data {
  int<lower=0> J;                      // number of groups
  int<lower=0> N;                      // number of observations
  array[N] int<lower=1, upper=J> jj;   // group per observation
  int<lower=1> K;                      // number of covariates
  matrix[N, K] x;                      // data matrix
  vector[N] y;                         // observations
}
parameters {
  vector[K] nu;                        // location of beta[ , j]
  vector<lower=0>[K] tau;              // scale of beta[ , j]
  cholesky_factor_corr[K] L_Omega;     // Cholesky of correlation of beta[ , j]
  matrix[K, J] beta_std;               // standard beta (beta - nu) / Sigma
  real<lower=0> sigma;                 // observation error for y
}
transformed parameters {
  matrix[K, J] beta = rep_matrix(nu, J)
                      + diag_pre_multiply(tau, L_Omega) * beta_std;
}
model {
  nu ~ normal(0, 1);
  tau ~ exponential(1);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(beta_std) ~ normal(0, 1);  // beta[ , j] ~ multi_normal(nu, Sigma)
  sigma ~ exponential(1);
  y ~ normal(rows_dot_product(x, beta[ , jj]'), sigma);
}
generated quantities {
  matrix[K, K] Sigma                   // covariance of beta[, j]
    = multiply_lower_tri_self_transpose(diag_pre_multiply(tau, L_Omega));
}

(WordPress is really annoying in its mishandling of angle brackets in pre environments.)

I started the first version using K = 2 and loops rather than vectorization. Next, I generalized from K = 2 to general K. Then I found the rows dot product function and got rid of the last loop. (Loops are fast in Stan—it’s the redundant autodiff, like multiple calculations of log(sigma) that are the time wasters in non-vectorized code.)

One could make the case for suffixing all the parameters of the prior for beta using _beta. You’d probably want to do that in a context with multiple groupings. It gets very hard to read even in this simple case—with multiple groupings it’s a right royal mess. Long variable names are very hard to read in math.

I put all the comments to the right at end of lines, so they don’t waste any vertical space and don’t get in the way of actually reading the code. Except the one true inline code comment that I’d leave, which is the implication of all the non-centered parameterization on the distribution of beta.

Let’s take it for a spin

I actually wrote the model first. I would generally recommend debugging the stages, but I’ve built this model many times, so I decided to roll the dice.

What I haven’t done before is simulate data in Python. Luckily, I have a very helpful coding assistant named ChatGPT. I have a system prompt that looks like this (I’m not actually a professor any more, but I thought it’d be easier to underestand that way).

WHO: I am a professor in statistics and machine learning with a Ph.D. in cognitive science and computer science. I understand (a) graduate level probability and statistics, (b) graduate-level machine learning including neural networks, (c) applied mathematics including calculus and analysis, matrices and linear algebra, combinatorics and graph theory, mathematical logic and set theory, (c) cognitive science including psycholinguistics and cognitive anthropology, (d) linguistics including phonology, morphology, syntax, semantics, and pragmatics, and (e) philosophy including 20th century analytic philosophy and philosophy of mind.

HOW: Respond in a concise and neutral academic style without embellishments.

Please call me “Bob”.

This was a one-and-done prompt. I’ve also had a bit of practice formulating these queries.

BOB: In Python, I want to set K = 5 covariates J = 10 groups. I want to generate a random K-vector location nu from a standard normal, and a (K x K) random covariance matrix Sigma of roughly unit scale (multiplying Cholesky factors or taking entries from a random walk would be OK). Then generate a K x J matrix beta where the columns are drawn from a multivariate normal distribution with location nu and covariance Sigma.

Then I want to set N = 1000 observations. For each observation in in range(N), generate a random group jj[n] in range(J) uniformly. Take a scale sigma = 1.5 for errors, and generate an N x K standard normal matrix x of covariates, with the first column set to all 1 values for an intercept. Then generate y from a hierarchical linear regression where the coefficient vector for observation n is given by beta[:, jj[n]] and the noise is normal with location 0 and scale sigma.

GPT-4 (Feb 28, 2024): [It just led with the code, which is unusual.]

import numpy as np
import scipy.linalg

# Set parameters
K = 5  # number of covariates
J = 10  # number of groups
N = 1000  # number of observations
sigma = 1.5  # scale of errors

# Generate nu and Sigma
nu = np.random.normal(0, 1, K)  # K-vector location from a standard normal
A = np.random.normal(0, 1, (K, K))
Sigma = np.dot(A, A.T)  # create a positive semi-definite matrix for covariance

# Generate beta
beta = np.random.multivariate_normal(nu, Sigma, J).T  # K x J matrix

# Generate observations
jj = np.random.randint(0, J, N)  # random group for each observation
x = np.random.normal(0, 1, (N, K))  # N x K standard normal matrix
x[:, 0] = 1  # set first column to all 1s for intercept

# Generate y
y = np.zeros(N)
for n in range(N):
    y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)

y[:5]  # Show the first 5 values of y as a sample output

The hierarchical linear regression model was successfully generated with the specified parameters. The first five values of the dependent variable y from the model are approximately:

−2.27
8.86
8.20
1.30
7.27

These values are based on the simulation with K=5 covariates,
J=10 groups, N=1000 observations, and an error scale σ=1.5.

It looks like it actually ran that Python code? I haven’t seen output in this format before from GPT—the code’s in a pulldown labeled “Analyzing”.

Anway, GPT’s code worked out of the box. It only took a couple of minutes to pull the data generation into a function and return a dictionary with simulations and simulated parameter values to compare to the fitted values. My Python code illustrates my suggestion a few posts back to turn blocks of code into functions rather than putting a bunch of comments around them in the script.

import numpy as np
import scipy.linalg

def sim_param_data(K, J, N, sigma):
    nu = np.random.normal(0, 1, K)
    A = np.random.normal(0, 1, (K, K))
    Sigma = np.dot(A, A.T)
    beta = np.random.multivariate_normal(nu, Sigma, J).T 
    jj = np.random.randint(0, J, N)
    x = np.random.normal(0, 1, (N, K))
    x[:, 0] = 1
    y = np.zeros(N)
    for n in range(N):
        y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)
    return nu, Sigma, beta, {'J': J, 'N': N, 'jj': jj + 1, 'K': K, 'x': x, 'y': y}
    
import cmdstanpy as csp
nu, Sigma, beta, data = sim_param_data(K = 5, J = 10, N = 1000, sigma = 1.5)
model = csp.CmdStanModel(stan_file = 'rsbs.stan')
fit = model.sample(data)

It takes Stan about 20 seconds to fit this data set, R-hats all less than 1.01, is low, ESS in the thousands from a sample of size 4000, and all but a couple parameters are all recovered within 95% posterior intervals. There is quite a lot of uncertainty here with this little data and this many groups—don’t take those point estimates of covariance too seriously!

Appendix on doc

Let’s digress and talk about doc. I wrote a blog post a few days ago on doc, and this triggers some of the same issues. I want to say up front that doc is hard and if you go and look at code I’ve written, there will be a lot of places where you can improve the doc. Same for the code. So this is a kind of normative theory of doc, not what one might expect in reality. People only have a finite amount of time for any project. You might want to take a look at the doc in R parts of his example with the same eye.

First, there’s a scaffolding example which has the classic problem of documentation just for the sake of documentation.

vector[N] mu; // declaring a mu vector

You see the same thing in the final example where “vector[N] y” is documented as “y vector”. For the same reason, I don’t like this from an early example,

  sigma ~ exponential(1); // using an exponential prior on sigma

And this is what I meant by documenting the language.

mu[i] = x[i] * beta; // * is matrix multiplication in this context

and

  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix

Going back to the final example, rather than “population sigma”, I would prefer “error scale” as it does not rely on the conventional notation sigma to pick out the scale.

The comment for z says “non-centered version of beta_p”, but the non-centered variable here is z_p. The terminology of “centering” is around the population mean, not zero.

Continuing with doc for z, I don’t understand what it means to be a version of beta_p. There is no beta_p in the model, so maybe some external doc? In the definition of mu, you can see beta acting as the location of the non-centered parameterization.

Did anyone spot the bug in this model? This is the real reason we don’t trust doc and have to read the code. It only works for K = 2. You’ll see a hard-coded 1 and 2 on the line defining mu[i] despite other parts of the program using K. My advice in this situation is to just bite the bullet and code the K = 2 case first. Then generalize later if you need to. I code the general case in the next section.

I want to emphasize that I’m not trying to pick on Will Hipson here. I’m assuming his intent was to be pedagogical, as the comment density drops as the models get more complicated. And the code is really good—better than in our User’s Guide.

This example also emphasizes why code review is so useful—a second set of eyes is the top thing I can recommend for improving code. Just knowing your code will be reviewed helps you write better code.

Free online book by Bruno Nicenboim, Daniel Schad, and Shravan Vasishth on Bayesian inference and hierarchical modeling using brms and Stan

Shravan points us to these materials:

Hierarchical models are bread and butter stuff for psycholinguists, so we are trying hard to make Stan/brms mainstream through various means. Teaching this stuff feels like the most important work I am doing right now, more important even than the scientific side of things.

We have chapters on hierarchical modeling in our book (to be published soon with CRC Press), we use both brms and Stan:

https://vasishth.github.io/bayescogsci/book/ [edit: made it a live link]

The online version will remain available for free. Comments/corrections are welcome; one can open issues: https://github.com/vasishth/bayescogsci/issues

This summer, I [Shravan] am teaching an intro to Bayes using brms/Stan, with a focus on hierarchical modeling, especially directed at researchers in linguistics who do experimental work:

https://www.mils.ugent.be/courses/module-9-bayesian-data-analysis/

Plus, at Potsdam, for the last seven years I have been running an annual summer school on stats for linguistics and psych, where our focus is on hierarchical modeling using Stan/brms:

https://vasishth.github.io/smlp2024/

Here, we teach both frequentist and Bayesian approaches to hierarchical modeling.

Cool! Good to have these resources out there.

Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson

I was teaching varying-intercept, varying-slope multilevel models, and . . . I can get them to fit in Stan, but the code is kinda ugly, so I was struggling to clean it up, with no success. This will be a real research project, to add appropriate functions and possibly expand the Stan language so that these models can be written at a higher, more intuitive level.

Varying-intercept models aren’t so bad. In lme4 or blme or rstanarm or brms, you write something like:

y ~ 1 | group + x + z + x:z

and that transfers pretty directly into Stan. Just create the X matrix and go from there. Indeed, you can add as many batches of varying coefficients and it’s no biggie to code it up.

But once you get to varying intercepts and slopes, it all changes. In lme4 or blme or rstanarm or brms, you can just write things like:

y ~ (1 + z | group) + x + z + x:z

But if you want to program this directly in Stan, once you have varying intercepts and slopes, you have to deal with covariance-matrix decompositions and arrays of coefficient vectors, and it’s all a hairy mess.

What to do? For this semester’s class, Imma just gonna go with lme4/blme/rstanarm when fitting varying-intercept, varying-slope models. All this Stan coding is a rabbit hole that’s getting us away from the goal, which is to be able to fit, use, and evaluate statistical models for measurement and variation.

I would like to be able to more easily fit these in Stan, though. Why, you might ask? If we can fit them in lme4, or blme for more stability, or rstanarm for including more uncertainty in the inferences, then why bother coding directly in Stan?

The answer for why we want to code directly in Stan is that we’re often wanting to expand our models, for example adding mixture components, measurement error terms, time series or spatial dependence, etc.

For that reason, you will want to be able to code varying-intercept, varying-slope models in Stan—even if I won’t be teaching that in class this semester.

The good news is that I did some googling and found this tutorial by Will Hipson on programming hierarchical regressions in Stan. It’s from 2020 and I have not looked at every line of code there, but it all looks reasonable and there’s lots of explanation of the workflow. So maybe this is the best place to start, if you want to go in this direction, as you should!

Evaluating samplers with reference draws

I started thinking a bit more about what we’re doing when we use something like posteriordb (from Stan) or Inference Gym (from TensorFlow Probability) to evaluate a sampler.

posteriordb gives you Stan programs and 10,000 reference draws from their posterior. The reference draws must be generated by another sampler. In some cases, like Neal’s funnel or high-dimensional normals, etc., we can just run forward simulation and get independent draws. In cases of models with data, we don’t actually know the posterior analytically—that’s why we’re running MCMC in the first place! I think what posteriordb did was run NUTS for enough iterations it could be thinned down to 10,000 roughly independent draws.

Standard error in reference

The standard error for $latex N$ independent draws estimating a particular quantity like the posterior mean of a parameter is

$latex \textrm{se} = \frac{\displaystyle \textrm{sd}}{\displaystyle \sqrt{N}}$

where sd is the standard deviation of the quantity in the posterior. Standard Monte Carlo methods scale horribly due to that square root—Markov chain Monte Carlo can be even worse because rather than the number of draws, we have the effective sample size, which can be much lower than the number of iterations for more challenging posteriors due to the correlation in the draws.

Because $latex \sqrt{10,000} = 100$, with 10,000 draws, or standard error is only 1/100th of our standard deviation. In general, we need four times as many draws to cut the error in half, or 100 times as many draws to add a decimal place of accuracy by cutting error by a factor of ten. That means we can’t reasonably expect something like posteriordb to store enough draws to make fine-grained convergence evaluation possible. A slightly more economical alternative would be to store long term averages. There we cold run a million iterations and save parameter and squared parameter estimates. We could even run a hundred million by farming out 100 jobs on a cluster. But that’s still only three or four decimal places of accuracy. After that, maybe we could wring out another decimal place or two with control variates.

That’s a lot of residual error to have to deal with when evaluating an algorithm that asymptotically approaches the correct answer. The only thing that saves us is that the algorithms we’re evaluating are also samplers and are subject to the same inverse square root rate of reduction in standard error.

Report error in evaluations

What we really want to report is the standard error of the residual between our new model’s estimate and the truth. But we don’t usually have the truth. In practice, we usually do one of two things, depending on how compute-intensive it is to fit my model and our compute resources and patience.

Easy model:. Run the new algorithm 100 times or even 1000 times and report something like a histogram of errors. With this many runs, we can directly estimate standard error (assuming our draws are unbiased) and we can estimate it even more tightly and without assumptions if we know the true values.

Hard model:. Run the new algorithm once or twice and use an estimated standard error for the new algorithm and report that instead of an empirical histogram of errors. This suffers from estimation error in the effective sample size, which can be large.

The problem with both of these aproaches is that it’s only reporting variability of the new algorithm, treating the reference draws as true. We should be going one step further and folding the error in posteriordb or another reference database into the uncertainty reporting of the error of the new algorithm. The additional error could be done with simulation (and maybe analytically—normals are nice that way).

That’s easy enough to do reporting single univariate means. I do not know how to do this for evaluating error in something like KL-divergence, which itself has a Monte Carlo integration, or something like Wasserstein distance. In the Pathfinder paper, we used a semi-discrete Wassertein distance algorithm. I have to admit I don’t even know how that algorithm works.

Evaluating delayed rejection

I ran into this problem because I want to evaluate our delayed rejection HMC and show that the error goes to zero as I take more draws. Specifically, I want to extend the illustration in the paper that shows that standard HMC and NUTS fails, but DR-HMC succeeds at sampling from multiscale distributions like Neal’s funnel in its centered parameterization.

There are two fundamental problems with this.

Problem One: By the time we get past an effective samle size of 10,000 from the new algorithm, the error we’re estimating is dominated by the standard error of posteriordb. We don’t even know the real error of posteriordb, only an estimated standard error assuming draws are independent.

Problem Two: The second problem is that we have no guarantees that the reference draws are unbiased draws from the posterior. For example, in the centered parameterization of Neal’s funnel, there is no step size for HMC that will properly explore the posterior—each step size picks out a range of log scale parameters it can explore (as we demonstrate in the delayed rejection paper). The standard error estimate from the reference draws assumes independent draws from the posterior, not biased draws. If the posteriordb draws are biased, the error reported for a perfect sampler will not go to zero as the number of posteriordb and new sampler draws goes to infinity.

Help?

Does anyone have any suggestions of what we can do in the face of potentially biased and stochastic samples for evaluation? I could wait until the Stan meeting on Thursday and ask Aki, who’s a font of inspiration on topics like this, but I thought I’d ask a wider group and see how that goes.

Mister P and Stan go to Bangladesh . . .

Prabhat Barnwal, Yuling Yao, Yiqian Wang, Nishat Akter Juy, Shabib Raihan, Mohammad Ashraful Haque, and Alexander van Geen ask,

Is the low COVID-19–related mortality reported in Bangladesh for 2020 associated with massive undercounting?

Here’s what they did:

This repeated survey study is based on an in-person census followed by 2 rounds of telephone calls. Data were collected from a sample of 135 villages within a densely populated 350-km2 rural area of Bangladesh. Household data were obtained first in person and subsequently over the telephone. For the analysis, mortality data were stratified by month, age, sex, and household education. Mortality rates were modeled by bayesian multilevel regression, and the strata were aggregated to the population by poststratification. Data analysis was performed from February to April 2021. . . .

Mortality rates were compared for 2019 and 2020, both without adjustment and after adjustment for nonresponse and differences in demographic variables between surveys. Income and food availability reported for January, May, and November 2020 were also compared.

And here’s what they found:

All-cause mortality in the surveyed are was lower in 2020 compared with 2019, but measures to control the COVID-19 pandemic were associated with a reduction in rural income and food availability. These findings suggest that government restrictions designed to curb the spread of COVID-19 may have been effective in 2020 but needed to be accompanied by expanded welfare support.

More specifically:

Enumerators collected data from an initial 16 054 households in January 2020 . . . for a total of 58 806 individuals . . . A total of 276 deaths were reported between February and the end of October 2020 for the subset of the population that could be contacted twice over the telephone, slightly below the 289 deaths reported for the same population over the same period in 2019. After adjustment for survey nonresponse and poststratification, 2020 mortality changed by −8% (95% CI, −21% to 7%) compared with an annualized mortality of 6.1 deaths per 1000 individuals in 2019. However, in May 2020, salaried primary income earners reported a 40% decrease in monthly income (from 17 485 to 10 835 Bangladeshi Taka), and self-employed earners reported a 60% decrease in monthly income (23 083 to 8521 Bangladeshi Taka), with only a small recovery observed by November 2020.

I’ve worked with Lex and Yuling for a long time, and they both know what they’re doing.

Beyond the direct relevance of this work, the above-linked article is a great example of applied statistical analysis with multilevel regression and poststratification using Stan.

Progress in 2023, Leo edition

Following Andrew, Aki, Jessica, and Charles, and based on Andrew’s proposal, I list my research contributions for 2023.

Published:

  1. Egidi, L. (2023). Seconder of the vote of thanks to Narayanan, Kosmidis, and Dellaportas and contribution to the Discussion of ‘Flexible marked spatio-temporal point processes with applications to event sequences from association football’Journal of the Royal Statistical Society Series C: Applied Statistics72(5), 1129.
  2. Marzi, G., Balzano, M., Egidi, L., & Magrini, A. (2023). CLC Estimator: a tool for latent construct estimation via congeneric approaches in survey research. Multivariate Behavioral Research, 58(6), 1160-1164.
  3. Egidi, L., Pauli, F., Torelli, N., & Zaccarin, S. (2023). Clustering spatial networks through latent mixture models. Journal of the Royal Statistical Society Series A: Statistics in Society186(1), 137-156.
  4. Egidi, L., & Ntzoufras, I. (2023). Predictive Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 929-934). Pearson.
  5. Macrì Demartino, R., Egidi, L., & Torelli, N. (2023). Power priors elicitation through Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 923-928). Pearson.

Preprints:

  1. Consonni, G., & Egidi, L. (2023). Assessing replication success via skeptical mixture priorsarXiv preprint arXiv:2401.00257. Submitted.

Softwares:

    CLC estimator

  • free and open-source app to estimate latent unidimensional constructs via congeneric approaches in survey research (Marzi et al., 2023)

   footBayes package (CRAN version 0.2.0)

   pivmet package (CRAN version 0.5.0)

I hope and guess that the paper dealing with the replication crisis, “Assessing replication success via skeptical mixture priors” with Guido Consonni, could have good potential in the Bayesian assesment of replication success in social and hard sciences; this paper can be seen as an extension of the paper written by Leonhard Held and Samuel Pawel entitled “The Sceptical Bayes Factor for the Assessment of Replication Success“.  Moreover, I am glad that the paper “Clustering spatial networks through latent mixture models“, focused on a model-based clustering approach defined in a hybrid latent space, has been finally published in JRSS A.

Regarding softwares, the footBayes package, a tool to fit the most well-known soccer (football) models through Stan and maximum likelihood methods, has been deeply developed and enriched with new functionalities (2024 objective: incorporate CmdStan with VI/Pathfinder algorithms and write a package’s paper in JSS/R Journal format).

Progress in 2023, Charles edition

Following the examples of Andrew, Aki, and Jessica, and at Andrew’s request:

Published:

Unpublished:

This year, I also served on the Stan Governing Body, where my primary role was to help bring back the in-person StanCon. StanCon 2023 took place at the University of Washington in St. Louis, MO and we got the ball rolling for the 2024 edition which will be held at Oxford University in the UK.

It was also my privilege to be invited as an instructor at the Summer School on Advanced Bayesian Methods at KU Leuven, Belgium and teach a 3-day course on Stan and Torsten, as well as teach workshops at StanCon 2023 and at the University of Buffalo.

Progress in 2023, Aki’s software edition

Andrew, I, and Jessica (and I hope we get more) listed papers for progress in 2023, but many papers would be much less useful without software, so I list also software I’m contributing to with the most interesting improvements added in 2023 (in addition there is always huge amount of work that improves the software somehow, but is not that visible)

Stan (including Stan math + Stan core + Stanc + CmdStan)

posterior R package

loo R package

projpred R package

  • augmented-data projection to add support for more model families (Weber et al., 2023)
  • latent projection to add support for more model families (Catalina et al., 2021)
  • enhanced verbose output
  • improved user interface and summary tables

P.S. I was surprised that there were no major updates to priorsense R package or bayesplot R package in 2023, but there are some great usability improvements coming soon to both of these.

Adding intermediate outcomes to an item-response (Bradley-Terry) model

Huib Meulenbelt writes:

Assume we have the following hierarchical Bradley-Terry model:

data {
  int<lower=0> K;                                // players
  int<lower=0> N;                                // number of rallies
  int<lower=1, upper=K> player_a;     // player a
  int<lower=1, upper=K> player_b;     // player b
  int<lower=0> y;                                 // number of rallies won
}
parameters {
  real<lower=0> sigma;
  vector[K] skill;                                   // ability for player K
}
model{
  sigma ~ lognormal(0, 0.5); 
  skill ~ normal(0, sigma);
  y ~ binomial_logit(N, skill[player_a] - skill[player_b]);
}

In this blog post you argue “there’s a lot of information in the score (or vote) differential that’s thrown away if you just look at win/loss.”

I agree completely. Of each rally I obtained the length of the rally and I would like this variable to influence the skill level of the player and opponent. The skill level of the two players are closer to each other when the game ends in 11-7 and the match lasts 500 seconds than when the game ends in 11-7 and the match only lasts 100 seconds.

So, we move from
p[player A wins over player B] = logistic(skill_A – skill_B)

to

p[player A wins over player B] = logistic(f(rally_length) * (skill_A – skill_B))

How would you define this function?

My reply: This is known in psychometrics as an item-response model with a discrimination parameter. The multiplier, f(rally_length) in the above notation, is called the discrimination: the idea is that the higher it is, the more predictive the skill-level difference is of the outcome. If discrimination is zero, the skill difference doesn’t predict at all, and a negative discrimination corresponds to a prediction that goes in the unexpected direction (the worse player being more likely to win).

My answer to the immediate question above is: sure, try it out. You could start with some simple form for the function f and see how it works. Ultimately I’m not thrilled with this model because it is not generative. I expect you can do better by modeling the length of the rally as an intermediate outcome. You can do this in Stan too. I’d recommend starting with just a single parameter per player, but you might need to add another parameter for each player if the rally length varies systematically by player after adjusting for ability.

But the biggest thing is . . . above you say that you agree with me to model score differential and not just win/loss, but then in your model you’re only including win/loss as an outcome. You’re throwing away information! Don’t do that. Whatever model you use, I strongly recommend you use score differential, not win/loss, as your outcome.

Book on Stan, R, and Python by Kentaro Matsuura

A new book on Stan using CmdStanR and CmdStanPy by Kentaro Matsuura has landed. And I mean that literally as you can see from the envelope (thanks, Kentaro!). Even the packaging from Japan is beautiful—it fit the book perfectly. You may also notice my Pentel Pointliner pen (they’re the best, but there’s a lot of competition) and my Mnemosyne pad (they’re the best, full stop), both from Japan.

If you click through to Amazon using the above link, the “Read Sample” button takes you to a list where you can read a sample, which includes the table of contents and a brief intro to notation.

Yes, it comes with source code

There’s a very neatly structured GitHub package, Bayesian statistical modeling with Stan R and Python, with all of the data and source code for the book.

The book just arrived, but from thumbing through it, I really like the way it’s organized. It uses practical simulation code and realistic data to illustrate points of workflow and show users how to get unstuck from common problems. This is a lot like the way Andrew teaches this material. Unlike how Andrew teaches, it starts from the basics, like what is a probability distribution. Luckily for the reader, rather than a dry survey trying to cover everything, it hits a few insightful highlights with examples—this is the way to go if you don’t want to just introduce distributions as you go.

The book is also generous with its workflow advice and tips on dealing with problems like non-identifiability or challenges like using discrete parameters. There’s even an advanced section at the end that works up to Gaussian processes and the application of Thompson sampling (not to reinforce Andrew’s impression that I love Thompson sampling—I just don’t have a better method for sequential decision making in “bandit” problems [scare quotes also for Andrew]).

CmdStanR and CmdStanPy interfaces

This is Kentaro’s second book on Stan. The first is in Japanese and it came out before CmdStanR and CmdStanPy. I’d recommend both this book and using CmdStanR or CmdStanPy—they are our go-to recommendations for using Stan these days (along with BridgeStan if you want transforms, log densities, and gradients). After moving to Flatiron Institute, I’ve switched from R to Python and now pretty much exclusively use Python with CmdStanPy, NumPy/SciPy (basic math and stats functions), plotnine (ggplot2 clone), and pandas (R data frame clone).

Random comment on form

In another nod to Andrew, I’ll make an observation about a minor point of form. If you’re going to use code in a book set in LaTeX, use sourcecodepro. It’s a Lucida Console-like font that’s much easier to read than courier. I’d just go with mathpazo for text and math in Palatino, but I can see why people like Times because it’s so narrow. Somehow Matsuura managed to solve the dreaded twiddle problem in his displayed Courier code so the twiddles look natural and not like superscripts—I’d love to know the trick to that. Overall, though, the graphics are abundant, clear, and consistently formatted, though Andrew might not like some of the ggplot2 defaults.

Comments from the peanut gallery

Brian Ward, who’s leading Stan language development these days and also one of the core devs for CmdStanPy and BridgeStan, said that he was a bit unsettled seeing API choices he’s made set down in print. Welcome to the club :-). This is why we’re so obsessive about backward compatibility.

Postdoc on Bayesian methodological and applied work! To optimize patient care! Using Stan! In North Carolina!

Sam Berchuck writes:

I wanted to bring your attention to a postdoc opportunity in my group at Duke University in the Department of Biostatistics & Bioinformatics. The full job ad is here: https://forms.stat.ufl.edu/statistics-jobs/entry/10978/.

The postdoc will work on Bayesian methodological and applied work, with a focus on modeling complex longitudinal biomedical data (including electronic health records and mobile health data) to create data-driven approaches to optimize patient care among patients with chronic diseases. The position will be particularly interesting to people interested in applying Bayesian statistics in real-world big data settings. We are looking for people who have experience in Bayesian inference techniques, including Stan!

Interesting. In addition to the Stan thing, I’m interested in data-driven approaches to optimize patient care. This is an area where a Bayesian approach, or something like it, is absolutely necessary, as you typically just won’t have enough data to make firm conclusions about individual effects, so you have to keep track of uncertainty. Sounds like a wonderful opportunity.

My SciML Webinar next week (28 Sep): Multiscale generalized Hamiltonian Monte Carlo with delayed rejection

I’m on the hook to do a SciML webinar next week:

These are organized by Keith Phuthi (who is at CMU) through University of Michigan’s Institute for Computational Discovery and Engineering.

Sam Livingstone is moderating.This is presenting joint work with Alex Barnett, Chirag Modi, Edward Roualdes, and Gilad Turok.

I’m very excited about this project as it combines a number of threads I’ve been working on with collaborators. When I did my job talk here, Leslie Greengard, our center director, asked me why we didn’t use variable stepwise integrators when doing Hamiltonian Monte Carlo. I told him we’d love to do it, but didn’t know how to do it in such a way as to preserve the stationary target distribution.

Delayed rejection HMC

Then we found Antonietta Mira’s work on delayed rejection. It lets you retry a second Metropolis proposal if the first one is rejected. The key here is that we can use a smaller step size for the second proposal, thus recovering from proposals that are rejected because the Hamiltonian diverged (i.e., the first-order gradient based algorithm can’t handle regions of high curvature in the target density). There’s a bit of bookkeeping (which is frustratingly hard to write down) for the Hastings condition to ensure detailed balance. Chirag Modi, Alex Barnett and I worked out the details, and Chirag figured out a novel twist on delayed rejection that only retries if the original acceptance probability was low. You can read about it in our paper:

This works really well and is enough that we can get proper draws from Neal’s funnel (vanilla HMC fails on this example in either the tails in either the mouth or neck of the funnel, depending on the step size). But it’s inefficient in that it retries an entire Hamiltonian trajectory. Which means if we cut the step size in half, we double the number of steps to keep the integration time constant.

Radford Neal to the rescue

As we were doing this, the irrepressible Radford Neal published a breakthrough algorithm:

What he managed to do was use generalized Hamiltonian Monte Carlo (G-HMC) to build an algorithm that takes one step of HMC (like Metropolis-adjusted Langevin, but over the coupled position/momentum variables) and manages to maintain directed progress. Instead of fully resampling momentum each iteration, G-HMC resamples a new momentum value then performs a weighted average with the existing momentum with most of the weight on the existing momentum. Neal shows that with a series of accepted one-step HMC iterations, we can make directed progress just like HMC with longer trajectories. The trick is getting sequences of acceptances together. Usually this doesn’t work because we have to flip momentum each iteration. We can re-flip it when regenerating, to keep going in the same direction on acceptances, but with rejections we reverse momentum (this isn’t an issue with HMC because it fully regenerates each time). So to get directed movement, we need steps that are too small. What Radford figured out is that we can solve this problem by replacing the way we generate uniform(0, 1)-distributed probabilities for the Metropolis accept/reject step (we compare the variate generated to the ratio of the density at the proposal to the density at the previous point and accept if it’s lower). Radford realized that if we instead generate them in a sawtooth pattern (with micro-jitter for ergodicity), then when we’re at the bottom of the sawtooth generating a sequence of values near zero, the acceptances will cluster together.

Replacing Neal’s trick with delayed rejection

Enter Chirag’s and my intern, Gilad Turok (who came to us as an undergrad in applied math at Columbia). Over the summer, working with me and Chirag and Edward Roualdes (who was here as a visitor), he built and evaluated a system that replaces Neal’s trick (sawtooth pattern of acceptance probability) with the Mira’s trick (delayed rejection). It indeed solves the multi scale problem. It exceeded our expectations in terms of efficiency—it’s about twice as fast as our delayed rejection HMC. Going one HMC step at a time, it is able to adjust its stepsize within what would be a single Hamiltonian trajectory. That is, we finally have something that works roughly like a typical ODE integrator in applied math.

Matt Hoffman to the rescue

But wait, that’s not all. There’s room for another one of the great MCMC researchers to weigh in. Matt Hoffman, along with Pavel Sountsov, figured out how to take Radford’s algorithm and provide automatic adaptation for it.

What Hoffman and Sountsov manage to do is run a whole lot of parallel chains, then use information in the other chains to set tuning parameters for a given chain. In that way it’s like the Goodman and Weare affine-invariant sampler that’s used in the Python package emcee. This involves estimating the metric (posterior covariance or just variance in the diagonal case) and also estimating steps size, which they do through a heuristic largest-eigenvalue estimate. Among the pleasant properties of their approach is that the entire setup produces a Markov chain from the very first iteration. That means we only have to do what people call “burn in” (sorry Andrew, but notice how I say other people call it that, not that they should), not set aside some number of iterations for adaptation.

Edward Roualdes has coded up Hoffman and Sountsov’s adaptation and it appears to work with delayed rejection replacing Neal’s trick.

Next for Stan?

I’m pretty optimistic that this will wind up being more efficient than NUTS and also make things like parallel adaptation and automatic stopping a whole lot easier. It should be more efficient because it doesn’t waste work—NUTS goes forward and backward in time and then subsamples along the final doubling (usually—it’s stochastic with a strong bias toward doing that). This means we “waste” the work going the wrong way in time and beyond where we finally sample. But we still have a lot of eval to do before we can replace Stan’s longstanding sampler or even provide an alternative.

My talk

The plan’s basically to expand this blog post with details and show you some results. Hope to see you there!

Bob Carpenter thinks GPT-4 is awesome.

Bob writes:

GPT4 is even better at Stan than GPT3 and even better at C++. It’s sooo much better at longer chains of reasoning, writing style, and being able to maintain coherence. Check this two turn dialogue to get it to write a non-centered nonlinear mixed effects model. I followed it up with a one-prompt C++ question to get it to write a full N-body simulation, which it nailed. I’m sure it could’ve used other integration method like Euler’s method or Runge-Kutta.

It’s easy at this point to have it start riffing on the model by adding effects, removing effects, switching to varying slope/varying intercept, etc. It can store up to about 5000 words of context when generating the next 1000 words (and there’s a version you can pay even more for with a roughly 25K word context).

The fact that it can keep all the indexes coherent and line everything up is still mind-boggling to me. GPT4 is qualitatively better at this than GPT3 was. This thing was never taught to code Stan per se. It was just taught which kinds of randomly generated code look like real code! Nevertheless, it picked up things that look like the weakly informative priors people use in practice for logistic regression.

Of course, it has trained on everything available about Stan online, which isn’t much. It’s way better at C++, Python, and R. I included a C++ example using physics below that you should appreciate.

I had GPT3 coach me through writing a speech input, speech output interface to GPT. I built it in Python using the Qt GUI toolkit, the Elevenlabs text to speech API, and the Whisper speech to text API. I now have a (push to talk) speech interface to GPT.

BC: I want to write a Stan program to define a nonlinear mixed effects model. I want to use a logistic regression to predict response to an intervention in hospital trauma centers. Each binomial measurement is a number of successes out of number of trials. Each hospital has a state in which it is located (50 states), and a trauma center level (1, 2, or 3), both of which get random effects with independent hierarchical priors in the model. Then there is are fixed effect for the average income of the hospital patients and the average severity of the cases it sees. And there should be an overall intercept. The fixed effects get fixed priors.

GPT4: To write a Stan program that defines a nonlinear mixed-effects model with the described specifications, you can follow the below structure:

data {
 int<lower=0> N;  // Number of observations
 int<lower=0> J;  // Number of hospitals
 int<lower=0> K;  // Number of states
 int<lower=1, upper=3> level[J];  // Trauma center level
 int<lower=1, upper=K> state[J];  // State in which each hospital is located
 int<lower=0> successes[N];  // Number of successes
 int<lower=0> trials[N];  // Number of trials
 real<lower=0> avg_income[N];  // Average income of hospital patients
 real<lower=0> avg_severity[N];  // Average severity of the cases
}

parameters {
 real<lower=0> intercept;  // Overall intercept
 real<lower=0> beta_income;  // Fixed effect of income
 real<lower=0> beta_severity;  // Fixed effect of severity
 real<lower=0> sigma_state;  // State random effect standard deviation
 real<lower=0> sigma_level;  // Trauma center level random effect
standard deviation
 vector[K] state_effect_raw;  // State random effects (non-centered)
 vector[3] level_effect_raw;  // Trauma center level random effects
(non-centered)
}

transformed parameters {
 vector[K] state_effect = sigma_state * state_effect_raw;  // State random effects
 vector[3] level_effect = sigma_level * level_effect_raw;  // Trauma center level random effects
}

model {
 // Priors for fixed effects
 intercept ~ normal(0, 10);
 beta_income ~ normal(0, 10);
 beta_severity ~ normal(0, 10);

 // Priors for random effects standard deviations
 sigma_state ~ cauchy(0, 2.5);
 sigma_level ~ cauchy(0, 2.5);

 // Priors for non-centered random effects
 state_effect_raw ~ normal(0, 1);
 level_effect_raw ~ normal(0, 1);

 // Model
 for (i in 1:N) {
   int hospital_idx = ...; // Index of the hospital corresponding to
the i-th observation
   successes[i] ~ binomial_logit(trials[i], intercept + beta_income *
avg_income[i] + beta_severity * avg_severity[i] +
state_effect[state[hospital_idx]] +
level_effect[level[hospital_idx]]);
 }
}

This Stan program defines a logistic regression model with fixed effects for average income and average severity of cases, and random effects for the state and trauma center level. The program takes in the number of observations, number of hospitals, number of states, trauma center level, and state of each hospital, as well as the number of successes, number of trials, average income, and average severity of the cases. The model parameters include the overall intercept, fixed effects for income and severity, standard deviations for the random effects, and non-centered random effects for state and trauma center level.

The transformed parameters block calculates the actual random effects for the state and trauma center level by scaling the non-centered random effects with their respective standard deviations. In the model block, priors are set for the fixed effects and the standard deviations of the random effects. The non-centered random effects are given standard normal priors.

Finally, the model block iterates through each observation, and for each observation, it identifies the corresponding hospital index. The binomial_logit distribution is used to model the number of successes out of the number of trials, incorporating the fixed effects, the state random effect, and the trauma center level random effect.

BC: That’s really good, but there are still two errors. (1) There should not be lower bounds on the random effects because we are going to put the linear predictor through an inverse logit transform (implicit in the binomial_logit distribution you used). (2) The indexing is wrong in the model block. The number of observations and the number of hospitals is the same and the index can just represent the hospital (e.g., in order to index the random effects). Please fix these errors and give me just the corrected Stan program.

GPT4:

My apologies for the errors in the previous response. I have corrected the errors in the Stan program below:

data {
 int<lower=0> N;  // Number of observations (equals to the number of hospitals)
 int<lower=0> K;  // Number of states
 int<lower=1, upper=3> level[N];  // Trauma center level
 int<lower=1, upper=K> state[N];  // State in which each hospital is located
 int<lower=0> successes[N];  // Number of successes
 int<lower=0> trials[N];  // Number of trials
 real<lower=0> avg_income[N];  // Average income of hospital patients
 real<lower=0> avg_severity[N];  // Average severity of the cases
}

parameters {
 real intercept;  // Overall intercept
 real beta_income;  // Fixed effect of income
 real beta_severity;  // Fixed effect of severity
 real<lower=0> sigma_state;  // State random effect standard deviation
 real<lower=0> sigma_level;  // Trauma center level random effect standard deviation
 vector[K] state_effect_raw;  // State random effects (non-centered)
 vector[3] level_effect_raw;  // Trauma center level random effects
(non-centered)
}

transformed parameters {
 vector[K] state_effect = sigma_state * state_effect_raw;  // State random effects
 vector[3] level_effect = sigma_level * level_effect_raw;  // Trauma center level random effects
}

model {
 // Priors for fixed effects
 intercept ~ normal(0, 10);
 beta_income ~ normal(0, 10);
 beta_severity ~ normal(0, 10);

 // Priors for random effects standard deviations
 sigma_state ~ cauchy(0, 2.5);
 sigma_level ~ cauchy(0, 2.5);

 // Priors for non-centered random effects
 state_effect_raw ~ normal(0, 1);
 level_effect_raw ~ normal(0, 1);

 // Model
 for (i in 1:N) {
   successes[i] ~ binomial_logit(trials[i], intercept + beta_income * avg_income[i] + beta_severity * avg_severity[i] + state_effect[state[i]] + level_effect[level[i]]);
 }
}

I have removed the lower bounds from the random effects in the parameters block and corrected the indexing in the model block, as the number of observations and the number of hospitals are the same.

==================physics time============================

BC: Can you write a C++ program to carry out an N-body problem simulation using the leapfrog integrator for the Hamiltonian dynamics derived from Newtonian mechanics. It should randomly initialize positions and masses and velocities.

GPT4: Here’s a simple C++ program for an N-body simulation using the leapfrog integrator. This program initializes positions, masses, and velocities randomly and then simulates the system using the leapfrog method.

cpp
#include <iostream>
#include <vector>
#include <cmath>
#include <random>

const double G = 6.67430e-11; // Gravitational constant

struct Particle {
   double m;
   double x, y, z;
   double vx, vy, vz;
};

void compute_forces(const std::vector<Particle>& particles,
std::vector<Particle>& forces) {
   for (size_t i = 0; i < particles.size(); ++i) {
       double fx = 0, fy = 0, fz = 0;
       for (size_t j = 0; j < particles.size(); ++j) {
           if (i == j) continue;
           double dx = particles[j].x - particles[i].x;
           double dy = particles[j].y - particles[i].y;
           double dz = particles[j].z - particles[i].z;
           double d = std::sqrt(dx*dx + dy*dy + dz*dz);
           double f = G * particles[i].m * particles[j].m / (d*d);
           fx += f * dx / d;
           fy += f * dy / d;
           fz += f * dz / d;
       }
       forces[i].x = fx;
       forces[i].y = fy;
       forces[i].z = fz;
   }
}

void leapfrog_step(std::vector<Particle>& particles, double dt) {
   std::vector<Particle> forces(particles.size());
   compute_forces(particles, forces);

   for (size_t i = 0; i < particles.size(); ++i) {
       particles[i].vx += forces[i].x / particles[i].m * dt / 2;
       particles[i].vy += forces[i].y / particles[i].m * dt / 2;
       particles[i].vz += forces[i].z / particles[i].m * dt / 2;
       particles[i].x += particles[i].vx * dt;
       particles[i].y += particles[i].vy * dt;
       particles[i].z += particles[i].vz * dt;
   }

   compute_forces(particles, forces);
   for (size_t i = 0; i < particles.size(); ++i) {
       particles[i].vx += forces[i].x / particles[i].m * dt / 2;
       particles[i].vy += forces[i].y / particles[i].m * dt / 2;
       particles[i].vz += forces[i].z / particles[i].m * dt / 2;
   }
}

int main() {
   const int n_particles = 100;
   const double t_total = 1000.0;
   const double dt = 0.01;

   std::random_device rd;
   std::mt19937 gen(rd());
   std::uniform_real_distribution<> mass_dist(1e20, 1e22);
   std::uniform_real_distribution<> pos_dist(-1e11, 1e11);
   std::uniform_real_distribution<> vel_dist(-1e4, 1e4);

   std::vector<Particle> particles(n_particles);

   for (int i = 0; i < n_particles; ++i) {
       particles[i].m = mass_dist(gen);
       particles[i].x = pos_dist(gen);
       particles[i].y = pos_dist(gen);
       particles[i].z = pos_dist(gen);
       particles[i].vx = vel_dist(gen);
       particles[i].vy = vel_dist(gen);
       particles[i].vz = vel_dist(gen);
   }
   int n_steps = static_cast<int>(t_total / dt);
   for (int step = 0; step < n_steps; ++step) {
       leapfrog_step(particles, dt);

   // Print the state of the system at each step (optional)
   /*
   for (const auto& p : particles) {
       std::cout << "Mass: " << p.m << ", ";
       std::cout << "Position: (" << p.x << ", " << p.y << ", " << p.z << "), ";
       std::cout << "Velocity: (" << p.vx << ", " << p.vy << ", " << p.vz << ")\n";
   }
   std::cout << "-----\n";
   */
 }

 // Print final state of the system
 for (const auto& p : particles) {
   std::cout << "Mass: " << p.m << ", ";
   std::cout << "Position: (" << p.x << ", " << p.y << ", " << p.z << "), ";
   std::cout << "Velocity: (" << p.vx << ", " << p.vy << ", " << p.vz << ")\n";
 }
 return 0;
}

This program initializes the positions, masses, and velocities randomly, and then simulates the system using the leapfrog method for a given total time `t_total` and time step `dt`. The state of the system can be printed at each step by uncommenting the optional print statements in the loop. The final state of the system is printed after the simulation is complete.

BC: What is the best integration method to use for an N-body problem of the scale you just coded?

GPT4: For N-body problems, especially when simulating gravitational interactions, the best integration methods are those that are symplectic, time-reversible, and have good long-term energy conservation properties.

One of the popular choices for such problems is the leapfrog integrator, which was used in the code I provided earlier. Leapfrog is a second-order symplectic integrator with good energy conservation properties and is relatively simple to implement. It is suitable for many N-body simulations and provides accurate results for a wide range of problems.

However, when dealing with highly eccentric orbits or close encounters between bodies, higher-order symplectic integrators like the Verlet, Forest-Ruth, or Yoshida integrators might be more appropriate. These methods provide better accuracy and stability for such challenging scenarios, albeit at the cost of increased complexity and computational effort.

I asked Bob if he’d like to blog this, and he said, “Overall, everyone’s fatigued from everyone saying how great GPT is, so I’m not so keen to blog it. But hey, feel free if you’d like to.”

So I decided to post it on lag. By the time the post is appeared, perhaps people won’t be talking about GPT anymore. The only thing I added above were the html tags to make the code more readable on the blog.

My overall reaction is the same as Bob’s: I’m impressed. As we see more and more of these chatbot dialogues, it becomes more clear how they work, in general terms. As Bob and others have explained, they’ve pulled tons of text off the internet and are reconstructing it, monkey-on-typewriter style. Oddly enough, the regurgitating nature of the algorithms becomes clearer when the chatbots work well, as above, than when they fail, as in Gary Smith’s examples we discussed awhile ago. So, yeah, it doesn’t understand Stan; it can just find various Stan codes on the web that are connected to various statistical questions, and put these together. Kinda like the way that, if I’m trying to get a widget working in Latex and I’m too lazy to try to figure out what’s going on, I’ll search the web for some relevant code. But . . . damn, yeah, it’s pretty impressive how well this works, how close it comes to a reasonable answer, even while not following the logic that a human would use to answer these questions. Or, maybe it’s not so far from the logic that a human would use to try to get in the right ballpark.

Also, I don’t like the blank lines between paragraphs of the code. When I’m coding, a key resource is real estate on the screen, and I don’t like to waste space in that way. Every blank line represents a line of code somewhere else that I don’t get to see. But we can’t blame the chatbot—it’s just copying all the code-with-blank-lines that’s already on the web!

Summer School on Advanced Bayesian Methods in Belgium

(this post is by Charles)

This September, the Interuniversity Institute for Biostatistics and statistical Bioinformatics is holding its 5th Summer School on Advanced Bayesian Methods. The event is set to take place in Leuven, Belgium. From their webpage:

As before, the focus is on novel Bayesian methods relevant to the applied statistician. In the fifth edition of the summer school, the following two courses will be organized in Leuven from 11 to 15 September 2023:

The target audience of the summer school are statisticians and/or epidemiologists with a sound background in statistics, but also with a background in Bayesian methodology. In both courses, practical sessions are organized, so participants are asked to bring along their laptop with the appropriate software (to be announced) pre-installed.

I’m happy to do a three-day workshop on Stan: we’ll have ample time to dig into a lot of interesting topics and students will have a chance to do plenty of coding.

I’m also looking forward to the course on spatial modeling. I’ve worked quite a bit on the integrated Laplace approximation (notably its implementation in autodiff systems such as Stan), but I’ve never used the INLA package itself (or one of its wrappers), nor am I very familiar with applications in ecology. I expect this will be a very enriching experience.

The registration deadline is July 31st.

The Economist is hiring a political data scientist to do election modeling!

Dan Rosenheck writes:

Our data-journalism team conducts original quantitative research, deploying cutting-edge statistical methods to ask and answer important, relevant questions about politics, economics and society. We are looking to hire a full-time political data scientist. . . .

The data scientist will oversee all of our poll aggregators and predictive models for elections. This entails learning all of the code and calculations behind our existing politics-related technical projects, such as our forecasting systems for presidential and legislative elections in the United States, France and Germany; proposing and implementing improvements to them; and setting up and maintaining data pipelines to keep them updated regularly once they launch. The data scientist will also have the opportunity to design and build new models and trackers on newsworthy subjects.

It’s a permanent, full-time staff position, to replace Elliott Morris, who worked on the 2020 forecast with Merlin and me (see here for one of our blog posts and here and here for relevant academic articles).

Sounds like a great job for a statistician or a political scientist, and I hope I’ll have the opportunity to work with whoever the Economist’s new data scientist is. We built a hierarchical model and fit it in Stan!

Stan class at NYR Conference in July (in person and virtual)

I (Jonah) am excited to be teaching a 2-day Stan workshop preceding the NYR Conference in July. The workshop will be July 11-12 and the conference July 13-14.  The focus of the workshop will be to introduce the basics of applied Bayesian data analysis, the Stan modeling language, and how to interface with Stan from R. Over the course of two full days, participants will learn to write models in the Stan language, run them in R, and use a variety of R packages to work with the results.

There are both in-person and remote spots available, so if you can’t make it to NYC you can still participate. For tickets to the workshop and/or conference head to https://rstats.ai/nyr.

P.S. In my original post I forgot to mention that you can use the discount code STAN20 to get 20% off tickets for the workshop and conference!

 

Principal stratification for vaccine efficacy (causal inference)

Rob Trangucci, Yang Chen, and Jon Zelner write:

In order to meet regulatory approval, pharmaceutical companies often must demonstrate that new vaccines reduce the total risk of a post-infection outcome like transmission, symptomatic disease, severe illness, or death in randomized, placebo-controlled trials. Given that infection is a necessary precondition for a post-infection outcome, one can use principal stratification to partition the total causal effect of vaccination into two causal effects: vaccine efficacy against infection, and the principal effect of vaccine efficacy against a post-infection outcome in the patients that would be infected under both placebo and vaccination. Despite the importance of such principal effects to policymakers, these estimands are generally unidentifiable, even under strong assumptions that are rarely satisfied in real-world trials. We develop a novel method to nonparametrically point identify these principal effects while eliminating the monotonicity assumption and allowing for measurement error. Furthermore, our results allow for multiple treatments, and are general enough to be applicable outside of vaccine efficacy. Our method relies on the fact that many vaccine trials are run at geographically disparate health centers, and measure biologically-relevant categorical pretreatment covariates. We show that our method can be applied to a variety of clinical trial settings where vaccine efficacy against infection and a post-infection outcome can be jointly inferred. This can yield new insights from existing vaccine efficacy trial data and will aid researchers in designing new multi-arm clinical trials.

Sounds important. And they use Stan, which always makes me happy.