GIST: Gibbs self-tuning for HMC

I’m pleased as Punch to announce our new paper,

We followed the mathematician alphabetical author-ordering convention.

The basic idea

The basic idea is so simple, I’m surprised it’s not more popular. The general GIST sampler couples the HMC algorithm tuning parameters (step size, number of steps, mass matrix) with the position (model parameters) and momentum. Each iteration of the Markov chain, we resample momentum using a Gibbs step (as usual in HMC), then we sample the tuning parameters in a second Gibbs step, conditioning on the current position and resampled momentum. The proposal is generated by the leapfrog algorithm using the sampled tuning parameters. The accept/reject step then uses the ratio of the joint densities, which now includes the tuning parameters. Of course, we need to specify a conditional distribution of tuning parameters to make this concrete.

The same idea could be applied to other Metropolis samplers.

Some prior art

We were inspired by a combination of NUTS and some negative results from Nawaf regarding acceptance bounds in delayed rejection generalized HMC (more on that later).

Radford Neal used the coupling idea for the Metropolis acceptance probability in his recent paper Non-reversibly updating a uniform [0,1] value for Metropolis accept/reject decisions. This is for generalized HMC, which is one-step HMC with partial momentum refresh. The partial refresh means the momentum flips in HMC matter and will typically thwart directed movement. Neal’s appraoch groups the acceptances together so that generalized HMC can make directed movement like HMC itself. Although this easily fits within the GIST framework, the proposal for acceptance probability doesn’t depend on the current position or momentum.

We (Chirag Modi, Gilad Turok and I) are working on a follow-up to my recent paper with Chirag and Alex Barnett on delayed rejection HMC. The goal there was to properly sample multiscale and varying scale densities (i.e., “stiff” Hamiltonians). In the follow up, we’re using generalized HMC with delayed rejection rather than Radford’s coupling idea, and show it can also generate directed exploration. The paper’s drafted and should be out soon. Spoiler alert! DR-G-HMC works better than DR-HMC in its ability to adjust to local changes in scale, and it’s competitive with NUTS in some cases where there are varying scales, but not in most problems.

Algorithms that randomize the number of steps (or stepsize), like those in Changye Wu and Christian Robert’s Faster Hamiltonian Monte Carlo by Learning Leapfrog Scale, can also be seen as an instance of GIST. From this view, they are coupling the number of steps and sampling that number from a fixed uniform distribution each iteration. This doesn’t condition on the current position. The general framework immediately suggests biasing those draws toward longer jumps, for example by sampling uniformly from the second half of the trajectory.

Sam Livingstone was visiting Flatiron Institute for the last couple of weeks, and when I ran the basic idea by him, he suggested I have a look at the paper by Chris Sherlock, Szymon Urbas, and Matthew Ludkin, The apogee to apogee path sampler. This is a really cool idea that uses U-turns in potential energy (negative log density), which is univariate, rather than in position space. They evolve the Hamiltonian forward and backward in time for a given number of U-turns in log density, and then sample from among the points on the path (like NUTS does). One issue is that a regular normal distribution on momentum can have trouble sampling through the range of log densities (see, e.g., Sam et al.’s paper on kinetic energy choice in HMC). Because we were going to be discussing his paper in our reading group, I wrote to Chris Sherlock and he told me they have been working on the apogee to apogee idea since before NUTS was released! It’s a very similar idea, with similar forward/backward in time balancing. The other idea it shares with NUTS is a biasing toward longer jumps—this is done in a really clever way that we can borrow for GIST. Milo and Nawaf figured out how NUTS and the apogee-to-apogee sampler can be fit into the GIST framework, which simplifies their correctness proofs. Once you’ve defined a proper conditional distribution for GIST, you’re done. The apogee-to-apogee paper is also nice in evaluating a randomized stepsize version of HMC they call “blurred” HMC (and which of course, fits into the general GIST framework the same way the Wu and Robert sampler does).

If you know of other prior art or examples, we’d love to hear about them.

Our concrete alternative to NUTS

The alternative to NUTS we discuss in the GIST paper proposes a number of steps by iterating the leapfrog algorithm until a U-turn, then randomly sampling along the trajectory (like the improvement to the original NUTS that Michael Betancourt introduced or the apogee-to-apogee sampler). We then use a crude biasing mechanism (compared to the apogee-to-apogee sampler or NUTS) toward longer paths. That’s it, really. If you look at what that means for evaluating, we have to run a trajectory from the point sampled backwards in time until a U-turn—you don’t really get away from that forward-and-backward thing in any of these samplers.

We evaluate mean-square jump distance, rejections, and errors on parameter estimates and squared parameter estimates. It’s a little behind NUTS’s performance, but mostly in the same ballpark. In most cases, the variance among NUTS runs was greater than the difference between the mean NUTS and new algorithm run times. The evals demonstrate why it’s necessary to look at both parameter and squared parameter estimates. As draws become more anti-correlated, which happens by maximizing expected square jump distance, estimates for parameters improve, but error goes way up on estimates of squared parameters. I provide an example in this Stan forum post, which was inspired by a visit with Wu and Robert to discuss their randomized steps algorithm. Also, before we submit to a journal, I need to scale all the root mean-square-error calculations to Z scores.

What we’re excited about here is that it’s going to be easy to couple step size adaptation. We might even be able to adapt the mass matrix this way and get a cheap approxmation to Riemannian HMC.

Try it out

The evaluation code is up under an MIT License on my GitHub repo adaptive-hmc. I’m about to go add some more user-facing doc on how to run the evaluations. I’m really a product coder at heart, so I always find it challenging to hit the right level of doc/clarity/robustness/readability with research code. To get log densities and gradients from Stan models to develop our algorithm, we used BridgeStan.

I’m always happy to get suggestions on improving my code, so feel free to open issues or send me email.

Thank you!

This project would’ve been much harder if I hadn’t gotten feedback on the basic idea and code from Stan developer and stats prof Edward Roualdes. We also wouldn’t have been able to do this easily if Edward hadn’t developed BridgeStan. This kind of code is so off-by-one, numerator-vs-denominator, negative vs. positive, log vs. exponent mistake prone, that it makes my head spin. It doesn’t help that I’ve moved from R to Python, where the upper bounds are exclusive! Edward found a couple of dingers in my original code. Thanks also to Chirag for helping me understand the general framework and on the code. Both Edward and Chirag are working on better alternatives to our simple alternative to NUTS, which will be showing up in the same adaptive HMC repo—just keep in mind this is all research code!

What’s next?

Hopefully we’ll be rolling out a bunch of effective GIST samplers soon. Or even better, maybe you will…

P.S. In-person math and computing

The turnaround time on this paper from conception to arXiv is about the fastest I’ve ever been involved with (outside of simple theoretical notes I used to dash out). I think the speed is from two things: (1) the idea is easy, and (2) Milo, Nawaf and I spent four days together at Rutgers about a month ago working pretty much full time on this paper (with some time outs for teaching and talks!). We started with a basic idea, then worked out all the theory and developed the alternative to NUTS and tried some alternatives over those four days. It’s very intense working like that, but it can be super productive. We even triple coded on the big screen as we developed the algorithm and evaluated alternatives. Then we divided the work of writing the paper cleanly among us—as always, modularity is the key to scaling.

“Close but no cigar” unit tests and bias in MCMC

I’m coding up a new adaptive sampler in Python, which is super exciting (the basic methodology is due to Nawaf Bou-Rabee and Tore Kleppe). Luckily for me, another great colleague, Edward Roualdes, has been keeping me on the straight and narrow by suggesting stronger tests and pointing out actual bugs in the repository (we’ll open access it when we put the arXiv paper up—hopefully by the end of the month).

There are a huge number of potential fencepost (off by one), log-vs-exponential, positive-vs-negative, numerator-vs-denominator, and related errors to make in this kind of thing. For example, here’s a snippet of the transition code.

L = self.uturn(theta, rho)
LB = self.lower_step_bound(L)
N = self._rng.integers(LB, L)
theta_star, rho_star = self.leapfrog(theta, rho, N)
rho_star = -rho_star
Lstar = self.uturn(theta_star, rho_star)
LBstar = self.lower_step_bound(Lstar)
if not(LBstar <= N and N < Lstar):
    ... reject ...

Looks easy, right? Not quite. The uturn function returns the number of steps to get to a point that is one step past the U-turn point. That is, if I take L steps from (theta, rho), I wind up closer than to where I started than if I take L - 1 steps. The rng.integers function samples uniformly, but it’s Python, so it excludes the upper bound and samples from {LB, LB + 1, .., L - 1} . That’s correct, because I want to choose a number of steps greater than 1 and less than the point past which you’ve made a U-turn. Let’s just say I got this wrong the first time around.

Because it’s MCMC and I want a simple proof of correctness, I have to make sure the chain’s reversible. So I see how many steps to get one past a U-turn coming back (after momentum flip), which is Lstar. Now I have to grab its lower bound, and make sure that I take a number of steps between the lower bound (inclusive) and upper bound (exclusive). Yup, had this wrong at one point. But the off-by-one error shows up in a position that is relatively rare given how I was sampling.

For more fun, we have to compute the acceptance probability. In theory, it’s just p(theta_star, rho_star, N) / p(theta, rho, N) in this algorithm, which looks as follows on the log scale.

log_accept = (
    self.log_joint(theta_star, rho_star) - np.log(Lstar - LBstar)
    - (log_joint_theta_rho - np.log(L - LB))
)

That’s because p(N | theta_star, rho_star) = 1 / (Lstar - LBstar) given the uniform sampling with Lstar excluded and LBstar included. But then I substituted the uniform distribution for a binomial, and made the following mistake.

log_accept = (
  self.log_joint(theta_star, rho_star) - self.length_log_prob(N, Lstar)
  - (log_joint_theta_rho - self.length_log_prob(N, L))
)

I only had the negation in -np.log(L - LB) because it was equivalent to np.log(1 / (L - LB)) with a subtraction instead of a division. Luckily Edward caught this one in the code review. I should’ve just coded the log density and added it rather than subtracted it. Now you’d think this would lead to an immediate and glaring bug in the results because MCMC is a delicate algorithm. In this case, the issue is that (N - L) and (N - Lstar) are identically distributed and only range over values of roughly 5 to 7. That’s a minor difference in a stochastic acceptance probability that’s already high. How hard was this to detect? With 100K iterations, everything looked fine. With 1M iterations, the estimates of parameters continued to follow a 1 / sqrt(iterations) trend in error, but showed the estimates of parameters squared asymptotic with residual error only after 100K iterations. That is, it required 1M iterations and an evaluation of the means of squared parameters to detect this bug.

I then introduced a similar error when I went to a binomial number of steps selection. I was using sp.stats.binom.logpmf(N, L, self._success_prob) when I should have been using sp.stats.binom.logpmf(N, L - 1, self._success_prob). As an aside, I like SciPy’s clear naming here vs. R’s dbinom(log.p = True, ...). What I don’t like about Python is that the discrete uniform doesn’t include its endpoint. Of course, the binomial includes its endpoint as an option, so these two versions need to be coded off by 1. Of course, I missed the L - 1. This only introduced a bug because I didn’t do the matching adjustment in testing whether things were reversible. That’s if not(1 <= N and N < Lstar) to match the Lstar - 1 in the logpmf() call. If I ran it all the way to L, then I would've needed N <= Lstar. This is another subtle difference that only shows up after more than 100K iterations.

We introduced a similar problem into Stan in 2016 when we revised NUTS to do multinomial sampling rather than slice sampling. It was an off-by-one error on trajectory length. All of our unit tests of roughly 10K iterations passed. A user spotted the bug by fitting a 2D correlated normal with known correlation for 1M iterations as a test and realizing estimates were off by 0.01 when they should've had smaller error. We reported this on the blog back when it happened, culminating in the post Michael found the bug in Stan's new sampler.

I was already skeptical of empirical results in papers and this is making me even more skeptical!

P.S. In case you don't know the English idiom "close but no cigar", here's the dictionary definition from Cambridge (not Oxford!).

Intelligence is whatever machines cannot (yet) do

I had dinner a few nights ago with Andrew’s former postdoc Aleks Jakulin, who left the green fields of academia for entrepreneurship ages ago. Aleks was telling me he was impressed by the new LLMs, but then asserted that they’re clearly not intelligent. This reminded me of the old saw in AI that “AI is whatever a machine can’t do.”

In the end, the definition of “intelligent” is a matter of semantics. Semantics is defined by conventional usage, not by fiat (the exception seems to be an astronomical organization trying to change the definition of “planet” to make it more astronomically precise). We do this all the time. If you think about what “water” means, it’s incredibly vague. In the simplest case, how many minerals can it contain before we call it “mud” rather than “water”? Does it even have to be made of H20 if we can find a clear liquid on an alternative earth that will nourish us in the same way (this is a common example in philosophy from Hilary Putnam, I believe)? When the word “water” was first introduced into English, let’s just say that our understanding of chemistry was less developed than it is now. The word “intelligent” is no different. We’ve been using the term since before computers, and now we have to rethink what it means. By convention, we could decide as a group of language users to define “intelligent” however we want. Usually such decisions are guided by pragmatic considerations (or at least I’d like to think so—this is the standard position of pragmatist philosophers of language, like Richard Rorty). For instance, we could decide to exclude GPT because (a) it’s not embodied in the same way as a person, (b) it doesn’t have long-term memory, (c) it runs on silicon rather than cells, etc.

It would be convenient for benchmarking if we could fix a definition of “intelligence” to work with. What we do instead is just keep moving the bar on what counts as “intelligent.” I doubt people 50 years ago (1974) would have said you can play chess without being intelligent. But as soon as Deep Blue beat the human chess champion, everyone changed their tune and the chorus became “chess is just a game” and “it’s finite” and “it has well defined rules, unlike real life.” Then when IBM’s Watson trounced the world champion at Jeopardy!, a language based game, it was dismissed as a parlor trick. Obviously because a machine can play Jeopardy!, the reasoning went, it doesn’t require intelligence.

Here’s the first hit on Google I found searching for something like [what machines can’t do]. This one’s in a popular magazine, not the scientific literature. It’s the usual piece in the genre of “ML is amazing, but it’s not intelligent because it can’t do X”.

Let’s go over Toews’s list of AI’s failures circa 2021 (these are direct quotes).

  1. Use “common sense.” A man went to a restaurant. He ordered a steak. He left a big tip. If asked what the man ate in this scenario, a human would have no problem giving the correct answer—a steak. Yet today’s most advanced artificial intelligence struggles with prompts like this.
     
  2. Learn continuously and adapt on the fly. Today, the typical AI development process is divided into two distinct phases: training and deployment.
     
  3. Understand cause and effect. Today’s machine learning is at its core a correlative tool. It excels at identifying subtle patterns and associations in data. But when it comes to understanding the causal mechanisms—the real-world dynamics—that underlie those patterns, today’s AI is at a loss.
     
  4. “Reason ethically…In 2016, Microsoft debuted an AI personality on Twitter named Tay. The idea was for Tay to engage in online conversations with Twitter users as a fun, interactive demonstration of Microsoft’s NLP technology. It did not go well. Within hours, Internet trolls had gotten Tay to tweet a wide range of offensive messages: for instance, “Hitler was right” and “I hate feminists and they should all die and burn in hell.”

(1) ChatGPT-4 gets these common-sense problems mostly right. But it’s not logic. The man may have ordered a steak, gotten it, sent it back, ordered the fish instead, and still left a big tip. This is a problem with a lot of the questions posed to GPT about whether X follows from Y. It’s not a sound inference, just the most likely thing to happen, or as we used to say, the “default.” Older AIs were typically designed around sound inference and weren’t so much trying to emulate human imprecision (having said that, my grad school admissions essay was about and my postdoc was funded by a grant on default logics back in the 1980s!).

(2) You can do in-context learning with ChatGPT, but it doesn’t retain anything long term without retraining/fine tuning. It will certainly adapt to its task/listener on the fly throughout a conversation (arguably the current systems like ChatGPT adapt to their interlocuter too much—it’s what they were trained to do via reinforcement learning). Long-term memory is perhaps the biggest technical challenge to overcome, and it’s been interesting to see people going back to LSTM/recursive NN ideas (transformers, the neural net architecture underlying ChatGPT, were introduced in a paper titled “Attention is all you need”, which used long, but finite memory).

(3) ChatGPT 4 is pretty bad at causal inference. But it’s probably above the bar for what Toews’s complaints. It’ll get simple “causal inference” right the same way people do. In general, humans are pretty bad at causal inference. We are way too prone to jump to causal conclusions based on insufficient evidence. Do we classify baseball announcers as not intelligent when they talk about how a player struggles with high pressure situations after N = 10 plate appearances in the playoffs? We’re also pretty bad at reasoning about things that go against our preconceptions. Do we think Fisher was not intelligent because he argued that smoking didn’t cause cancer? Do we think all the anthropogenic global warming deniers are not intelligent? Maybe they’re right and it’s just a coincidence that temps have gone up coinciding with industrialization and carbon emissions. Seems like a highly suspicious coincidence, but causation is really hard when you can’t do randomized controlled trials (and even then it’s not so easy because of all the possible mediation).

(4) How you call this one depends on whether you think the front-line fine-tuning of ChatGPT made a reasonably helpful/harmless/truthful bot or not and whether the “ethics” it was trained with are yours. You can certainly jailbreak even ChatGPT-4 to send it spiraling into hate land or fantasy land. You can jailbreak some of my family in the same way, but I wouldn’t go so far as to say they weren’t intelligent. You can find lots of folks who think ChatGPT is too “woke”. This is a running theme on the GPT subreddit. It’s also a running theme among anti-woke billionaires, as reflected in the UK’s Daily Telegraph article title, “ChatGPT may be the next big thing, but it’s a biased woke robot.”

I’ve heard a lot of people say their dog is more intelligent than ChatGPT. I suppose they would argue for a version of intelligence that doesn’t require (1) or (4) and is very tolerant of poor performance in (2) and (3).

You probably don’t have a general algorithm for an MLE of Gaussian mixtures

Those of you who are familiar with Garey and Johnson’s 1979 classic, Computers and Intractability: a guide to the theory of NP-completeness, may notice I’m simply “porting” their introduction, including the dialogue, to the statistics world.

Imagine Andrew had tasked me and Matt Hoffman with fitting simple standard (aka isotropic, aka spherical) Gaussian mixtures rather than hierarchical models. Let’s say that Andrew didn’t like that K-means got a different answer every time he ran it, K-means++ wasn’t much better, and even using soft-clustering (i.e., fitting the stat model with EM) didn’t let him replicate simulated data. Would we have something like Stan for mixtures. Sadly, no. Matt and I may have tried and failed. We wouldn’t want to go back to Andrew and say,

  • “We can’t find an efficient algorithm. I guess we’re just too dumb.”

We’re computer scientists and we know about proving hardness. We’d like to say,

  • “We can’t find an efficient algorithm, because no such algorithm is possible.”

But that would’ve been beyond Matt’s and my grasp, because, in this particular case, it would require solving the biggest open problem in theoretical computer science. Instead, it’s almost certain we would have come back and said,

  • “We can’t find an efficient algorithm, but neither can all these famous people.”

That seems weak. Why would we say that? Because we could’ve proven that the problem is NP-hard. A problem is in the class P if it can be solved in polynomial time with a deterministic algorithm. A problem is in the class NP when there is a non-deterministic (i.e., infinitely parallel) algorithm to solve it in polynomial time. It’s NP-hard if it’s just as hard as any other NP algorithm (formally specified through reductions, a powerful CS proof technique that’s the basis of Gödel’s incompleteness theorem). An NP-hard algorithm often has a non-deterministic algorithm to solve it makes a complete set of (exponentially many) guesses in parallel and then spends polynomial time on each one verifying whether or not it is a solution. An algorithm is NP-complete if it is NP-hard and a member of NP. Some well known NP-complete problems are bin packing, satisfiability in propositional logic, and the traveling salesman problem—there’s a big list of NP-complete problems.

Nobody has found a tractable algorithm to solve an NP-hard problem. When we (computer scientists) say “tractable,” we mean solvable in polynomial time with a deterministic algorithm (i.e., the problem is in P). The only known algorithms for NP-hard problems are exponential. Researchers have been working for the last 50+ years trying to prove that the class of NP problems is disjoint from the class of P problems.

In other words, there’s a Turing Award waiting for you if you can actually turn response (3) into response (2).

In the case of mixtures of standard (spherical, isotropic) Gaussians there’s a short JMLR paper with a proof that maximum likelihood estimation is NP-hard.

And yes, that’s the same Tosh as who was the first author of the “piranha” paper.

Ising models that are not restricted to be planar are also NP-hard.

What both these problems have in common is that they are combinatorial and require inference over sets. I think (though am really not sure) that one of the appeals of quantum computing is potentially solving NP-hard problems.

P.S. How this story really would’ve went is that we would’ve told Andrew that some simple distributions over NP-hard problem instances lead to expected polynomial time algorithms and we’d be knee-deep in the kinds of heuristics used to pack container ships efficiently.

ISBA 2024 Satellite Meeting: Lugano, 25–28 June

Antonietta Mira is organizing a satellite workshop before ISBA. It’s free, there is still time to submit a poster, and it’s a great excuse to visit Lugano. Here are the details:

I really like small meetings like this. Mitzi and I are going to be there and then continue on to ISBA.

Varying slopes and intercepts in Stan: still painful in 2024

Andrew recently blogged the following: Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson. This is the kind of model Andrew et al. used for one example in Red State, Blue State, which is the varying effect of income on Republican preference by state. Each state had its own slope and intercept related with a multivariate hierarchical prior. The version in Gelman and Hill’s regression book is a hack that tried to scale an inverse Wishart; the LKJ is what they would have used if Ben Goodrich had created it at that point.

Andrew points to a tutorial on Bayesian varying effects models from Will Hipson, which is really nice in the way it steps through workflow, building up the model in stages. The model Hipson develops is an improvement on what we have in our User’s Guide. After everything else, I circle back and talk about doc, trying to connect it to my recent post on why doc is so dangerous.

I think we can do a bit better in the current verison of Stan, but I have to confess up front that Andrew’s right—this is still painful. This took me around three hours to put together the model and simulations and blog post and I’m the one who designed the language! This would’ve been much faster if I wasn’t trying to bring it up to a “publishable” standard as an example of how I like to see Stan code written.

The original Stan model

Here’s Will Hipson’s model:

data {
  int N_obs; // number of observations
  int N_pts; // number of participants
  int K; // number of predictors + intercept
  int pid[N_obs]; // participant id vector
  matrix[N_obs, K] x; // matrix of predictors
  real y[N_obs]; // y vector
}

parameters {
  matrix[K, N_pts] z_p; // matrix of intercepts and slope
  vector[K] sigma_p; // sd for intercept and slope
  vector[K] beta; // intercept and slope hyper-priors
  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
  real sigma; // population sigma
}

transformed parameters {
  matrix[K, N_pts] z; // non-centered version of beta_p
  z = diag_pre_multiply(sigma_p, L_p) * z_p; 
}

model {
  vector[N_obs] mu;
  
  // priors
  beta ~ normal(0, 1);
  sigma ~ exponential(1);
  sigma_p ~ exponential(1);
  L_p ~ lkj_corr_cholesky(2);
  to_vector(z_p) ~ normal(0, 1);
  
  // likelihood
  for(i in 1:N_obs) {
    mu[i] = beta[1] + z[1, pid[i]] + (beta[2] + z[2, pid[i]]) * x[i, 2];
  }
  y ~ normal(mu, sigma);
}

generated quantities {
  matrix[2, 2] Omega;
  Omega = multiply_lower_tri_self_transpose(L_p);
}

Warning: There’s a bug in this code in that it only handles the K = 2 case. You can see this with the 1 and 2 hardcoded in the definition of mu[i].

My Stan model

The documentation for the model is at the top of the Stan code, then the Stan code only has a single line of doc other than explanations of the variables (which I wouldn’t include in non-tutorial code, just to link this back to what I was saying a few posts ago about comments).

/**
 * Varying slopes and intercept hierarchical linear regression.
 * N observations organized into J groups, with jj[n] being the group
 * and x[n, 1:K] the covariates for observation n.  The covariate
 * matrix x should include a column of 1s to include a slope.
 * 
 * The slopes and intercept per group have a multivariate normal prior
 * and the scale has an exponential prior.  The location of the
 * multivariate normal prior has a standard normal hyperprior and its
 * covariance is decomposed into a correlation matrix with an LKJ
 * hyperprior and a scale vector with an exponential hyperprior. In
 * symbols: 
 *
 * Likelihod:
 *   y[n] ~ normal(x[n] * beta[1:K, jj[n]], sigma) for n in 1:N
 *
 * Priors:
 *   sigma ~ exponential(1)
 *   beta[1:K, j] ~ multi_normal(nu, Sigma) for j in 1:J
 * 
 * Hyperpriors:
 *   nu ~ normal(0, 1)
 *   scale(Sigma) ~ exponential(1)
 *   corr(Sigma) ~ lkj(2)
 *
 * where scale(Sigma) is the scale vector and corr(Sigma) is the
 * correlation matrix of Sigma.
 *
 * For efficiency and numerical stability, the covariance and
 * correlation matrices are Cholesky factored.
 */
data {
  int<lower=0> J;                      // number of groups
  int<lower=0> N;                      // number of observations
  array[N] int<lower=1, upper=J> jj;   // group per observation
  int<lower=1> K;                      // number of covariates
  matrix[N, K] x;                      // data matrix
  vector[N] y;                         // observations
}
parameters {
  vector[K] nu;                        // location of beta[ , j]
  vector<lower=0>[K] tau;              // scale of beta[ , j]
  cholesky_factor_corr[K] L_Omega;     // Cholesky of correlation of beta[ , j]
  matrix[K, J] beta_std;               // standard beta (beta - nu) / Sigma
  real<lower=0> sigma;                 // observation error for y
}
transformed parameters {
  matrix[K, J] beta = rep_matrix(nu, J)
                      + diag_pre_multiply(tau, L_Omega) * beta_std;
}
model {
  nu ~ normal(0, 1);
  tau ~ exponential(1);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(beta_std) ~ normal(0, 1);  // beta[ , j] ~ multi_normal(nu, Sigma)
  sigma ~ exponential(1);
  y ~ normal(rows_dot_product(x, beta[ , jj]'), sigma);
}
generated quantities {
  matrix[K, K] Sigma                   // covariance of beta[, j]
    = multiply_lower_tri_self_transpose(diag_pre_multiply(tau, L_Omega));
}

(WordPress is really annoying in its mishandling of angle brackets in pre environments.)

I started the first version using K = 2 and loops rather than vectorization. Next, I generalized from K = 2 to general K. Then I found the rows dot product function and got rid of the last loop. (Loops are fast in Stan—it’s the redundant autodiff, like multiple calculations of log(sigma) that are the time wasters in non-vectorized code.)

One could make the case for suffixing all the parameters of the prior for beta using _beta. You’d probably want to do that in a context with multiple groupings. It gets very hard to read even in this simple case—with multiple groupings it’s a right royal mess. Long variable names are very hard to read in math.

I put all the comments to the right at end of lines, so they don’t waste any vertical space and don’t get in the way of actually reading the code. Except the one true inline code comment that I’d leave, which is the implication of all the non-centered parameterization on the distribution of beta.

Let’s take it for a spin

I actually wrote the model first. I would generally recommend debugging the stages, but I’ve built this model many times, so I decided to roll the dice.

What I haven’t done before is simulate data in Python. Luckily, I have a very helpful coding assistant named ChatGPT. I have a system prompt that looks like this (I’m not actually a professor any more, but I thought it’d be easier to underestand that way).

WHO: I am a professor in statistics and machine learning with a Ph.D. in cognitive science and computer science. I understand (a) graduate level probability and statistics, (b) graduate-level machine learning including neural networks, (c) applied mathematics including calculus and analysis, matrices and linear algebra, combinatorics and graph theory, mathematical logic and set theory, (c) cognitive science including psycholinguistics and cognitive anthropology, (d) linguistics including phonology, morphology, syntax, semantics, and pragmatics, and (e) philosophy including 20th century analytic philosophy and philosophy of mind.

HOW: Respond in a concise and neutral academic style without embellishments.

Please call me “Bob”.

This was a one-and-done prompt. I’ve also had a bit of practice formulating these queries.

BOB: In Python, I want to set K = 5 covariates J = 10 groups. I want to generate a random K-vector location nu from a standard normal, and a (K x K) random covariance matrix Sigma of roughly unit scale (multiplying Cholesky factors or taking entries from a random walk would be OK). Then generate a K x J matrix beta where the columns are drawn from a multivariate normal distribution with location nu and covariance Sigma.

Then I want to set N = 1000 observations. For each observation in in range(N), generate a random group jj[n] in range(J) uniformly. Take a scale sigma = 1.5 for errors, and generate an N x K standard normal matrix x of covariates, with the first column set to all 1 values for an intercept. Then generate y from a hierarchical linear regression where the coefficient vector for observation n is given by beta[:, jj[n]] and the noise is normal with location 0 and scale sigma.

GPT-4 (Feb 28, 2024): [It just led with the code, which is unusual.]

import numpy as np
import scipy.linalg

# Set parameters
K = 5  # number of covariates
J = 10  # number of groups
N = 1000  # number of observations
sigma = 1.5  # scale of errors

# Generate nu and Sigma
nu = np.random.normal(0, 1, K)  # K-vector location from a standard normal
A = np.random.normal(0, 1, (K, K))
Sigma = np.dot(A, A.T)  # create a positive semi-definite matrix for covariance

# Generate beta
beta = np.random.multivariate_normal(nu, Sigma, J).T  # K x J matrix

# Generate observations
jj = np.random.randint(0, J, N)  # random group for each observation
x = np.random.normal(0, 1, (N, K))  # N x K standard normal matrix
x[:, 0] = 1  # set first column to all 1s for intercept

# Generate y
y = np.zeros(N)
for n in range(N):
    y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)

y[:5]  # Show the first 5 values of y as a sample output

The hierarchical linear regression model was successfully generated with the specified parameters. The first five values of the dependent variable y from the model are approximately:

−2.27
8.86
8.20
1.30
7.27

These values are based on the simulation with K=5 covariates,
J=10 groups, N=1000 observations, and an error scale σ=1.5.

It looks like it actually ran that Python code? I haven’t seen output in this format before from GPT—the code’s in a pulldown labeled “Analyzing”.

Anway, GPT’s code worked out of the box. It only took a couple of minutes to pull the data generation into a function and return a dictionary with simulations and simulated parameter values to compare to the fitted values. My Python code illustrates my suggestion a few posts back to turn blocks of code into functions rather than putting a bunch of comments around them in the script.

import numpy as np
import scipy.linalg

def sim_param_data(K, J, N, sigma):
    nu = np.random.normal(0, 1, K)
    A = np.random.normal(0, 1, (K, K))
    Sigma = np.dot(A, A.T)
    beta = np.random.multivariate_normal(nu, Sigma, J).T 
    jj = np.random.randint(0, J, N)
    x = np.random.normal(0, 1, (N, K))
    x[:, 0] = 1
    y = np.zeros(N)
    for n in range(N):
        y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)
    return nu, Sigma, beta, {'J': J, 'N': N, 'jj': jj + 1, 'K': K, 'x': x, 'y': y}
    
import cmdstanpy as csp
nu, Sigma, beta, data = sim_param_data(K = 5, J = 10, N = 1000, sigma = 1.5)
model = csp.CmdStanModel(stan_file = 'rsbs.stan')
fit = model.sample(data)

It takes Stan about 20 seconds to fit this data set, R-hats all less than 1.01, is low, ESS in the thousands from a sample of size 4000, and all but a couple parameters are all recovered within 95% posterior intervals. There is quite a lot of uncertainty here with this little data and this many groups—don’t take those point estimates of covariance too seriously!

Appendix on doc

Let’s digress and talk about doc. I wrote a blog post a few days ago on doc, and this triggers some of the same issues. I want to say up front that doc is hard and if you go and look at code I’ve written, there will be a lot of places where you can improve the doc. Same for the code. So this is a kind of normative theory of doc, not what one might expect in reality. People only have a finite amount of time for any project. You might want to take a look at the doc in R parts of his example with the same eye.

First, there’s a scaffolding example which has the classic problem of documentation just for the sake of documentation.

vector[N] mu; // declaring a mu vector

You see the same thing in the final example where “vector[N] y” is documented as “y vector”. For the same reason, I don’t like this from an early example,

  sigma ~ exponential(1); // using an exponential prior on sigma

And this is what I meant by documenting the language.

mu[i] = x[i] * beta; // * is matrix multiplication in this context

and

  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix

Going back to the final example, rather than “population sigma”, I would prefer “error scale” as it does not rely on the conventional notation sigma to pick out the scale.

The comment for z says “non-centered version of beta_p”, but the non-centered variable here is z_p. The terminology of “centering” is around the population mean, not zero.

Continuing with doc for z, I don’t understand what it means to be a version of beta_p. There is no beta_p in the model, so maybe some external doc? In the definition of mu, you can see beta acting as the location of the non-centered parameterization.

Did anyone spot the bug in this model? This is the real reason we don’t trust doc and have to read the code. It only works for K = 2. You’ll see a hard-coded 1 and 2 on the line defining mu[i] despite other parts of the program using K. My advice in this situation is to just bite the bullet and code the K = 2 case first. Then generalize later if you need to. I code the general case in the next section.

I want to emphasize that I’m not trying to pick on Will Hipson here. I’m assuming his intent was to be pedagogical, as the comment density drops as the models get more complicated. And the code is really good—better than in our User’s Guide.

This example also emphasizes why code review is so useful—a second set of eyes is the top thing I can recommend for improving code. Just knowing your code will be reviewed helps you write better code.

When all else fails, add a code comment

Another way of saying this is that you should treat inline code comments as a last resort when there is no other way to make your intentions clear.

I used to teach a session of Andrew’s statistical communication class once a year and I’d focus on communicating a computational API. Most of the students hated it because they signed up for the class to hear Andrew talk about stats, not me talk about API design. At least one student just up and walked out every year! So if you’re that student, now’s your chance to bail.

Comments considered harmful

Most academics, before they will share code with me, tell me they have to “clean it up.” I invariably tell them not to bother, and at best, they will dilly dally and shilly shally and apologize for lack of comments. What they don’t realize is that they were on the right track in the first place. The best number of inline code comments is zero. Nada. Zilch. Nil. Naught.

Why are comments so harmful? They lie! Even with the best of intent, they might not match the actual implementation. They often go stale over time. You can write whatever you want in a comment and there’s no consistency checking with the code.

You know what doesn’t lie? Code. Code doesn’t lie. So what do professional programmers do? They don’t trust comments and read the code instead. At this point, comments just get in the way.

What’s a bajillion times better than comments?

Readable code. Why? It’s self documenting. To be self documenting, code needs to be relatively simple and modular. The biggest mistake beginners make in writing code is lack of modularity. Without modularity, it’s impossible to build code bottom up, testing as you go.

It’s really hard to debug a huge program. It’s really easy to debug modules built up piece by piece on top of already-tested modules. So design top down, but build code bottom up. This is why we again and again stress in our writing on Bayesian workflow and in our replies to user questions on forums, that it helps immensely to scaffold up a complicated model one piece at a time. This lets you know when you add something and it causes a failure.

Knowing where to draw lines between modules is, unfortunately, a matter of experience. The best way to get that experience? Read code. In the product coding world, code is read much more often than it’s written. That means much more effort typically goes into production code to make it readable. This is very unlike research code which might be written once and never read again.

There is a tradeoff here. Code is more readable with short variable names and short function names. It’s easier to apprehend the structure of the expression a * b + c**2 than meeting_time * number_of_meetings + participants**2. We need to strike a balance with not too long, but still informative variable names.

And why are beginners so afraid of wasting horizontal space while being spendthrifts on the much more valuable vertical space? I have no explanation. But I see a lot of code from math-oriented researchers that looks like this, ((a*b)/c)+3*9**2+cos(x-y). Please use spaces around operators and no more parens than are necessary to disambiguate given attachment binding.

When should I comment?

Sometimes you’re left with no choice and have to drop in a comment as a last resort. This should be done if you’re doing something non-idiomatic with the language or coding an unusual algorithm or something very involved. In this case, a little note inline about intent and/or algebra can be helpful. That’s why commenting is sometimes called a method of last resort.

But whatever you do, comment for people who know the language better than you. Don’t write a comment that explains what a NumPy function does—that’s what the NumPy doc is for. Nobody wants to see this:

int num_observations = 513;  // declare num_observations as an integer and set equal to 513

But people who feel compelled to comment will write just this kind of thing, thinking it makes their code more professional. If you think this is a caricature, you don’t read enough code.

The other thing you don’t want to do is this:

#####################################################
################## INFERENCE CODE ###################
#####################################################
...
...
...

This is what functions are for. Write a function called inference() and call it. It will also help prevent accidental reuse of global variables, which is always a problem in scripting languages like R and Python. Don’t try to fix hundreds or thousands of lines of unstructured code with structured comments.

Another thing to keep in mind is that vertical space is very precious in coding, because we want to be able to see as much of the code as we can at a time without scrolling. Don’t waste vertical space with useless or even harmful comments.

Do not, and I repeat, do not use /* ... */ style comments inline with code. It’s too easy to get confused when it’s a lot of lines and it’s doubly confusing when nested. Instead, use line comments (// in C++ and Stan, # in Python and R). Use the comment-region command in emacs or whatever does the same in your IDE. With line comments, the commented out code will be very visible, as in the following example.

for (int n = 0; n < N; ++n) {
  // int x = 5
  // int y = x * x * 3;
  // int z = normal_rng(y, 1);
  z = n * 3
}

Compare that to what I often see, which is some version of the following.

for (int n = 0; n < N; ++n) {
  /* int x = 5
  int y = x * x * 3;
  int z = normal_rng(y, 1); */
  z = n * 3
}

In the first case, it's easy to just scan down the margin and see what's commented out.

After commenting out and fixing everything, please be a good and respectful citizen and just delete all the edited out code before merging or releasing. Dead code makes the live code hard to find and one always wonders why it's still there---was it a mistake or some future plan or what? When I first showed up at Bell Labs in the mid 1990s, I was handed a 100+ page Tcl/Tk script for running a speech recognizer and told only a few lines were active, but I'd have to figure out which ones. Don't do that!

The golden triangle

What I stressed in Andrew's class is the tight interconnection between three aspects of production code:


$latex \textrm{API Documentation} \leftrightarrow \textrm{Unit tests} \leftrightarrow \textrm{Code}$

 

The API documentation should be functionally oriented and say what the code does. It might include a note as to how it does it if that is relevant to its use. An example might be different algorithms to compute the same thing that are widely known by name and useful in different situations. The API doc should ideally be specific enough to be the basis of both unit testing and coding. So I'm not saying don't document. I'm saying don't document how inline code works, document your API's intent.

The reason I call this the "golden" triangle is the virtuous cycle it imposes. If the API doc is hard to write, you know there's a problem with the way the function has been specified or modularized. With R and Python programmers, that's often because the code is trying to do too many things for a single function and the input types and output types become a mess of dependencies. This leads to what programmers identify as a "bad smell" in the code. If the code or the unit tests are hard to write, you know there's a problem with the API specification.

Clients (human and computational) are going to see and "feel" the API. That's where the "touch" is that designers like to talk about in physical object design. Things need to feel natural for the application, or in the words of UI/UX designers, it needs to offer affordances (in the past, we might have said it should be intuitive). It needs to feel natural for the application. Design the API first from the client perspective. Sometimes you have to suffer on the testing side to maintain a clean and documentable API, but that clean API is your objective.

What about research code?

Research code is different. It doesn't have to be robust. It doesn't have to be written to be read by multiple people in the future. You're usually writing end-to-end tests rather than unit tests, though that can be dangerous. It still helps to develop bottom-up with testing.

What research code should be is reproducible. There should be a single script to run that generates all the output for a paper. That way, even if the code's ugly, at least the output's reproducible and someone with enough interest can work through it.

And of course, research code needs to be tested that it's doing what it's supposed to be doing. And it needs to be audited to make sure it's not "cheating" (like cross-validating a time-series, etc.).

Notebooks, Quarto, and other things that get in the way of coding and documenting

With all due respect to Donald Knuth (never a good start), literate programming is a terrible way to develop code. (On a related topic, I would totally recommend at least the beginning part of Knuth's notes on how to write math.)

I don't love them, but I use Quarto and Jupyter (nee iPython) notebooks for writing reproducible tutorial material. But only after I've sorted out the code. These tools mix text and code and make too many compromises along the way to make them good at either task. Arguably the worst sin is that it winds up obfuscating the code with a bunch of text. Jupyter also makes it possible to get into inconsistent states because it doesn't automatically re-run everything. Quarto is just a terrible typesetting platform, inheriting all the flaws of pandoc, citeproc, with the added joy of HTML and LaTeX interoperability and R and Python interoperability. We use it for Stan docs so that we can easily generate HTML and LaTeX, but it always feels like there should be a better way to do this as it's a lot of trial and error due to the lack of specs for markdown.

Evaluating samplers with reference draws

I started thinking a bit more about what we’re doing when we use something like posteriordb (from Stan) or Inference Gym (from TensorFlow Probability) to evaluate a sampler.

posteriordb gives you Stan programs and 10,000 reference draws from their posterior. The reference draws must be generated by another sampler. In some cases, like Neal’s funnel or high-dimensional normals, etc., we can just run forward simulation and get independent draws. In cases of models with data, we don’t actually know the posterior analytically—that’s why we’re running MCMC in the first place! I think what posteriordb did was run NUTS for enough iterations it could be thinned down to 10,000 roughly independent draws.

Standard error in reference

The standard error for $latex N$ independent draws estimating a particular quantity like the posterior mean of a parameter is

$latex \textrm{se} = \frac{\displaystyle \textrm{sd}}{\displaystyle \sqrt{N}}$

where sd is the standard deviation of the quantity in the posterior. Standard Monte Carlo methods scale horribly due to that square root—Markov chain Monte Carlo can be even worse because rather than the number of draws, we have the effective sample size, which can be much lower than the number of iterations for more challenging posteriors due to the correlation in the draws.

Because $latex \sqrt{10,000} = 100$, with 10,000 draws, or standard error is only 1/100th of our standard deviation. In general, we need four times as many draws to cut the error in half, or 100 times as many draws to add a decimal place of accuracy by cutting error by a factor of ten. That means we can’t reasonably expect something like posteriordb to store enough draws to make fine-grained convergence evaluation possible. A slightly more economical alternative would be to store long term averages. There we cold run a million iterations and save parameter and squared parameter estimates. We could even run a hundred million by farming out 100 jobs on a cluster. But that’s still only three or four decimal places of accuracy. After that, maybe we could wring out another decimal place or two with control variates.

That’s a lot of residual error to have to deal with when evaluating an algorithm that asymptotically approaches the correct answer. The only thing that saves us is that the algorithms we’re evaluating are also samplers and are subject to the same inverse square root rate of reduction in standard error.

Report error in evaluations

What we really want to report is the standard error of the residual between our new model’s estimate and the truth. But we don’t usually have the truth. In practice, we usually do one of two things, depending on how compute-intensive it is to fit my model and our compute resources and patience.

Easy model:. Run the new algorithm 100 times or even 1000 times and report something like a histogram of errors. With this many runs, we can directly estimate standard error (assuming our draws are unbiased) and we can estimate it even more tightly and without assumptions if we know the true values.

Hard model:. Run the new algorithm once or twice and use an estimated standard error for the new algorithm and report that instead of an empirical histogram of errors. This suffers from estimation error in the effective sample size, which can be large.

The problem with both of these aproaches is that it’s only reporting variability of the new algorithm, treating the reference draws as true. We should be going one step further and folding the error in posteriordb or another reference database into the uncertainty reporting of the error of the new algorithm. The additional error could be done with simulation (and maybe analytically—normals are nice that way).

That’s easy enough to do reporting single univariate means. I do not know how to do this for evaluating error in something like KL-divergence, which itself has a Monte Carlo integration, or something like Wasserstein distance. In the Pathfinder paper, we used a semi-discrete Wassertein distance algorithm. I have to admit I don’t even know how that algorithm works.

Evaluating delayed rejection

I ran into this problem because I want to evaluate our delayed rejection HMC and show that the error goes to zero as I take more draws. Specifically, I want to extend the illustration in the paper that shows that standard HMC and NUTS fails, but DR-HMC succeeds at sampling from multiscale distributions like Neal’s funnel in its centered parameterization.

There are two fundamental problems with this.

Problem One: By the time we get past an effective samle size of 10,000 from the new algorithm, the error we’re estimating is dominated by the standard error of posteriordb. We don’t even know the real error of posteriordb, only an estimated standard error assuming draws are independent.

Problem Two: The second problem is that we have no guarantees that the reference draws are unbiased draws from the posterior. For example, in the centered parameterization of Neal’s funnel, there is no step size for HMC that will properly explore the posterior—each step size picks out a range of log scale parameters it can explore (as we demonstrate in the delayed rejection paper). The standard error estimate from the reference draws assumes independent draws from the posterior, not biased draws. If the posteriordb draws are biased, the error reported for a perfect sampler will not go to zero as the number of posteriordb and new sampler draws goes to infinity.

Help?

Does anyone have any suggestions of what we can do in the face of potentially biased and stochastic samples for evaluation? I could wait until the Stan meeting on Thursday and ask Aki, who’s a font of inspiration on topics like this, but I thought I’d ask a wider group and see how that goes.

Fun with Dååta: Reference librarian edition

Rasmuth Bååth reports the following fun story in a blog post, The source of the cake dataset (it’s a hierarchical modeling example included with the R package lme4).

Rasmuth writes,

While looking for a dataset to illustrate a simple hierarchical model I stumbled upon another one: The cake dataset in the lme4 package which is described as containing “data on the breakage angle of chocolate cakes made with three different recipes and baked at six different temperatures [as] presented in Cook (1938).

The search is on.

… after a fair bit of flustered searching, I realized that this scholarly work, despite its obvious relevance to society, was nowhere to be found online.

The plot thickens like cake batter until Megan N. O’Donnell, a reference librarian (officially, Research Data Services Lead!) at Iowa State, the source of the original, gets involved. She replies to Rasmuth’s query,

Sorry for the delay — I got caught up in a deadline. The scan came out fairly well, but page 16 is partially cut off. I’ll put in a request to have it professionally scanned, but that will take some time. Hopefully this will do for now.

Rasmuth concludes,

She (the busy Research Data Services Lead with a looming deadline) is apologizing to me (the random Swede with an eccentric cake thesis digitization request) that it took a few days to get me everything I asked for!?

Reference librarians are amazing! Read the whole story and download the actual manuscript from Rasmuth’s original blog post. The details of the experimental design are quite interesting, including the device used to measure cake breakage angle, a photo of which is included in the post.

I think it’d be fun to organize a class around generating new, small scale and amusing data sets like this one. Maybe it sounds like more fun than it would actually be—data collection is grueling. Andrew says he’s getting tired of teaching data communication, and he’s been talking a lot more about the importance of data collection on the blog, so maybe next year…

P.S. In a related note, there’s something called a baath cake that’s popular in Goa and confused my web search.

Book on Stan, R, and Python by Kentaro Matsuura

A new book on Stan using CmdStanR and CmdStanPy by Kentaro Matsuura has landed. And I mean that literally as you can see from the envelope (thanks, Kentaro!). Even the packaging from Japan is beautiful—it fit the book perfectly. You may also notice my Pentel Pointliner pen (they’re the best, but there’s a lot of competition) and my Mnemosyne pad (they’re the best, full stop), both from Japan.

If you click through to Amazon using the above link, the “Read Sample” button takes you to a list where you can read a sample, which includes the table of contents and a brief intro to notation.

Yes, it comes with source code

There’s a very neatly structured GitHub package, Bayesian statistical modeling with Stan R and Python, with all of the data and source code for the book.

The book just arrived, but from thumbing through it, I really like the way it’s organized. It uses practical simulation code and realistic data to illustrate points of workflow and show users how to get unstuck from common problems. This is a lot like the way Andrew teaches this material. Unlike how Andrew teaches, it starts from the basics, like what is a probability distribution. Luckily for the reader, rather than a dry survey trying to cover everything, it hits a few insightful highlights with examples—this is the way to go if you don’t want to just introduce distributions as you go.

The book is also generous with its workflow advice and tips on dealing with problems like non-identifiability or challenges like using discrete parameters. There’s even an advanced section at the end that works up to Gaussian processes and the application of Thompson sampling (not to reinforce Andrew’s impression that I love Thompson sampling—I just don’t have a better method for sequential decision making in “bandit” problems [scare quotes also for Andrew]).

CmdStanR and CmdStanPy interfaces

This is Kentaro’s second book on Stan. The first is in Japanese and it came out before CmdStanR and CmdStanPy. I’d recommend both this book and using CmdStanR or CmdStanPy—they are our go-to recommendations for using Stan these days (along with BridgeStan if you want transforms, log densities, and gradients). After moving to Flatiron Institute, I’ve switched from R to Python and now pretty much exclusively use Python with CmdStanPy, NumPy/SciPy (basic math and stats functions), plotnine (ggplot2 clone), and pandas (R data frame clone).

Random comment on form

In another nod to Andrew, I’ll make an observation about a minor point of form. If you’re going to use code in a book set in LaTeX, use sourcecodepro. It’s a Lucida Console-like font that’s much easier to read than courier. I’d just go with mathpazo for text and math in Palatino, but I can see why people like Times because it’s so narrow. Somehow Matsuura managed to solve the dreaded twiddle problem in his displayed Courier code so the twiddles look natural and not like superscripts—I’d love to know the trick to that. Overall, though, the graphics are abundant, clear, and consistently formatted, though Andrew might not like some of the ggplot2 defaults.

Comments from the peanut gallery

Brian Ward, who’s leading Stan language development these days and also one of the core devs for CmdStanPy and BridgeStan, said that he was a bit unsettled seeing API choices he’s made set down in print. Welcome to the club :-). This is why we’re so obsessive about backward compatibility.

SIMD, memory locality, vectorization, and branch point prediction

The title of this post lists the three most important considerations for performant code these days (late 2023).

SIMD

GPUs can do a lot of compute in parallel. High end ($15K to $30K) GPUs like they use in big tech perform thousands of operations in parallel (50 teraflops for the H100). The catch is that they want all of those operations to be done in lockstep on different data. This is called single-instruction, multiple data (SIMD). Matrix operations, as used by today’s neural networks, are easy to code with SIMD.

GPUs cannot do 1000s of different things at once. This makes it challenging to write recursive algorithms like the no-U-turn sampler (NUTS) and is one of the reason people like Matt Hoffman (developer of NUTS) have turned to generalized HMC.

GPUs can do different things in sequence if you keep memory on the GPU (in kernel). This is how deep nets can sequence, feedforward, convolution, attention, activation, and GLM layers. Steve Bronder is working on keeping generated Stan GPU code in kernel.

Memory locality

Memory is slow. Really slow. The time it takes to fetch a fresh value from random access memory (RAM) is on the order of 100–200 times slower than the time it takes to execute a primitive arithmetic operation. On a CPU. The problem just gets worse with GPUs.

Modern CPUs are equipped with levels of cache. For instance, a consumer grade 8-core CPU like an Intel i9 might have a shared 32 MB L3 cache among all the cores, a 1MB L2 cache, and an 80KB L1 cache. When memory is read in from RAM, it gets read in blocks, with not only the value you requested, but other values near it. This gets pulled first into L3 cache, then into L2 cache, then into L1 cache, then it gets into registers on the CPU. This means if you have laid an array x out contiguously in memory and you have read x[n], then it is really cheap to read x[n + 1] because it’s either in your L1 cache already or being pipelined there. If your code accesses non-local pieces of memory, then you wind up getting cache misses. The higher up the cache miss (L1, L2, L3, or RAM), the more time the CPU has to wait to get the values it needs.

One way to see this in practice is to consider matrix layout in memory. If we use column major order, then each column is contiguous in memory and the columns are laid out one after the other. This makes it much more efficient to traverse the matrix by first looping over the columns, then looping over the rows. Getting the order wrong can be an order of magnitude penalty or more. Matrix libraries will do more efficient block-based transposes so this doesn’t bite users writing naive code.

The bottom line here is that even if you have 8 cores, your can’t run 8 parallel chains of MCMC as fast as you can run 1. On my Xeon desktop with 8 cores, I can run 4 chains in parallel, followed by another 4 in parallel in the same amount of time as I can run 8 in parallel. As a bonus, my fan doesn’t whine as loudly. The reason for the slowdown with 8 parallel chains is due not due to the CPUs being busy, it’s because the asynchronous execution causes a bottleneck in the cache. This can be overcome with hard work by restructuring parallel code to be more cache sensitive, but it’s a deep dive.

Performant code often recomputes the value of a function if its operands are in cache in order to reduce the memory pressure that would arise from storing the value. Or it reorders operations to be lazy explicitly to support recompilation. Stan does this to prioritize scalability over efficiency (i.e., it recomputes values which means fewer memory fetches but more operations).

Vectorization

Modern CPUs pipeline their operations, for example using AVX and SSE instructions on Intel chips. C++ compilers at high levels of optimization will exploit this if the right flags are enabled. This way, CPUs can do on the order of 8 simultaneous arithmetic operations. Writing loops so that they are in blocks of 8 so that they can exploit CPU vectorization is critical for performant code. The good news is that calling underlying matrix libraries like Eigen or BLAS will do that for you. The bad news is that if you write your own loops, they are going to be slow compared to loops optimized using vectorization. You have to do it yourself in C++ if you want performant code.

Another unexpected property of modern CPUs for numerical computing is that integer operations are pretty much free. The CPUs have separate integer and floating point units and with most numerical computing, there is far less pressure on integer arithmetic. So you can see things like adding integer arithmetic to a loop without slowing it down.

Branch-point prediction

When the CPU executes a conditional such as the compiled form of

if (A) then B else C;

it will predict whether A will evaluate to true or false. If it predicts true, then the operations in B will begin to execute “optimistically” at the same time as A. If A does evaluate to true, we have a head start. If A evaluates to false, then we have a branch-point misprediction. We have to backtrack, flush the results from optimistic evaluation of B, fetch the instructions for C, then continue. This is very very slow because of memory contention and because it breaks the data pipeline. And it’s double trouble for GPUs. Stan includes suggestions (pragmas) to the compiler as to which branch is more likely in our tight memory management code for automatic differentiation.

Conclusion

The takeaway message is that for efficient code, our main concerns are memory locality, branch-point prediction, and vectorization. With GPUs, we further have to worry about SIMD. Good luck!

Bayes factors evaluate priors, cross validations evaluate posteriors

I’ve written this explanation on the board often enough that I thought I’d put it in a blog post.

Bayes factors

Bayes factors compare the data density (sometimes called the “evidence”) of one model against another. Suppose we have two Bayesian models for data y, one model p_1(\theta_1, y) with parameters \theta_1 and a second model p_2(\theta_2, y) with parameters \theta_2.

The Bayes factor is defined to be the ratio of the marginal probability density of the data in the two models,

\textrm{BF}_{1,2} = p_1(y) \, / \, p_2(y),

where we have

p_1(y) = \mathbb{E}[p_1(y \mid \Theta_1)] \ = \ \int p_1(y \mid \theta_1) \cdot p_1(\theta_1) \, \textrm{d}\theta_1

and

p_2(y) = \mathbb{E}[p_2(y \mid \Theta_2)] \ = \ \int p_2(y \mid \theta_2) \cdot p_2(\theta_2) \, \textrm{d}\theta_2.

The distributions p_1(y) and p_2(y) are known as prior predictive distributions because they integrate the likelihood over the prior.

There are ad-hoc guidelines from Harold Jeffreys of “uninformative” prior fame, classifying Bayes factor values as “decisive,” “very strong,” “strong,” “substantial,” “barely worth mentioning,” or “negative”; see the Wikipedia on Bayes factors. These seem about as useful as a 5% threshold on p-values before declaring significance.

Held-out validation

Held-out validation tries to evaluate prediction after model estimation (aka training). It works by dividing the data y into two pieces, y = y', y'' and then training on y' and testing on y''. The held out validation values are

p_1(y'' \mid y') = \mathbb{E}[p_1(y'' \mid \Theta_1) \mid y'] = \int p_1(y'' \mid \theta_1) \cdot p_1(\theta_1 \mid y') \, \textrm{d}\theta_1

and

p_2(y'' \mid y') = \mathbb{E}[p_2(y'' \mid \Theta_2) \mid y'] = \int p_2(y'' \mid \theta_2) \cdot p_2(\theta_2 \mid y') \, \textrm{d}\theta_2.

The distributions p_1(y'' \mid y') and p_2(y'' \mid y') are known as posterior predictive distributions because they integrate the likelihood over the posterior from earlier training data.

This can all be done on the log scale to compute either the log expected probability or the expected log probability (which are different because logarithms are not linear). We will use expected log probability in the next section.

(Leave one out) cross validation

Suppose our data is y_1, \ldots, y_N. Leave-one-out cross validation works by successively taking y'' = y_n and y' = y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N and then averaging on the log scale.

\frac{1}{N} \sum_{n=1}^N \log\left( \strut p_1(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right)

and

\frac{1}{N} \sum_{n=1}^N \log \left( \strut p_2(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right).

Leave-one-out cross validation is interpretable as the expected log posterior density (ELPD) for a new data item. Estimating ELPD is (part of) the motivation for various information criteria such as AIC, DIC, and WAIC.

Conclusion and a question

The main distinction between Bayes factors and cross validation is that the former uses prior predictive distributions whereas the latter uses posterior predictive distributions. This makes Bayes factors very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors.

This matters because pragmatic Bayesians like Andrew Gelman tend to use weakly informative priors that determine the rough magnitude, but not the value of parameters. You can’t get good Bayes factors this way. The best way to get a good Bayes factor is to push the prior toward the posterior, which you get for free with cross validation.

My question is whether the users of Bayes factors really believe so strongly in their priors. I’ve been told that’s true of the hardcore “subjective” Bayesians, who aim for strong priors, and also the hardcore “objective” Bayesians, who try to use “uninformative” priors, but I don’t think I’ve ever met anyone who claimed to follow either approach. It’s definitely not the perspective we’ve been pushing in our “pragmatic” Bayesian approach, for instance as described in the Bayesian workflow paper. We flat out encourage people to start with weakly informative priors and then add more information if the priors turn out to be too weak for either inference or computation.

Further reading

For more detail on these methods and further examples, see Gelman et al.’s Bayesian Data Analysis, 3rd Edition, which is available free online through the link, particularly Section 7.2 (“Information criteria and cross-validation,” p. 175) and section 7.4 (“Model comparison using Bayes factors,” page 183). I’d also recommend Vehtari, Gelman, and Gabry’s paper, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.

Simple pseudocode for transformer decoding a la GPT

A number of people have asked me for this, so I’m posting it here:

[Edit: revised original draft (twice) to fix log density context variable and width of multi-head attention values.]

This is a short note that provides complete and relatively simple pseudocode for the neural network architecture behind the current crop of large language models (LLMs), the generative pretrained transformers (GPT). These are based on the notion of (multi-head) attention, followed by feedforward neural networks, stacked in a deep architecture.

I simplified the pseudocode compared to things like Karpathy’s nanoGPT repository in Python (great, but it’s tensorized and batched PyTorch code for GPU efficiency) or Hunter and Phuong’s pseudocode, which is more general and covers encoding and multiple different architectures. I also start from scratch with the basic notions of tokenization and language modeling.

I include the pseudocode for evaluating the objective function for training and the pseudocode for generating responses. The initial presentation uses single-head attention to make the attention stage clearer, with a note afterward with pseudocode to generalize to multi-head attention.

I also include references to other basic presentations, including Daniel Lee’s version coded in Stan.

If this is confusing or you think I got a detail wrong, please let me know—I want to make this as clear and correct (w.r.t. GPT-2) as possible.

My SciML Webinar next week (28 Sep): Multiscale generalized Hamiltonian Monte Carlo with delayed rejection

I’m on the hook to do a SciML webinar next week:

These are organized by Keith Phuthi (who is at CMU) through University of Michigan’s Institute for Computational Discovery and Engineering.

Sam Livingstone is moderating.This is presenting joint work with Alex Barnett, Chirag Modi, Edward Roualdes, and Gilad Turok.

I’m very excited about this project as it combines a number of threads I’ve been working on with collaborators. When I did my job talk here, Leslie Greengard, our center director, asked me why we didn’t use variable stepwise integrators when doing Hamiltonian Monte Carlo. I told him we’d love to do it, but didn’t know how to do it in such a way as to preserve the stationary target distribution.

Delayed rejection HMC

Then we found Antonietta Mira’s work on delayed rejection. It lets you retry a second Metropolis proposal if the first one is rejected. The key here is that we can use a smaller step size for the second proposal, thus recovering from proposals that are rejected because the Hamiltonian diverged (i.e., the first-order gradient based algorithm can’t handle regions of high curvature in the target density). There’s a bit of bookkeeping (which is frustratingly hard to write down) for the Hastings condition to ensure detailed balance. Chirag Modi, Alex Barnett and I worked out the details, and Chirag figured out a novel twist on delayed rejection that only retries if the original acceptance probability was low. You can read about it in our paper:

This works really well and is enough that we can get proper draws from Neal’s funnel (vanilla HMC fails on this example in either the tails in either the mouth or neck of the funnel, depending on the step size). But it’s inefficient in that it retries an entire Hamiltonian trajectory. Which means if we cut the step size in half, we double the number of steps to keep the integration time constant.

Radford Neal to the rescue

As we were doing this, the irrepressible Radford Neal published a breakthrough algorithm:

What he managed to do was use generalized Hamiltonian Monte Carlo (G-HMC) to build an algorithm that takes one step of HMC (like Metropolis-adjusted Langevin, but over the coupled position/momentum variables) and manages to maintain directed progress. Instead of fully resampling momentum each iteration, G-HMC resamples a new momentum value then performs a weighted average with the existing momentum with most of the weight on the existing momentum. Neal shows that with a series of accepted one-step HMC iterations, we can make directed progress just like HMC with longer trajectories. The trick is getting sequences of acceptances together. Usually this doesn’t work because we have to flip momentum each iteration. We can re-flip it when regenerating, to keep going in the same direction on acceptances, but with rejections we reverse momentum (this isn’t an issue with HMC because it fully regenerates each time). So to get directed movement, we need steps that are too small. What Radford figured out is that we can solve this problem by replacing the way we generate uniform(0, 1)-distributed probabilities for the Metropolis accept/reject step (we compare the variate generated to the ratio of the density at the proposal to the density at the previous point and accept if it’s lower). Radford realized that if we instead generate them in a sawtooth pattern (with micro-jitter for ergodicity), then when we’re at the bottom of the sawtooth generating a sequence of values near zero, the acceptances will cluster together.

Replacing Neal’s trick with delayed rejection

Enter Chirag’s and my intern, Gilad Turok (who came to us as an undergrad in applied math at Columbia). Over the summer, working with me and Chirag and Edward Roualdes (who was here as a visitor), he built and evaluated a system that replaces Neal’s trick (sawtooth pattern of acceptance probability) with the Mira’s trick (delayed rejection). It indeed solves the multi scale problem. It exceeded our expectations in terms of efficiency—it’s about twice as fast as our delayed rejection HMC. Going one HMC step at a time, it is able to adjust its stepsize within what would be a single Hamiltonian trajectory. That is, we finally have something that works roughly like a typical ODE integrator in applied math.

Matt Hoffman to the rescue

But wait, that’s not all. There’s room for another one of the great MCMC researchers to weigh in. Matt Hoffman, along with Pavel Sountsov, figured out how to take Radford’s algorithm and provide automatic adaptation for it.

What Hoffman and Sountsov manage to do is run a whole lot of parallel chains, then use information in the other chains to set tuning parameters for a given chain. In that way it’s like the Goodman and Weare affine-invariant sampler that’s used in the Python package emcee. This involves estimating the metric (posterior covariance or just variance in the diagonal case) and also estimating steps size, which they do through a heuristic largest-eigenvalue estimate. Among the pleasant properties of their approach is that the entire setup produces a Markov chain from the very first iteration. That means we only have to do what people call “burn in” (sorry Andrew, but notice how I say other people call it that, not that they should), not set aside some number of iterations for adaptation.

Edward Roualdes has coded up Hoffman and Sountsov’s adaptation and it appears to work with delayed rejection replacing Neal’s trick.

Next for Stan?

I’m pretty optimistic that this will wind up being more efficient than NUTS and also make things like parallel adaptation and automatic stopping a whole lot easier. It should be more efficient because it doesn’t waste work—NUTS goes forward and backward in time and then subsamples along the final doubling (usually—it’s stochastic with a strong bias toward doing that). This means we “waste” the work going the wrong way in time and beyond where we finally sample. But we still have a lot of eval to do before we can replace Stan’s longstanding sampler or even provide an alternative.

My talk

The plan’s basically to expand this blog post with details and show you some results. Hope to see you there!

ChatGPT (4) can do 3-digit multiplication

I’ve heard a lot about GPT’s inability to do long multiplication. Most people had either tried giving it a few examples (i.e., “few shot learning”) or giving it a few examples and asking it to explain its work (i.e., “chain of thought”). I thought I’d just try to teach it the way you might teach a person. Or if you want to get fancy, I use few shot learning of chain of thought (a very common technique to get it to do this kind of problem). The bottom line: I could get it to do 3-digit multiplications but not 4-digit. The title originally continued “but not 4-digit multiplication,” but I realized I haven’t shown that—just that my prompt doesn’t do it.

Bob

BC: Now we are going to do multiplication.  As a refresher, here is a list
of all one-digit multiplications.

0 * 0 = 0
0 * 1 = 0
...
[removed for blog]
...
9 * 8 = 72
9 * 9 = 81

To multiply a two-digit number by a one-digit number, we break the two
digit number apart into its tens place and its one place and add them,
and then distribute through to get simpler additions, and then
multiply,

31 * 2
  = (30 + 1) * 2 
  = (30 * 2) + (1 * 2)
  = 60 + 2

If a number ends in zeros, we multiply the digits before the zero then
add the zeros back in at the end.  So to do 30 * 2 we multiply 3 * 2
to get 6 and then add the zero back to get 60.  Another way to think
of it is that 30 * 2 = 10 * 3 * 2 = 10 * 6 = 60, where we just wind up
multiplying by the tens place.

If we have a bigger number, it'll have more digits in the final
addition, as in

85 * 9
  = (80 + 5) * 9
  = (80 * 9) + (5 * 9)
  = 720 + 45
  = 765

To multiply two digit numbers times each other, we break them both
down, multiply all the factors, then add the resulting terms
together. 

23 * 45
  = (20 + 3) * (40 + 5)
  = (20 * 40) + (3 * 40) + (20 * 5) + (3 * 5)
  = 800 + 120 + 100 + 15
  = 920 + 100 + 15
  = 1020 + 15
  = 1035

We handle products of three digit numbers the same way.

234 * 567
  = (200 + 30 + 4) * (500 + 60 + 7)
  = (200 * 500) + (200 * 60) + (200 * 7) + (30 * 500) + (30 * 60) + (30 * 7) + (4 * 500) + (4 * 60) + (4 * 7)
  = 100000 + 12000 + 1400 + 15000 + 1800 + 210 + 2000 + 240 + 28
  = 112000 + 1400 + 15000 + 1800 + 210 + 2000 + 240 + 28
  = 113400 + 15000 + 1800 + 210 + 2000 + 240 + 28
  = 128400 + 1800 + 210 + 2000 + 240 + 28
  = 130200 + 210 + 2000 + 240 + 28
  = 130410 + 2000 + 240 + 28
  = 132410 + 240 + 28
  = 132650 + 28
  = 132678

ChatGPT (4, Aug 3)

Certainly! To multiply 12 by 3, we can follow the method you outlined.

First, we break 12 down into its tens and ones places:

12 = 10 + 2

Next, we distribute 3 through the sum:

(10 + 2) * 3 = (10 * 3) + (2 * 3)
= 30 + 6
= 36

So 12 * 3 = 36.

Three digits works

Bob

That's really fantastic.  Now we want to see how far you can take this.  Can you do a three digit product?  How about trying

423 * 567
  = 

ChatGPT (4, Aug 3)

Certainly! To multiply 423 by 567, we'll follow the same process: breaking down each number into its individual place values (hundreds, tens, ones), and then using distributive multiplication.

First, the breakdown:
  423 = 400 + 20 + 3
  567 = 500 + 60 + 7

Now, the multiplication:

(400 + 20 + 3) * (500 + 60 + 7) 
 = (400 * 500) + (400 * 60) + (400 * 7) + (20 * 500) + (20 * 60) + (20 * 7) + (3 * 500) + (3 * 60) + (3 * 7) 
 = 200000 + 24000 + 2800 + 10000 + 1200 + 140 + 1500 + 180 + 21 
 = 200000 + 24000 + 2800 + 10000 + 1200 + 140 + 1500 + 180 + 21 
 = 224000 + 2800 + 10000 + 1200 + 140 + 1500 + 180 + 21 
 = 226800 + 10000 + 1200 + 140 + 1500 + 180 + 21 
 = 236800 + 1200 + 140 + 1500 + 180 + 21 
 = 238000 + 140 + 1500 + 180 + 21 
 = 238140 + 1500 + 180 + 21 
 = 239640 + 180 + 21 
 = 239820 + 21 
 = 239841 

So 423 * 567 = 239841.

Generalizing to 4 x 2 digits

It can generalize to different numbers of digits.

Bob

How about a 4 digit number times a 2 digit number?

2357 * 15
  = 

ChatGPT (4, Aug 3)

Certainly! To multiply 2357 by 15, you can use the same distributive multiplication process.

First, break down each number into its place values:
 2357 = 2000 + 300 + 50 + 7 
 15 = 10 + 5 

Now, perform the multiplication:
 (2000 + 300 + 50 + 7) * (10 + 5) 
 = (2000 * 10) + (2000 * 5) + (300 * 10) + (300 * 5) + (50 * 10) + (50 * 5) + (7 * 10) + (7 * 5) 
 = 20000 + 10000 + 3000 + 1500 + 500 + 250 + 70 + 35 
 = 30000 + 3000 + 1500 + 500 + 250 + 70 + 35 
 = 33000 + 1500 + 500 + 250 + 70 + 35 
 = 34500 + 500 + 250 + 70 + 35 
 = 35000 + 250 + 70 + 35 
 = 35250 + 70 + 35 
 = 35320 + 35 
 = 35355 

So  2357 * 15 = 35355.

4 digits fails

That’s nice. Unfortunately, 4-digit multiplication goes off the rails. It generalizes well to breaking the numbers down and calculating the intermediate terms, but it can’t handle adding together 16 numbers. Maybe another style of teaching it multiplication would work, like breaking the problem down into each place of one of the numbers and then adding at the end like people do. This seems to be something that could be taught in the reinforcment learning stage of ChatGPT training.

Maybe next time. People have shown various things in this problem: (a) it helps to break the numbers apart with spaces (it’s bad at tokenizing numbers by default), (b) writing numbers with ones place first helps, and (c) training a transformer directly to do arithmetic with numerical coding works better still. Feel free to add more in the comments.

But how about addition

OK, this is backward, but I didn’t want to make this too tedious. I started with addition and it’s good at that, too. Here’s the prompt if you’re curious.

Bob

You are an expert at doing arithmetic by hand.  I am going to
remind you of the protocol to follow so that you can show your
work and teach others.  Here are the rules for adding single digits:

0 + 0 = 0
0 + 1 = 1
...
[snip]
...
9 + 8 = 17
9 + 9 = 18

To add two-digit numbers, we add all of their parts together, e.g.,

62 + 3
  = 60 + 2 + 3
  = 60 + 5
  = 65

First, we reduce the two-digit arithmetic using the rule 2 + 3 = 5.
Then we reduce 60 + 5 = 65 because when the two-digit number ends in
0, we just concatenate the one-digit number.  Here's another example.

86 + 7
  = 80 + 6 + 7 = 80 + 13
  = 80 + 10 + 3
  = 90 + 3
  = 93

To add two two digit numbers that end in zero, just add their
first digits and add a 0 to the result, e.g.,

20 + 30 = 50   (because 2 + 3 = 5, followed by 0)
90 + 40 = 130  (because 9 + 4 = 13, followed by 0)

I would like you to do some two digit addition for me and show your
work.

To add two digit numbers, we just continue to break them apart.

12 + 43
  = 10 + 2 + 40 + 3
  = 10 + 40 + 2 + 3
  = 50 + 5
  = 55

Here, we had to break both numbers apart, 12 = 10 + 2, and 43 = 40 +
3.  We then rearranged the order of terms so that we could reduce 10 +
40 = 50 and 2 + 3 = 5.  Then we just added the result.  Another
example

59 + 82
  = 50 + 9 + 80 + 2
  = 50 + 80 + 9 + 2
  = 130 + 9 + 2
  = 130 + 11
  = 100 + 30 + 10 + 1
  = 100 + 40 + 1
  = 100 + 41
  = 141

We always reduce the smallest terms first and then break apart again
if we need to.  In the final step, we use the rule that if the last
two digits of a three-digit number are zero, we can add to a two-digit
number by concatenation.

Now let's see if you can do some.  I want you to answer in the form
above with a sequence of equalities until you get to a single number
for the answer.

12 + 3
  = 

It generalizes nicely from two to three digit examples.

Bob


And now three digit numbers,

123 + 45
  =

ChatGPT

Certainly, let's add 123 and 45 by breaking them down into their hundreds, tens, and ones components and proceed step-by-step:

123 + 45
= 100 + 20 + 3 + 40 + 5
= 100 + 20 + 40 + 3 + 5
= 100 + 60 + 8
= 100 + 60 + 8
= 160 + 8
= 168

I started by breaking apart both 123 and 45 into their respective components. For 123, the components are 100, 20, and 3; for 45, the components are 40 and 5. Next, I rearranged the terms to make it easier to add the tens together (20 + 40 = 60) and the ones together (3 + 5 = 8). After that, I combined the hundreds, tens, and ones sums together: first combining 100 and 60 to get 160, and then adding the remaining 8 to arrive at 168.

Report on the large language model meeting at Berkeley

I’m not the only one who thinks GPT-4 is awesome. I just got back from an intense week of meetings at the Large language models and transformers workshop at Berkeley’s Simons Institute for the Theory of Computing. Thanks to Umesh Vazirani for organizing and running it so calmly.

Here are the videos of the talks.

Human feedback models

I gave a short talk Friday afternoon on models of data annotation.

  • Bob Carpenter. Softening human feedback improves classifier calibration

  • The step from the language model GPT-4 to the chatbot ChatGPT involves soliciting human feedback to rank potential outputs. This is typically done by converting the human feedback to a “gold” standard and retraining the baseline GPT-4 neural network.

    Chris Manning (who introduced me to statistical natural language processing when we were both professors at CMU), provides a nice high-level overview of how OpenAI uses reinforcement learning with human feedback to try to align ChatGPT to the goals of being helpful, harmless, and truthful.

    Chris Manning. Towards reliable use of large language models: better detection, consistency, and instruction-tuning.

    Humans rank potential ChatGPT output and their feedback is used as input for a Bradley-Terry model of conversational reward. This is then used to retrain the network. Chris suggests a much simpler approach than the one they use.

    While at the workshop, John Thickstun, a Stanford CS postdoc, pointed me to the following (and also filled me in on a bunch of technical details in between sessions).

    Chen Cheng, Hilal Asi, and John Duchi. 2022. How many labelers do you
    have? A close look at gold-standard labels
    . arXiv.

    It makes some simplifying assumptions to prove results including the bias of majority voting. I show similar things through simulation in a case study I posted on the Stan forums a while back,

    Bob Carpenter. For probabilistic prediction, full Bayes is better than point estimators.

    More on that soon when Seong Han and I finish our recent paper on annotation models.

    LLMs and copyright

    The highlight of the entire event for me was a captivating talk by a brilliant professor of intellectual property law at Berkeley

    Pamela Samuelson. Large language models meet copyright law.

    If you’re at all interested in copyright and AI, you should watch this. She very clearly explains what copyright is and how the law sees works of artistic expression different than function and hence how it sees code and (other) artistic works differently. She also covers the basis for the cases currently being litigated. She was also masterly at handling the rather unruly crowd. I’ve never been to an event with so many interruptions by the audience members. It was almost like the audience was practicing to be DARPA program managers (a notoriously interruption-prone group).

    Is ChatGPT just a stochastic parrot?

    The other talk I’d encourage everyone to watch is

    Steven Piantadosi. Meaning in the age of large language models.

    He goes over a lot of the cognitive science and philosophy of language necessary to understand why ChatGPT is not just a “stochastic parrot.” He focuses on the work of, wait for it…Susan Gelman (Andrew’s sister). Susan works in my favorite area of cognitive science—concept development.

    I can’t recommend this one highly enough, and I’ll be curious what people get out of it. This one’s closest to my own background (my Ph.D. is joint in cognitive science/computer science and I taught semantics, philosophy of language, and psycholinguistics as well as NLP at CMU), so I’m curious how understandable it’ll be to people who haven’t studied a lot of cognitive anthropology, philosophy of mind, and cognitive development.

    Sanjeev Arora gave a talk about combining skills and how he did a simple combinatorial experiment of combining five “skills” among thousands of skills (not defining what a skill was drove the audience into a frenzy of interruptions that’s quite something to watch). This is behavior that “emerged” in GPT-4 that isn’t so great in the less powerful models.

    Speaking of parrots, the West Coast Stats Views blog (which Andrew often cites) is parroting mainstream chatbot FUD (fear, uncertainty, and doubt); see, e.g., Thursday tweets. I say “parrot” because the blog’s Thursday posts just point to things we used to call “tweets” before a certain someone decided to throw away a brand name that’s become a verb. The irony, of course, is that they accuse GPT of being a parrot!

    Scaling laws

    There were a couple of nice talks by Yasaman Bahri (DeepMind) on Understanding the origins and taxonomy of neural scaling laws and Sasha Rush (Cornell/Hugging Face) on Scaling data-constrained language models. These are important as they’re what allows you to decide how much data to use and how large a model to build for your compute budget. It’s what gave companies the incentive to invest hundreds of millions of dollars in infrastructure to fit these models. Sasha’s talk also discusses the roles researchers can take who don’t have access to big-tech compute power.

    Watermarking LLMs

    Scott Aaronson (UT Austin, on sabbatical at OpenAI) gave a really interesting talk,

    Scott Aaronson. Watermarking of large language models

    The talks a masterclass in distilling a problem to math, explaining why it’s difficult, and considering several solutions and their implications. I felt smarter after watching this one.

    You might also want to check out the competition from John Thickstun, Robust distortion-free watermarks for language models, which takes an encryption key-based approach.

    In-context learning

    “In-context learning” is what people call an LLM’s ability to be given zero or more examples and then to complete the pattern. For example, if I say “translate to French. oui: “, we get what’s called “zero-shot learning”. If I give it an example, then it’s called “one-shot learning”, for example, “translate to French. notre: our, oui: “, and so on. ChatGPT can manage all sorts of nuanced language tasks given only a few examples. It’s so good that it’s competitive with most custom solutions to these problems.

    Everyone kept pointing out in-context learning did not learn in the sense of updating model weights. Of course it doesn’t. That’s because it’s conditioning, not learning. The whole process is Markovian, returning a draw from Pr[continuation | context]. The issue is whether you can do a good job of this prediction without being AI-compete (i.e, a general artificial intelligence, whatever that means).

    A whole lot of attention was given to ChatGPT’s poor performance on arithmetic problems coded as characters like “123 * 987”, with a couple different talks taking different approaches. One trained a transformer with the actual digits and showed it could be made to do this, pointing to the problem being the encoding of math in language. Perhaps the most insightful is that if you give GPT access to an API (with in-context learning, no less), it can call on that API to do arithmetic and the problem goes away. The final talk on this was during the lightning sessions, where Nayoung Lee (a Ph.D. student from Wisconsin-Madison) showed if you reversed the digits in the output (so that they were least significant first, the way we usually do arithmetic), transformers could be trained to do arithmetic very well; here’s a link to her arXiv paper, Teaching arithmetic to small transformers.

    Sparks of artificial general intelligence

    Yin Tat Lee kicked off the program talking about the Microsoft paper, Sparks of general AI. If you haven’t read the paper it’s a fascinating list of things that ChatGPT can do with relatively simple prompting.

    One of the interesting aspects of Yin Tat’s talk is his description of how they treated ChatGPT (4) as an evolving black box. To me, this and a bunch of these other talks that people did probing GPT’s abilities, point out that we need much better evaluation methods.

    Federated learning and task specialization

    For those interested in hierarchical modeling and the idea of a foundation model that can be adapted to different tasks, Colin Raffel (UNC/Hugging Face) gave an interesting talk on federated learning and deployment.

    Colin Raffel. Build an ecosystem, not a monolith

    This was not unrelated to Sasha’s talk (perhaps not surprising as they’re both affiliated with Hugging Face). They also talk about the ecosystem of image models sprouting up around Stable Diffusion and the ability to fine-tune them using low rank methods.

    OpenAI is closed

    Ilya Sutskever, CTO of OpenAI, gave a talk that I can best describe as adversarial. It was the only talk that filled the room to staning room only. He said he couldn’t talk about anything computational or anything related to LLMs, so he spent an hour talking about the duality between probability and compression and Kolmogorov complexity.

    Slides on large language models for statisticians

    I was invited by David Banks to give an introductory talk on large language models to the regional American Statistical Association meeting on large language models. Here are the slides:

    Most usefully, it has complete pseudocode up to but not including multi-head attention. It also has an annotated bibliography of the main papers if you want to catch up. After the talk, I added a couple slides on scaling laws and an annotated bibliography, which I didn’t have time to get to before the talk. I also added a slide describing multi-head attention, but without pseudocode.

    P.S. The meeting was yesterday at Columbia and I hadn’t been to the stats department since the pandemic started, so it felt very strange.

    P.P.S. GPT-4 helped me generate the LaTeX Tikz code to the point where I did zero searching through doc or the web. It also generates all of my pandas and plotnine code (Python clones of R’s data frames and ggplot2) and a ton of my NumPy, SciPy, and general Python code. It can explain the techniques it uses, so I’m learning a lot, too. I almost never use StackOverflow any more!

    Scientific software research faculty award

    Simons Foundation (the parent institution of Flatiron Institute, where I work) has just announced grants to support professors working on scientific software, with a plan to support 6 new fellows per year. From the call for proposals:

    A Scientific Software Fellowship provides five years of 50 percent salary support of the awardee’s academic-year salary and fringe benefits, whether normally paid over 9 or 12 months, along with a yearly $50,000 research allowance for the awardee…

    Letters of intent are due 8 December 2023.

    It’s clear these are not career transition awards and will instead go to people already involved in scientific software. While this call is aimed at physics, astrophysics, and mathematics professors, stats might qualify (you should check if you’re interested). I’m not involved in reviewing the grants—that’s the folks across the street.

    R and OOP anti-patterns

    Thomas Lumley just dropped a blog post, Blank cheque inheritance and statistical objects, which begins as follows.

    One of the problems with object-oriented programming for statistical methods is that inheritance is backwards. Everything is fine for data structures, and Bioconductor has many examples of broad (often abstract) base classes for biological information or annotation that are specialised by inheritance to more detailed and useful classes. Statistical methods go the other way.

    In base R, glm for generalised linear models is a generalisation of lm for linear models, but the glm class inherits from the lm class, …

    This isn’t a problem with inheritance, it’s a problem with how R uses it.

    Fundamentals of OOP and inheritance

    The fundamental rule for inheritance in object oriented programming (OOP) is that a class X should inherit from class Y only if every X is a Y. That means you should be able to use an X wherever the code calls for an Y. For instance, a Poodle class might inherit from the Dog class, because every poodle is a dog. Any function you can apply to dogs, you can apply to poodles. Every function that is defined to return a poodle will also return a dog. Behavior as arguments and returns is governed by the concepts of covariance and contravariance in programming language theory. Inheritance must respect these relations for a coherent OOP design.

    A classic blunder is to define a class for real numbers and one for complex numbers and have the complex numbers inherit from the real numbers. Every real number is a complex number, but not every complex number is a real number. So doing this will break standard OOP implementations. The reason beginners in OOP make this mistake is that it’s natural to think of the implementation of a complex number as taking a real number and adding an imaginary component. If you want to start with real numbers, a better way to define a complex number would be using composition to include two real components. That is, it contains two real numbers rather than inheriting from real numbers to get its real component. This is exactly how std::complex is defined in C++, with a constructor that takes the real and complex components as two objects of type T, where T might be double for double-precision complex numbers or it might be an autodiff type like stan::math::var.

    The God object anti-pattern

    I’m also not fond of how lm returns a god object. God objects are a widely cited anti-pattern in OOP, largely because they’re so frequently seen in the wild. The inclination that leads to this is to have something like a “blackboard” into which all kinds of state can be written so the user can get it all in one place. A common example is including all the input in a function’s output. No need to do that because the user already had all of that information because they couldn’t have called the function otherwise. God objects’ are usually a terrible idea as it’s nearly impossible to ensure consistency of such an object without defeating its purpose, because its purpose is to behave like a global variable repository. R doesn’t even try—you can take an lm fit object and change various aspects of it and leave it in an inconsistent state without warning, e.g.,

    > fit <- lm(dist ~ speed, data = cars)
    
    > fit$coefficients
    (Intercept)       speed 
     -17.579095    3.932409 
    
    > fit$coefficients = c(1, 2, 3, 4)
    
    > fit$coefficients
    [1] 1 2 3 4
    

    The Stan interfaces in Python and R also return god objects. I lost the design argument to the other developers, who argued, “That’s the way it’s done in R and Python.”

    R’s argument chaining vs. OOP method chaining

    Speaking of OOP, chaining with pipes in R follows the object-oriented pattern of method chaining, but instead of using object returns that are the class defining the next method in the chain, it just passes along the return to use as the first argument in the next chained function. It’s no longer object oriented. This doesn’t break any OO patterns, per se. It might be awkward if you need to pack enough into a return to go onto the next function. In OOP, developers often break long method chains into groups of coherent calls with named returns when the returns are not all instances of the same class. The reason to break up long chains is, ironically given how they’re motivated in R, to help with readability and self-documentation. Code readability is the single best thing you can do to make code maintainable, because code will be read much more often than it gets written. You can bridge the gap between what R does with chaining and the standard way to do method chaining in OOP by looking at how Python classes are defined with an explicit self argument (like the this pointer to the class instance in C++, but C++ doesn’t require it as an explicit argument on methods).

    P.S. I tried commenting on Lumley’s blog but was defeated by Disqus. I thought it might be of general interest, so am including it here.

    Rosenthal’s textbook: A First Look at Rigorous Probability Theory

    I’ve never taken a stats class. I was a math major, but dropped stats after I got appendicitis because I didn’t want to drop abstract algebra or computability theory. So here I am 40 years later trying to write up some more formal notes on probability theory and Markov chain Monte Carlo methods (MCMC) and finding myself in need of a gentle intro to probability theory that defines things rigorously. Jeffrey S. Rosenthal’s 2006 book A First Look at Rigorous Probability Theory is just what I needed. Despite not being very good at continuous math as an undergrad, I would have loved this book as it’s largely algebraic, topological, and set-theoretic in approach rather than relying on in-depth knowledge of real analysis or matrix algebra. Even the examples are chosen to be simple rather than being designed for Ph.D.s in math, and even include a lot of basic analysis results like constructing uncountable, unmeasurable sets.

    This is not the final book you’ll need on your way to becoming a measure theorist. For example, it never discusses Radon-Nikodym derivatives to unify the theory of pdfs. It doesn’t really get into stochastic processes beyond a high-level discussion of Brownian motion. It does cover the basic theory of general state space, time-homogeneous Markov chains in a few pages (why I was reading it), but that’s just scratching the surface of Rosenthal and Roberts’ general state-space MCMC paper which is dozens of pages long in much finer print.

    One of the best features of the book is that it’s both short and packed with meaningful examples. Rosenthal was quite selective in eliminating unnecessary fluff and sticking to the through line of introducing the main ideas. Most of the texts in this area just can’t resist all kinds of side excursions, examples which bring in all sorts of Ph.D. level math. That can be good if you need breadth and depth, but Rosenthal’s approach is superior for a first, broad introduction to the field.

    In summary: 5/5 stars.