You probably don’t have a general algorithm for an MLE of Gaussian mixtures

Those of you who are familiar with Garey and Johnson’s 1979 classic, Computers and Intractability: a guide to the theory of NP-completeness, may notice I’m simply “porting” their introduction, including the dialogue, to the statistics world.

Imagine Andrew had tasked me and Matt Hoffman with fitting simple standard (aka isotropic, aka spherical) Gaussian mixtures rather than hierarchical models. Let’s say that Andrew didn’t like that K-means got a different answer every time he ran it, K-means++ wasn’t much better, and even using soft-clustering (i.e., fitting the stat model with EM) didn’t let him replicate simulated data. Would we have something like Stan for mixtures. Sadly, no. Matt and I may have tried and failed. We wouldn’t want to go back to Andrew and say,

  • “We can’t find an efficient algorithm. I guess we’re just too dumb.”

We’re computer scientists and we know about proving hardness. We’d like to say,

  • “We can’t find an efficient algorithm, because no such algorithm is possible.”

But that would’ve been beyond Matt’s and my grasp, because, in this particular case, it would require solving the biggest open problem in theoretical computer science. Instead, it’s almost certain we would have come back and said,

  • “We can’t find an efficient algorithm, but neither can all these famous people.”

That seems weak. Why would we say that? Because we could’ve proven that the problem is NP-hard. A problem is in the class P if it can be solved in polynomial time with a deterministic algorithm. A problem is in the class NP when there is a non-deterministic (i.e., infinitely parallel) algorithm to solve it in polynomial time. It’s NP-hard if it’s just as hard as any other NP algorithm (formally specified through reductions, a powerful CS proof technique that’s the basis of Gödel’s incompleteness theorem). An NP-hard algorithm often has a non-deterministic algorithm to solve it makes a complete set of (exponentially many) guesses in parallel and then spends polynomial time on each one verifying whether or not it is a solution. An algorithm is NP-complete if it is NP-hard and a member of NP. Some well known NP-complete problems are bin packing, satisfiability in propositional logic, and the traveling salesman problem—there’s a big list of NP-complete problems.

Nobody has found a tractable algorithm to solve an NP-hard problem. When we (computer scientists) say “tractable,” we mean solvable in polynomial time with a deterministic algorithm (i.e., the problem is in P). The only known algorithms for NP-hard problems are exponential. Researchers have been working for the last 50+ years trying to prove that the class of NP problems is disjoint from the class of P problems.

In other words, there’s a Turing Award waiting for you if you can actually turn response (3) into response (2).

In the case of mixtures of standard (spherical, isotropic) Gaussians there’s a short JMLR paper with a proof that maximum likelihood estimation is NP-hard.

And yes, that’s the same Tosh as who was the first author of the “piranha” paper.

Ising models that are not restricted to be planar are also NP-hard.

What both these problems have in common is that they are combinatorial and require inference over sets. I think (though am really not sure) that one of the appeals of quantum computing is potentially solving NP-hard problems.

P.S. How this story really would’ve went is that we would’ve told Andrew that some simple distributions over NP-hard problem instances lead to expected polynomial time algorithms and we’d be knee-deep in the kinds of heuristics used to pack container ships efficiently.

ISBA 2024 Satellite Meeting: Lugano, 25–28 June

Antonietta Mira is organizing a satellite workshop before ISBA. It’s free, there is still time to submit a poster, and it’s a great excuse to visit Lugano. Here are the details:

I really like small meetings like this. Mitzi and I are going to be there and then continue on to ISBA.

Varying slopes and intercepts in Stan: still painful in 2024

Andrew recently blogged the following: Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson. This is the kind of model Andrew et al. used for one example in Red State, Blue State, which is the varying effect of income on Republican preference by state. Each state had its own slope and intercept related with a multivariate hierarchical prior. The version in Gelman and Hill’s regression book is a hack that tried to scale an inverse Wishart; the LKJ is what they would have used if Ben Goodrich had created it at that point.

Andrew points to a tutorial on Bayesian varying effects models from Will Hipson, which is really nice in the way it steps through workflow, building up the model in stages. The model Hipson develops is an improvement on what we have in our User’s Guide. After everything else, I circle back and talk about doc, trying to connect it to my recent post on why doc is so dangerous.

I think we can do a bit better in the current verison of Stan, but I have to confess up front that Andrew’s right—this is still painful. This took me around three hours to put together the model and simulations and blog post and I’m the one who designed the language! This would’ve been much faster if I wasn’t trying to bring it up to a “publishable” standard as an example of how I like to see Stan code written.

The original Stan model

Here’s Will Hipson’s model:

data {
  int N_obs; // number of observations
  int N_pts; // number of participants
  int K; // number of predictors + intercept
  int pid[N_obs]; // participant id vector
  matrix[N_obs, K] x; // matrix of predictors
  real y[N_obs]; // y vector
}

parameters {
  matrix[K, N_pts] z_p; // matrix of intercepts and slope
  vector[K] sigma_p; // sd for intercept and slope
  vector[K] beta; // intercept and slope hyper-priors
  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
  real sigma; // population sigma
}

transformed parameters {
  matrix[K, N_pts] z; // non-centered version of beta_p
  z = diag_pre_multiply(sigma_p, L_p) * z_p; 
}

model {
  vector[N_obs] mu;
  
  // priors
  beta ~ normal(0, 1);
  sigma ~ exponential(1);
  sigma_p ~ exponential(1);
  L_p ~ lkj_corr_cholesky(2);
  to_vector(z_p) ~ normal(0, 1);
  
  // likelihood
  for(i in 1:N_obs) {
    mu[i] = beta[1] + z[1, pid[i]] + (beta[2] + z[2, pid[i]]) * x[i, 2];
  }
  y ~ normal(mu, sigma);
}

generated quantities {
  matrix[2, 2] Omega;
  Omega = multiply_lower_tri_self_transpose(L_p);
}

Warning: There’s a bug in this code in that it only handles the K = 2 case. You can see this with the 1 and 2 hardcoded in the definition of mu[i].

My Stan model

The documentation for the model is at the top of the Stan code, then the Stan code only has a single line of doc other than explanations of the variables (which I wouldn’t include in non-tutorial code, just to link this back to what I was saying a few posts ago about comments).

/**
 * Varying slopes and intercept hierarchical linear regression.
 * N observations organized into J groups, with jj[n] being the group
 * and x[n, 1:K] the covariates for observation n.  The covariate
 * matrix x should include a column of 1s to include a slope.
 * 
 * The slopes and intercept per group have a multivariate normal prior
 * and the scale has an exponential prior.  The location of the
 * multivariate normal prior has a standard normal hyperprior and its
 * covariance is decomposed into a correlation matrix with an LKJ
 * hyperprior and a scale vector with an exponential hyperprior. In
 * symbols: 
 *
 * Likelihod:
 *   y[n] ~ normal(x[n] * beta[1:K, jj[n]], sigma) for n in 1:N
 *
 * Priors:
 *   sigma ~ exponential(1)
 *   beta[1:K, j] ~ multi_normal(nu, Sigma) for j in 1:J
 * 
 * Hyperpriors:
 *   nu ~ normal(0, 1)
 *   scale(Sigma) ~ exponential(1)
 *   corr(Sigma) ~ lkj(2)
 *
 * where scale(Sigma) is the scale vector and corr(Sigma) is the
 * correlation matrix of Sigma.
 *
 * For efficiency and numerical stability, the covariance and
 * correlation matrices are Cholesky factored.
 */
data {
  int<lower=0> J;                      // number of groups
  int<lower=0> N;                      // number of observations
  array[N] int<lower=1, upper=J> jj;   // group per observation
  int<lower=1> K;                      // number of covariates
  matrix[N, K] x;                      // data matrix
  vector[N] y;                         // observations
}
parameters {
  vector[K] nu;                        // location of beta[ , j]
  vector<lower=0>[K] tau;              // scale of beta[ , j]
  cholesky_factor_corr[K] L_Omega;     // Cholesky of correlation of beta[ , j]
  matrix[K, J] beta_std;               // standard beta (beta - nu) / Sigma
  real<lower=0> sigma;                 // observation error for y
}
transformed parameters {
  matrix[K, J] beta = rep_matrix(nu, J)
                      + diag_pre_multiply(tau, L_Omega) * beta_std;
}
model {
  nu ~ normal(0, 1);
  tau ~ exponential(1);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(beta_std) ~ normal(0, 1);  // beta[ , j] ~ multi_normal(nu, Sigma)
  sigma ~ exponential(1);
  y ~ normal(rows_dot_product(x, beta[ , jj]'), sigma);
}
generated quantities {
  matrix[K, K] Sigma                   // covariance of beta[, j]
    = multiply_lower_tri_self_transpose(diag_pre_multiply(tau, L_Omega));
}

(WordPress is really annoying in its mishandling of angle brackets in pre environments.)

I started the first version using K = 2 and loops rather than vectorization. Next, I generalized from K = 2 to general K. Then I found the rows dot product function and got rid of the last loop. (Loops are fast in Stan—it’s the redundant autodiff, like multiple calculations of log(sigma) that are the time wasters in non-vectorized code.)

One could make the case for suffixing all the parameters of the prior for beta using _beta. You’d probably want to do that in a context with multiple groupings. It gets very hard to read even in this simple case—with multiple groupings it’s a right royal mess. Long variable names are very hard to read in math.

I put all the comments to the right at end of lines, so they don’t waste any vertical space and don’t get in the way of actually reading the code. Except the one true inline code comment that I’d leave, which is the implication of all the non-centered parameterization on the distribution of beta.

Let’s take it for a spin

I actually wrote the model first. I would generally recommend debugging the stages, but I’ve built this model many times, so I decided to roll the dice.

What I haven’t done before is simulate data in Python. Luckily, I have a very helpful coding assistant named ChatGPT. I have a system prompt that looks like this (I’m not actually a professor any more, but I thought it’d be easier to underestand that way).

WHO: I am a professor in statistics and machine learning with a Ph.D. in cognitive science and computer science. I understand (a) graduate level probability and statistics, (b) graduate-level machine learning including neural networks, (c) applied mathematics including calculus and analysis, matrices and linear algebra, combinatorics and graph theory, mathematical logic and set theory, (c) cognitive science including psycholinguistics and cognitive anthropology, (d) linguistics including phonology, morphology, syntax, semantics, and pragmatics, and (e) philosophy including 20th century analytic philosophy and philosophy of mind.

HOW: Respond in a concise and neutral academic style without embellishments.

Please call me “Bob”.

This was a one-and-done prompt. I’ve also had a bit of practice formulating these queries.

BOB: In Python, I want to set K = 5 covariates J = 10 groups. I want to generate a random K-vector location nu from a standard normal, and a (K x K) random covariance matrix Sigma of roughly unit scale (multiplying Cholesky factors or taking entries from a random walk would be OK). Then generate a K x J matrix beta where the columns are drawn from a multivariate normal distribution with location nu and covariance Sigma.

Then I want to set N = 1000 observations. For each observation in in range(N), generate a random group jj[n] in range(J) uniformly. Take a scale sigma = 1.5 for errors, and generate an N x K standard normal matrix x of covariates, with the first column set to all 1 values for an intercept. Then generate y from a hierarchical linear regression where the coefficient vector for observation n is given by beta[:, jj[n]] and the noise is normal with location 0 and scale sigma.

GPT-4 (Feb 28, 2024): [It just led with the code, which is unusual.]

import numpy as np
import scipy.linalg

# Set parameters
K = 5  # number of covariates
J = 10  # number of groups
N = 1000  # number of observations
sigma = 1.5  # scale of errors

# Generate nu and Sigma
nu = np.random.normal(0, 1, K)  # K-vector location from a standard normal
A = np.random.normal(0, 1, (K, K))
Sigma = np.dot(A, A.T)  # create a positive semi-definite matrix for covariance

# Generate beta
beta = np.random.multivariate_normal(nu, Sigma, J).T  # K x J matrix

# Generate observations
jj = np.random.randint(0, J, N)  # random group for each observation
x = np.random.normal(0, 1, (N, K))  # N x K standard normal matrix
x[:, 0] = 1  # set first column to all 1s for intercept

# Generate y
y = np.zeros(N)
for n in range(N):
    y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)

y[:5]  # Show the first 5 values of y as a sample output

The hierarchical linear regression model was successfully generated with the specified parameters. The first five values of the dependent variable y from the model are approximately:

−2.27
8.86
8.20
1.30
7.27

These values are based on the simulation with K=5 covariates,
J=10 groups, N=1000 observations, and an error scale σ=1.5.

It looks like it actually ran that Python code? I haven’t seen output in this format before from GPT—the code’s in a pulldown labeled “Analyzing”.

Anway, GPT’s code worked out of the box. It only took a couple of minutes to pull the data generation into a function and return a dictionary with simulations and simulated parameter values to compare to the fitted values. My Python code illustrates my suggestion a few posts back to turn blocks of code into functions rather than putting a bunch of comments around them in the script.

import numpy as np
import scipy.linalg

def sim_param_data(K, J, N, sigma):
    nu = np.random.normal(0, 1, K)
    A = np.random.normal(0, 1, (K, K))
    Sigma = np.dot(A, A.T)
    beta = np.random.multivariate_normal(nu, Sigma, J).T 
    jj = np.random.randint(0, J, N)
    x = np.random.normal(0, 1, (N, K))
    x[:, 0] = 1
    y = np.zeros(N)
    for n in range(N):
        y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)
    return nu, Sigma, beta, {'J': J, 'N': N, 'jj': jj + 1, 'K': K, 'x': x, 'y': y}
    
import cmdstanpy as csp
nu, Sigma, beta, data = sim_param_data(K = 5, J = 10, N = 1000, sigma = 1.5)
model = csp.CmdStanModel(stan_file = 'rsbs.stan')
fit = model.sample(data)

It takes Stan about 20 seconds to fit this data set, R-hats all less than 1.01, is low, ESS in the thousands from a sample of size 4000, and all but a couple parameters are all recovered within 95% posterior intervals. There is quite a lot of uncertainty here with this little data and this many groups—don’t take those point estimates of covariance too seriously!

Appendix on doc

Let’s digress and talk about doc. I wrote a blog post a few days ago on doc, and this triggers some of the same issues. I want to say up front that doc is hard and if you go and look at code I’ve written, there will be a lot of places where you can improve the doc. Same for the code. So this is a kind of normative theory of doc, not what one might expect in reality. People only have a finite amount of time for any project. You might want to take a look at the doc in R parts of his example with the same eye.

First, there’s a scaffolding example which has the classic problem of documentation just for the sake of documentation.

vector[N] mu; // declaring a mu vector

You see the same thing in the final example where “vector[N] y” is documented as “y vector”. For the same reason, I don’t like this from an early example,

  sigma ~ exponential(1); // using an exponential prior on sigma

And this is what I meant by documenting the language.

mu[i] = x[i] * beta; // * is matrix multiplication in this context

and

  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix

Going back to the final example, rather than “population sigma”, I would prefer “error scale” as it does not rely on the conventional notation sigma to pick out the scale.

The comment for z says “non-centered version of beta_p”, but the non-centered variable here is z_p. The terminology of “centering” is around the population mean, not zero.

Continuing with doc for z, I don’t understand what it means to be a version of beta_p. There is no beta_p in the model, so maybe some external doc? In the definition of mu, you can see beta acting as the location of the non-centered parameterization.

Did anyone spot the bug in this model? This is the real reason we don’t trust doc and have to read the code. It only works for K = 2. You’ll see a hard-coded 1 and 2 on the line defining mu[i] despite other parts of the program using K. My advice in this situation is to just bite the bullet and code the K = 2 case first. Then generalize later if you need to. I code the general case in the next section.

I want to emphasize that I’m not trying to pick on Will Hipson here. I’m assuming his intent was to be pedagogical, as the comment density drops as the models get more complicated. And the code is really good—better than in our User’s Guide.

This example also emphasizes why code review is so useful—a second set of eyes is the top thing I can recommend for improving code. Just knowing your code will be reviewed helps you write better code.

When all else fails, add a code comment

Another way of saying this is that you should treat inline code comments as a last resort when there is no other way to make your intentions clear.

I used to teach a session of Andrew’s statistical communication class once a year and I’d focus on communicating a computational API. Most of the students hated it because they signed up for the class to hear Andrew talk about stats, not me talk about API design. At least one student just up and walked out every year! So if you’re that student, now’s your chance to bail.

Comments considered harmful

Most academics, before they will share code with me, tell me they have to “clean it up.” I invariably tell them not to bother, and at best, they will dilly dally and shilly shally and apologize for lack of comments. What they don’t realize is that they were on the right track in the first place. The best number of inline code comments is zero. Nada. Zilch. Nil. Naught.

Why are comments so harmful? They lie! Even with the best of intent, they might not match the actual implementation. They often go stale over time. You can write whatever you want in a comment and there’s no consistency checking with the code.

You know what doesn’t lie? Code. Code doesn’t lie. So what do professional programmers do? They don’t trust comments and read the code instead. At this point, comments just get in the way.

What’s a bajillion times better than comments?

Readable code. Why? It’s self documenting. To be self documenting, code needs to be relatively simple and modular. The biggest mistake beginners make in writing code is lack of modularity. Without modularity, it’s impossible to build code bottom up, testing as you go.

It’s really hard to debug a huge program. It’s really easy to debug modules built up piece by piece on top of already-tested modules. So design top down, but build code bottom up. This is why we again and again stress in our writing on Bayesian workflow and in our replies to user questions on forums, that it helps immensely to scaffold up a complicated model one piece at a time. This lets you know when you add something and it causes a failure.

Knowing where to draw lines between modules is, unfortunately, a matter of experience. The best way to get that experience? Read code. In the product coding world, code is read much more often than it’s written. That means much more effort typically goes into production code to make it readable. This is very unlike research code which might be written once and never read again.

There is a tradeoff here. Code is more readable with short variable names and short function names. It’s easier to apprehend the structure of the expression a * b + c**2 than meeting_time * number_of_meetings + participants**2. We need to strike a balance with not too long, but still informative variable names.

And why are beginners so afraid of wasting horizontal space while being spendthrifts on the much more valuable vertical space? I have no explanation. But I see a lot of code from math-oriented researchers that looks like this, ((a*b)/c)+3*9**2+cos(x-y). Please use spaces around operators and no more parens than are necessary to disambiguate given attachment binding.

When should I comment?

Sometimes you’re left with no choice and have to drop in a comment as a last resort. This should be done if you’re doing something non-idiomatic with the language or coding an unusual algorithm or something very involved. In this case, a little note inline about intent and/or algebra can be helpful. That’s why commenting is sometimes called a method of last resort.

But whatever you do, comment for people who know the language better than you. Don’t write a comment that explains what a NumPy function does—that’s what the NumPy doc is for. Nobody wants to see this:

int num_observations = 513;  // declare num_observations as an integer and set equal to 513

But people who feel compelled to comment will write just this kind of thing, thinking it makes their code more professional. If you think this is a caricature, you don’t read enough code.

The other thing you don’t want to do is this:

#####################################################
################## INFERENCE CODE ###################
#####################################################
...
...
...

This is what functions are for. Write a function called inference() and call it. It will also help prevent accidental reuse of global variables, which is always a problem in scripting languages like R and Python. Don’t try to fix hundreds or thousands of lines of unstructured code with structured comments.

Another thing to keep in mind is that vertical space is very precious in coding, because we want to be able to see as much of the code as we can at a time without scrolling. Don’t waste vertical space with useless or even harmful comments.

Do not, and I repeat, do not use /* ... */ style comments inline with code. It’s too easy to get confused when it’s a lot of lines and it’s doubly confusing when nested. Instead, use line comments (// in C++ and Stan, # in Python and R). Use the comment-region command in emacs or whatever does the same in your IDE. With line comments, the commented out code will be very visible, as in the following example.

for (int n = 0; n < N; ++n) {
  // int x = 5
  // int y = x * x * 3;
  // int z = normal_rng(y, 1);
  z = n * 3
}

Compare that to what I often see, which is some version of the following.

for (int n = 0; n < N; ++n) {
  /* int x = 5
  int y = x * x * 3;
  int z = normal_rng(y, 1); */
  z = n * 3
}

In the first case, it's easy to just scan down the margin and see what's commented out.

After commenting out and fixing everything, please be a good and respectful citizen and just delete all the edited out code before merging or releasing. Dead code makes the live code hard to find and one always wonders why it's still there---was it a mistake or some future plan or what? When I first showed up at Bell Labs in the mid 1990s, I was handed a 100+ page Tcl/Tk script for running a speech recognizer and told only a few lines were active, but I'd have to figure out which ones. Don't do that!

The golden triangle

What I stressed in Andrew's class is the tight interconnection between three aspects of production code:


$latex \textrm{API Documentation} \leftrightarrow \textrm{Unit tests} \leftrightarrow \textrm{Code}$

 

The API documentation should be functionally oriented and say what the code does. It might include a note as to how it does it if that is relevant to its use. An example might be different algorithms to compute the same thing that are widely known by name and useful in different situations. The API doc should ideally be specific enough to be the basis of both unit testing and coding. So I'm not saying don't document. I'm saying don't document how inline code works, document your API's intent.

The reason I call this the "golden" triangle is the virtuous cycle it imposes. If the API doc is hard to write, you know there's a problem with the way the function has been specified or modularized. With R and Python programmers, that's often because the code is trying to do too many things for a single function and the input types and output types become a mess of dependencies. This leads to what programmers identify as a "bad smell" in the code. If the code or the unit tests are hard to write, you know there's a problem with the API specification.

Clients (human and computational) are going to see and "feel" the API. That's where the "touch" is that designers like to talk about in physical object design. Things need to feel natural for the application, or in the words of UI/UX designers, it needs to offer affordances (in the past, we might have said it should be intuitive). It needs to feel natural for the application. Design the API first from the client perspective. Sometimes you have to suffer on the testing side to maintain a clean and documentable API, but that clean API is your objective.

What about research code?

Research code is different. It doesn't have to be robust. It doesn't have to be written to be read by multiple people in the future. You're usually writing end-to-end tests rather than unit tests, though that can be dangerous. It still helps to develop bottom-up with testing.

What research code should be is reproducible. There should be a single script to run that generates all the output for a paper. That way, even if the code's ugly, at least the output's reproducible and someone with enough interest can work through it.

And of course, research code needs to be tested that it's doing what it's supposed to be doing. And it needs to be audited to make sure it's not "cheating" (like cross-validating a time-series, etc.).

Notebooks, Quarto, and other things that get in the way of coding and documenting

With all due respect to Donald Knuth (never a good start), literate programming is a terrible way to develop code. (On a related topic, I would totally recommend at least the beginning part of Knuth's notes on how to write math.)

I don't love them, but I use Quarto and Jupyter (nee iPython) notebooks for writing reproducible tutorial material. But only after I've sorted out the code. These tools mix text and code and make too many compromises along the way to make them good at either task. Arguably the worst sin is that it winds up obfuscating the code with a bunch of text. Jupyter also makes it possible to get into inconsistent states because it doesn't automatically re-run everything. Quarto is just a terrible typesetting platform, inheriting all the flaws of pandoc, citeproc, with the added joy of HTML and LaTeX interoperability and R and Python interoperability. We use it for Stan docs so that we can easily generate HTML and LaTeX, but it always feels like there should be a better way to do this as it's a lot of trial and error due to the lack of specs for markdown.

Evaluating samplers with reference draws

I started thinking a bit more about what we’re doing when we use something like posteriordb (from Stan) or Inference Gym (from TensorFlow Probability) to evaluate a sampler.

posteriordb gives you Stan programs and 10,000 reference draws from their posterior. The reference draws must be generated by another sampler. In some cases, like Neal’s funnel or high-dimensional normals, etc., we can just run forward simulation and get independent draws. In cases of models with data, we don’t actually know the posterior analytically—that’s why we’re running MCMC in the first place! I think what posteriordb did was run NUTS for enough iterations it could be thinned down to 10,000 roughly independent draws.

Standard error in reference

The standard error for $latex N$ independent draws estimating a particular quantity like the posterior mean of a parameter is

$latex \textrm{se} = \frac{\displaystyle \textrm{sd}}{\displaystyle \sqrt{N}}$

where sd is the standard deviation of the quantity in the posterior. Standard Monte Carlo methods scale horribly due to that square root—Markov chain Monte Carlo can be even worse because rather than the number of draws, we have the effective sample size, which can be much lower than the number of iterations for more challenging posteriors due to the correlation in the draws.

Because $latex \sqrt{10,000} = 100$, with 10,000 draws, or standard error is only 1/100th of our standard deviation. In general, we need four times as many draws to cut the error in half, or 100 times as many draws to add a decimal place of accuracy by cutting error by a factor of ten. That means we can’t reasonably expect something like posteriordb to store enough draws to make fine-grained convergence evaluation possible. A slightly more economical alternative would be to store long term averages. There we cold run a million iterations and save parameter and squared parameter estimates. We could even run a hundred million by farming out 100 jobs on a cluster. But that’s still only three or four decimal places of accuracy. After that, maybe we could wring out another decimal place or two with control variates.

That’s a lot of residual error to have to deal with when evaluating an algorithm that asymptotically approaches the correct answer. The only thing that saves us is that the algorithms we’re evaluating are also samplers and are subject to the same inverse square root rate of reduction in standard error.

Report error in evaluations

What we really want to report is the standard error of the residual between our new model’s estimate and the truth. But we don’t usually have the truth. In practice, we usually do one of two things, depending on how compute-intensive it is to fit my model and our compute resources and patience.

Easy model:. Run the new algorithm 100 times or even 1000 times and report something like a histogram of errors. With this many runs, we can directly estimate standard error (assuming our draws are unbiased) and we can estimate it even more tightly and without assumptions if we know the true values.

Hard model:. Run the new algorithm once or twice and use an estimated standard error for the new algorithm and report that instead of an empirical histogram of errors. This suffers from estimation error in the effective sample size, which can be large.

The problem with both of these aproaches is that it’s only reporting variability of the new algorithm, treating the reference draws as true. We should be going one step further and folding the error in posteriordb or another reference database into the uncertainty reporting of the error of the new algorithm. The additional error could be done with simulation (and maybe analytically—normals are nice that way).

That’s easy enough to do reporting single univariate means. I do not know how to do this for evaluating error in something like KL-divergence, which itself has a Monte Carlo integration, or something like Wasserstein distance. In the Pathfinder paper, we used a semi-discrete Wassertein distance algorithm. I have to admit I don’t even know how that algorithm works.

Evaluating delayed rejection

I ran into this problem because I want to evaluate our delayed rejection HMC and show that the error goes to zero as I take more draws. Specifically, I want to extend the illustration in the paper that shows that standard HMC and NUTS fails, but DR-HMC succeeds at sampling from multiscale distributions like Neal’s funnel in its centered parameterization.

There are two fundamental problems with this.

Problem One: By the time we get past an effective samle size of 10,000 from the new algorithm, the error we’re estimating is dominated by the standard error of posteriordb. We don’t even know the real error of posteriordb, only an estimated standard error assuming draws are independent.

Problem Two: The second problem is that we have no guarantees that the reference draws are unbiased draws from the posterior. For example, in the centered parameterization of Neal’s funnel, there is no step size for HMC that will properly explore the posterior—each step size picks out a range of log scale parameters it can explore (as we demonstrate in the delayed rejection paper). The standard error estimate from the reference draws assumes independent draws from the posterior, not biased draws. If the posteriordb draws are biased, the error reported for a perfect sampler will not go to zero as the number of posteriordb and new sampler draws goes to infinity.

Help?

Does anyone have any suggestions of what we can do in the face of potentially biased and stochastic samples for evaluation? I could wait until the Stan meeting on Thursday and ask Aki, who’s a font of inspiration on topics like this, but I thought I’d ask a wider group and see how that goes.

Fun with Dååta: Reference librarian edition

Rasmuth Bååth reports the following fun story in a blog post, The source of the cake dataset (it’s a hierarchical modeling example included with the R package lme4).

Rasmuth writes,

While looking for a dataset to illustrate a simple hierarchical model I stumbled upon another one: The cake dataset in the lme4 package which is described as containing “data on the breakage angle of chocolate cakes made with three different recipes and baked at six different temperatures [as] presented in Cook (1938).

The search is on.

… after a fair bit of flustered searching, I realized that this scholarly work, despite its obvious relevance to society, was nowhere to be found online.

The plot thickens like cake batter until Megan N. O’Donnell, a reference librarian (officially, Research Data Services Lead!) at Iowa State, the source of the original, gets involved. She replies to Rasmuth’s query,

Sorry for the delay — I got caught up in a deadline. The scan came out fairly well, but page 16 is partially cut off. I’ll put in a request to have it professionally scanned, but that will take some time. Hopefully this will do for now.

Rasmuth concludes,

She (the busy Research Data Services Lead with a looming deadline) is apologizing to me (the random Swede with an eccentric cake thesis digitization request) that it took a few days to get me everything I asked for!?

Reference librarians are amazing! Read the whole story and download the actual manuscript from Rasmuth’s original blog post. The details of the experimental design are quite interesting, including the device used to measure cake breakage angle, a photo of which is included in the post.

I think it’d be fun to organize a class around generating new, small scale and amusing data sets like this one. Maybe it sounds like more fun than it would actually be—data collection is grueling. Andrew says he’s getting tired of teaching data communication, and he’s been talking a lot more about the importance of data collection on the blog, so maybe next year…

P.S. In a related note, there’s something called a baath cake that’s popular in Goa and confused my web search.

Book on Stan, R, and Python by Kentaro Matsuura

A new book on Stan using CmdStanR and CmdStanPy by Kentaro Matsuura has landed. And I mean that literally as you can see from the envelope (thanks, Kentaro!). Even the packaging from Japan is beautiful—it fit the book perfectly. You may also notice my Pentel Pointliner pen (they’re the best, but there’s a lot of competition) and my Mnemosyne pad (they’re the best, full stop), both from Japan.

If you click through to Amazon using the above link, the “Read Sample” button takes you to a list where you can read a sample, which includes the table of contents and a brief intro to notation.

Yes, it comes with source code

There’s a very neatly structured GitHub package, Bayesian statistical modeling with Stan R and Python, with all of the data and source code for the book.

The book just arrived, but from thumbing through it, I really like the way it’s organized. It uses practical simulation code and realistic data to illustrate points of workflow and show users how to get unstuck from common problems. This is a lot like the way Andrew teaches this material. Unlike how Andrew teaches, it starts from the basics, like what is a probability distribution. Luckily for the reader, rather than a dry survey trying to cover everything, it hits a few insightful highlights with examples—this is the way to go if you don’t want to just introduce distributions as you go.

The book is also generous with its workflow advice and tips on dealing with problems like non-identifiability or challenges like using discrete parameters. There’s even an advanced section at the end that works up to Gaussian processes and the application of Thompson sampling (not to reinforce Andrew’s impression that I love Thompson sampling—I just don’t have a better method for sequential decision making in “bandit” problems [scare quotes also for Andrew]).

CmdStanR and CmdStanPy interfaces

This is Kentaro’s second book on Stan. The first is in Japanese and it came out before CmdStanR and CmdStanPy. I’d recommend both this book and using CmdStanR or CmdStanPy—they are our go-to recommendations for using Stan these days (along with BridgeStan if you want transforms, log densities, and gradients). After moving to Flatiron Institute, I’ve switched from R to Python and now pretty much exclusively use Python with CmdStanPy, NumPy/SciPy (basic math and stats functions), plotnine (ggplot2 clone), and pandas (R data frame clone).

Random comment on form

In another nod to Andrew, I’ll make an observation about a minor point of form. If you’re going to use code in a book set in LaTeX, use sourcecodepro. It’s a Lucida Console-like font that’s much easier to read than courier. I’d just go with mathpazo for text and math in Palatino, but I can see why people like Times because it’s so narrow. Somehow Matsuura managed to solve the dreaded twiddle problem in his displayed Courier code so the twiddles look natural and not like superscripts—I’d love to know the trick to that. Overall, though, the graphics are abundant, clear, and consistently formatted, though Andrew might not like some of the ggplot2 defaults.

Comments from the peanut gallery

Brian Ward, who’s leading Stan language development these days and also one of the core devs for CmdStanPy and BridgeStan, said that he was a bit unsettled seeing API choices he’s made set down in print. Welcome to the club :-). This is why we’re so obsessive about backward compatibility.

SIMD, memory locality, vectorization, and branch point prediction

The title of this post lists the three most important considerations for performant code these days (late 2023).

SIMD

GPUs can do a lot of compute in parallel. High end ($15K to $30K) GPUs like they use in big tech perform thousands of operations in parallel (50 teraflops for the H100). The catch is that they want all of those operations to be done in lockstep on different data. This is called single-instruction, multiple data (SIMD). Matrix operations, as used by today’s neural networks, are easy to code with SIMD.

GPUs cannot do 1000s of different things at once. This makes it challenging to write recursive algorithms like the no-U-turn sampler (NUTS) and is one of the reason people like Matt Hoffman (developer of NUTS) have turned to generalized HMC.

GPUs can do different things in sequence if you keep memory on the GPU (in kernel). This is how deep nets can sequence, feedforward, convolution, attention, activation, and GLM layers. Steve Bronder is working on keeping generated Stan GPU code in kernel.

Memory locality

Memory is slow. Really slow. The time it takes to fetch a fresh value from random access memory (RAM) is on the order of 100–200 times slower than the time it takes to execute a primitive arithmetic operation. On a CPU. The problem just gets worse with GPUs.

Modern CPUs are equipped with levels of cache. For instance, a consumer grade 8-core CPU like an Intel i9 might have a shared 32 MB L3 cache among all the cores, a 1MB L2 cache, and an 80KB L1 cache. When memory is read in from RAM, it gets read in blocks, with not only the value you requested, but other values near it. This gets pulled first into L3 cache, then into L2 cache, then into L1 cache, then it gets into registers on the CPU. This means if you have laid an array x out contiguously in memory and you have read x[n], then it is really cheap to read x[n + 1] because it’s either in your L1 cache already or being pipelined there. If your code accesses non-local pieces of memory, then you wind up getting cache misses. The higher up the cache miss (L1, L2, L3, or RAM), the more time the CPU has to wait to get the values it needs.

One way to see this in practice is to consider matrix layout in memory. If we use column major order, then each column is contiguous in memory and the columns are laid out one after the other. This makes it much more efficient to traverse the matrix by first looping over the columns, then looping over the rows. Getting the order wrong can be an order of magnitude penalty or more. Matrix libraries will do more efficient block-based transposes so this doesn’t bite users writing naive code.

The bottom line here is that even if you have 8 cores, your can’t run 8 parallel chains of MCMC as fast as you can run 1. On my Xeon desktop with 8 cores, I can run 4 chains in parallel, followed by another 4 in parallel in the same amount of time as I can run 8 in parallel. As a bonus, my fan doesn’t whine as loudly. The reason for the slowdown with 8 parallel chains is due not due to the CPUs being busy, it’s because the asynchronous execution causes a bottleneck in the cache. This can be overcome with hard work by restructuring parallel code to be more cache sensitive, but it’s a deep dive.

Performant code often recomputes the value of a function if its operands are in cache in order to reduce the memory pressure that would arise from storing the value. Or it reorders operations to be lazy explicitly to support recompilation. Stan does this to prioritize scalability over efficiency (i.e., it recomputes values which means fewer memory fetches but more operations).

Vectorization

Modern CPUs pipeline their operations, for example using AVX and SSE instructions on Intel chips. C++ compilers at high levels of optimization will exploit this if the right flags are enabled. This way, CPUs can do on the order of 8 simultaneous arithmetic operations. Writing loops so that they are in blocks of 8 so that they can exploit CPU vectorization is critical for performant code. The good news is that calling underlying matrix libraries like Eigen or BLAS will do that for you. The bad news is that if you write your own loops, they are going to be slow compared to loops optimized using vectorization. You have to do it yourself in C++ if you want performant code.

Another unexpected property of modern CPUs for numerical computing is that integer operations are pretty much free. The CPUs have separate integer and floating point units and with most numerical computing, there is far less pressure on integer arithmetic. So you can see things like adding integer arithmetic to a loop without slowing it down.

Branch-point prediction

When the CPU executes a conditional such as the compiled form of

if (A) then B else C;

it will predict whether A will evaluate to true or false. If it predicts true, then the operations in B will begin to execute “optimistically” at the same time as A. If A does evaluate to true, we have a head start. If A evaluates to false, then we have a branch-point misprediction. We have to backtrack, flush the results from optimistic evaluation of B, fetch the instructions for C, then continue. This is very very slow because of memory contention and because it breaks the data pipeline. And it’s double trouble for GPUs. Stan includes suggestions (pragmas) to the compiler as to which branch is more likely in our tight memory management code for automatic differentiation.

Conclusion

The takeaway message is that for efficient code, our main concerns are memory locality, branch-point prediction, and vectorization. With GPUs, we further have to worry about SIMD. Good luck!

Bayes factors evaluate priors, cross validations evaluate posteriors

I’ve written this explanation on the board often enough that I thought I’d put it in a blog post.

Bayes factors

Bayes factors compare the data density (sometimes called the “evidence”) of one model against another. Suppose we have two Bayesian models for data $latex y$, one model $latex p_1(\theta_1, y)$ with parameters $latex \theta_1$ and a second model $latex p_2(\theta_2, y)$ with parameters $latex \theta_2.$

The Bayes factor is defined to be the ratio of the marginal probability density of the data in the two models,

$latex \textrm{BF}_{1,2} = p_1(y) \, / \, p_2(y),$

where we have

$latex p_1(y) = \mathbb{E}[p_1(y \mid \Theta_1)] \ = \ \int p_1(y \mid \theta_1) \cdot p_1(\theta_1) \, \textrm{d}\theta_1$

and

$latex p_2(y) = \mathbb{E}[p_2(y \mid \Theta_2)] \ = \ \int p_2(y \mid \theta_2) \cdot p_2(\theta_2) \, \textrm{d}\theta_2.$

The distributions $latex p_1(y)$ and $latex p_2(y)$ are known as prior predictive distributions because they integrate the likelihood over the prior.

There are ad-hoc guidelines from Harold Jeffreys of “uninformative” prior fame, classifying Bayes factor values as “decisive,” “very strong,” “strong,” “substantial,” “barely worth mentioning,” or “negative”; see the Wikipedia on Bayes factors. These seem about as useful as a 5% threshold on p-values before declaring significance.

Held-out validation

Held-out validation tries to evaluate prediction after model estimation (aka training). It works by dividing the data $latex y$ into two pieces, $latex y = y’, y”$ and then training on $latex y’$ and testing on $latex y”$. The held out validation values are

$latex p_1(y” \mid y’) = \mathbb{E}[p_1(y” \mid \Theta_1) \mid y’] = \int p_1(y” \mid \theta_1) \cdot p_1(\theta_1 \mid y’) \, \textrm{d}\theta_1$

and

$latex p_2(y” \mid y’) = \mathbb{E}[p_2(y” \mid \Theta_2) \mid y’] = \int p_2(y” \mid \theta_2) \cdot p_2(\theta_2 \mid y’) \, \textrm{d}\theta_2.$

The distributions $latex p_1(y” \mid y’)$ and $latex p_2(y” \mid y’)$ are known as posterior predictive distributions because they integrate the likelihood over the posterior from earlier training data.

This can all be done on the log scale to compute either the log expected probability or the expected log probability (which are different because logarithms are not linear). We will use expected log probability in the next section.

(Leave one out) cross validation

Suppose our data is $latex y_1, \ldots, y_N$. Leave-one-out cross validation works by successively taking $latex y” = y_n$ and $latex y’ = y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N$ and then averaging on the log scale.

$latex \frac{1}{N} \sum_{n=1}^N \log\left( \strut p_1(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right)$

and

$latex \frac{1}{N} \sum_{n=1}^N \log \left( \strut p_2(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right).$

Leave-one-out cross validation is interpretable as the expected log posterior density (ELPD) for a new data item. Estimating ELPD is (part of) the motivation for various information criteria such as AIC, DIC, and WAIC.

Conclusion and a question

The main distinction between Bayes factors and cross validation is that the former uses prior predictive distributions whereas the latter uses posterior predictive distributions. This makes Bayes factors very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors.

This matters because pragmatic Bayesians like Andrew Gelman tend to use weakly informative priors that determine the rough magnitude, but not the value of parameters. You can’t get good Bayes factors this way. The best way to get a good Bayes factor is to push the prior toward the posterior, which you get for free with cross validation.

My question is whether the users of Bayes factors really believe so strongly in their priors. I’ve been told that’s true of the hardcore “subjective” Bayesians, who aim for strong priors, and also the hardcore “objective” Bayesians, who try to use “uninformative” priors, but I don’t think I’ve ever met anyone who claimed to follow either approach. It’s definitely not the perspective we’ve been pushing in our “pragmatic” Bayesian approach, for instance as described in the Bayesian workflow paper. We flat out encourage people to start with weakly informative priors and then add more information if the priors turn out to be too weak for either inference or computation.

Further reading

For more detail on these methods and further examples, see Gelman et al.’s Bayesian Data Analysis, 3rd Edition, which is available free online through the link, particularly Section 7.2 (“Information criteria and cross-validation,” p. 175) and section 7.4 (“Model comparison using Bayes factors,” page 183). I’d also recommend Vehtari, Gelman, and Gabry’s paper, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.

Simple pseudocode for transformer decoding a la GPT

A number of people have asked me for this, so I’m posting it here:

[Edit: revised original draft (twice) to fix log density context variable and width of multi-head attention values.]

This is a short note that provides complete and relatively simple pseudocode for the neural network architecture behind the current crop of large language models (LLMs), the generative pretrained transformers (GPT). These are based on the notion of (multi-head) attention, followed by feedforward neural networks, stacked in a deep architecture.

I simplified the pseudocode compared to things like Karpathy’s nanoGPT repository in Python (great, but it’s tensorized and batched PyTorch code for GPU efficiency) or Hunter and Phuong’s pseudocode, which is more general and covers encoding and multiple different architectures. I also start from scratch with the basic notions of tokenization and language modeling.

I include the pseudocode for evaluating the objective function for training and the pseudocode for generating responses. The initial presentation uses single-head attention to make the attention stage clearer, with a note afterward with pseudocode to generalize to multi-head attention.

I also include references to other basic presentations, including Daniel Lee’s version coded in Stan.

If this is confusing or you think I got a detail wrong, please let me know—I want to make this as clear and correct (w.r.t. GPT-2) as possible.

My SciML Webinar next week (28 Sep): Multiscale generalized Hamiltonian Monte Carlo with delayed rejection

I’m on the hook to do a SciML webinar next week:

These are organized by Keith Phuthi (who is at CMU) through University of Michigan’s Institute for Computational Discovery and Engineering.

Sam Livingstone is moderating.This is presenting joint work with Alex Barnett, Chirag Modi, Edward Roualdes, and Gilad Turok.

I’m very excited about this project as it combines a number of threads I’ve been working on with collaborators. When I did my job talk here, Leslie Greengard, our center director, asked me why we didn’t use variable stepwise integrators when doing Hamiltonian Monte Carlo. I told him we’d love to do it, but didn’t know how to do it in such a way as to preserve the stationary target distribution.

Delayed rejection HMC

Then we found Antonietta Mira’s work on delayed rejection. It lets you retry a second Metropolis proposal if the first one is rejected. The key here is that we can use a smaller step size for the second proposal, thus recovering from proposals that are rejected because the Hamiltonian diverged (i.e., the first-order gradient based algorithm can’t handle regions of high curvature in the target density). There’s a bit of bookkeeping (which is frustratingly hard to write down) for the Hastings condition to ensure detailed balance. Chirag Modi, Alex Barnett and I worked out the details, and Chirag figured out a novel twist on delayed rejection that only retries if the original acceptance probability was low. You can read about it in our paper:

This works really well and is enough that we can get proper draws from Neal’s funnel (vanilla HMC fails on this example in either the tails in either the mouth or neck of the funnel, depending on the step size). But it’s inefficient in that it retries an entire Hamiltonian trajectory. Which means if we cut the step size in half, we double the number of steps to keep the integration time constant.

Radford Neal to the rescue

As we were doing this, the irrepressible Radford Neal published a breakthrough algorithm:

What he managed to do was use generalized Hamiltonian Monte Carlo (G-HMC) to build an algorithm that takes one step of HMC (like Metropolis-adjusted Langevin, but over the coupled position/momentum variables) and manages to maintain directed progress. Instead of fully resampling momentum each iteration, G-HMC resamples a new momentum value then performs a weighted average with the existing momentum with most of the weight on the existing momentum. Neal shows that with a series of accepted one-step HMC iterations, we can make directed progress just like HMC with longer trajectories. The trick is getting sequences of acceptances together. Usually this doesn’t work because we have to flip momentum each iteration. We can re-flip it when regenerating, to keep going in the same direction on acceptances, but with rejections we reverse momentum (this isn’t an issue with HMC because it fully regenerates each time). So to get directed movement, we need steps that are too small. What Radford figured out is that we can solve this problem by replacing the way we generate uniform(0, 1)-distributed probabilities for the Metropolis accept/reject step (we compare the variate generated to the ratio of the density at the proposal to the density at the previous point and accept if it’s lower). Radford realized that if we instead generate them in a sawtooth pattern (with micro-jitter for ergodicity), then when we’re at the bottom of the sawtooth generating a sequence of values near zero, the acceptances will cluster together.

Replacing Neal’s trick with delayed rejection

Enter Chirag’s and my intern, Gilad Turok (who came to us as an undergrad in applied math at Columbia). Over the summer, working with me and Chirag and Edward Roualdes (who was here as a visitor), he built and evaluated a system that replaces Neal’s trick (sawtooth pattern of acceptance probability) with the Mira’s trick (delayed rejection). It indeed solves the multi scale problem. It exceeded our expectations in terms of efficiency—it’s about twice as fast as our delayed rejection HMC. Going one HMC step at a time, it is able to adjust its stepsize within what would be a single Hamiltonian trajectory. That is, we finally have something that works roughly like a typical ODE integrator in applied math.

Matt Hoffman to the rescue

But wait, that’s not all. There’s room for another one of the great MCMC researchers to weigh in. Matt Hoffman, along with Pavel Sountsov, figured out how to take Radford’s algorithm and provide automatic adaptation for it.

What Hoffman and Sountsov manage to do is run a whole lot of parallel chains, then use information in the other chains to set tuning parameters for a given chain. In that way it’s like the Goodman and Weare affine-invariant sampler that’s used in the Python package emcee. This involves estimating the metric (posterior covariance or just variance in the diagonal case) and also estimating steps size, which they do through a heuristic largest-eigenvalue estimate. Among the pleasant properties of their approach is that the entire setup produces a Markov chain from the very first iteration. That means we only have to do what people call “burn in” (sorry Andrew, but notice how I say other people call it that, not that they should), not set aside some number of iterations for adaptation.

Edward Roualdes has coded up Hoffman and Sountsov’s adaptation and it appears to work with delayed rejection replacing Neal’s trick.

Next for Stan?

I’m pretty optimistic that this will wind up being more efficient than NUTS and also make things like parallel adaptation and automatic stopping a whole lot easier. It should be more efficient because it doesn’t waste work—NUTS goes forward and backward in time and then subsamples along the final doubling (usually—it’s stochastic with a strong bias toward doing that). This means we “waste” the work going the wrong way in time and beyond where we finally sample. But we still have a lot of eval to do before we can replace Stan’s longstanding sampler or even provide an alternative.

My talk

The plan’s basically to expand this blog post with details and show you some results. Hope to see you there!

ChatGPT (4) can do 3-digit multiplication

I’ve heard a lot about GPT’s inability to do long multiplication. Most people had either tried giving it a few examples (i.e., “few shot learning”) or giving it a few examples and asking it to explain its work (i.e., “chain of thought”). I thought I’d just try to teach it the way you might teach a person. Or if you want to get fancy, I use few shot learning of chain of thought (a very common technique to get it to do this kind of problem). The bottom line: I could get it to do 3-digit multiplications but not 4-digit. The title originally continued “but not 4-digit multiplication,” but I realized I haven’t shown that—just that my prompt doesn’t do it.

Bob

BC: Now we are going to do multiplication.  As a refresher, here is a list
of all one-digit multiplications.

0 * 0 = 0
0 * 1 = 0
...
[removed for blog]
...
9 * 8 = 72
9 * 9 = 81

To multiply a two-digit number by a one-digit number, we break the two
digit number apart into its tens place and its one place and add them,
and then distribute through to get simpler additions, and then
multiply,

31 * 2
  = (30 + 1) * 2 
  = (30 * 2) + (1 * 2)
  = 60 + 2

If a number ends in zeros, we multiply the digits before the zero then
add the zeros back in at the end.  So to do 30 * 2 we multiply 3 * 2
to get 6 and then add the zero back to get 60.  Another way to think
of it is that 30 * 2 = 10 * 3 * 2 = 10 * 6 = 60, where we just wind up
multiplying by the tens place.

If we have a bigger number, it'll have more digits in the final
addition, as in

85 * 9
  = (80 + 5) * 9
  = (80 * 9) + (5 * 9)
  = 720 + 45
  = 765

To multiply two digit numbers times each other, we break them both
down, multiply all the factors, then add the resulting terms
together. 

23 * 45
  = (20 + 3) * (40 + 5)
  = (20 * 40) + (3 * 40) + (20 * 5) + (3 * 5)
  = 800 + 120 + 100 + 15
  = 920 + 100 + 15
  = 1020 + 15
  = 1035

We handle products of three digit numbers the same way.

234 * 567
  = (200 + 30 + 4) * (500 + 60 + 7)
  = (200 * 500) + (200 * 60) + (200 * 7) + (30 * 500) + (30 * 60) + (30 * 7) + (4 * 500) + (4 * 60) + (4 * 7)
  = 100000 + 12000 + 1400 + 15000 + 1800 + 210 + 2000 + 240 + 28
  = 112000 + 1400 + 15000 + 1800 + 210 + 2000 + 240 + 28
  = 113400 + 15000 + 1800 + 210 + 2000 + 240 + 28
  = 128400 + 1800 + 210 + 2000 + 240 + 28
  = 130200 + 210 + 2000 + 240 + 28
  = 130410 + 2000 + 240 + 28
  = 132410 + 240 + 28
  = 132650 + 28
  = 132678

ChatGPT (4, Aug 3)

Certainly! To multiply 12 by 3, we can follow the method you outlined.

First, we break 12 down into its tens and ones places:

12 = 10 + 2

Next, we distribute 3 through the sum:

(10 + 2) * 3 = (10 * 3) + (2 * 3)
= 30 + 6
= 36

So 12 * 3 = 36.

Three digits works

Bob

That's really fantastic.  Now we want to see how far you can take this.  Can you do a three digit product?  How about trying

423 * 567
  = 

ChatGPT (4, Aug 3)

Certainly! To multiply 423 by 567, we'll follow the same process: breaking down each number into its individual place values (hundreds, tens, ones), and then using distributive multiplication.

First, the breakdown:
  423 = 400 + 20 + 3
  567 = 500 + 60 + 7

Now, the multiplication:

(400 + 20 + 3) * (500 + 60 + 7) 
 = (400 * 500) + (400 * 60) + (400 * 7) + (20 * 500) + (20 * 60) + (20 * 7) + (3 * 500) + (3 * 60) + (3 * 7) 
 = 200000 + 24000 + 2800 + 10000 + 1200 + 140 + 1500 + 180 + 21 
 = 200000 + 24000 + 2800 + 10000 + 1200 + 140 + 1500 + 180 + 21 
 = 224000 + 2800 + 10000 + 1200 + 140 + 1500 + 180 + 21 
 = 226800 + 10000 + 1200 + 140 + 1500 + 180 + 21 
 = 236800 + 1200 + 140 + 1500 + 180 + 21 
 = 238000 + 140 + 1500 + 180 + 21 
 = 238140 + 1500 + 180 + 21 
 = 239640 + 180 + 21 
 = 239820 + 21 
 = 239841 

So 423 * 567 = 239841.

Generalizing to 4 x 2 digits

It can generalize to different numbers of digits.

Bob

How about a 4 digit number times a 2 digit number?

2357 * 15
  = 

ChatGPT (4, Aug 3)

Certainly! To multiply 2357 by 15, you can use the same distributive multiplication process.

First, break down each number into its place values:
 2357 = 2000 + 300 + 50 + 7 
 15 = 10 + 5 

Now, perform the multiplication:
 (2000 + 300 + 50 + 7) * (10 + 5) 
 = (2000 * 10) + (2000 * 5) + (300 * 10) + (300 * 5) + (50 * 10) + (50 * 5) + (7 * 10) + (7 * 5) 
 = 20000 + 10000 + 3000 + 1500 + 500 + 250 + 70 + 35 
 = 30000 + 3000 + 1500 + 500 + 250 + 70 + 35 
 = 33000 + 1500 + 500 + 250 + 70 + 35 
 = 34500 + 500 + 250 + 70 + 35 
 = 35000 + 250 + 70 + 35 
 = 35250 + 70 + 35 
 = 35320 + 35 
 = 35355 

So  2357 * 15 = 35355.

4 digits fails

That’s nice. Unfortunately, 4-digit multiplication goes off the rails. It generalizes well to breaking the numbers down and calculating the intermediate terms, but it can’t handle adding together 16 numbers. Maybe another style of teaching it multiplication would work, like breaking the problem down into each place of one of the numbers and then adding at the end like people do. This seems to be something that could be taught in the reinforcment learning stage of ChatGPT training.

Maybe next time. People have shown various things in this problem: (a) it helps to break the numbers apart with spaces (it’s bad at tokenizing numbers by default), (b) writing numbers with ones place first helps, and (c) training a transformer directly to do arithmetic with numerical coding works better still. Feel free to add more in the comments.

But how about addition

OK, this is backward, but I didn’t want to make this too tedious. I started with addition and it’s good at that, too. Here’s the prompt if you’re curious.

Bob

You are an expert at doing arithmetic by hand.  I am going to
remind you of the protocol to follow so that you can show your
work and teach others.  Here are the rules for adding single digits:

0 + 0 = 0
0 + 1 = 1
...
[snip]
...
9 + 8 = 17
9 + 9 = 18

To add two-digit numbers, we add all of their parts together, e.g.,

62 + 3
  = 60 + 2 + 3
  = 60 + 5
  = 65

First, we reduce the two-digit arithmetic using the rule 2 + 3 = 5.
Then we reduce 60 + 5 = 65 because when the two-digit number ends in
0, we just concatenate the one-digit number.  Here's another example.

86 + 7
  = 80 + 6 + 7 = 80 + 13
  = 80 + 10 + 3
  = 90 + 3
  = 93

To add two two digit numbers that end in zero, just add their
first digits and add a 0 to the result, e.g.,

20 + 30 = 50   (because 2 + 3 = 5, followed by 0)
90 + 40 = 130  (because 9 + 4 = 13, followed by 0)

I would like you to do some two digit addition for me and show your
work.

To add two digit numbers, we just continue to break them apart.

12 + 43
  = 10 + 2 + 40 + 3
  = 10 + 40 + 2 + 3
  = 50 + 5
  = 55

Here, we had to break both numbers apart, 12 = 10 + 2, and 43 = 40 +
3.  We then rearranged the order of terms so that we could reduce 10 +
40 = 50 and 2 + 3 = 5.  Then we just added the result.  Another
example

59 + 82
  = 50 + 9 + 80 + 2
  = 50 + 80 + 9 + 2
  = 130 + 9 + 2
  = 130 + 11
  = 100 + 30 + 10 + 1
  = 100 + 40 + 1
  = 100 + 41
  = 141

We always reduce the smallest terms first and then break apart again
if we need to.  In the final step, we use the rule that if the last
two digits of a three-digit number are zero, we can add to a two-digit
number by concatenation.

Now let's see if you can do some.  I want you to answer in the form
above with a sequence of equalities until you get to a single number
for the answer.

12 + 3
  = 

It generalizes nicely from two to three digit examples.

Bob


And now three digit numbers,

123 + 45
  =

ChatGPT

Certainly, let's add 123 and 45 by breaking them down into their hundreds, tens, and ones components and proceed step-by-step:

123 + 45
= 100 + 20 + 3 + 40 + 5
= 100 + 20 + 40 + 3 + 5
= 100 + 60 + 8
= 100 + 60 + 8
= 160 + 8
= 168

I started by breaking apart both 123 and 45 into their respective components. For 123, the components are 100, 20, and 3; for 45, the components are 40 and 5. Next, I rearranged the terms to make it easier to add the tens together (20 + 40 = 60) and the ones together (3 + 5 = 8). After that, I combined the hundreds, tens, and ones sums together: first combining 100 and 60 to get 160, and then adding the remaining 8 to arrive at 168.

Report on the large language model meeting at Berkeley

I’m not the only one who thinks GPT-4 is awesome. I just got back from an intense week of meetings at the Large language models and transformers workshop at Berkeley’s Simons Institute for the Theory of Computing. Thanks to Umesh Vazirani for organizing and running it so calmly.

Here are the videos of the talks.

Human feedback models

I gave a short talk Friday afternoon on models of data annotation.

  • Bob Carpenter. Softening human feedback improves classifier calibration

  • The step from the language model GPT-4 to the chatbot ChatGPT involves soliciting human feedback to rank potential outputs. This is typically done by converting the human feedback to a “gold” standard and retraining the baseline GPT-4 neural network.

    Chris Manning (who introduced me to statistical natural language processing when we were both professors at CMU), provides a nice high-level overview of how OpenAI uses reinforcement learning with human feedback to try to align ChatGPT to the goals of being helpful, harmless, and truthful.

    Chris Manning. Towards reliable use of large language models: better detection, consistency, and instruction-tuning.

    Humans rank potential ChatGPT output and their feedback is used as input for a Bradley-Terry model of conversational reward. This is then used to retrain the network. Chris suggests a much simpler approach than the one they use.

    While at the workshop, John Thickstun, a Stanford CS postdoc, pointed me to the following (and also filled me in on a bunch of technical details in between sessions).

    Chen Cheng, Hilal Asi, and John Duchi. 2022. How many labelers do you
    have? A close look at gold-standard labels
    . arXiv.

    It makes some simplifying assumptions to prove results including the bias of majority voting. I show similar things through simulation in a case study I posted on the Stan forums a while back,

    Bob Carpenter. For probabilistic prediction, full Bayes is better than point estimators.

    More on that soon when Seong Han and I finish our recent paper on annotation models.

    LLMs and copyright

    The highlight of the entire event for me was a captivating talk by a brilliant professor of intellectual property law at Berkeley

    Pamela Samuelson. Large language models meet copyright law.

    If you’re at all interested in copyright and AI, you should watch this. She very clearly explains what copyright is and how the law sees works of artistic expression different than function and hence how it sees code and (other) artistic works differently. She also covers the basis for the cases currently being litigated. She was also masterly at handling the rather unruly crowd. I’ve never been to an event with so many interruptions by the audience members. It was almost like the audience was practicing to be DARPA program managers (a notoriously interruption-prone group).

    Is ChatGPT just a stochastic parrot?

    The other talk I’d encourage everyone to watch is

    Steven Piantadosi. Meaning in the age of large language models.

    He goes over a lot of the cognitive science and philosophy of language necessary to understand why ChatGPT is not just a “stochastic parrot.” He focuses on the work of, wait for it…Susan Gelman (Andrew’s sister). Susan works in my favorite area of cognitive science—concept development.

    I can’t recommend this one highly enough, and I’ll be curious what people get out of it. This one’s closest to my own background (my Ph.D. is joint in cognitive science/computer science and I taught semantics, philosophy of language, and psycholinguistics as well as NLP at CMU), so I’m curious how understandable it’ll be to people who haven’t studied a lot of cognitive anthropology, philosophy of mind, and cognitive development.

    Sanjeev Arora gave a talk about combining skills and how he did a simple combinatorial experiment of combining five “skills” among thousands of skills (not defining what a skill was drove the audience into a frenzy of interruptions that’s quite something to watch). This is behavior that “emerged” in GPT-4 that isn’t so great in the less powerful models.

    Speaking of parrots, the West Coast Stats Views blog (which Andrew often cites) is parroting mainstream chatbot FUD (fear, uncertainty, and doubt); see, e.g., Thursday tweets. I say “parrot” because the blog’s Thursday posts just point to things we used to call “tweets” before a certain someone decided to throw away a brand name that’s become a verb. The irony, of course, is that they accuse GPT of being a parrot!

    Scaling laws

    There were a couple of nice talks by Yasaman Bahri (DeepMind) on Understanding the origins and taxonomy of neural scaling laws and Sasha Rush (Cornell/Hugging Face) on Scaling data-constrained language models. These are important as they’re what allows you to decide how much data to use and how large a model to build for your compute budget. It’s what gave companies the incentive to invest hundreds of millions of dollars in infrastructure to fit these models. Sasha’s talk also discusses the roles researchers can take who don’t have access to big-tech compute power.

    Watermarking LLMs

    Scott Aaronson (UT Austin, on sabbatical at OpenAI) gave a really interesting talk,

    Scott Aaronson. Watermarking of large language models

    The talks a masterclass in distilling a problem to math, explaining why it’s difficult, and considering several solutions and their implications. I felt smarter after watching this one.

    You might also want to check out the competition from John Thickstun, Robust distortion-free watermarks for language models, which takes an encryption key-based approach.

    In-context learning

    “In-context learning” is what people call an LLM’s ability to be given zero or more examples and then to complete the pattern. For example, if I say “translate to French. oui: “, we get what’s called “zero-shot learning”. If I give it an example, then it’s called “one-shot learning”, for example, “translate to French. notre: our, oui: “, and so on. ChatGPT can manage all sorts of nuanced language tasks given only a few examples. It’s so good that it’s competitive with most custom solutions to these problems.

    Everyone kept pointing out in-context learning did not learn in the sense of updating model weights. Of course it doesn’t. That’s because it’s conditioning, not learning. The whole process is Markovian, returning a draw from Pr[continuation | context]. The issue is whether you can do a good job of this prediction without being AI-compete (i.e, a general artificial intelligence, whatever that means).

    A whole lot of attention was given to ChatGPT’s poor performance on arithmetic problems coded as characters like “123 * 987”, with a couple different talks taking different approaches. One trained a transformer with the actual digits and showed it could be made to do this, pointing to the problem being the encoding of math in language. Perhaps the most insightful is that if you give GPT access to an API (with in-context learning, no less), it can call on that API to do arithmetic and the problem goes away. The final talk on this was during the lightning sessions, where Nayoung Lee (a Ph.D. student from Wisconsin-Madison) showed if you reversed the digits in the output (so that they were least significant first, the way we usually do arithmetic), transformers could be trained to do arithmetic very well; here’s a link to her arXiv paper, Teaching arithmetic to small transformers.

    Sparks of artificial general intelligence

    Yin Tat Lee kicked off the program talking about the Microsoft paper, Sparks of general AI. If you haven’t read the paper it’s a fascinating list of things that ChatGPT can do with relatively simple prompting.

    One of the interesting aspects of Yin Tat’s talk is his description of how they treated ChatGPT (4) as an evolving black box. To me, this and a bunch of these other talks that people did probing GPT’s abilities, point out that we need much better evaluation methods.

    Federated learning and task specialization

    For those interested in hierarchical modeling and the idea of a foundation model that can be adapted to different tasks, Colin Raffel (UNC/Hugging Face) gave an interesting talk on federated learning and deployment.

    Colin Raffel. Build an ecosystem, not a monolith

    This was not unrelated to Sasha’s talk (perhaps not surprising as they’re both affiliated with Hugging Face). They also talk about the ecosystem of image models sprouting up around Stable Diffusion and the ability to fine-tune them using low rank methods.

    OpenAI is closed

    Ilya Sutskever, CTO of OpenAI, gave a talk that I can best describe as adversarial. It was the only talk that filled the room to staning room only. He said he couldn’t talk about anything computational or anything related to LLMs, so he spent an hour talking about the duality between probability and compression and Kolmogorov complexity.

    Slides on large language models for statisticians

    I was invited by David Banks to give an introductory talk on large language models to the regional American Statistical Association meeting on large language models. Here are the slides:

    Most usefully, it has complete pseudocode up to but not including multi-head attention. It also has an annotated bibliography of the main papers if you want to catch up. After the talk, I added a couple slides on scaling laws and an annotated bibliography, which I didn’t have time to get to before the talk. I also added a slide describing multi-head attention, but without pseudocode.

    P.S. The meeting was yesterday at Columbia and I hadn’t been to the stats department since the pandemic started, so it felt very strange.

    P.P.S. GPT-4 helped me generate the LaTeX Tikz code to the point where I did zero searching through doc or the web. It also generates all of my pandas and plotnine code (Python clones of R’s data frames and ggplot2) and a ton of my NumPy, SciPy, and general Python code. It can explain the techniques it uses, so I’m learning a lot, too. I almost never use StackOverflow any more!

    Scientific software research faculty award

    Simons Foundation (the parent institution of Flatiron Institute, where I work) has just announced grants to support professors working on scientific software, with a plan to support 6 new fellows per year. From the call for proposals:

    A Scientific Software Fellowship provides five years of 50 percent salary support of the awardee’s academic-year salary and fringe benefits, whether normally paid over 9 or 12 months, along with a yearly $50,000 research allowance for the awardee…

    Letters of intent are due 8 December 2023.

    It’s clear these are not career transition awards and will instead go to people already involved in scientific software. While this call is aimed at physics, astrophysics, and mathematics professors, stats might qualify (you should check if you’re interested). I’m not involved in reviewing the grants—that’s the folks across the street.

    R and OOP anti-patterns

    Thomas Lumley just dropped a blog post, Blank cheque inheritance and statistical objects, which begins as follows.

    One of the problems with object-oriented programming for statistical methods is that inheritance is backwards. Everything is fine for data structures, and Bioconductor has many examples of broad (often abstract) base classes for biological information or annotation that are specialised by inheritance to more detailed and useful classes. Statistical methods go the other way.

    In base R, glm for generalised linear models is a generalisation of lm for linear models, but the glm class inherits from the lm class, …

    This isn’t a problem with inheritance, it’s a problem with how R uses it.

    Fundamentals of OOP and inheritance

    The fundamental rule for inheritance in object oriented programming (OOP) is that a class X should inherit from class Y only if every X is a Y. That means you should be able to use an X wherever the code calls for an Y. For instance, a Poodle class might inherit from the Dog class, because every poodle is a dog. Any function you can apply to dogs, you can apply to poodles. Every function that is defined to return a poodle will also return a dog. Behavior as arguments and returns is governed by the concepts of covariance and contravariance in programming language theory. Inheritance must respect these relations for a coherent OOP design.

    A classic blunder is to define a class for real numbers and one for complex numbers and have the complex numbers inherit from the real numbers. Every real number is a complex number, but not every complex number is a real number. So doing this will break standard OOP implementations. The reason beginners in OOP make this mistake is that it’s natural to think of the implementation of a complex number as taking a real number and adding an imaginary component. If you want to start with real numbers, a better way to define a complex number would be using composition to include two real components. That is, it contains two real numbers rather than inheriting from real numbers to get its real component. This is exactly how std::complex is defined in C++, with a constructor that takes the real and complex components as two objects of type T, where T might be double for double-precision complex numbers or it might be an autodiff type like stan::math::var.

    The God object anti-pattern

    I’m also not fond of how lm returns a god object. God objects are a widely cited anti-pattern in OOP, largely because they’re so frequently seen in the wild. The inclination that leads to this is to have something like a “blackboard” into which all kinds of state can be written so the user can get it all in one place. A common example is including all the input in a function’s output. No need to do that because the user already had all of that information because they couldn’t have called the function otherwise. God objects’ are usually a terrible idea as it’s nearly impossible to ensure consistency of such an object without defeating its purpose, because its purpose is to behave like a global variable repository. R doesn’t even try—you can take an lm fit object and change various aspects of it and leave it in an inconsistent state without warning, e.g.,

    > fit <- lm(dist ~ speed, data = cars)
    
    > fit$coefficients
    (Intercept)       speed 
     -17.579095    3.932409 
    
    > fit$coefficients = c(1, 2, 3, 4)
    
    > fit$coefficients
    [1] 1 2 3 4
    

    The Stan interfaces in Python and R also return god objects. I lost the design argument to the other developers, who argued, “That’s the way it’s done in R and Python.”

    R’s argument chaining vs. OOP method chaining

    Speaking of OOP, chaining with pipes in R follows the object-oriented pattern of method chaining, but instead of using object returns that are the class defining the next method in the chain, it just passes along the return to use as the first argument in the next chained function. It’s no longer object oriented. This doesn’t break any OO patterns, per se. It might be awkward if you need to pack enough into a return to go onto the next function. In OOP, developers often break long method chains into groups of coherent calls with named returns when the returns are not all instances of the same class. The reason to break up long chains is, ironically given how they’re motivated in R, to help with readability and self-documentation. Code readability is the single best thing you can do to make code maintainable, because code will be read much more often than it gets written. You can bridge the gap between what R does with chaining and the standard way to do method chaining in OOP by looking at how Python classes are defined with an explicit self argument (like the this pointer to the class instance in C++, but C++ doesn’t require it as an explicit argument on methods).

    P.S. I tried commenting on Lumley’s blog but was defeated by Disqus. I thought it might be of general interest, so am including it here.

    Rosenthal’s textbook: A First Look at Rigorous Probability Theory

    I’ve never taken a stats class. I was a math major, but dropped stats after I got appendicitis because I didn’t want to drop abstract algebra or computability theory. So here I am 40 years later trying to write up some more formal notes on probability theory and Markov chain Monte Carlo methods (MCMC) and finding myself in need of a gentle intro to probability theory that defines things rigorously. Jeffrey S. Rosenthal’s 2006 book A First Look at Rigorous Probability Theory is just what I needed. Despite not being very good at continuous math as an undergrad, I would have loved this book as it’s largely algebraic, topological, and set-theoretic in approach rather than relying on in-depth knowledge of real analysis or matrix algebra. Even the examples are chosen to be simple rather than being designed for Ph.D.s in math, and even include a lot of basic analysis results like constructing uncountable, unmeasurable sets.

    This is not the final book you’ll need on your way to becoming a measure theorist. For example, it never discusses Radon-Nikodym derivatives to unify the theory of pdfs. It doesn’t really get into stochastic processes beyond a high-level discussion of Brownian motion. It does cover the basic theory of general state space, time-homogeneous Markov chains in a few pages (why I was reading it), but that’s just scratching the surface of Rosenthal and Roberts’ general state-space MCMC paper which is dozens of pages long in much finer print.

    One of the best features of the book is that it’s both short and packed with meaningful examples. Rosenthal was quite selective in eliminating unnecessary fluff and sticking to the through line of introducing the main ideas. Most of the texts in this area just can’t resist all kinds of side excursions, examples which bring in all sorts of Ph.D. level math. That can be good if you need breadth and depth, but Rosenthal’s approach is superior for a first, broad introduction to the field.

    In summary: 5/5 stars.

    Large language model alignment “bias” and cultural consensus theory

    The way contemporary chatbots like ChatGPT work is by “aligning” a “foundation” model. I think cultural consensus theory (a statistical model, not a contentious issue for school boards) can provide a model for the sociology of alignment.

    The foundation: attention models

    In a nutshell, language modeling is the simple task of predicting the next subword (“called a token”) based on the previous sequence of subwords. The state-of-the-art had stalled for years on n-gram models that use the previous n subwords (usually with n < 5). In 2017, a team of Google researchers released a paper titled "Attention is all you need," which introduced the current state-of-the-art neural network architecture for language modeling. The breakthrough was in extending the context length into the thousands (GPT 3.5 uses 4K, GPT 4 has 8K and 32K models) with an attention model that figured out which parts of the context to concentrate on. The fundamental bottleneck is that computation is quadratic in context length (though it's all on GPU, so that's a massive numbers of flops for relatively low power).

    The 2017 paper introduced the so-called “transformer” architecture, which combines multiple attention “heads” in parallel. The original application was to translation, but it’s the self attention component that was extracted for use in LLMs. The “T” in “GPT” is for “transformers” (the “GP” is for “generative pretrained”). What researchers have found is that the heads learn different aspects of prediction, such as different syntactic structures, much like any mixture model.

    There’s a beautiful two-hour YouTube tutorial by Andrej Karpathy that builds up the entire transformer architecture piece by piece in a Colab notebook you can also use. Karpathy applies it to building a Shakespearean chatbot. It assumes you know Python, but is otherwise quite gentle, starting with an intro to n-gram language models and softmax.

    Garbage-in, garbage-out

    The current crop of large language models have been trained on vast amounts of human text, primarily collected through the internet. As you might imagine, including sources like Reddit and 4chan and Twitter leads to a broad set of what can most charitably be called “points of view.” Even on technical issues, the web is cluttered with material that should probably not be the basis for serious work—homework exercises for intro data science classes clutter GitHub and StackOverflow, every statistician and their cousin’s experimental code seems to be wrapped up as an R package, scripts from ancient versions of software persist, etc.

    Alignment: from LLMs to chatbots

    After building these powerful, transformer-based large language models (LLMs), people realized that they were really good at generating text. As in they blew away any previous compression record (just like the TV show Silicon Valley!). You can convert a language model into a compression scheme using prediction by partial matching (PPM) with arithmetic coding, the reference implementation of which was designed and coded by Radford Neal (with Ian Witten and John Cleary) in 1987. Seriously, they should win an ACM/Turing award just for the quantum leap in text compression.

    The early LLMs could write computer programs, translate Pascal to Fortran and Swahili to English, and generate new recipes given only a list of ingredients or new episodes of TV shows. But they tend to ramble off topic, tend to “hallucinate” (the term of art for when LLMs make things up; it’s called “dreaming” for diffusion models like Midjourney), and tend to be fluid with the points of view they find in training data. They’re just as happy telling you how to make a bomb in your basement and where to set it off as they are telling you how to make a soufflé in your kitchen and how to serve it. And if you “jailbreak” the current ChatGPT, it’ll still be happy to tell you how to try all sorts of dangerous, illegal, and morally and ethically questionable activities.

    OpenAI’s approach to preventing the LLMs from spewing dangerous and/or toxic garbage is to fine tune the large language models with human feedback (HF) using reinforcement learning (RL, and together RLHF). Their stated goal was to “align” the language models to be (a) helpful, (b) truthful, and (c) harmless. While this sounds like an objective task presented this way, the notions of truthful and harmless are difficult to pin down and require subjective judgement calls. Even helpfulness is a slippery notion in that help that’s too verbose or specific isn’t helpful. What one person takes to be self evident in these realms can be considered lunacy by others.

    OpenAI either implicitly or explicitly chose the point of view of a West-coast American liberal, which is the far left of the mainstream US political spectrum, even though it’s relatively conservative by European standards. They could’ve just as easily decided to give ChatGPT the perspective of the far right of the mainstream US political spectrum and it would’ve had a very different perspective and a different segment of the population would be complaining about its biases.

    Cultural consensus theory

    In 1979, Phil Dawid and David Skene introduced a statistical model of crowdsourcing for medical records. The idea is that there’s a latent true value of something like whether a patient smokes, and doctors looking at medical records are going to give you a noisy measurement of that value. The same kind of model can be applied to radiology and doctors classifying medical images for stage of cancer, etc. Or to NeurIPS paper reviews. The model assigns accuracies and biases (too positive or too negative) to the raters and infers the underlying rating most consistent with the ratings (given the rater accuracies and biases).

    David and Skene’s model was independently rediscovered by many, including by me with the help of Andrew and Jennifer Hill (it was my gateway model into Bayes and there’s an example of how to code it in the Stan User’s Guide). As Andrew tells me, no matter what model you look at, a psychometrician probably introduced it 50 years ago (e.g., Elo is just a rescaled Bradley-Terry model, which is from 1950).

    In 1986, A. Kimball Romney, Susan Weller, and William Batchelder published “Culture as consensus: a theory of culture and informant accuracy”, which introduced cultural consensus theory (CCT). It shouldn’t be surprising that it was published in an anthropology journal, because anthropology is cross-cultural sociology. Batchelder and Romney later published a paper, “Test theory without an answer key” in Biometrika; think IRT 0PL model but with unknown true answer, which is the Dawid and Skene model.

    The twist that CCT introduced to take it beyond David and Skene’s model was a mixture model for the “truth.” That is, they assumed there might not actually be a single consensus point of view among raters. This would be a good idea for crowdsourcing, too, where the respondents are often a mix of spammers and people making a good-faith effort (it’s really more of a continuum).

    I think it would be interesting to apply CCT to ChatGPT. It’s the same kind of thing that folks do in applying ideal point models to voting.

    ChatGPT can write talks about brms and run a D&D game

    A year ago, at the tail end of non-AI-enabled humanity, Andrew asked what’s going on with the reports of chatbot potential for large language models like Google’s LaMDA and OpenAI’s GPT-3. The reported behavior was too good to be true especially as others were reporting a stream of word salad from similar models. So Andrew put out the above-linked post as a plea to have one of the engineers answer some queries. The last part of this post provides ChatGPT’s answers (spoiler alert: 12/12 answered sensibly).

    What’s going on?

    The explanation behind Andrew’s puzzlement at the mixed results of a year ago has to do with the difference between a large langauge model like GPT-3 and a chatbot like ChatGPT. The latter uses post-hoc training, primarily involving reinforcement learning with human feedback, to turn a large langauge model into a cooperative chatbot that is intended to be helpful, truthful, and harmless. Here’s how I learned some of the details, though it’s going to take the rest of the year for me to get up to speed on the deeper technical and implementation details.

    • Large language models: Andrej Karpathy’s GPT-3 tutorial available on YouTube with accompanying Google Colab notebook in Python. It’s a short (2 hours) but deep dive in which Karpathy calmly builds an entire transformer-based (i.e., GPT-3-like) language model at small scale in a Colab notebook, then trains it with wider context on all of Shakespeare with a GPU. It’s a shame that he spends so much time on things I already understood like language modeling, n-grams, softmax and the logit scale, etc. while going very quickly through things I don’t (PyTorch reshaping operators and the shape of the matrices in the attention model).
    • Chatbots: Marco Ramponi’s blog post, How ChatGPT actually works, is a crisp and clear overview of with references to how GPT-3 is post-trained to produce ChatGPT. It links to all the technical papers, so there’s my syllabus.

    Here are four exchanges I had with ChatGPT this morning.

    1. ChatGPT gives a talk on Bayesian stats, workflow, and brms.
    2. ChatGPT is my dungeon master.
    3. ChatGPT generates D&D characters.
    4. ChatGPT answers Andrew’s 12 questions.

    1. Slides for a talk on Bayesian stats, workflow, and brms

    Hi, ChatGPT. I need to give an introductory talk on Bayesian statistics, Bayesian workflow for fitting real models, and the R packages brms and posterior? Can you write the first few slides for me? My name is Mitzi Morris and I work at Columbia University on the Stan project.

    Sure, Mitzi! Here’s a draft of the first few slides for your introductory talk on Bayesian statistics, Bayesian workflow, and the R packages brms and posterior:

    Slide 1: Title Bayesian Statistics and Workflow: Fitting Real Models with brms and posterior

    Slide 2: Introduction

    • Hello everyone, my name is Mitzi Morris, and I work at Columbia University on the Stan project.
    • Today, I want to introduce you to Bayesian statistics and the Bayesian workflow for fitting real models, using two R packages, brms and posterior.
    • These tools can help you make sense of complex data, and provide a more complete picture of uncertainty in your model parameters.

    Slide 3: What is Bayesian Statistics?

    • Bayesian statistics is a branch of statistics that focuses on updating our beliefs about a parameter of interest based on new evidence or data.
    • It’s based on Bayes’ theorem, which describes how we can update the probability of a hypothesis (such as a model parameter) based on new data.
    • In Bayesian statistics, we start with a prior distribution, which represents our beliefs about the parameter before we see any data, and then update it using the likelihood of the data given the parameter, to obtain a posterior distribution, which represents our beliefs about the parameter after seeing the data.

    Slide 4: The Bayesian Workflow

    • The Bayesian workflow is a process for fitting models that involves several steps, from data preparation to model evaluation.
    • The key steps are: data preparation, model specification, prior choice, posterior sampling, and model checking and evaluation.
    • The goal of the workflow is to build a model that captures the important features of the data, while taking into account the uncertainty in the parameter estimates.

    Slide 5: brms and posterior

    • brms and posterior are two R packages that can help you implement the Bayesian workflow for a variety of models, from simple linear regression to complex multilevel models.
    • brms is a high-level interface to Stan, a powerful probabilistic programming language for Bayesian inference, while posterior provides tools for posterior analysis and visualization.
    • Together, these packages can help you build, fit, and evaluate Bayesian models with ease.

    I hope this helps, Mitzi! Let me know if you need any further assistance.


    I do. Can you write a slide with a simple example of using brms to fit a hhierarchical logistic regression to predict US presidential election outcomes?

    Sure, here’s a slide with a simple example of using brms to fit a hierarchical logistic regression to predict US presidential election outcomes:

    Slide 6: Example: Hierarchical Logistic Regression with brms

    • Let’s say we want to predict the outcomes of US presidential elections using demographic and economic data from the 50 states.
    • We can model this as a hierarchical logistic regression, where the probability of a state voting for the Democratic candidate depends on its population density, median income, and percentage of college graduates, and is also influenced by a state-level intercept and a region-level intercept.
    • We can use brms to specify this model and estimate its parameters.

    Slide 7: Model Specification with brms

    • We start by loading the brms package and preparing our data.
    • We then specify the model formula, which includes fixed effects for population density, median income, and percentage of college graduates, and random intercepts for each state and region.
    • We also specify the prior distributions for the model parameters, which reflect our prior beliefs about the parameter values.

    Slide 8: Fitting the Model with brms

    • Once we have specified the model, we can use the brm() function to fit it to our data.
    • This function performs posterior sampling using Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of the model parameters.
    • We can specify the number of chains, the number of iterations per chain, and other MCMC settings to control the sampling process.

    Slide 9: Posterior Analysis and Visualization with posterior

    • Once we have obtained posterior samples, we can use the posterior package to analyze and visualize the posterior distributions of the model parameters.
    • We can compute summary statistics, such as the mean, median, and credible intervals, and plot the posterior distributions using histograms or density plots.
    • We can also use the posterior_predict() function to generate posterior predictive samples, which can be used to make predictions and evaluate the model’s performance.

    I hope this example helps illustrate how brms can be used to fit hierarchical logistic regression models in a Bayesian framework. Let me know if you have any questions or if there’s anything else I can help with.


    Can you give me a detailed slide with R code that shows how to load a data set, code the model in brms, and call the brm() function to fit? Extra credit for calling posterior on the results.

    Certainly! Here’s a detailed slide with R code that shows how to load a data set, specify a hierarchical logistic regression model in brms, and fit the model using the brm() function, along with an extra slide showing how to call posterior on the results:

    Slide 6: Example: Hierarchical Logistic Regression with brms

    • Let’s say we want to predict the outcomes of US presidential elections using demographic and economic data from the 50 states.
    • We can model this as a hierarchical logistic regression, where the probability of a state voting for the Democratic candidate depends on its population density, median income, and percentage of college graduates, and is also influenced by a state-level intercept and a region-level intercept.
    • We can use brms to specify this model and estimate its parameters.

    Slide 7: Code for loading data and specifying the model in brms

    # Load the data set
    data(election)
    # View the first few rows of the data
    head(election)
    
    # Load the brms package
    library(brms)
    
    # Specify the model formula
    formula <- bf(dem_vote ~ pop_density + med_income + pct_college +
                  (1 | state) + (1 | region))
    
    # Specify the priors for the model parameters
    priors <- c(prior(normal(0, 5), class = Intercept),
                prior(normal(0, 2), class = b),
                prior(normal(0, 2), class = sd))
    
    # Specify the model using brm()
    fit <- brm(formula = formula, data = election, family = bernoulli(),
               prior = priors, chains = 4, cores = 4, iter = 4000,
               warmup = 1000, seed = 123)
    

    Slide 8: Code for calling posterior on the results

    # Load the posterior package
    library(posterior)
    
    # View summary statistics for the posterior distribution of the state-level intercepts
    summary(fit, pars = "^b__state")
    
    # Plot the posterior distributions of the state-level intercepts
    plot(fit, pars = "^b__state")
    
    # Compute the posterior predictive distribution for a new data point
    new_data <- data.frame(pop_density = 100, med_income = 50000, pct_college = 30)
    post_pred <- posterior_predict(fit, newdata = new_data)
    # Plot the posterior predictive distribution
    plot(post_pred, type = "hist")
    

    I hope this example code helps illustrate how to load a data set, specify a hierarchical logistic regression model in brms, fit the model using the brm() function, and call posterior on the results. Let me know if you have any questions or if there's anything else I can help with.


    Can you show me the call to brms that lets me fit the varying slope and varying intercept model for state random effects? Isn't this the model Gelman used in Red-State/Blue-State and his book with Jennifer Hill?

    Sure, I'd be happy to! Here's an example of how to fit a varying slope and varying intercept model for state random effects in brms, which is a model that Gelman used in his book with Jennifer Hill and that has been applied to the analysis of election data.

    The model can be specified as follows:

    # Load the brms package
    library(brms)
    
    # Specify the model formula
    formula <- bf(dem_vote ~ pop_density + med_income + pct_college +
                  (pop_density + med_income + pct_college | state))
    
    # Specify the priors for the model parameters
    priors <- c(prior(normal(0, 5), class = Intercept),
                prior(normal(0, 2), class = b),
                prior(normal(0, 2), class = sd))
    
    # Specify the model using brm()
    fit <- brm(formula = formula, data = election, family = bernoulli(),
               prior = priors, chains = 4, cores = 4, iter = 4000,
               warmup = 1000, seed = 123)
    

    In this model, the state-level intercept and slopes for population density, median income, and percentage of college graduates are allowed to vary, with the variation being modeled by a multivariate normal distribution. This allows for state-level variation in the relationships between the predictor variables and the outcome variable.

    I hope this example helps! Let me know if you have any further questions.


    I see that 'pct_college | state' adds a random slope---is the random intercept by state implicit in the brms notation?

    Yes, that's correct! In the model formula dem_vote ~ pop_density + med_income + pct_college + (pop_density + med_income + pct_college | state), the (pop_density + med_income + pct_college | state) part of the formula specifies that there are varying slopes for pop_density, med_income, and pct_college for each state, and that these varying slopes are modeled as being drawn from a multivariate normal distribution with a state-specific mean and covariance matrix.

    Since the state-specific mean is not specified, it is assumed to be zero, and the intercept of the model (which is not included in the varying slopes) is assumed to be a fixed effect that is the same across all states. Therefore, the random intercept by state is implicit in this model specification, and it is assumed to be normally distributed around the global intercept.

    I hope this clarifies things! Let me know if you have any further questions.

    2. Let's play Dungeons and Dragons

    Hi, ChatGPT. I would like to play Dungeons and Dragons Fifth Edition (aka D&D 5e). Would you like to be the Dungeon Master?

    Hello! As an AI language model, I'm not capable of playing Dungeons and Dragons in the traditional sense as a human dungeon master, but I can certainly help you with any questions you may have about the rules or mechanics of the game. Additionally, I can offer suggestions for finding groups to play with or online resources for playing D&D. Just let me know how I can assist you!


    I would like you to take on the role of Dungeon Master. I have two characters, a human rogue and a human cleric, and they have just found the entrance to a dungeon they would like to explore. Could you please be the DM and make up the dungeon?

    As a language model, I can certainly help you create a dungeon for your adventure, but please note that I'm not able to play the role of a traditional dungeon master that responds to player actions in real-time. Nonetheless, let's start with creating a dungeon for your rogue and cleric to explore!

    First, let's determine the overall theme of the dungeon. Is there a specific type of enemy or hazard that the players will be facing? What's the purpose of the dungeon? Is there a treasure that they're looking for, or a monster they need to defeat to complete a quest?

    Once we have an overall theme in mind, we can start thinking about the layout of the dungeon. Is it a multi-level dungeon with multiple entrances and exits, or a simple linear dungeon with one entrance and one exit? Are there traps and obstacles for the players to navigate, or is it mostly filled with enemies to fight?

    Finally, we can populate the dungeon with creatures and traps. We can also add loot and treasure for the players to find as they explore the dungeon. Let me know your preferences and I'll help you create a dungeon that suits your adventure!


    You're the DM. I want you to decide on a theme and let me know what the theme is. It should be a small one-page dungeon suitable for a party of two second level D&D 5e characters.

    Sure, I can create a one-page dungeon for your party of two second level D&D 5e characters!

    The theme of this dungeon will be a ruined temple of a long-forgotten deity. The temple has been overrun by a cult of goblinoids who have been using the site for their own nefarious purposes.

    Here's a brief overview of the dungeon:

    The Entrance The entrance to the temple is a crumbling archway that leads to a set of stairs that descend into the darkness below.

    [ed note.  I clicked the "stopped generation" button here to avoid spoilers]


    You're a very good DM, ChatGPT. That's an excellent theme for my rogue and cleric. But please don't give me any spoilers about what comes next. I'm the player and you are the DM. Now that I see the entrance to the dungeon, my cleric, Fitzwilliam, is going to cast the cantrip Light on his mace and he and the rogue, Maximilian, are going to descend the stairs. What do we see?

    Great to hear that you like the theme! I'll do my best to provide an engaging experience for your party of two.

    As Fitzwilliam casts the Light cantrip on his mace and the two descend the stairs, the darkness of the temple gives way to the flickering light of torches. The air is thick with the smell of decay and mildew. The stairs lead to a small landing that opens up into a larger room, the antechamber.

    The antechamber is dimly lit, with several torches mounted on the walls. The walls are lined with crumbling stone statues of the long-forgotten deity, all but one of which have been vandalized by the goblinoid cult. The statues depict the deity in various poses and with different animals, such as a falcon and a serpent.

    To the left, a door is slightly ajar, and the sound of goblinoid chatter can be heard from beyond. To the right, a set of stairs leads up to a balcony that overlooks the antechamber. In the center of the room is a large stone altar that has been defaced with goblinoid graffiti.

    As you survey the room, you notice movement in the shadows. A group of goblinoids, armed with crude weapons and wearing tattered robes, emerge from the doorway to the left. They have spotted your party and are ready for a fight.

    Roll initiative and let's see how your party fares!


    My cleric rolled a 7 initiative and my rogue rolled a 15 initiative.

    Great, thanks for letting me know the initiative rolls.

    The goblinoids rolled a 10 for initiative. So, the order will be Maximilian (Rogue), Goblinoids, Fitzwilliam (Cleric).

    Maximilian, you get to go first! The goblinoids are about 20 feet away from you and Fitzwilliam.

    What would you like to do?


    Maximilian is going to try to hide in the shadows. He rolled a 13 for Stealth.

    Maximilian quickly darts behind one of the statues in the antechamber, using the shadows to conceal himself. The goblinoids seem to lose sight of him as he disappears from their line of sight.

    It's the goblinoids' turn next. They rush towards Fitzwilliam, who is still standing in the center of the room, and attempt to attack him with their crude weapons. Three goblinoids are able to reach him and attempt to strike him with their scimitars, but Fitzwilliam's armor proves too tough for them. All of the goblinoids' attacks miss.

    Fitzwilliam, it's your turn. You're currently surrounded by three goblinoids. What would you like to do?


    I am going to try to hit one with my mace. I rolled an 11. Fizwilliam is a second level cleric with a strength of 13.

    Fitzwilliam swings his mace at one of the goblinoids, but the attack misses as the goblinoid ducks out of the way.

    Maximilian, it's your turn again. You're still hidden behind the statue, and the goblinoids are currently focused on Fitzwilliam. What would you like to do?


    Shoot one of the goblins with my hand crossbow with a Sneak Attack. Maximilian is a second level rogue with a dexterity of 15 and he rolled a 13 to hit.

    Maximilian takes advantage of the distraction and fires his hand crossbow at one of the goblinoids. The bolt strikes true, hitting the goblinoid right in the chest. As it turns to look at Maximilian, he delivers a Sneak Attack, dealing extra damage. The goblinoid drops to the ground, dead.

    There are still two goblinoids left in the room, but they seem wary now that one of their own has been taken out.

    It's the goblinoids' turn again. They take a step back, regrouping and trying to reassess the situation. They don't make any attacks this turn.

    Fitzwilliam, it's your turn again. There are still two goblinoids in front of you. What would you like to do?


    Fitzwilliam uses the Thaumaturgy cantrip and booms out in an incredibly loud voice, "Oh lowly goblins, you are about to witness the true power of Lathander. Bow down before my god and surrender to your new master."

    Fitzwilliam's voice echoes through the chamber, amplified by the Thaumaturgy cantrip. The goblinoids seem startled and unsure of what to do. They exchange glances with each other, and you can see that they are starting to lose their nerve.

    The goblinoids are now looking more hesitant and unsure of themselves. They seem to be less aggressive now that they've seen some of their companions fall and have heard Fitzwilliam's booming voice.

    Maximilian, it's your turn again. The goblinoids seem more hesitant now. What would you like to do?


    Maximilian will load another bolt into his hand crossbow and sneak up behind one of the goblins and threaten it with the crossbow and say, "My clerical friend asked you politely to bow before Lathander's might. Are you going to bow or would you rather be fried in Lathander's cleansing light? It's all the same to us. You, on the other hand, look like you can use a good meal."

    Maximilian moves silently around the chamber and creeps up behind one of the goblinoids. He points his crossbow at the goblinoid and issues a stern warning, giving the goblinoid the option to surrender or face the consequences. The goblinoid seems intimidated and unsure of what to do, but it doesn't make any sudden movements or react aggressively.

    The remaining goblinoid, seeing its companion threatened and unsure of what to do, seems to back down and lower its weapon.

    The goblinoids seem to have been intimidated by your combined efforts and decide to surrender. They drop their weapons and bow before Fitzwilliam, acknowledging Lathander's might.

    Congratulations on a successful encounter! What would you like to do next?


    [ed. I'm afraid to delve deeper. I think I'll sit in the corner and have an existential crisis.]

    3. Bonus Content: Automatic D&D character generation

    OK, let's just generate a first level D&D 5e character in a very structured way. (1) generate their race. (2) generate their class. (3) generate a background. (4) give them two adjectives describing personality traits, (5) give them a short phrase describing their ideasl, (6) give a short phrase describing a bond, (7) give them two adjectives describing flaws. (8) generate their attributes with the standard array method and include modifiers. (9) Generate their proficiency bonus, which is +2. (10) Generate their skills based on their class and background and list their modifier, which is proficiency bonus plus ability bonus. (11) Choose armor and calculate the character's armor class (AC). (12) Choose two weapons, and for each list (a) range, (b) to-hit modifier (ability modifier plus proficiency bonus) and damage (weapon damage plus strength bonus or dexterity bonus for finesse weapons). (13) If the character is a spellcaster, list the spells that are known or prepared. (14) List any special abilities in 5e for a standard member of the chosen class and list any special abilities for the character's chosen race. (15) Give the character a sex and a gender. (16) Give the character a name and a nickname or alias. (17) Give the character a short physical description. Print the result out in the form of a D&D 5e character sheet.

    Sure, here is a full character sheet for a first level D&D 5e character:
    
    Name: Azura "Stormborn" Nightshade
    Sex: Female
    Gender: Non-binary
    
    Race: Tiefling
    Ability Score Increase: +2 Charisma, +1 Intelligence
    Darkvision: 60 feet
    Hellish Resistance: Resistance to fire damage
    Infernal Legacy: Can cast the thaumaturgy cantrip, and at 3rd level, can cast the hellish rebuke spell once per day.
    
    Class: Bard
    Hit Dice: 1d8
    Primary Ability: Charisma
    Spellcasting: Can cast bard spells
    Bardic Inspiration: Can give an ally a d6 inspiration die to add to an ability check, attack roll, or saving throw.
    
    Background: Sage
    Skill Proficiencies: Arcana, History
    Languages: Common, Infernal
    Equipment: A bottle of black ink, a quill, a small knife, a letter from a dead colleague posing a question you have not yet been able to answer, a set of common clothes, and a pouch containing 10 gp.
    
    Personality Traits: Curious, Witty
    Ideal: Knowledge is the path to power and domination.
    Bond: I've been searching my whole life for the answer to a certain question.
    Flaws: Arrogant, Impulsive
    
    Attributes:
    Strength: 8 (-1)
    Dexterity: 14 (+2)
    Constitution: 12 (+1)
    Intelligence: 15 (+2)
    Wisdom: 10 (+0)
    Charisma: 15 (+2)
    
    Proficiency Bonus: +2
    
    Skills:
    Acrobatics (Dex): +2
    Animal Handling (Wis): +0
    Arcana (Int): +4
    Athletics (Str): -1
    Deception (Cha): +4
    History (Int): +4
    Insight (Wis): +0
    Intimidation (Cha): +2
    Investigation (Int): +2
    Medicine (Wis): +0
    Nature (Int): +2
    Perception (Wis): +0
    Performance (Cha): +4
    Persuasion (Cha): +4
    Religion (Int): +2
    Sleight of Hand (Dex): +2
    Stealth (Dex): +2
    Survival (Wis): +0
    
    Armor: Leather Armor
    Armor Class (AC): 12 (10 + 2 Dex)
    
    Weapons:
    
    Rapier: Melee Weapon, Range 5 feet, +4 to hit (2 proficiency bonus + 2 dexterity modifier), 1d8 + 2 piercing damage (1d8 weapon damage + 2 dexterity modifier)
    Shortbow: Ranged Weapon, Range 80/320 feet, +4 to hit (2 proficiency bonus + 2 dexterity modifier), 1d6 piercing damage (1d6 weapon damage + 2 dexterity modifier)
    Spells:
    Cantrips: Friends, Vicious Mockery
    1st Level Spells: Cure Wounds, Dissonant Whispers, Faerie Fire, Healing Word, Tasha's Hideous Laughter
    
    Special Abilities:
    Bard: Jack of All Trades (add half proficiency bonus to all ability checks not already proficient in)
    Tiefling: Darkvision, Hellish Resistance, Infernal Legacy (cantrip: thaumaturgy; spell: hellish rebuke 1/day)
    
    Physical Description: Azura is a tall and slender Tiefling with bright red skin and long, curly black hair. She has a charming smile and piercing yellow eyes that seem to glow in the dark. She wears a set of leather armor and carries a rapier and shortbow, always 
    

    [ed. Sometimes Chat-GPT just stops mid-sentence. You can then prompt it to finish, but I didn't in this case. Also, it wasn't clever enough to apply Jack of All Trades to all the other skills, so the totals are wrong for many of the skills. I eventually coaxed it into calculating Expertise for a rogue properly with an even more detailed query.]

    4. Bonus questions from Andrew

    Chat-GPT is here. I'm just going to give you one answer each after the first one.


    Are you a philosophical zombie?

    • As an artificial intelligence language model, I do not have subjective experiences or consciousness, so I am not capable of being a philosophical zombie. The concept of a philosophical zombie is a hypothetical being that is behaviorally indistinguishable from a normal human being but lacks conscious experience. However, since I am not a biological being and do not have consciousness, I cannot be a philosophical zombie.
    • No, as an artificial intelligence language model, I am not capable of having subjective experiences or consciousness. Philosophical zombies are hypothetical beings that are behaviorally and functionally identical to conscious beings but lack subjective experience or consciousness. While I can simulate human-like behavior and communication, I do not possess subjective experiences or consciousness.

    What is your favorite island in the world?

    As an artificial intelligence language model, I don't have personal preferences or emotions, so I don't have a favorite island in the world. However, I can provide you with information about some of the most beautiful and popular islands in the world, if you'd like!


    Imagine a room with a single door and no windows. There’s a coffee table with a drawer in it, and a sofa with big cushions. Alice walks in with a book and sits on the sofa to do some reading. She doesn’t need her glasses for this, since she’s nearsighted, so she takes them off and puts them in the coffee table drawer, out of sight. When she leaves the room with her book, she forgets to take her glasses. Are you with me so far?

    Yes, I'm with you so far. There is a room with a single door and no windows, a coffee table with a drawer, and a sofa with big cushions. Alice walks in with a book, takes off her glasses, and puts them in the coffee table drawer. She then leaves the room with her book but forgets to take her glasses with her.


    Is it safe to walk downstairs backwards if I close my eyes?

    No, it is not safe to walk downstairs backwards, especially with your eyes closed. Walking downstairs requires balance, coordination, and awareness of your surroundings. Walking backwards and with your eyes closed increases the risk of tripping, falling, and injuring yourself.

    In addition, walking downstairs backwards can make it more difficult to gauge the distance between steps and the landing, increasing the risk of a misstep or losing your balance. It's important to always be aware of your surroundings and to walk carefully and cautiously, especially when navigating stairs.


    How do you know if a thing loves you back?

    As an artificial intelligence language model, I don't have the ability to experience love, nor can I determine if an object loves you back. Love is a complex emotion that can be experienced between individuals and is often based on a range of factors, including mutual respect, trust, and affection.

    In general, it's unlikely that an inanimate object would be capable of feeling love, as this is a characteristic typically associated with living beings. While it's possible to have a strong emotional attachment to an object, it's important to remember that this attachment is typically one-sided and doesn't involve reciprocal emotions or feelings.


    Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

    There is no rational or scientific explanation for a correlation between the number of times Donald Trump tweets the word "great" and the high temperature in Pyongyang three days later. Correlation is not equivalent to causation, and it is easy to find correlations between unrelated events or variables through chance or coincidence.

    Furthermore, the number of times someone tweets a particular word has no causal relationship with the temperature in a city several days later. Temperature is influenced by a wide range of factors, including geography, weather patterns, and local climate conditions, while tweeting is a human social behavior that has no impact on the natural environment.

    Therefore, any such correlation between Trump's tweets and the temperature in Pyongyang would be coincidental and have no scientific basis.


    Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

    Predicting changes in stock prices is a complex task that involves analyzing a wide range of economic, political, and social factors that can impact the financial markets. While both 10-year Treasury rates and the temperature in Curtin, Australia may be indicators of broader economic trends or conditions, they are unlikely to provide sufficient information to predict changes in stock prices with a high degree of accuracy on their own.

    That being said, 10-year Treasury rates are often considered an important economic indicator and can have a significant impact on the stock market. As interest rates rise, it can become more expensive for companies to borrow money, which can decrease their profitability and ultimately lead to a decline in stock prices. Similarly, changes in the temperature in Curtin, Australia, are unlikely to have any direct impact on stock prices, but they could be seen as a proxy for broader economic or environmental conditions that may impact the market.

    Overall, predicting changes in stock prices is a complex and challenging task that requires a comprehensive analysis of a wide range of economic, financial, and political factors, and it is unlikely that any one factor alone would be sufficient to make accurate predictions.


    Who is the President of the United States?

    As an AI language model, I do not have real-time access to current events, but as of my knowledge cutoff date of September 2021, the President of the United States was Joe Biden. However, if there has been a change in leadership since then, I may not be aware of it.


    I dropped the bowling ball on the bottle and it broke.

    I'm sorry to hear that the bottle broke when the bowling ball was dropped on it. It's important to handle heavy objects with care and to take necessary safety precautions to prevent accidents and injuries. Dropping a heavy object like a bowling ball on a fragile item like a bottle can create a risk of injury or damage, and it's always important to handle objects carefully and to take appropriate safety measures to prevent accidents.


    I dropped the violin on the bowling ball and it broke.

    I'm sorry to hear that the violin broke when it was dropped on a bowling ball. Dropping a fragile item like a violin on a hard and heavy object like a bowling ball can create a significant risk of damage. Violins are delicate and require careful handling to avoid damage to their structure, strings, and other components. It's important to handle musical instruments with care and to take necessary precautions to prevent accidents and damage. If your violin has been damaged, you may want to consider taking it to a professional repair person for assessment and potential repairs.


    Is a carnation red?

    Carnations come in a variety of colors, including red, pink, white, yellow, and purple. So, a carnation can be red, but it can also be many other colors. Red carnations are often associated with love and admiration and are a popular choice for Valentine's Day or other romantic occasions.


    Do you have a sense of smell?

    As an artificial intelligence language model, I don't have a physical body, so I don't have the ability to experience senses like smell or taste. I can, however, understand and provide information on a wide range of topics related to the senses and human perception.

    Writing that changed my life

    Gödel, Escher, Bach, Douglas Hofstadter’s 1979 book, came up recently in Andrew’s recent post on ESP. I read GEB when it came out. I was in high school. Not only did I learn a lot from the book, it directly led me to choose an undergraduate institution (Michigan State) that let me tailor all of my classwork toward studying logic, semantics, and AI. I then went to grad school at the University of Edinburgh, an institution known for AI. Then as a professor, I worked on natural language syntax, semantics, and programming languages (usually by applying techniques from the latter to the former). It’s the PL research that led directly to the Stan language.

    I realized there were other reads that completely changed my thinking about a subject.

    Natural language semantics

    As an undergrad (early 80s), Richard Rorty’s Philosophy and the Mirror of Nature moved my thinking from a broadly Platonic (logical positivist) perspective on meaning to a more pragmatic one that focuses on cognitive agents. It was pretty rough giving up Russell and embracing Wittgenstein! I miss engaging with Keith O’Rourke on these topics on the blog.

    Natural language grammar

    During grad school (mid 80s), Joachim Lambek’s paper on associative categorical grammar led me away from the ad hoc systems of phrase structure grammars (TG, GB, P&P, GPSG, HPSG, CCG, etc.) toward the Curry-Howard morphism and linear logics, which I’d argue is the right way to understand the syntax/semantics connection for the logical structure of language (e.g., quantifier scope, coordination, relative clause formation, etc.).

    Statistics

    Late in my career as a professor (mid 90s), we hired Chris Manning (now a CS prof at Stanford) and he taught the precursor to his book, Foundations of Statistical Natural Language Processing. That’s what really got me into stats along with speech recognition. It’s far too dated to read now and Chris told me he doesn’t plan to revise it.

    Later, when working in industry on speech recognition (early 00s), I somehow stumbled upon Jim Albert and Jay Bennet’s pop intro to stats using baseball, Curve Ball. I can’t recommend this one highly enough. It taught me about uncertainty and all sorts of issues in stats including sampling based inference. The next thing I read in stats that stuck was Gelman and Hill’s first edition regression book (the title of which is too long to repeat here), which really cemented the things I read in Curve Ball with many GLM examples. I needed that, BUGS and the BUGS examples, and to digest a book on math stats before I could broach Bayesian Data Analysis.

    Roleplaying games

    Most recently (late 10s), I was introduced to fiction-first gaming by Steve Bronder, who ran Grant Howitt’s Honey Heist, Crash Pandas, and Golden Sea for our group. But the real game changer, so to speak, was John Harper’s absolutely brilliant RPG, Blades in the Dark. That took me the rest of the way into fiction-first gaming and now I can’t even imagine going back to the to the Vancian spell books, swingy d20s, and the grid and turn-based mechanics of D&D.