You probably don’t have a general algorithm for an MLE of Gaussian mixtures

Those of you who are familiar with Garey and Johnson’s 1979 classic, Computers and Intractability: a guide to the theory of NP-completeness, may notice I’m simply “porting” their introduction, including the dialogue, to the statistics world.

Imagine Andrew had tasked me and Matt Hoffman with fitting simple standard (aka isotropic, aka spherical) Gaussian mixtures rather than hierarchical models. Let’s say that Andrew didn’t like that K-means got a different answer every time he ran it, K-means++ wasn’t much better, and even using soft-clustering (i.e., fitting the stat model with EM) didn’t let him replicate simulated data. Would we have something like Stan for mixtures. Sadly, no. Matt and I may have tried and failed. We wouldn’t want to go back to Andrew and say,

  • “We can’t find an efficient algorithm. I guess we’re just too dumb.”

We’re computer scientists and we know about proving hardness. We’d like to say,

  • “We can’t find an efficient algorithm, because no such algorithm is possible.”

But that would’ve been beyond Matt’s and my grasp, because, in this particular case, it would require solving the biggest open problem in theoretical computer science. Instead, it’s almost certain we would have come back and said,

  • “We can’t find an efficient algorithm, but neither can all these famous people.”

That seems weak. Why would we say that? Because we could’ve proven that the problem is NP-hard. A problem is in the class P if it can be solved in polynomial time with a deterministic algorithm. A problem is in the class NP when there is a non-deterministic (i.e., infinitely parallel) algorithm to solve it in polynomial time. It’s NP-hard if it’s just as hard as any other NP algorithm (formally specified through reductions, a powerful CS proof technique that’s the basis of Gödel’s incompleteness theorem). An NP-hard algorithm often has a non-deterministic algorithm to solve it makes a complete set of (exponentially many) guesses in parallel and then spends polynomial time on each one verifying whether or not it is a solution. An algorithm is NP-complete if it is NP-hard and a member of NP. Some well known NP-complete problems are bin packing, satisfiability in propositional logic, and the traveling salesman problem—there’s a big list of NP-complete problems.

Nobody has found a tractable algorithm to solve an NP-hard problem. When we (computer scientists) say “tractable,” we mean solvable in polynomial time with a deterministic algorithm (i.e., the problem is in P). The only known algorithms for NP-hard problems are exponential. Researchers have been working for the last 50+ years trying to prove that the class of NP problems is disjoint from the class of P problems.

In other words, there’s a Turing Award waiting for you if you can actually turn response (3) into response (2).

In the case of mixtures of standard (spherical, isotropic) Gaussians there’s a short JMLR paper with a proof that maximum likelihood estimation is NP-hard.

And yes, that’s the same Tosh as who was the first author of the “piranha” paper.

Ising models that are not restricted to be planar are also NP-hard.

What both these problems have in common is that they are combinatorial and require inference over sets. I think (though am really not sure) that one of the appeals of quantum computing is potentially solving NP-hard problems.

P.S. How this story really would’ve went is that we would’ve told Andrew that some simple distributions over NP-hard problem instances lead to expected polynomial time algorithms and we’d be knee-deep in the kinds of heuristics used to pack container ships efficiently.

Leap Day Special!

The above graph is from a few years ago but is particularly relevant today!

It’s funny that, in leap years, approximately 10% fewer babies are born on 29 Feb. I think it would be cool to have a Leap Day birthday. But I guess most people, not being nerds, would prefer the less-“weird” days before and after.

There’s lots of good stuff at the above link; I encourage you to read the whole thing.

In the years since, we’ve improved Stan so we can fit and improve the birthdays time series decomposition model using full Bayesian inference.

Here’s Aki’s birthday case study which has all the details. This will also be going into our Bayesian Workflow book.

Varying slopes and intercepts in Stan: still painful in 2024

Andrew recently blogged the following: Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson. This is the kind of model Andrew et al. used for one example in Red State, Blue State, which is the varying effect of income on Republican preference by state. Each state had its own slope and intercept related with a multivariate hierarchical prior. The version in Gelman and Hill’s regression book is a hack that tried to scale an inverse Wishart; the LKJ is what they would have used if Ben Goodrich had created it at that point.

Andrew points to a tutorial on Bayesian varying effects models from Will Hipson, which is really nice in the way it steps through workflow, building up the model in stages. The model Hipson develops is an improvement on what we have in our User’s Guide. After everything else, I circle back and talk about doc, trying to connect it to my recent post on why doc is so dangerous.

I think we can do a bit better in the current verison of Stan, but I have to confess up front that Andrew’s right—this is still painful. This took me around three hours to put together the model and simulations and blog post and I’m the one who designed the language! This would’ve been much faster if I wasn’t trying to bring it up to a “publishable” standard as an example of how I like to see Stan code written.

The original Stan model

Here’s Will Hipson’s model:

data {
  int N_obs; // number of observations
  int N_pts; // number of participants
  int K; // number of predictors + intercept
  int pid[N_obs]; // participant id vector
  matrix[N_obs, K] x; // matrix of predictors
  real y[N_obs]; // y vector
}

parameters {
  matrix[K, N_pts] z_p; // matrix of intercepts and slope
  vector[K] sigma_p; // sd for intercept and slope
  vector[K] beta; // intercept and slope hyper-priors
  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
  real sigma; // population sigma
}

transformed parameters {
  matrix[K, N_pts] z; // non-centered version of beta_p
  z = diag_pre_multiply(sigma_p, L_p) * z_p; 
}

model {
  vector[N_obs] mu;
  
  // priors
  beta ~ normal(0, 1);
  sigma ~ exponential(1);
  sigma_p ~ exponential(1);
  L_p ~ lkj_corr_cholesky(2);
  to_vector(z_p) ~ normal(0, 1);
  
  // likelihood
  for(i in 1:N_obs) {
    mu[i] = beta[1] + z[1, pid[i]] + (beta[2] + z[2, pid[i]]) * x[i, 2];
  }
  y ~ normal(mu, sigma);
}

generated quantities {
  matrix[2, 2] Omega;
  Omega = multiply_lower_tri_self_transpose(L_p);
}

Warning: There’s a bug in this code in that it only handles the K = 2 case. You can see this with the 1 and 2 hardcoded in the definition of mu[i].

My Stan model

The documentation for the model is at the top of the Stan code, then the Stan code only has a single line of doc other than explanations of the variables (which I wouldn’t include in non-tutorial code, just to link this back to what I was saying a few posts ago about comments).

/**
 * Varying slopes and intercept hierarchical linear regression.
 * N observations organized into J groups, with jj[n] being the group
 * and x[n, 1:K] the covariates for observation n.  The covariate
 * matrix x should include a column of 1s to include a slope.
 * 
 * The slopes and intercept per group have a multivariate normal prior
 * and the scale has an exponential prior.  The location of the
 * multivariate normal prior has a standard normal hyperprior and its
 * covariance is decomposed into a correlation matrix with an LKJ
 * hyperprior and a scale vector with an exponential hyperprior. In
 * symbols: 
 *
 * Likelihod:
 *   y[n] ~ normal(x[n] * beta[1:K, jj[n]], sigma) for n in 1:N
 *
 * Priors:
 *   sigma ~ exponential(1)
 *   beta[1:K, j] ~ multi_normal(nu, Sigma) for j in 1:J
 * 
 * Hyperpriors:
 *   nu ~ normal(0, 1)
 *   scale(Sigma) ~ exponential(1)
 *   corr(Sigma) ~ lkj(2)
 *
 * where scale(Sigma) is the scale vector and corr(Sigma) is the
 * correlation matrix of Sigma.
 *
 * For efficiency and numerical stability, the covariance and
 * correlation matrices are Cholesky factored.
 */
data {
  int<lower=0> J;                      // number of groups
  int<lower=0> N;                      // number of observations
  array[N] int<lower=1, upper=J> jj;   // group per observation
  int<lower=1> K;                      // number of covariates
  matrix[N, K] x;                      // data matrix
  vector[N] y;                         // observations
}
parameters {
  vector[K] nu;                        // location of beta[ , j]
  vector<lower=0>[K] tau;              // scale of beta[ , j]
  cholesky_factor_corr[K] L_Omega;     // Cholesky of correlation of beta[ , j]
  matrix[K, J] beta_std;               // standard beta (beta - nu) / Sigma
  real<lower=0> sigma;                 // observation error for y
}
transformed parameters {
  matrix[K, J] beta = rep_matrix(nu, J)
                      + diag_pre_multiply(tau, L_Omega) * beta_std;
}
model {
  nu ~ normal(0, 1);
  tau ~ exponential(1);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(beta_std) ~ normal(0, 1);  // beta[ , j] ~ multi_normal(nu, Sigma)
  sigma ~ exponential(1);
  y ~ normal(rows_dot_product(x, beta[ , jj]'), sigma);
}
generated quantities {
  matrix[K, K] Sigma                   // covariance of beta[, j]
    = multiply_lower_tri_self_transpose(diag_pre_multiply(tau, L_Omega));
}

(WordPress is really annoying in its mishandling of angle brackets in pre environments.)

I started the first version using K = 2 and loops rather than vectorization. Next, I generalized from K = 2 to general K. Then I found the rows dot product function and got rid of the last loop. (Loops are fast in Stan—it’s the redundant autodiff, like multiple calculations of log(sigma) that are the time wasters in non-vectorized code.)

One could make the case for suffixing all the parameters of the prior for beta using _beta. You’d probably want to do that in a context with multiple groupings. It gets very hard to read even in this simple case—with multiple groupings it’s a right royal mess. Long variable names are very hard to read in math.

I put all the comments to the right at end of lines, so they don’t waste any vertical space and don’t get in the way of actually reading the code. Except the one true inline code comment that I’d leave, which is the implication of all the non-centered parameterization on the distribution of beta.

Let’s take it for a spin

I actually wrote the model first. I would generally recommend debugging the stages, but I’ve built this model many times, so I decided to roll the dice.

What I haven’t done before is simulate data in Python. Luckily, I have a very helpful coding assistant named ChatGPT. I have a system prompt that looks like this (I’m not actually a professor any more, but I thought it’d be easier to underestand that way).

WHO: I am a professor in statistics and machine learning with a Ph.D. in cognitive science and computer science. I understand (a) graduate level probability and statistics, (b) graduate-level machine learning including neural networks, (c) applied mathematics including calculus and analysis, matrices and linear algebra, combinatorics and graph theory, mathematical logic and set theory, (c) cognitive science including psycholinguistics and cognitive anthropology, (d) linguistics including phonology, morphology, syntax, semantics, and pragmatics, and (e) philosophy including 20th century analytic philosophy and philosophy of mind.

HOW: Respond in a concise and neutral academic style without embellishments.

Please call me “Bob”.

This was a one-and-done prompt. I’ve also had a bit of practice formulating these queries.

BOB: In Python, I want to set K = 5 covariates J = 10 groups. I want to generate a random K-vector location nu from a standard normal, and a (K x K) random covariance matrix Sigma of roughly unit scale (multiplying Cholesky factors or taking entries from a random walk would be OK). Then generate a K x J matrix beta where the columns are drawn from a multivariate normal distribution with location nu and covariance Sigma.

Then I want to set N = 1000 observations. For each observation in in range(N), generate a random group jj[n] in range(J) uniformly. Take a scale sigma = 1.5 for errors, and generate an N x K standard normal matrix x of covariates, with the first column set to all 1 values for an intercept. Then generate y from a hierarchical linear regression where the coefficient vector for observation n is given by beta[:, jj[n]] and the noise is normal with location 0 and scale sigma.

GPT-4 (Feb 28, 2024): [It just led with the code, which is unusual.]

import numpy as np
import scipy.linalg

# Set parameters
K = 5  # number of covariates
J = 10  # number of groups
N = 1000  # number of observations
sigma = 1.5  # scale of errors

# Generate nu and Sigma
nu = np.random.normal(0, 1, K)  # K-vector location from a standard normal
A = np.random.normal(0, 1, (K, K))
Sigma = np.dot(A, A.T)  # create a positive semi-definite matrix for covariance

# Generate beta
beta = np.random.multivariate_normal(nu, Sigma, J).T  # K x J matrix

# Generate observations
jj = np.random.randint(0, J, N)  # random group for each observation
x = np.random.normal(0, 1, (N, K))  # N x K standard normal matrix
x[:, 0] = 1  # set first column to all 1s for intercept

# Generate y
y = np.zeros(N)
for n in range(N):
    y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)

y[:5]  # Show the first 5 values of y as a sample output

The hierarchical linear regression model was successfully generated with the specified parameters. The first five values of the dependent variable y from the model are approximately:

−2.27
8.86
8.20
1.30
7.27

These values are based on the simulation with K=5 covariates,
J=10 groups, N=1000 observations, and an error scale σ=1.5.

It looks like it actually ran that Python code? I haven’t seen output in this format before from GPT—the code’s in a pulldown labeled “Analyzing”.

Anway, GPT’s code worked out of the box. It only took a couple of minutes to pull the data generation into a function and return a dictionary with simulations and simulated parameter values to compare to the fitted values. My Python code illustrates my suggestion a few posts back to turn blocks of code into functions rather than putting a bunch of comments around them in the script.

import numpy as np
import scipy.linalg

def sim_param_data(K, J, N, sigma):
    nu = np.random.normal(0, 1, K)
    A = np.random.normal(0, 1, (K, K))
    Sigma = np.dot(A, A.T)
    beta = np.random.multivariate_normal(nu, Sigma, J).T 
    jj = np.random.randint(0, J, N)
    x = np.random.normal(0, 1, (N, K))
    x[:, 0] = 1
    y = np.zeros(N)
    for n in range(N):
        y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)
    return nu, Sigma, beta, {'J': J, 'N': N, 'jj': jj + 1, 'K': K, 'x': x, 'y': y}
    
import cmdstanpy as csp
nu, Sigma, beta, data = sim_param_data(K = 5, J = 10, N = 1000, sigma = 1.5)
model = csp.CmdStanModel(stan_file = 'rsbs.stan')
fit = model.sample(data)

It takes Stan about 20 seconds to fit this data set, R-hats all less than 1.01, is low, ESS in the thousands from a sample of size 4000, and all but a couple parameters are all recovered within 95% posterior intervals. There is quite a lot of uncertainty here with this little data and this many groups—don’t take those point estimates of covariance too seriously!

Appendix on doc

Let’s digress and talk about doc. I wrote a blog post a few days ago on doc, and this triggers some of the same issues. I want to say up front that doc is hard and if you go and look at code I’ve written, there will be a lot of places where you can improve the doc. Same for the code. So this is a kind of normative theory of doc, not what one might expect in reality. People only have a finite amount of time for any project. You might want to take a look at the doc in R parts of his example with the same eye.

First, there’s a scaffolding example which has the classic problem of documentation just for the sake of documentation.

vector[N] mu; // declaring a mu vector

You see the same thing in the final example where “vector[N] y” is documented as “y vector”. For the same reason, I don’t like this from an early example,

  sigma ~ exponential(1); // using an exponential prior on sigma

And this is what I meant by documenting the language.

mu[i] = x[i] * beta; // * is matrix multiplication in this context

and

  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix

Going back to the final example, rather than “population sigma”, I would prefer “error scale” as it does not rely on the conventional notation sigma to pick out the scale.

The comment for z says “non-centered version of beta_p”, but the non-centered variable here is z_p. The terminology of “centering” is around the population mean, not zero.

Continuing with doc for z, I don’t understand what it means to be a version of beta_p. There is no beta_p in the model, so maybe some external doc? In the definition of mu, you can see beta acting as the location of the non-centered parameterization.

Did anyone spot the bug in this model? This is the real reason we don’t trust doc and have to read the code. It only works for K = 2. You’ll see a hard-coded 1 and 2 on the line defining mu[i] despite other parts of the program using K. My advice in this situation is to just bite the bullet and code the K = 2 case first. Then generalize later if you need to. I code the general case in the next section.

I want to emphasize that I’m not trying to pick on Will Hipson here. I’m assuming his intent was to be pedagogical, as the comment density drops as the models get more complicated. And the code is really good—better than in our User’s Guide.

This example also emphasizes why code review is so useful—a second set of eyes is the top thing I can recommend for improving code. Just knowing your code will be reviewed helps you write better code.

Bayesian Analysis with Python

Osvaldo Martin writes:

The third edition of Bayesian Analysis with Python serves as an introduction to the basic concepts of applied Bayesian modeling. It adopts a hands-on approach, guiding you through the process of building, exploring and expanding models using PyMC and ArviZ. The field of probabilistic programming is in a different place today than it was when the first edition was devised in the middle of the last decade. The journey from its first publication to this current edition mirrors the evolution of Bayesian modeling itself – a path marked by significant advancements, growing community involvement, and an increasing presence in both academia and industry. Consequently, this updated edition also includes coverage of additional topics and libraries such as Bambi, for flexible and easy hierarchical linear modeling, PyMC-BART, for flexible non-parametric regression; PreliZ, for prior elicitation; and Kulprit, for variable selection.

Whether you’re a student, data scientist, researcher, or developer aiming to initiate Bayesian data analysis and delve into probabilistic programming, this book provides an excellent starting point. The content is introductory, requiring little to none prior statistical knowledge, although familiarity with Python and scientific libraries like NumPy is advisable.

By the end of this book, you will possess a functional understanding of probabilistic modeling, enabling you to design and implement Bayesian models for your data science challenges. You’ll be well-prepared to delve into more advanced material or specialized statistical modeling if the need arises.

See more at the book website

Osvaldo spent one year at Aalto in Finland (unfortunately, during the pandemic) so I know he knows what he writes. Bambi is rstanarm / brms style interface for building models with PyMC in Python ecosystem, and Kulprit is the Python version of projpred (in R) for projective predictive model selection (which is one of my favorite research topics).

When all else fails, add a code comment

Another way of saying this is that you should treat inline code comments as a last resort when there is no other way to make your intentions clear.

I used to teach a session of Andrew’s statistical communication class once a year and I’d focus on communicating a computational API. Most of the students hated it because they signed up for the class to hear Andrew talk about stats, not me talk about API design. At least one student just up and walked out every year! So if you’re that student, now’s your chance to bail.

Comments considered harmful

Most academics, before they will share code with me, tell me they have to “clean it up.” I invariably tell them not to bother, and at best, they will dilly dally and shilly shally and apologize for lack of comments. What they don’t realize is that they were on the right track in the first place. The best number of inline code comments is zero. Nada. Zilch. Nil. Naught.

Why are comments so harmful? They lie! Even with the best of intent, they might not match the actual implementation. They often go stale over time. You can write whatever you want in a comment and there’s no consistency checking with the code.

You know what doesn’t lie? Code. Code doesn’t lie. So what do professional programmers do? They don’t trust comments and read the code instead. At this point, comments just get in the way.

What’s a bajillion times better than comments?

Readable code. Why? It’s self documenting. To be self documenting, code needs to be relatively simple and modular. The biggest mistake beginners make in writing code is lack of modularity. Without modularity, it’s impossible to build code bottom up, testing as you go.

It’s really hard to debug a huge program. It’s really easy to debug modules built up piece by piece on top of already-tested modules. So design top down, but build code bottom up. This is why we again and again stress in our writing on Bayesian workflow and in our replies to user questions on forums, that it helps immensely to scaffold up a complicated model one piece at a time. This lets you know when you add something and it causes a failure.

Knowing where to draw lines between modules is, unfortunately, a matter of experience. The best way to get that experience? Read code. In the product coding world, code is read much more often than it’s written. That means much more effort typically goes into production code to make it readable. This is very unlike research code which might be written once and never read again.

There is a tradeoff here. Code is more readable with short variable names and short function names. It’s easier to apprehend the structure of the expression a * b + c**2 than meeting_time * number_of_meetings + participants**2. We need to strike a balance with not too long, but still informative variable names.

And why are beginners so afraid of wasting horizontal space while being spendthrifts on the much more valuable vertical space? I have no explanation. But I see a lot of code from math-oriented researchers that looks like this, ((a*b)/c)+3*9**2+cos(x-y). Please use spaces around operators and no more parens than are necessary to disambiguate given attachment binding.

When should I comment?

Sometimes you’re left with no choice and have to drop in a comment as a last resort. This should be done if you’re doing something non-idiomatic with the language or coding an unusual algorithm or something very involved. In this case, a little note inline about intent and/or algebra can be helpful. That’s why commenting is sometimes called a method of last resort.

But whatever you do, comment for people who know the language better than you. Don’t write a comment that explains what a NumPy function does—that’s what the NumPy doc is for. Nobody wants to see this:

int num_observations = 513;  // declare num_observations as an integer and set equal to 513

But people who feel compelled to comment will write just this kind of thing, thinking it makes their code more professional. If you think this is a caricature, you don’t read enough code.

The other thing you don’t want to do is this:

#####################################################
################## INFERENCE CODE ###################
#####################################################
...
...
...

This is what functions are for. Write a function called inference() and call it. It will also help prevent accidental reuse of global variables, which is always a problem in scripting languages like R and Python. Don’t try to fix hundreds or thousands of lines of unstructured code with structured comments.

Another thing to keep in mind is that vertical space is very precious in coding, because we want to be able to see as much of the code as we can at a time without scrolling. Don’t waste vertical space with useless or even harmful comments.

Do not, and I repeat, do not use /* ... */ style comments inline with code. It’s too easy to get confused when it’s a lot of lines and it’s doubly confusing when nested. Instead, use line comments (// in C++ and Stan, # in Python and R). Use the comment-region command in emacs or whatever does the same in your IDE. With line comments, the commented out code will be very visible, as in the following example.

for (int n = 0; n < N; ++n) {
  // int x = 5
  // int y = x * x * 3;
  // int z = normal_rng(y, 1);
  z = n * 3
}

Compare that to what I often see, which is some version of the following.

for (int n = 0; n < N; ++n) {
  /* int x = 5
  int y = x * x * 3;
  int z = normal_rng(y, 1); */
  z = n * 3
}

In the first case, it's easy to just scan down the margin and see what's commented out.

After commenting out and fixing everything, please be a good and respectful citizen and just delete all the edited out code before merging or releasing. Dead code makes the live code hard to find and one always wonders why it's still there---was it a mistake or some future plan or what? When I first showed up at Bell Labs in the mid 1990s, I was handed a 100+ page Tcl/Tk script for running a speech recognizer and told only a few lines were active, but I'd have to figure out which ones. Don't do that!

The golden triangle

What I stressed in Andrew's class is the tight interconnection between three aspects of production code:


$latex \textrm{API Documentation} \leftrightarrow \textrm{Unit tests} \leftrightarrow \textrm{Code}$

 

The API documentation should be functionally oriented and say what the code does. It might include a note as to how it does it if that is relevant to its use. An example might be different algorithms to compute the same thing that are widely known by name and useful in different situations. The API doc should ideally be specific enough to be the basis of both unit testing and coding. So I'm not saying don't document. I'm saying don't document how inline code works, document your API's intent.

The reason I call this the "golden" triangle is the virtuous cycle it imposes. If the API doc is hard to write, you know there's a problem with the way the function has been specified or modularized. With R and Python programmers, that's often because the code is trying to do too many things for a single function and the input types and output types become a mess of dependencies. This leads to what programmers identify as a "bad smell" in the code. If the code or the unit tests are hard to write, you know there's a problem with the API specification.

Clients (human and computational) are going to see and "feel" the API. That's where the "touch" is that designers like to talk about in physical object design. Things need to feel natural for the application, or in the words of UI/UX designers, it needs to offer affordances (in the past, we might have said it should be intuitive). It needs to feel natural for the application. Design the API first from the client perspective. Sometimes you have to suffer on the testing side to maintain a clean and documentable API, but that clean API is your objective.

What about research code?

Research code is different. It doesn't have to be robust. It doesn't have to be written to be read by multiple people in the future. You're usually writing end-to-end tests rather than unit tests, though that can be dangerous. It still helps to develop bottom-up with testing.

What research code should be is reproducible. There should be a single script to run that generates all the output for a paper. That way, even if the code's ugly, at least the output's reproducible and someone with enough interest can work through it.

And of course, research code needs to be tested that it's doing what it's supposed to be doing. And it needs to be audited to make sure it's not "cheating" (like cross-validating a time-series, etc.).

Notebooks, Quarto, and other things that get in the way of coding and documenting

With all due respect to Donald Knuth (never a good start), literate programming is a terrible way to develop code. (On a related topic, I would totally recommend at least the beginning part of Knuth's notes on how to write math.)

I don't love them, but I use Quarto and Jupyter (nee iPython) notebooks for writing reproducible tutorial material. But only after I've sorted out the code. These tools mix text and code and make too many compromises along the way to make them good at either task. Arguably the worst sin is that it winds up obfuscating the code with a bunch of text. Jupyter also makes it possible to get into inconsistent states because it doesn't automatically re-run everything. Quarto is just a terrible typesetting platform, inheriting all the flaws of pandoc, citeproc, with the added joy of HTML and LaTeX interoperability and R and Python interoperability. We use it for Stan docs so that we can easily generate HTML and LaTeX, but it always feels like there should be a better way to do this as it's a lot of trial and error due to the lack of specs for markdown.

Evaluating samplers with reference draws

I started thinking a bit more about what we’re doing when we use something like posteriordb (from Stan) or Inference Gym (from TensorFlow Probability) to evaluate a sampler.

posteriordb gives you Stan programs and 10,000 reference draws from their posterior. The reference draws must be generated by another sampler. In some cases, like Neal’s funnel or high-dimensional normals, etc., we can just run forward simulation and get independent draws. In cases of models with data, we don’t actually know the posterior analytically—that’s why we’re running MCMC in the first place! I think what posteriordb did was run NUTS for enough iterations it could be thinned down to 10,000 roughly independent draws.

Standard error in reference

The standard error for $latex N$ independent draws estimating a particular quantity like the posterior mean of a parameter is

$latex \textrm{se} = \frac{\displaystyle \textrm{sd}}{\displaystyle \sqrt{N}}$

where sd is the standard deviation of the quantity in the posterior. Standard Monte Carlo methods scale horribly due to that square root—Markov chain Monte Carlo can be even worse because rather than the number of draws, we have the effective sample size, which can be much lower than the number of iterations for more challenging posteriors due to the correlation in the draws.

Because $latex \sqrt{10,000} = 100$, with 10,000 draws, or standard error is only 1/100th of our standard deviation. In general, we need four times as many draws to cut the error in half, or 100 times as many draws to add a decimal place of accuracy by cutting error by a factor of ten. That means we can’t reasonably expect something like posteriordb to store enough draws to make fine-grained convergence evaluation possible. A slightly more economical alternative would be to store long term averages. There we cold run a million iterations and save parameter and squared parameter estimates. We could even run a hundred million by farming out 100 jobs on a cluster. But that’s still only three or four decimal places of accuracy. After that, maybe we could wring out another decimal place or two with control variates.

That’s a lot of residual error to have to deal with when evaluating an algorithm that asymptotically approaches the correct answer. The only thing that saves us is that the algorithms we’re evaluating are also samplers and are subject to the same inverse square root rate of reduction in standard error.

Report error in evaluations

What we really want to report is the standard error of the residual between our new model’s estimate and the truth. But we don’t usually have the truth. In practice, we usually do one of two things, depending on how compute-intensive it is to fit my model and our compute resources and patience.

Easy model:. Run the new algorithm 100 times or even 1000 times and report something like a histogram of errors. With this many runs, we can directly estimate standard error (assuming our draws are unbiased) and we can estimate it even more tightly and without assumptions if we know the true values.

Hard model:. Run the new algorithm once or twice and use an estimated standard error for the new algorithm and report that instead of an empirical histogram of errors. This suffers from estimation error in the effective sample size, which can be large.

The problem with both of these aproaches is that it’s only reporting variability of the new algorithm, treating the reference draws as true. We should be going one step further and folding the error in posteriordb or another reference database into the uncertainty reporting of the error of the new algorithm. The additional error could be done with simulation (and maybe analytically—normals are nice that way).

That’s easy enough to do reporting single univariate means. I do not know how to do this for evaluating error in something like KL-divergence, which itself has a Monte Carlo integration, or something like Wasserstein distance. In the Pathfinder paper, we used a semi-discrete Wassertein distance algorithm. I have to admit I don’t even know how that algorithm works.

Evaluating delayed rejection

I ran into this problem because I want to evaluate our delayed rejection HMC and show that the error goes to zero as I take more draws. Specifically, I want to extend the illustration in the paper that shows that standard HMC and NUTS fails, but DR-HMC succeeds at sampling from multiscale distributions like Neal’s funnel in its centered parameterization.

There are two fundamental problems with this.

Problem One: By the time we get past an effective samle size of 10,000 from the new algorithm, the error we’re estimating is dominated by the standard error of posteriordb. We don’t even know the real error of posteriordb, only an estimated standard error assuming draws are independent.

Problem Two: The second problem is that we have no guarantees that the reference draws are unbiased draws from the posterior. For example, in the centered parameterization of Neal’s funnel, there is no step size for HMC that will properly explore the posterior—each step size picks out a range of log scale parameters it can explore (as we demonstrate in the delayed rejection paper). The standard error estimate from the reference draws assumes independent draws from the posterior, not biased draws. If the posteriordb draws are biased, the error reported for a perfect sampler will not go to zero as the number of posteriordb and new sampler draws goes to infinity.

Help?

Does anyone have any suggestions of what we can do in the face of potentially biased and stochastic samples for evaluation? I could wait until the Stan meeting on Thursday and ask Aki, who’s a font of inspiration on topics like this, but I thought I’d ask a wider group and see how that goes.

“When will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?”

Alexey Guzey asks:

How much have you thought about AI and when will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?

My first reply: I guess that AI can already do better science than Matthew “Sleeplord” Walker, Brian “Pizzagate” Wansink, Marc “Schoolmarm” Hauser, or Satoshi “Freakonomics” Kanazawa. So some humans are already obsolete, when it comes to producing science.

OK, let me think a bit more. I guess it depends on what kind of scientific research we’re talking about. Lots of research can be automated, and I could easily imagine an AI that can do routine analysis of A/B tests better than a human could. Indeed, thinking of how the AI could do this is a good way to improve how humans currently do things.

For bigger-picture research, I don’t see AI doing much. But a big problem now with human research is that human researchers want to take routine research and promote it as big-picture (see Walker, Wansink, Kanazawa, etc.). I guess that an AI could be programmed to do hype and create Ted talk scripts.

Guzey’s response:

What’s “routine research”? Would someone without a college degree be able to do it? Is routine research simply defined as such that can be done by a computer now?

My reply: I guess the computer couldn’t really do the research, as it that would require filling test tubes or whatever. I’m thinking that the computer could set up the parameters of an experiment, evaluate measurements, choose sample size, write up the analysis, etc. It would have to be some computer program that someone writes. If you just fed the scientific literature into a chatbot, I guess you’d just get millions more crap papers, basically reproducing much of what is bad about the literature now, which is the creation of articles that give the appearance of originality and relevance while actually being empty in content.

But, now that I’m writing this, I think Guzey is asking something slightly different: he wants to know when a general purpose “scientist” computer could be written, kind of like a Roomba or a self-driving car, but instead of driving around, it would read the literature, perform some sort of sophisticated meta-analyses, and come up with research ideas, like “Run an experiment on 500 people testing manipulations A and B, measure pre-treatment variables U and V, and look at outcomes X and Y.” I guess the first step would be to try to build such a system in a narrow environment such as testing certain compounds that are intended to kill bacteria or whatever.

I don’t know. On one hand, even the narrow version of this problem sounds really hard; on the other hand, our standards for publishable research are so low that it doesn’t seem like it would be so difficult to write a computer program that can fake it.

Maybe the most promising area of computer-designed research would be in designing new algorithms, because there the computer could actually perform the experiment; no laboratory or test tubes required, so the experiments can be run automatically and the computer could try millions of different things.

They solved the human-statistical reasoning interface back in the 80s

Eytan Adar pointed me to a video, Reasoning Under Uncertainty, from the historical ACM SIGCHI (computer-human interaction) Video Project. Beyond fashionable hairstyles, it demos interfaces from a software curriculum to teach high school students statistical reasoning in 1989. They’ve got stretchy histograms, where you can adjust bars to see how the moments change (reminding me of more recent work on sketch-based and other graphical elicitation interfaces, including some of my own), and shifty lines, where you drag a line trying to find the best fit, as in more recent work on regression by eye. Ben Shneiderman’s interface design guidelines came up recently on the blog, and both are nice early examples of interacting with data through direct manipulation, one of the ideas he pioneered. They were also using hypothetical outcome plots years before we attempted to make them cool! 

Sometimes I suspect the majority of the good ideas for statistical software were already out there back in the 80s and 90s, at least in pedagogical form. I’m thinking about work like Andreas Buja and colleagues’ “Elements of a viewing pipeline for data analysis” (from 1988) and their early mentions of graphical inference techniques like the line-up in the 90s. There’s also the ASA Statistical Graphics video library which includes a lot of cool early work.

It’s too bad that many of the mainstream GUI-based interactive visualization systems we see today dropped many of the ideas related to sampling and uncertainty in favor of what we’ve called  “data exposure” (making it as immediate as possible for the user to see and interact with the data itself). Much of the research too, at least in the more computer-science oriented data visualization community, seems more interested in pursuing variants of an exposure philosophy like “behavior-driven optimization”, where the system is trying to learn what data or queries the user might want to see next so as to serve those up. The idea of GUI-based visualization tools that let you simulate processes, for example, has never really caught on (though we’re still trying, e.g., here). 

P.S. I  haven’t watched the video that follows it, but it promises to describe the ‘The Andrew System’, a computing system that can offer the benefits of many diverse and useful systems within a common environment.  Maybe they were on to something there too!

Progress in 2023, Leo edition

Following Andrew, Aki, Jessica, and Charles, and based on Andrew’s proposal, I list my research contributions for 2023.

Published:

  1. Egidi, L. (2023). Seconder of the vote of thanks to Narayanan, Kosmidis, and Dellaportas and contribution to the Discussion of ‘Flexible marked spatio-temporal point processes with applications to event sequences from association football’Journal of the Royal Statistical Society Series C: Applied Statistics72(5), 1129.
  2. Marzi, G., Balzano, M., Egidi, L., & Magrini, A. (2023). CLC Estimator: a tool for latent construct estimation via congeneric approaches in survey research. Multivariate Behavioral Research, 58(6), 1160-1164.
  3. Egidi, L., Pauli, F., Torelli, N., & Zaccarin, S. (2023). Clustering spatial networks through latent mixture models. Journal of the Royal Statistical Society Series A: Statistics in Society186(1), 137-156.
  4. Egidi, L., & Ntzoufras, I. (2023). Predictive Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 929-934). Pearson.
  5. Macrì Demartino, R., Egidi, L., & Torelli, N. (2023). Power priors elicitation through Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 923-928). Pearson.

Preprints:

  1. Consonni, G., & Egidi, L. (2023). Assessing replication success via skeptical mixture priorsarXiv preprint arXiv:2401.00257. Submitted.

Softwares:

    CLC estimator

  • free and open-source app to estimate latent unidimensional constructs via congeneric approaches in survey research (Marzi et al., 2023)

   footBayes package (CRAN version 0.2.0)

   pivmet package (CRAN version 0.5.0)

I hope and guess that the paper dealing with the replication crisis, “Assessing replication success via skeptical mixture priors” with Guido Consonni, could have good potential in the Bayesian assesment of replication success in social and hard sciences; this paper can be seen as an extension of the paper written by Leonhard Held and Samuel Pawel entitled “The Sceptical Bayes Factor for the Assessment of Replication Success“.  Moreover, I am glad that the paper “Clustering spatial networks through latent mixture models“, focused on a model-based clustering approach defined in a hybrid latent space, has been finally published in JRSS A.

Regarding softwares, the footBayes package, a tool to fit the most well-known soccer (football) models through Stan and maximum likelihood methods, has been deeply developed and enriched with new functionalities (2024 objective: incorporate CmdStan with VI/Pathfinder algorithms and write a package’s paper in JSS/R Journal format).

Learning from mistakes (my online talk for the American Statistical Association, 2:30pm Tues 30 Jan 2024)

Here’s the link:

Learning from mistakes

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We learn so much from mistakes! How can we structure our workflow so that we can learn from mistakes more effectively? I will discuss a bunch of examples where I have learned from mistakes, including data problems, coding mishaps, errors in mathematics, and conceptual errors in theory and applications. I will also discuss situations where researchers have avoided good learning opportunities. We can then try to use all these cases to develop some general understanding of how and when we learn from errors in the context of the fractal nature of scientific revolutions.

The video is here.

It’s sooooo frustrating when people get things wrong, the mistake is explained to them, and they still don’t make the correction or take the opportunity to learn from their mistakes.

To put it another way . . . when you find out you made a mistake, you learn three things:

1. Now: Your original statement was wrong.

2. Implications for the future: Beliefs and actions that flow from that original statement may be wrong. You should investigate your reasoning going forward and adjust to account for your error.

3. Implications for the past: Something in your existing workflow led to your error. You should trace your workflow, see how that happened, and alter your workflow accordingly.

In poker, they say to evaluate the strategy, not the play. In quality control, they say to evaluate the process, not the individual outcome. Similarly with workflow.

As we’ve discussed many many times in this space (for example, here), it makes me want to screeeeeeeeeeam when people forego this opportunity to learn. Why do people, sometimes very accomplished people, give up this opportunity? I’m speaking here of people who are trying their best, not hacks and self-promoters.

The simple answer for why even honest people will avoid admitting clear mistakes is that it’s embarrassing for them to admit error, they don’t want to lose face.

The longer answer, I’m afraid, is that at some level they recognize issues 1, 2, and 3 above, and they go to some effort to avoid confronting item 1 because they really really don’t want to face item 2 (their beliefs and actions might be affected, and they don’t want to hear that!) and item 3 (they might be going about everything all wrong, and they don’t want to hear that either!).

So, paradoxically, the very benefits of learning from error are scary enough to some people that they’ll deny or bury their own mistakes. Again, I’m speaking here of otherwise-sincere people, not of people who are willing to lie to protect their investment or make some political point or whatever.

In my talk, I’ll focus on my own mistakes, not those of others. My goal is for you in the audience to learn how to improve your own workflow so you can catch errors faster and learn more from them, in all three senses listed above.

P.S. Planning a talk can be good for my research workflow. I’ll get invited to speak somewhere, then I’ll write a title and abstract that seems like it should work for that audience, then the existence of this structure gives me a chance to think about what to say. For example, I’d never quite thought of the three ways of learning from error until writing this post, which in turn was motivated by the talk coming up. I like this framework. I’m not claiming it’s new—I guess it’s in Pólya somewhere—, just that it will help my workflow. Here’s another recent example of how the act of preparing an abstract helped me think about a topic of continuing interest to me.

Progress in 2023, Charles edition

Following the examples of Andrew, Aki, and Jessica, and at Andrew’s request:

Published:

Unpublished:

This year, I also served on the Stan Governing Body, where my primary role was to help bring back the in-person StanCon. StanCon 2023 took place at the University of Washington in St. Louis, MO and we got the ball rolling for the 2024 edition which will be held at Oxford University in the UK.

It was also my privilege to be invited as an instructor at the Summer School on Advanced Bayesian Methods at KU Leuven, Belgium and teach a 3-day course on Stan and Torsten, as well as teach workshops at StanCon 2023 and at the University of Buffalo.

Progress in 2023, Aki’s software edition

Andrew, I, and Jessica (and I hope we get more) listed papers for progress in 2023, but many papers would be much less useful without software, so I list also software I’m contributing to with the most interesting improvements added in 2023 (in addition there is always huge amount of work that improves the software somehow, but is not that visible)

Stan (including Stan math + Stan core + Stanc + CmdStan)

posterior R package

loo R package

projpred R package

  • augmented-data projection to add support for more model families (Weber et al., 2023)
  • latent projection to add support for more model families (Catalina et al., 2021)
  • enhanced verbose output
  • improved user interface and summary tables

P.S. I was surprised that there were no major updates to priorsense R package or bayesplot R package in 2023, but there are some great usability improvements coming soon to both of these.

Prediction isn’t everything, but everything is prediction

Image

This is Leo.

Explanation or explanatory modeling can be considered to be the use of statistical models for testing causal hypotheses or associations, e.g. between a set of covariates and a response variable. Prediction or predictive modeling, (supposedly) on the other hand, is the act of using a model—or device, algorithm—to produce values of new, existing, or future observations. A lot has been written about the similarities and differences between explanation and prediction, for example Breiman (2001), Shmueli (2010), Billheimer (2019), and many more.

These are often thought to be separate dimensions of statistics, but Jonah and I have been discussing for a long time that in some sense there may actually be no such thing as explanation without prediction. Basically, although prediction itself is not the only goal of inferential statistics, everything—every objective—in inferential statistics can be reframed through the lense of prediction.

Hypothesis testing, ability estimation, hierarchical modeling, treatment effect estimation, causal inference problems, etc., can all be described in our opinion from a (inferential) predictive perspective. So far we have not found an example for which there is no way to reframe it as prediction problem. So I ask you: is there any inferential statistical ambition that cannot be described in predictive terms?

P.S. Like Billheimer (2019) and others, we think that inferential statistics should be considered as inherently predictive and be focused primarily on probabilistic predictions of observable events and quantities, rather than focusing on statistical estimates of unobersvable parameters that do not exist outside of our highly contrived models. Similarly, we also feel that the goal of Bayesian modeling should not be taught to students as finding the posterior distribution of unobservables, but rather as finding the posterior predictive distribution of the observables (with finding the posterior as an intermediate step); even when we don’t only care about predictive accuracy and we still care about understanding how a model works (model checking, GoF measures), we think the predictive modeling interpretation is generally more intuitive and effective.

Ben Shneiderman’s Golden Rules of Interface Design

The legendary computer science and graphics researcher writes:

1. Strive for consistency.

Consistent sequences of actions should be required in similar situations; identical terminology should be used in prompts, menus, and help screens; and consistent color, layout, capitalization, fonts, and so on, should be employed throughout. Exceptions, such as required confirmation of the delete command or no echoing of passwords, should be comprehensible and limited in number.

2. Seek universal usability.

Recognize the needs of diverse users and design for plasticity, facilitating transformation of content. Novice to expert differences, age ranges, disabilities, international variations, and technological diversity each enrich the spectrum of requirements that guides design. Adding features for novices, such as explanations, and features for experts, such as shortcuts and faster pacing, enriches the interface design and improves perceived quality.

3. Offer informative feedback.

For every user action, there should be an interface feedback. For frequent and minor actions, the response can be modest, whereas for infrequent and major actions, the response should be more substantial. Visual presentation of the objects of interest provides a convenient environment for showing changes explicitly.

4. Design dialogs to yield closure.

Sequences of actions should be organized into groups with a beginning, middle, and end. Informative feedback at the completion of a group of actions gives users the satisfaction of accomplishment, a sense of relief, a signal to drop contingency plans from their minds, and an indicator to prepare for the next group of actions. For example, e-commerce websites move users from selecting products to the checkout, ending with a clear confirmation page that completes the transaction.

5. Prevent errors.

As much as possible, design the interface so that users cannot make serious errors; for example, gray out menu items that are not appropriate and do not allow alphabetic characters in numeric entry fields. If users make an error, the interface should offer simple, constructive, and specific instructions for recovery. For example, users should not have to retype an entire name-address form if they enter an invalid zip code but rather should be guided to repair only the faulty part. Erroneous actions should leave the interface state unchanged, or the interface should give instructions about restoring the state.

6. Permit easy reversal of actions.

As much as possible, actions should be reversible. This feature relieves anxiety, since users know that errors can be undone, and encourages exploration of unfamiliar options. The units of reversibility may be a single action, a data-entry task, or a complete group of actions, such as entry of a name-address block.

7. Keep users in control.

Experienced users strongly desire the sense that they are in charge of the interface and that the interface responds to their actions. They don’t want surprises or changes in familiar behavior, and they are annoyed by tedious data-entry sequences, difficulty in obtaining necessary information, and inability to produce their desired result.

8. Reduce short-term memory load.

Humans’ limited capacity for information processing in short-term memory (the rule of thumb is that people can remember “seven plus or minus two chunks” of information) requires that designers avoid interfaces in which users must remember information from one display and then use that information on another display. It means that cellphones should not require reentry of phone numbers, website locations should remain visible, and lengthy forms should be compacted to fit a single display.

Wonderful, wonderful stuff. When coming across this, I saw that Shneiderman taught at the University of Maryland . . . checking his CV, it turns out that he taught there back when I was a student. I could’ve taken his course!

It would be interesting to come up with similar sets of principles for statistical software, statistical graphics, etc. We do have 10 quick tips to improve your regression modeling, so that’s a start.

Progress in 2023

Published:

Unpublished:

Enjoy.

Hey wassup Detroit Pistons? What’s gonna happen for the rest of the season? Let’s get (kinda) Bayesian. With graphs and code (but not a lot of data; sorry):

Paul Campos points us to this discussion of the record of the Detroit professional basketball team:

The Detroit Pistons broke the NBA record for most consecutive losses in a season last night, with their 27th loss in a row. . . . A team’s record is, roughly speaking, a function of two factors:

(1) The team’s quality. By “quality” I mean everything about the team”s performance that isn’t an outcome of random factors, aka luck — the ability of the players, individually and collectively, the quality of the coaching, and the quality of the team’s management, for example.

(2) Random factors, aka luck.

The relative importance of luck and skill?

The above-linked post continues:

How do we disentangle the relative importance of these two factors when evaluating a team’s performance to some point in the season? . . . The best predictor ex ante of team performance is the evaluation of people who gamble on that performance. I realize that occasionally gambling odds include significant inefficiencies, in the form of the betting public making sentimental rather than coldly rational wagers, but this is very much the exception rather than the rule. . . . the even money over/under for Detroit’s eventual winning percentage this season was, before the first game was played, a winning percentage of .340. To this point, a little more than third of the way through the season, Detroit’s winning percentage has been .0666. . . .

To the extent that the team has had unusually bad luck, then one would expect the team’s final record to be better. But how much better? Here we can again turn to the savants of Las Vegas et. al., who currently set the even money odds of the team’s final record on the basis of the assumption that it will have a .170 winning percentage in its remaining games.

Campos shares a purported Bayesian analysis and summarizes, “if we have just two pieces of information — a prior assumption of a .340 team, and the subsequent information of a .066 performance through thirty games — the combination of these two pieces of information yields a posterior prediction of a .170 winning percentage going forward, which remarkably enough is exactly what the current gambling odds predict! . . . it appears that the estimate being made by professional gamblers is that about two-thirds of Detroit’s worse than expected record is a product of an ex ante overestimate of the team’s quality, while the other third is assumed to be accounted for by bad luck.”

I think that last statement is coming from the fact that (1/3)*0.340 + (2/3)*0.067 is approximately 0.170.

I don’t quite follow his Bayesian logic. But never mind about that for now.

As I said, I didn’t quite follow the Bayesian logic shared by Campos. Here’s my problem. He posts this graph:

I think I understand the “No_Prior_Info” curve in the graph: that’s the y ~ binomial(n, p) likelihood for p, given the data n=30, y=2. But I don’t understand where the “Prior” and “Posterior” curves come from. I guess the Prior distribution has a mean of 0.340 and the Posterior distribution has a mean of 0.170, but where are the widths of these curves coming from?

Part of the confusion here is that we’re dealing with inference for p (the team’s “quality,” as summarized by the probability that they’d win against a randomly-chosen opponent on a random day) and also with predictions of outcomes. For the posterior mean, there’s no difference: under the basic model, the posterior expected proportion of future games won is equal to the posterior mean of p. It gets trickier when we talk about uncertainty in p.

How, then, could we take the beginning-of-season and current betting lines–which we will, for the purposes of our discussion here, identify as the prior and posterior means of p, ignoring systematic biases of bettors–and extract implied prior and posterior distributions? There’s surely enough information here to do this, if we use information from all 30 teams and calibrate properly.

Exploratory analysis

I started by going to the internet, finding various sources on betting odds, team records, and score differentials, and entering the data into this file. The latest Vegas odds I could find on season records were from 19 Dec; everything else came from 27 Dec.

Next step was to make some graphs. First, I looked at point differential and team records so far:

nba <- read.table("nba2023.txt", header=TRUE, skip=1)
nba$ppg <- nba$avg_points
nba$ppg_a <- nba$avg_points_opponent
nba$ppg_diff <- nba$ppg - nba$ppg_a
nba$record <- nba$win_fraction
nba$start_odds <- nba$over_under_beginning/82
nba$dec_odds <- nba$over_under_as_of_dec/82
nba$sched <- - (nba$schedule_strength - mean(nba$schedule_strength)) # signed so that positive value implies a more difficult schedule so far in season
nba$future_odds <- (82*nba$dec_odds - 30*nba$record)/52

pdf("nba2023_1.pdf", height=3.5, width=10)
par(mfrow=c(1,2), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="s")
rng <- range(nba$ppg_a, nba$ppg)
plot(rng, rng, xlab="Points per game allowed", ylab="Points per game scored", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg_a, nba$ppg, nba$team, col="blue")
#
par(pty="m")
plot(nba$ppg_diff, nba$record, xlab="Point differential", ylab="Won/lost record so far", bty="l", type="n")
text(nba$ppg_diff, nba$record, nba$team, col="blue")
#
mtext("Points per game and won-lost record as of 27 Dec", line=.5, side=3, outer=TRUE)
dev.off()

Here's a question you should always ask yourself: What do you expect to see?

Before performing any statistical analysis it's good practice to anticipate the results. So what do you think these graphs will look like?
- Ppg scored vs. ppg allowed. What do you expect to see? Before making the graph, I could have imagined it going either way: you might expect a negative correlation, with some teams doing the run-and-gun and others the physical game, or you might expect a positive correlation, because some teams are just much better than others. My impression is that team styles don't vary as much as they used to, so I was guessing a positive correlation.
- Won/lost record vs. point differential. What do you expect to see? Before making the graph, I was expecting a high correlation. Indeed, if I could only use one of these two metrics to estimate a team's ability, I'd be inclined to use point differential.

Aaaand, here's what we found:

Hey, my intuition worked on these! It would be interesting to see data from other years to see if I just got lucky with that first one.

Which is a better predictor of won-loss record: ppg scored or ppg allowed?

OK, this is a slight distraction from Campos's question, but now I'm wondering, which is a better predictor of won-loss record: ppg scored or ppg allowed? From basic principles I'm guessing they're about equally good.

Let's do a couple of graphs:

pdf("nba2023_2.pdf", height=3.5, width=10)
par(mfrow=c(1,3), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="m")
rng <- range(nba$ppg_a, nba$ppg)
plot(rng, range(nba$record), xlab="Points per game scored", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg, nba$record, nba$team, col="blue")
#
par(pty="m")
plot(rng, range(nba$record), xlab="Points per game allowed", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg_a, nba$record, nba$team, col="blue")
#
par(pty="m")
plot(range(nba$ppg_diff), range(nba$record), xlab="Avg score differential", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg_diff, nba$record, nba$team, col="blue")
#
mtext("Predicting won-loss record from ppg, ppg allowed, and differential", line=.5, side=3, outer=TRUE)
dev.off()

Which yields:

So, about what we expected. To round it out, let's try some regressions:

library("rstanarm")
print(stan_glm(record ~ ppg, data=nba, refresh=0), digits=3)
print(stan_glm(record ~ ppg_a, data=nba, refresh=0), digits=3)
print(stan_glm(record ~ ppg + ppg_a, data=nba, refresh=0), digits=3)

The results:

            Median MAD_SD
(Intercept) -1.848  0.727
ppg          0.020  0.006

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.162  0.021 
------
            Median MAD_SD
(Intercept)  3.192  0.597
ppg_a       -0.023  0.005

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.146  0.019 
------
            Median MAD_SD
(Intercept)  0.691  0.335
ppg          0.029  0.002
ppg_a       -0.030  0.002

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.061  0.008

So, yeah, points scored and points allowed are about equal as predictors of won-loss record. Given that, it makes sense to recode as ppg differential and total points:

print(stan_glm(record ~ ppg_diff + I(ppg + ppg_a), data=nba, refresh=0), digits=3)

Here's what we get:

               Median MAD_SD
(Intercept)     0.695  0.346
ppg_diff        0.029  0.002
I(ppg + ppg_a) -0.001  0.001

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.062  0.009

Check. Once we include ppg_diff as a predictor, the average total points doesn't do much of anything. Again, it would be good to check with data from other seasons, as 30 games per team does not supply much of a sample.

Now on to the betting lines

Let's now include the Vegas over-unders in our analysis. First, some graphs:

pdf("nba2023_3.pdf", height=3.5, width=10)
par(mfrow=c(1,3), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="s")
rng <- range(nba$start_odds, nba$record)
plot(rng, rng, xlab="Betting line at start", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$start_odds, nba$record, nba$team, col="blue")
#
par(pty="s")
rng <- range(nba$record, nba$dec_odds)
plot(rng, rng, xlab="Won/lost record so far", ylab="Betting line in Dec", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$record, nba$dec_odds, nba$team, col="blue")
#
par(pty="s")
rng <- range(nba$start_odds, nba$dec_odds)
plot(rng, rng, xlab="Betting line at start", ylab="Betting line in Dec", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$start_odds, nba$dec_odds, nba$team, col="blue")
#
mtext("Won-lost record and over-under at start and in Dec", line=.5, side=3, outer=TRUE)
dev.off()

Which yields:

Oops--I forgot to make some predictions before looking. In any case, the first graph is kinda surprising. You'd expect to see an approximate pattern of E(y|x) = x, and we do see that--but not at the low end. The teams that were predicted to do the worst this year are doing even worse than expected. It would be interesting to see the corresponding graph for earlier years. My guess is that this year is special, not only in the worst teams doing so bad, but in them underperforming their low expectations.

The second graph is as one might anticipate: Betters are predicting some regression toward the mean. Not much, though! And the third graph doesn't tell us much beyond the first graph.

Upon reflection, I'm finding the second graph difficult to interpret. The trouble is that "Betting line in Dec" is the forecast win percentage for the year, but 30/82 of that is the existing win percentage. (OK, not every team has played exactly 30 games, but close enough.) What I want to do is just look at the forecast for their win percentage for the rest of the season:

pdf("nba2023_4.pdf", height=3.5, width=10)
par(mfrow=c(1,3), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="s")
rng <- range(nba$record, nba$dec_odds)
plot(rng, rng, xlab="Won/lost record so far", ylab="Betting line of record for rest of season", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
fit <- coef(stan_glm(future_odds ~ record, data=nba, refresh=0))
print(fit)
abline(fit, lwd=.5, col="blue")
text(nba$record, nba$future_odds, nba$team, col="blue")
#
dev.off()

Here's the graph:

The fitted regression line has a slope of 0.66:

            Median MAD_SD
(Intercept) 0.17   0.03  
record      0.66   0.05  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.05   0.01 

Next step is to predict the Vegas prediction for the rest of the season given the initial prediction and the team's record so far:

print(stan_glm(future_odds ~ start_odds + record, data=nba, refresh=0), digits=2)

            Median MAD_SD
(Intercept) -0.02   0.03 
start_odds   0.66   0.10 
record       0.37   0.06 

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.03   0.00  

It's funny--everywhere we look, we see this 0.66. And 30 games is 37% of the season!

Now let's add into the regression the points-per-game differential, as this should include additional information beyond what was in the won-loss so far:

print(stan_glm(future_odds ~ start_odds + record + ppg_diff, data=nba, refresh=0), digits=2)

            Median MAD_SD
(Intercept) 0.06   0.06  
start_odds  0.67   0.09  
record      0.20   0.11  
ppg_diff    0.01   0.00  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.03   0.00 

Hard to interpret this one, as ppg_diff is on a different scale from the rest. Let's quickly standardize it to be on the same scale as the won-lost record so far:

nba$ppg_diff_std <- nba$ppg_diff * sd(nba$ppg_record) / sd(nba$ppg_diff)
print(stan_glm(future_odds ~ start_odds + record + ppg_diff_std, data=nba, refresh=0), digits=2)

             Median MAD_SD
(Intercept)  0.06   0.06  
start_odds   0.67   0.09  
record       0.20   0.11  
ppg_diff_std 0.17   0.10  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.03   0.00  

OK, not enough data to cleanly disentangle won-lost record and point differential as predictors here. My intuition would be that, once you have point differential, the won-lost record tells you very little about what will happen in the future, and the above fitted model is consistent with that intuition, but it's also consistent with the two predictors being equally important, indeed it's consistent with point differential being irrelevant conditional on won-lost record.

What we'd want to do here--and I know I'm repeating myself--is to repeat the analysis using data from previous years.

Interpreting the implied Vegas prediction for the rest of the season as an approximate weighted average of the preseason prediction and the current won-lost record

In any case, the weighting seems clear: approx two-thirds from starting odds and one-third from the record so far, which at least on a naive level seems reasonable, given that the season is about one-third over.

Just for laffs, we can also throw in difficulty of schedule, as that could alter our interpretation of the teams' records so far.

nba$sched_std <- nba$sched * sd(nba$record) / sd(nba$sched)
print(stan_glm(future_odds ~ start_odds + record + ppg_diff_std + sched_std, data=nba, refresh=0), digits=2)

             Median MAD_SD
(Intercept)  0.06   0.06  
start_odds   0.68   0.09  
record       0.21   0.11  
ppg_diff_std 0.17   0.10  
sched_std    0.04   0.03 

So, strength of schedule does not supply much information. This makes sense, given that 30 games is enough for the teams' schedules to mostly average out.

The residuals

Now that I've fit the regression, I'm curious about the residuals. Let's look:

fit_5 <- stan_glm(future_odds ~ start_odds + record + ppg_diff_std + sched_std, data=nba, refresh=0)
fitted_5 <- fitted(fit_5)
resid_5 <- resid(fit_5)
#
pdf("nba2023_5.pdf", height=5, width=8)
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="m")
plot(fitted_5, resid_5, xlab="Vegas prediction of rest-of-season record", ylab="Residual from fitted model", bty="l", type="n")
abline(0, 0, lwd=.5, col="gray")
text(fitted_5, resid_5, nba$team, col="blue")
#
dev.off()

And here's the graph:

The residual for Detroit is negative (-0.05*52 = -2.6, so the Pistons are expected to win about 3 games less than their regression prediction based on prior odds and outcome of first 30 games). Cleveland and Boston are also expected to do a bit worse than the model would predict. On the other direction, Vegas is predicting that Memphis will win about 4 games more than predicted from the regression model.

I have no idea whassup with Memphis. The quick generic answer is that the regression model is crude, and bettors have other information not included in the regression.

Reverse engineering an implicit Bayesian prior

OK, now for the Bayesian analysis. As noted above, we aren't given a prior for team j's average win probability, p_j; we're just given a prior point estimate of each p_j.

But we can use the empirical prior-to-posterior transformation, along with the known likelihood function, under the simplifying assumption the 30 win-loss outcomes for each team j are independent with constant probability p_j for team j. This assumption that is obviously wrong, given that teams are playing each other, but let's just go with it here, recognizing that with full data it would be straightforward to extend to an item-response model with an ability parameter for each team (as here).

To continue, the above regression models show that the Vegas "posterior Bayesian" prediction of p_j after 30 games is approximately a weighted average of 0.65*(prior prediction) + 0.35*(data won-loss record). From basic Bayesian algebra (see, for example, chapter 2 of BDA), this tells us that the prior has about 65/35 as much information as data from 30 games. So, informationally, the prior is equivalent to the information from (65/35)*30 = 56 games, about two-thirds of a season worth of information.

Hey--what happened??

But, wait! That approximate 2/3 weighting for the prior and 1/3 weighting of the data from 30 games is the opposite of what Campos reported, which was a 1/3 weighting of the prior and 2/3 of the data. Recall: prior estimated win probability of 0.340, data win rate of 0.067, take (1/3)*0.340 + (2/3)*0.067 and you get 0.158, which isn't far from the implied posterior estimate of 0.170.

What happened here is that the Pistons are an unusual case, partly because the Vegas over-under for their season win record is a few percentage points lower than the linear model predicted, and partly because when the probability is low, a small percentage-point change in the probability corresponds to a big change in the implicit weights.

Again, it would be good to check all this with data from other years.

Skill and luck

There's one more loose end, and that's Campos taking the weights assigned to data and prior and characterizing them as "skill" and "luck" in prediction errors. I didn't follow that part of the reasoning at all so I'll just let it go for now. Part of the problem here is in one place Campos seems to be talking about skill and luck as contributors to the team's record, and in another place he seems to considering them as contributors to the difference between preseason predictions and actual outcomes.

One way to think about skill and luck in a way that makes sense to me is within an item-response-style model in which the game outcome is a stochastic function of team abilities and predictable factors. For example, in the model,

score differential = ability of home team - ability of away team + home-field advantage + error,

the team abilities are in the "skill" category and the error is in the "luck" category, and, ummm, I guess home-field advantage counts as "skill" too? OK, it's not so clear that the error in the model should all be called "luck." If a team plays better against a specific opponent by devising a specific offensive/defensive plan, that's skill, but it would pop up in the error term above.

In any case, once we've defined what is skill and what is luck, we can partition the variance of the total to assign percentages to each.

Another way of looking at this is to consider the extreme case of pure luck. If outcomes determined only by luck, then each game is a coin flip, and we'd see this in the data because the team win proportions after 30 games would follow a binomial distribution with n=30 and p=0.5. The actual team win proportions have mean 0.5 (of course) and sd 0.18, as compared to the theoretical mean of 0.5 and sd of 0.5/sqrt(30) = 0.09. That simple calculation suggests that skill is (0.18/0.09)^2 = 4 times as important as luck when determining the outcome of 30 games.

And maybe I'm getting just getting this all tangled myself. The first shot at any statistical analysis often will have some mix of errors in data, modeling, computing, and general understanding, with that last bit corresponding to the challenge of mapping from substantive concepts to mathematical and statistical models. Some mixture of skill and luck, I guess.

Summary

1. Data are king. In the immortal words of Hal Stern, the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use. I could do more than Campos did, not so much because of my knowledge of Bayesian statistics but because I was using data from all 30 teams.

2. To continue with that point, you can do lots better than me by including data from other years.

3. Transparency is good. All my data and code are above. I might well have made some mistakes in my analyses, and, in any case, many loose ends remain.

4. Basketball isn't so important (hot hand aside). The idea of backing out an effective prior by looking at information updating, that's a more general idea worth studying further. This little example is a good entry point into the potential challenge of such studies.

5. Models can be useful, not just for prediction but also for understanding, as we saw for the problem of partitioning outcomes into skill and luck.

P.S. Last week, when the Pistons were 2-25 or something like that, I was taking with someone who's a big sports fan but not into analytics, the kind of person who Bill James talked about when he said that people interpret statistics as words that describe a situation rather than as numbers that can be added, subtracted, multiplied, and divided. The person I was talking with predicted that the Pistons would win no more than 6 games this year. I gave the statistical argument why this was unlikely: (a) historically there's been regression to the mean, with an improving record among the teams that have been doing the worst and an average decline among the teams at the top of the standings, (b) if a team does unexpectedly poorly, you can attribute some of that to luck. Also, 2/30 = 0.067, and 5/82 = 0.061, so if you bet that the Pistons will win no more than 6 games this season, you're actually predicting they might do worse in the rest of the season. All they need to do is get lucky in 5 of the remaining games. He said, yeah, sure, but they don't look like they can do it. Also, now all the other teams are trying extra hard because nobody wants to be the team that loses to the Pistons. OK, maybe. . . .

Following Campos, I'll just go with the current Vegas odds and give a point prediction the Pistons will end the season with about 11 wins.

P.P.S. Also related is a post from a few years back, “The Warriors suck”: A Bayesian exploration.

P.P.P.S. Unrelatedly, except for the Michigan connection, I recommend these two posts from a couple years ago:

What is fame? The perspective from Niles, Michigan. Including an irrelevant anecdote about “the man who invented hedging”

and

Not only did this guy not hold the world record in the 100 meter or 110-yard dash for 35 years, he didn’t even run the 110-yard dash in 10.8 seconds, nor did he see a million patients, nor was he on the Michigan football and track teams, nor did Michigan even have a track team when he attended the university. It seems likely that he did know Jack Dempsey, though.

Enjoy.

Explainable AI works, but only when we don’t need it

This is Jessica. I went to NeurIPS last week, mostly to see what it was like. While waiting for my flight home at the airport I caught a talk that did a nice job of articulating some fundamental limitations with attempts to make deep machine learning models “interpretable” or “explainable.”

It was part of an XAI workshop. My intentions in checking out the XAI workshop were not entirely pure, as its an area I’ve been skeptical of for a while. Formalizing aspects of statistical communication is very much in line with my interests, but I tried and failed to get into XAI and related work on interpretability a few years ago when it was getting popular. The ML contributions have always struck me as more of an academic exercise than a real attempt at aligning human expectations with model capabilities. When human-computer interaction people started looking into it, there started to be a little more attention to how people actually use explanations, but the methods used to study human reliance on explanations there have not been well grounded (e.g., ‘appropriate reliance’ is often defined as agreeing with the AI when it’s right and not agreeing when it’s wrong, which can be shown to be incoherent in various ways). 

The talk, by Ulrike Luxburg, which gave a sort of impossibility result for explainable AI, was refreshing. First, she distinguished two very different scenarios for explanation: the cooperative ones where you have a principal with a model furnishing the explanations and a user using them who both want the best quality/most accurate explanations, versus adversarial scenarios where you have a principal whose best interests are not aligned with the goal of accurate explanation. For example, some company who needs to explain why it denied someone a loan has little motivation to explain the actual reason behind that prediction, because it’s not in their best interest to give people fodder to then try to minimally change their features to push the prediction to a different label. Her first point was that there is little value in trying to guarantee good explanations in the adversarial case, because existing explanation techniques (e.g.,for feature attribution like SHAP or LIME) give very different explanations for the same prediction, and the same explanation technique is often highly sensitive to small differences in the function to be explained (e.g., slight changes to parameters in training). There are too many degrees of freedom in terms of selecting among inductive biases so the principal easily produce something faithful by some definition while hiding important information. Hence laws guaranteeing a right to explanation miss this point.

In the cooperative setting, maybe there is hope. But, turns out something like the anthropic principle of statistics operates here: we have techniques that we can show work well in the simple scenarios where we don’t really need explanations, but when we do really need them (e.g., deep neural nets over high dimensional feature spaces) anything we can guarantee is not going to be of much use.

There’s an analogy to clustering: back when unsupervised learning was very hot, everyone wanted guarantees for clustering algorithms but to make them required working in settings where the assumptions were very strong, such that the clusters would be obvious upon inspecting the data. In explainable AI, we have various feature attribution methods that describe which features led to the prediction on a particular instance. SHAP, which borrows Shapley values from game theory to allocate credit among features, is very popular. Typically SHAP provides the marginal contribution of each feature, but Shapley Interaction Values have been proposed to allow for local interaction effects between pairs of features. Luxburg presented a theoretical result from this paper which extends Shapley Interaction Values to n-Shapley Values, which explain individual predictions with interaction terms up to order n given some number of total features d. They are additive in that they always sum to the output of the function we’re trying to explain over all subsets of combinations of variables less than or equal to n. Starting from the original Shapley values (where n=1), n-Shapley Values successively add higher-order variable interactions to the explanations.

The theoretical result shows that n-Shapley Values recover generalized additive models (GAMs), which are GLMs where the outcome depends linearly on smoothed functions of the inputs: g(E[Y] = B_0 = f_1(x_1) + f_2(x_2) + … f_m(x_m). GAMs are considered inherently interpretable, but are also undetermined. For n-Shapley to recover a faithful representation of the function as a GAM, the order of the explanation just needs to be as large or larger than the maximum variable interaction in the model. 

However, GAMs lose their interpretability as we add interactions. When we have large numbers of features, as is typically the case in deep learning, what is the value of the explanation?  We need to look at interactions between all combinatorial subsets of the features. So when simple explanations like standard SHAP are applied to complex functions, you’re getting an average over billions of features, and there’s no reduction to be made that would give you something meaningful. The fact that in the simple setting of a GAM of order 1 we can prove SHAP does the right thing does not mean we’re anywhere close to having “solved” explainability. 

The organizers of the workshop obviously invited this rather negative talk on XAI, so perhaps the community is undergoing self-reflection that will temper the overconfidence I associate with it. Although, the day before the workshop I also heard someone complaining that his paper on calibration got rejected from the same workshop, with an accompanying explanation that it wasn’t about LIME or SHAP. Something tells me XAI will live on.

I guess one could argue there’s still value in taking a pragmatic view, where if we find that explanations of model predictions, regardless of how meaningless, lead to better human decisions in scenarios where humans must make the final decision regardless of the model accuracy (e.g., medical diagnoses, loan decisions, child welfare cases), then there’s still some value in XAI. But who would want to dock their research program on such shaky footing? And of course we still need an adequate way of measuring reliance, but I will save my thoughts on that for another post.

 Another thing that struck me about the talk was a kind of tension around just trusting one’s instincts that something is half-baked versus taking the time to get to the bottom of it. Luxburg started by talking about how her strong gut feeling as a theorist was that trying to guarantee AI explainability was not going to be possible. I believed her before she ever got into the demonstration, because it matched my intuition. But then she spent the next 30 minutes discussing an XAI paper. There’s a decision to be made sometimes, about whether to just trust your intuition and move on to something that you might still believe in versus to stop and articulate the critique. Others might benefit from the latter, but then you realize you just spent another year trying to point out issues with a line of work you stopped believing in a long time ago. Anyway, I can relate to that. (Not that I’m complaining about the paper she presented – I’m glad she took the time to figure it out as it provides a nice example). 

I was also reminded of the kind of awkward moment that happens sometimes where someone says something rather final and damning, and everyone pauses for a moment to listen to it. Then the chatter starts right back up again like it was never said. Gotta love academics!

Mixtures: A debugging story

I’m not a great programmer. The story that follows is not intended to represent best programming practice or even good programming practice. It’s just something that happened to me, and it’s the kind of thing that’s happened to me many times before, so I’m sharing it with you here.

The problem

For a research project I needed to fit a regression model with an error term that is a mixture of three normals. I googled *mixture model Stan* and came to this page with some code:

data {
  int K;          // number of mixture components
  int N;          // number of data points
  array[N] real y;         // observations
}
parameters {
  simplex[K] theta;          // mixing proportions
  ordered[K] mu;             // locations of mixture components
  vector[K] sigma;  // scales of mixture components
}
model {
  vector[K] log_theta = log(theta);  // cache log calculation
  sigma ~ lognormal(0, 2);
  mu ~ normal(0, 10);
  for (n in 1:N) {
    vector[K] lps = log_theta;
    for (k in 1:K) {
      lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
    }
    target += log_sum_exp(lps);
  }
}

I should’ve just started by using this code as is, but instead I altered it in a couple of ways:

data {
  int M;
  int N;
  int K;
  vector[N] v;
  matrix[N,K] X;
}
parameters {
  vector[K] beta;
  simplex[M] lambda;
  ordered[M] mu;
  vector[M] sigma;
}
model {
  lambda ~ lognormal(0, 2);
  mu ~ normal(0, 10);
  sigma ~ lognormal(0, 2);
  for (n in 1:N){
    vector[M] lps = log(lambda);
    for (m in 1:M){
      lps[m] += normal_lpdf(v[n] | X*beta + mu[m], sigma[m]);
    }
    target += log_sum_exp(lps);
  }
}

The main thing was adding the linear predictor, X*beta, but I also renamed a couple of variables to make the code line up with the notation in the paper I was writing, also I added a prior on the mixture component sizes and I removed some of the code from the Stan User’s Guide that increased computational efficiency but which seemed to me to make the code harder to read to newcomers.

I fit this model setting M=3, and . . . it was really slow! I mean, stupendously slow. My example had about 1000 data points and it was taking, oh, I dunno, close to an hour to run?

This just made no sense. It’s not just that it was slow; also, the slowness was just not right, which made me concerned that something else was going wrong.

Also there was poor convergence. Some of this was understandable, as I was fitting the model to data that had been simulated from a linear regression with normal errors, so there weren’t actually three components. But, still, the model has priors, and I don’t think the no-U-turn sampler (NUTS) algorithm used by Stan should have so much trouble traversing this space.

Playing with priors

It was time to debug. My first thought was that the slowness was caused by poor geometry: if NUTS is moving poorly, it can take up to 1024 steps per iteration, and then each iteration will take a long time. Poor mixing and slow convergence go together.

A natural way to fix this problem is to make the priors stronger. With strong priors, the parameters are restricted to fall in a small, controlled zone, and the geometry should be better.

I tried a few things with priors, the most interesting of which was to set up a hierarchical model for the scales of the mixture components. I added this line to the parameters block:

  real log_sigma_0;

And then, in the model block, I replaced “sigma ~ lognormal(0, 2);” with:

  sigma ~ lognormal(log_sigma_0, 1);

One thing that made the priors relatively easy to set up here was that the data are on unit scale: they’re the logarithms of sampling weights, and the sampling weights were normalized to have a sample mean of 1. Also, we’re not gonna have weights of 1000, so they have a limited range on the log scale.

In any case, these steps of tinkering with the prior weren’t helping. The model was still running ridiculously slowly.

Starting from scratch

OK, what’s going on? I decided to start from the other direction: Instead of starting with my desired model and trying to clean it up, I started with something simple and built up.

The starting point is the code in the Stan User’s Guide, given above. I simulated some fake data and fitted it:

library("rstanarm")
set.seed(123)

# Simulate data
N <- 100
K <- 3
lambda <- c(0.5, 0.3, 0.2)
mu <- c(-2, 0, 2)
sigma <- c(1, 1, 1)
z <- sample(1:K, n, replace=TRUE, prob=lambda)
v <- rnorm(N, mu[z], sigma[z])

# Fit model
mixture <- cmdstan_model("mixture_2.stan")
mixture_data <- list(y=v, N=N, K=3)
v_mixture_fit <- mixture$sample(data=mixture_data, seed=123, chains=4, parallel_chains=4)
print(v_mixture_fit)

And it worked just fine, ran fast, no convergence problems. It also worked fine with N=1000; it just made sense to try N=100 first in case any issues came up.

Then I changed the Stan code to my notation, using M rather than K and a couple other things, and still no problems.

Then I decided to make it a bit harder by setting the true means of the mixture components to be identical, changing "mu <- c(-2, 0, 2)" in the above code to

mu <- c(0, 0, 0)

Convergence was a bit worse, but it was still basically ok. So the problem didn't seem to be the geometry.

Next step was to add predictors to the model. I added them to the R code and the Stan code . . . and then the problem returned.

So I'd isolated the problem. It was with the regression predictors. But what was going on? One problem could be nonidentification of the constant term in the regression with the location parameters mu in the mixture model---but I'd been careful not to include a constant term in my regression, so it wasn't that.

Finding the bug

I stared at the code harder and found the problem! It was in this line of the Stan program:

      lps[m] += normal_lpdf(v[n] | X*beta + mu[m], sigma[m]);

The problem is that the code is doing one data point at a time, but "X*beta" has all the data together! So I fixed it. I changed the above line to:

      lps[m] += normal_lpdf(v[n] | Xbeta[n] + mu[m], sigma[m]);

and added the following line to the beginning of the model block:

  vector[N] Xbeta = X*beta;

Now it all works. I found the bug.

What next?

The code runs and does what it is supposed to do. Great. Now I have to go back to the larger analysis and see whether everything makes sense.

Here was the output of the simple normal linear regression fit to the data I'd simulated:

            Median MAD_SD
(Intercept)  0.57   0.10 
x           -0.16   0.01 

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.02   0.02

And here was the result of fitting the regression model in Stan with error term being a mixture of 3 normals:

    variable     mean   median   sd  mad       q5      q95 rhat ess_bulk ess_tail
 lp__        -1340.21 -1339.91 2.85 2.77 -1345.21 -1335.96 1.00      681      458
 beta[1]        -0.16    -0.16 0.01 0.01    -0.18    -0.14 1.00     2139     2357
 lambda[1]       0.32     0.14 0.34 0.19     0.01     0.94 1.02      419      977
 lambda[2]       0.39     0.27 0.35 0.36     0.01     0.95 1.01      506      880
 lambda[3]       0.28     0.11 0.32 0.15     0.01     0.92 1.01      543      879
 mu[1]          -0.14    -0.02 0.61 0.57    -1.43     0.57 1.02      385      281
 mu[2]           0.58     0.56 0.51 0.33    -0.25     1.56 1.01      474      536
 mu[3]           1.53     1.37 1.56 0.67     0.60     2.42 1.01      562     1025
 sigma[1]        0.77     0.85 0.30 0.22     0.21     1.09 1.01      522      481
 sigma[2]        0.85     0.93 0.32 0.16     0.27     1.16 1.00      644     1028
 sigma[3]        0.77     0.79 0.50 0.29     0.25     1.11 1.00      910     1056
 log_sigma_0    -0.35    -0.34 0.64 0.65    -1.42     0.68 1.00     1474     1605

The estimate for the slope parameter, beta, seems fine, but it's hard to judge mu and sigma. Ummm, we can take a weighted average for mu, 0.32*(-0.14) + 0.39*0.58 + 0.28*1.53 = 0.61, which seems kind of ok although a bit off from the 0.57 we got from the linear regression. What about sigma? It's harder to tell. We can compute the weighted variance of the mu's plus the weighted average of the sigma^2's, but it's getting kinda messy, also really we want to account for the posterior uncertainty---as you can see from above, these uncertainty intervals are really wide, which makes sense given that we're fitting a mixture of 3 normals to data that were simulated from a single normal distribution . . . Ok, this is getting messy. Let's just do it right.

I added a generated quantities block to the Stan program to compute the total mean and standard deviation of the error term:

generated quantities {
  real mu_total = sum(lambda.*mu)/sum(lambda);
  real sigma_total = sqrt(sum(lambda.*((mu - mu_total)^2 + sigma^2))/sum(lambda));
}

And here's what came out:

    variable     mean   median   sd  mad       q5      q95 rhat ess_bulk ess_tail
 mu_total        0.56     0.56 0.11 0.11     0.39     0.74 1.00     2172     2399
 sigma_total     1.02     1.02 0.03 0.02     0.98     1.06 1.00     3116     2052

Check. Very satisfying.

The next step is to continue on with the research on the problem that motivated all this. Fitting a regression with mixture model for errors, that was just a little technical thing I needed to do. It's annoying that it took many hours (not even counting the hour it took to write this post!) and even more annoying that I can't do much with this---it's just a stupid little bug, nothing that we can even put in our workflow book, I'm sorry to say---but now it's time to move on.

Sometimes, cleaning the code, or getting the model to converge, or finding a statistically-significant result, or whatever, takes so much effort that when we reach that ledge, we just want to stop and declare victory right there. But we can't! Or, at least, we shouldn't! Getting the code to do what we want is just a means to an end, not an end in itself.

P.S. I cleaned the code some more, soft-constraining mu_total to 0 so I could include the intercept back into the model. Here's the updated Stan program:

data {
  int M;
  int N;
  int K;
  vector[N] v;
  matrix[N,K] X;
}
parameters {
  vector[K] beta;
  simplex[M] lambda;
  ordered[M] mu;
  vector[M] sigma;
  real log_sigma_0;
}
model {
  vector[N] Xbeta = X*beta;
  lambda ~ lognormal(log(1./M), 1);
  mu ~ normal(0, 10);
  sigma ~ lognormal(log_sigma_0, 1);
  sum(lambda.*mu) ~ normal(0, 0.1);
  for (n in 1:N){
    vector[M] lps = log(lambda);
    for (m in 1:M){
      lps[m] += normal_lpdf(v[n] | Xbeta[n] + mu[m], sigma[m]);
    }
    target += log_sum_exp(lps);
  }
}
generated quantities {
  real mu_total = sum(lambda.*mu);
  real sigma_total = sqrt(sum(lambda.*((mu - mu_total)^2 + sigma^2)));
}

Book on Stan, R, and Python by Kentaro Matsuura

A new book on Stan using CmdStanR and CmdStanPy by Kentaro Matsuura has landed. And I mean that literally as you can see from the envelope (thanks, Kentaro!). Even the packaging from Japan is beautiful—it fit the book perfectly. You may also notice my Pentel Pointliner pen (they’re the best, but there’s a lot of competition) and my Mnemosyne pad (they’re the best, full stop), both from Japan.

If you click through to Amazon using the above link, the “Read Sample” button takes you to a list where you can read a sample, which includes the table of contents and a brief intro to notation.

Yes, it comes with source code

There’s a very neatly structured GitHub package, Bayesian statistical modeling with Stan R and Python, with all of the data and source code for the book.

The book just arrived, but from thumbing through it, I really like the way it’s organized. It uses practical simulation code and realistic data to illustrate points of workflow and show users how to get unstuck from common problems. This is a lot like the way Andrew teaches this material. Unlike how Andrew teaches, it starts from the basics, like what is a probability distribution. Luckily for the reader, rather than a dry survey trying to cover everything, it hits a few insightful highlights with examples—this is the way to go if you don’t want to just introduce distributions as you go.

The book is also generous with its workflow advice and tips on dealing with problems like non-identifiability or challenges like using discrete parameters. There’s even an advanced section at the end that works up to Gaussian processes and the application of Thompson sampling (not to reinforce Andrew’s impression that I love Thompson sampling—I just don’t have a better method for sequential decision making in “bandit” problems [scare quotes also for Andrew]).

CmdStanR and CmdStanPy interfaces

This is Kentaro’s second book on Stan. The first is in Japanese and it came out before CmdStanR and CmdStanPy. I’d recommend both this book and using CmdStanR or CmdStanPy—they are our go-to recommendations for using Stan these days (along with BridgeStan if you want transforms, log densities, and gradients). After moving to Flatiron Institute, I’ve switched from R to Python and now pretty much exclusively use Python with CmdStanPy, NumPy/SciPy (basic math and stats functions), plotnine (ggplot2 clone), and pandas (R data frame clone).

Random comment on form

In another nod to Andrew, I’ll make an observation about a minor point of form. If you’re going to use code in a book set in LaTeX, use sourcecodepro. It’s a Lucida Console-like font that’s much easier to read than courier. I’d just go with mathpazo for text and math in Palatino, but I can see why people like Times because it’s so narrow. Somehow Matsuura managed to solve the dreaded twiddle problem in his displayed Courier code so the twiddles look natural and not like superscripts—I’d love to know the trick to that. Overall, though, the graphics are abundant, clear, and consistently formatted, though Andrew might not like some of the ggplot2 defaults.

Comments from the peanut gallery

Brian Ward, who’s leading Stan language development these days and also one of the core devs for CmdStanPy and BridgeStan, said that he was a bit unsettled seeing API choices he’s made set down in print. Welcome to the club :-). This is why we’re so obsessive about backward compatibility.