Hiring postdocs and research scientists at Flatiron Institute’s Center for Computational Mathematics

Hiring season is officially open! We posted our job ad for postdocs today at the Center for Computational Mathematics (CCM) at the Flatiron Institute. Applications are due 15 December 2021 for positions starting summer/fall 2022. Initial appointments are for two years with a baked-in one-year extension available. Here’s the link with full details of how to apply.

Postdoc job ad: Flatiron Research Fellow, ML and Comp Stats

We’re also hiring for permanent research scientist positions.

Research scientist job ad: Flatiron Research Scientist, Open Rank

Our mission statement sums up what we do pretty well.

CCM’s mission is to create new mathematical approaches, algorithms and software to advance scientific research in multiple disciplines, often in collaboration with other Flatiron Centers.

Flatiron Institute cafeteriaMore specifically, members of our center work on methodology, applications, and theory for computational stats and machine learning, as well as more traditional scientific computing in the form of optimization, differential equation solvers, non-uniform FFTs, phase retrieval, etc. I (Bob Carpenter) am here full time, as is Brian Ward, a software engineer who’s been working on the Stan language, CmdStan and CmdStanPy. Yuling Yao, a former Ph.D. student of Andrew’s, has joined as a postdoc (he’s the obvious person to ask about the postdoc experience). I’m sharing an office with Dave Blei while he’s here on sabbatical and we have other great visitors tentatively lined up (if you’d like to visit for a leave or a sabbatical, please drop me a line).

Flatiron Institute roof gardenAlthough most of our permanent staff researchers and software engineers work on open-source software, we realize postdocs need to get a job, so we give them the flexibility to concentrate on research. We also understand that research is an international collaborative effort and we encourage postdocs and research scientists to maintain and develop academic collaborations outside of Flatiron Institute.

My favorite part of this place is that it’s packed to the rafters with scientists working on deep and interesting problems across the biological and physical sciences (our other centers are for computational astronomy, computational biology, computational neuroscience, and computational quantum physics).

I’m afraid we don’t do social science or humanities computing and are unlikely to hire anyone who’s concentrated mainly in these areas. The centers run popular science talks for each other on a weekly basis, which are really wonderful because they’re pitched at mathematically- and computationally-oriented scientists (so they assume you know basic physics and differential equations). Another great benefit is our computing clusters, which have state of the art GPU support and support from our high-performance computing experts who can help get your C++, Fortran, or even Python code running efficiently on the cluster.
Like continental Europe and the old Bell Labs, we have a very pleasant lunch culture where we eat together (the picture above is our cafeteria). christian mueller at blackboard
We also have really great admin support at all levels. Even our security guards are friendly and helpful. We’re one of the few places that has sufficient meeting rooms for our staff size (as well as great professional AV support). We’re also one of those places with chalkboards and white boards in all the halls, and it feels very welcoming to be surrounded by mathematical scribbling (that’s my office you can see behind Christian Mueller writing on the blackboard below).

Flatiron Institute is part of the Simons Foundation, which is one of the largest non-governmental science funders. Flatiron Institute auditoriumSimons Foundation also runs all kinds of popular outreach programs around math and science education and engagement, including Quanta Magazine and Sandbox Films (science documentaries), as well as a large autism research institute dedicated to open-access data collection and curation.

We have very nice digs in the Flatiron neighborhood of NYC. The office is right on Fifth Avenue, only a few blocks from Greenwich Village (NYU and subsidized postdoc housing) and Chelsea (Google and the Hudson River).

We’re pretty much 100% back to the office. Simons Foundation requires vaccinations to get into the building and all employees have to get weekly swab PCR tests for Covid.

If you’re interested in a postdoc or research scientist position on the computational stats side of CCM, please send me an email ([email protected]).

Postdoc in machine learning and computational statistics for cosmology

Uroš Seljak is looking to hire a postdoc to work on variational inference with applications to cosmology at Lawrence Berkeley National Lab (LBNL). Here’s a link to the full job posting.

The position requires recent Ph.D. and a background in optimization, sampling, or uncertainty quantification (i.e., computational statistics). The non-boilerplate parts of the job ad are as follows.

Lawrence Berkeley National Lab’s (LBNL) Computational Research Division has an opening for a Postdoctoral Scholar (Computational Research) in applying machine learning tools to optimization, sampling and uncertainty quantification.

…The successful applicant will develop, test, and benchmark new algorithms for high-dimensional inference and data analysis, using machine learning models to accelerate computations. …

Uroš suggested reading the following three papers for background.

P.S. I’m posting this on the recommendation of Chirag Modi, who was a Ph.D. student of Uroš’s at Berkeley and is now a postdoc at Flatiron Institute. Chirag assures me Uroš is a great mentor in addition to being a leading cosmologist.

“Hello, World!” with Emoji (Stan edition)

Brian Ward sorted out our byte-level string escaping in Stan in this stanc3 pull request and the corresponding docs pull request.

In the next release of Stan (2.28), we’ll be able to include UTF-8 in our strings and write “Hello, World!” without words.

transformed data {
  print("🙋🏼 🌎❗");
}

I’m afraid we still don’t allow Unicode identifiers. So if you want identifiers like α or ℵ or 平均数 then you’ll have to use Julia.

logit and normal are special functions, too!

I find that when people write $latex \displaystyle \frac{\exp(x)}{1 + \exp(x)}$ instead of writing $latex \textrm{logit}^{-1}(x)$ or when they write $latex \displaystyle \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(y – \mu)^2}{2\sigma^2} \right)$ instead of $latex \textrm{normal}(y \mid \mu, \sigma)$, I wind up just having to parse the formulas back to common forms I can understand as units.

So my question is the following. What’s the process for making a function an “official” special function? Is there a committee I need to petition somewhere? Or can I just beg here on the blog?

EU proposing to regulate the use of Bayesian estimation

The European Commission just released their Proposal for a Regulation on a European approach for Artificial Intelligence. They finally get around to a definition of “AI” on page 60 of the report (link above):

‘artificial intelligence system’ (AI system) means software that is developed with one or more of the techniques and approaches listed in Annex I and can, for a given set of human-defined objectives, generate outputs such as content, predictions, recommendations, or decisions influencing the environments they interact with

We don’t even have to wonder if they mean us. They do. Here’s the full text of Annex 1 from page 1 of “Laying down harmonized rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts” (same link, different doc).

ANNEX I
ARTIFICIAL INTELLIGENCE TECHNIQUES AND APPROACHES
referred to in Article 3, point 1

(a) Machine learning approaches, including supervised, unsupervised and reinforcement learning, using a wide variety of methods including deep learning;

(b) Logic- and knowledge-based approaches, including knowledge representation, inductive (logic) programming, knowledge bases, inference and deductive engines, (symbolic) reasoning and expert systems;

(c) Statistical approaches, Bayesian estimation, search and optimization methods.

This feels hopelessly vague with phrasing like “wide variety of methods” and the inclusion of “statistical approaches”. Also, I don’t see what’s added by “Bayesian estimation” given that it’s an instance of a statistical approach. At least it’s nice to be noticed.

Annex I looks like my CV rearranged. In the ’80s and ’90s, I worked on logic programming, linguistics, and knowledge representation (b). I’m surprised that’s still a going concern. I spent the ’00s working on ML-based natural language processing, speech recognition, and search (a). And ever since the ’10s, I’ve been working on Bayesian stats (c).

The Folk Theorem, revisited

It’s time to review the folk theorem, an old saw on this blog, on the Stan forums, and in all of Andrew’s and my applied modeling.

Folk Theorem

Andrew uses “folk” in the sense of being folksy as opposed to rigorous.

The Folk Theorem of Statistical Computing (Gelman 2008): When you have computational problems, often there’s a problem with your model.

Isn’t computation often the culprit?

Better samplers like the no-U-turn sampler (NUTS) are able to fit models to data sets that were previously unfittable. ML computation is accelerating even faster than MCMC tech, fitting ever deeper, wider, and more highly structured neural networks.

This seems to imply the folk theorem is backwards. If we had better computation, maybe we could fit our data with our preferred model.

A second folk theorem

Here’s another folk theorem, stemming from decades of experience in statistical computing.

Folk Theorem II: When you have computational problems, often there’s a problem with your computation.

Here’s the rub. That problem in the computation might take years or decades or even centuries to get solved.

Now what?

In the meanwhile, you have data to analyze. In these cases, I take Andrew’s folk theorem as pragmatic advice for users. Someone trying to fit a problematic model in Stan is unlikely to solve their problem by building a better sampler (though they might solve it by using a different sampler, such as INLA). But they can fit simpler models (where unfortunately for Stan, simpler is largely the geometric issue of the condition of the posterior curvature, so it can involve things like tightening the tails, standardizing scales, and reducing posterior correlation).

But sometimes it really is the model

For example, if you have an unidintifiable likelihood, you probably don’t want to just sort that out in the prior. You’re better off reparameterizing with one fewer degrees of freedom (e.g., by pinning a value or enforcing a sum-to-zero constraint). This should always be the first recourse—trying to figure out if the model does in fact make sense. Bad computation is still a useful clue to this.

Bottom line

The frontier at which you run into computational problems is a feature of the fitting tools you have at your command. What we’ve been trying to do with Stan is push that frontier. In the meanwhile, you can heed Andrew’s folk theorem and try to simplify your model.

Summer research jobs at Flatiron Institute

If you’re an undergrad or grad student and work in applied math, stats, or machine learning, you may be interested in our summer research assistant and associate positions at the Flatiron Institute’s Center for Computational Mathematics:

There is no deadline, but we’ll start reviewing applications and making offers in early March and we only have a limited number of positions.

If you’re applying because you’d like to work on Stan, please mail me directly at [email protected] and let me know.

Edit: I should clarify that we have capacity at CCM for quite a few summer researchers in applied math, stats, and ML. I’m particularly interested in hearing from people who are interested in working on the core Stan language, Stan math library, or on challenging applied projects. Here’s a rundown of my current projects.

Jordana Cepelewicz on “The Hard Lessons of Modeling the Coronavirus Pandemic”

Here’s a long and thoughtful article on issues that have come up with Covid modeling.

Jordana’s a staff writer for Quanta, a popular science magazine funded by the Simons Foundation, which also funds the Flatiron Institute, where I now work. She’s a science reporter, not a statistician or machine learning specialist. A lot of Simons Foundation funding goes to community outreach and math and science education. Quanta aims to be more like the old Scientific American than the new Scientific American; but it also has the more personal angle of a New Yorker article on science (my favorite source of articles on science because the writing is so darn good).

There’s also a film that goes along with Jordana’s article:

I found the comments on YouTube fascinating. Not as off the wall as replies to newspaper articles, but not the informed stats readership of this (Andrew’s) blog, either.

How many infectious people are likely to show up at an event?

Stephen Kissler and Yonatan Grad launched a Shiny app,

Effective SARS-CoV-2 test sensitivity,

to help you answer the question,

How many infectious people are likely to show up to an event, given a screening test administered n days prior to the event?

Here’s a screenshot.



The app is based on some modeling they did with Stan followed by simulation-based predictions. Here’s the medRxiv paper.

Stephen M. Kissler et al. 2020. SARS-CoV-2 viral dynamics in acute infections. medRxiv.

Users input characteristics of the test taken, the event size, the time the tests are taken before the event, the duration of the event itself, etc. The tool then produces plots of expected number of infectious people at your event and even the projected viral load at your event with uncertainties.

This obviously isn’t a tool for amateurs. I don’t even understand the units Ct for the very first input slider; the authors describe it as “the RT-qPCR cycle threshold”. They said they’d welcome feedback in making this more usable.

How to describe Pfizer’s beta(0.7, 1) prior on vaccine effect?

Now it’s time for some statistical semantics. Specifically, how do we describe the prior that Pfizer is using for their COVID-19 study? Here’s a link to the report.

Way down on page 101–102, they say (my emphasis),

A minimally informative beta prior, beta (0.700102, 1), is proposed for θ = (1-VE)/(2-VE). The prior is centered at θ = 0.4118 (VE=30%) which can be considered pessimistic. The prior allows considerable uncertainty; the 95% interval for θ is (0.005, 0.964) and the corresponding 95% interval for VE is (-26.2, 0.995).

I think “VE” stands for vaccine effect. Here’s the definition from page 92 of the report.

VE = 100 × (1 – IRR). IRR is calculated as the ratio of first confirmed COVID-19 illness rate in the vaccine group to the corresponding illness rate in the placebo group. In Phase 2/3, the assessment of VE will be based on posterior probabilities of VE1 > 30% and VE2 > 30%.

VE1 represents VE for prophylactic BNT162b2 against confirmed COVID-19 in participants without evidence of infection before vaccination, and VE2 represents VE for prophylactic BNT162b2 against confirmed COVID-19 in all participants after vaccination.

I’m unclear on why they’d want to impose a prior on (1 – VE) / (2 – VE), or even how to interpret that quantity, but that’s not what I’m writing about. But the internet’s great and Sebastian Kranz walks us through it in a blog post, A look at Biontech/Pfizer’s Bayesian analysis of their COVID-19 vaccine trial. It turns out that the prior is on the quantity $latex \theta = \frac{\displaystyle \pi_v}{\displaystyle \pi_v + \pi_c},$ where $latex \pi_v, \pi_c \in (0, 1)$ are, in Kranz’s words, “population probabilities that a vaccinated subject or a subject in the control group, respectively, fall ill to Covid-19.” I’m afraid I still don’t get it. Is the time frame restricted to the trial? What does “fall ill” mean, a positive PCR test or something more definitive. (The answers may be in the report—I didn’t read it.)

What is a weakly informative prior?

It’s the description “minimially informative” and subsequent results calling it “weakly informative” that got my attention. For instance, Ian Fellow’s post (which Andrew summarized in his own post here), The Pfizer-Biontech vaccine may be a lot more effective than you think that Andrew just reported on, Fellows calls it “a Bayesian analysis using a beta binomial model with a weakly-informative prior.”

What we mean by weakly informative is that the prior determines the scale of the answer. For example a standard normal prior (normal(0, 1)), imposes a unit scale, whereas a normal(0, 100) would impose a scale of 100 (like Stan and R, I’m using a scale or standard deviation parameterization of the normal so that the two parameters have the same units).

Weakly informative in which parameterization?

Thinking about proportions is tricky, because they’re constrained to fall in the interval (0, 1). The maximum standard deviation achievable with a beta distribution is 0.5 as alpha and beta -> 0, whereas a uniform distribution on (0, 1) has standard deviation 0.28, and a beta(100, 100) has standard deviation 0.03.

It helps to transform using logit so we can consider the log odds, mapping a proportion $latex \theta$ to $latex \textrm{logit}(\theta) = \log \theta / (1 – \theta).$. A uniform distribution on theta in (0, 1) results in a standard logistic(0, 1) distribution on logit(theta) in (-inf, inf). So even a uniform distribution on the proportion leads to a unit scale distribution on the log odds. In that sense, a uniform distribution is weakly informative in the sense that we mean it when we recommend weakly informative priors in Stan. All on its own, it’ll control the scale of the unconstrained parameter. (By the way, I think transforming theta in (0, 1) to logit(theta) in (-inf, inf) is the easiest way to get a handle on Jacobian adjustments—it’s easy to see the transformed variable no longer has a uniform distribution, and it’s the Jacobian of the inverse transform that defines the logistic distribution’s density.)

Fellows is not alone. In the post, Warpspeed confidence — what is credible?, which relates Pfizer’s methodology to more traditional frequentist methods, Chuck Powell says, “For purposes of this post I’m going to use a flat, uninformed prior [beta(1, 1)] in all cases.” Sure, it’s flat on the (0, 1) scale, but not on the log odds scale. Flat is relative to parameterization. If you work with a logistic prior on the log odds scale and then transform with inverse logit, you get exactly the same answer with a prior that is far from flat—it’s centered at 0 and has a standard deviation of pi / 3, or about 1.

How much information is in a beta prior?

It helps to reparameterize the beta with a mean $latex \phi \in (0, 1)$ and “count” $latex \kappa > 0,$

$latex \textrm{beta2}(\theta \mid \phi, \kappa) = \textrm{beta}(\theta \mid \phi \cdot \kappa, (1 – \phi) \cdot \kappa).$

The beta distribution is conjugate to the Bernoulli (and more generally, the binomial), which is what makes it a popular choice. What this means in practice is that it’s an exponential family distribution that can be treated as pseudodata for a Bernoulli distribution.

Because beta(1, 1) is a uniform distribution, we think of that as having no prior data, or a total of zero pseudo-observations. From this perspective, beta(1, 1) really is uninformative in the sense that it’s equivalent to starting uniform and seeing no prior data.

In the beta2 parameterization, the uniform distribution on (0, 1) is beta2(0.5, 2). This corresponds to pseudodata with count 0, not 2—we need to subtract 2 from $latex \kappa$ to get the pseudocount!

Where does that leave us with the beta(0.7, 1)? Using our preferred parameterization, that’s beta2(0.4117647, 1.7). That means a prior pseudocount of -0.3 observations! That means we start with negative pseudodata when the prior count parameter kappa is less than 2. Spoiler alert—that negative pseudocount is going to be swamped by the actual data.

What about Pfizer’s beta(0.700102, 1) prior? That’s beta2(0.4118, 1.700102). If you plot beta(theta | 0.7, 1) vs. theta, you’ll see that the log density tends to infinity as theta goes to 0. That makes it look like it’s going to be somewhat or maybe even highly informative. There’s a nice density plot in Kranz’s post.

Of course, the difference between beta(0.700102, 1) and beta(0.7, 1) is negligible—1/10,000th on prior mean and 1/1000-th of a patient in prior pseudocount. They must’ve derived the number from a formula somehow and then didn’t want to round. The only harm in using 0.700102 rather than 0.7 or even 1 is that someone may assume a false sense of precision.

Let’s look at the effect on the prior, in terms of how it affects the posterior. That is, differences between beta(n + 0.7, N – n + 1) vs. beta(n + 1, N – n + 1) for a trial with n out of N successes. I’m really surprised they’re only looking at N = 200 and expecting something like n = 30. Binomial data is super noisy and thus N = 200 is a small data size unless the effect is huge.

Is that 0.00102 in prior pseudocount going to matter? Of course not. Will the difference between beta(1, 1) and beta(0.7, 1) going to matter? Nope. matter? If we compare the posteriors beta(30 + 0.7, 170 + 1) and beta(30 + 1, 170 + 1), their posterior 95% central intervals are (0.107, 0.206) and (0.106, 0.205).

So I guess it’s like Andrew’s injunction to vote. It might make a difference on the edge if we impose a three-digit threshold somewhere and just manage to cross it in the last digit.

Beta-binomial and Jeffrey’s priors

I’ll leave it to the Bayes theory wonks to talk about why beta(0.5, 0.5) is the Jeffrey’s prior for the beta-binomial model. I’ve never dug into the theory enough to understand why anyone cares about these priors other than scale invariance.

UX issues around voting

While Andrew’s worrying about how to measure calibration and sharpness on small N probabilistic predictions, let’s consider some computer and cognitive science issues around voting.

How well do elections measure individual voter intent?

What is the probability that a voter who tries to vote has their intended votes across the ballot registered? Spoiler alert. It’s not 100%.

We also want to know if the probability of having your vote recorded depends on the vote. Or on the voter. To put it in traditional statistical terms, if we think of the actual vote count as an estimate of voter intent, what is the error and bias of the estimator?

Not very well, it turns out

User interface (UX) errors are non-negligible. For a concrete analysis around the 2000 U.S. presidential election, see the following page summarizing some of the findings of the

a big study partly coordinated by our collaborator Steve Ansolabehere.*

No surprise here

There’s nothing at all surprising here from a UX point of view. Everyone who’s ever worked on UX knows that UX errors are the norm, not the exception. The point of building a good UX is to minimize errors to the extent possible.

I also want to point out that banks seem to manage handing out cash through their ATMs with low enough error that it’s still profitable. Now I’m curious about the error rate.

Simple Example

In case you don’t want to click through to see the real example, an example of a classic UX blunder in a voting context is the following ballot.

[ ] CANDIDATE 1 [ ] CANDIDATE 2 [ ] CANDIDATE 3

This kind of layout violates the basic UX principle of putting checkboxes distinctively closer to their associated items than to other items. With the layout above, a voter who intends to vote for candidate 2 might accidentally vote for candidate 3 because the two boxes are equally close to the name “CANDIDATE 2”.

It’s better with more whitespace so that the boxes are visually identified with their associated item.

[ ] CANDIDATE 1      [ ] CANDIDATE 2      [ ] CANDIDATE 3

A vertical layout can solve some problems, but as the example in the article I linked above shows, it can introduce other ones if done poorly.

This is just one of the many blunders that are quite common in user interfaces.

Anecdote about my own ballot this year

Personally, I had a question about the fine print of the NYC ballots because there were a bunch of judge candidates and it wasn’t clear to me how many I could vote for. I actually flipped to the instructions, which said the number would be at the top. It wasn’t. I went and asked the poll worker out of curiousity (I’m fascinated by UX issues). Turns out they moved the number to relatively fine print to the left of the column of checkboxes. Now this particular vote didn’t matter as far as I can tell because there were only four candidates and you could choose up to four. Just an example of the kind of confusion you can run into.


* Steve’s the one who introduced me to Andrew 25 years ago after Andrew and I both moved to NYC. The second-to-last grant Andrew and I got before I moved to the Flatiron Institute was with Steve to work on his Cooperative Congressional Election Study data collection and analysis. That project’s still ongoing and the data, models, etc. are all open access.

The Economist not hedging the predictions

Andrew’s hedge that he’s predicting vote intent and not accounting for any voting irregularities either never made it to the editorial staff at The Economist or they chose to ignore it. Their headline still reports that they’re predicting the electoral college, not voter intent:



Predicting voter intent is a largely academic exercise in that all this hand-wringing over uncertainty in intent could be dominated by voting irregularities. For example, I applied for an absentee ballot in NYC on October 6. They have a tracker, so I can see it was approved on October 7, but never mailed. My inquiry to their help desk went unanswered. Now I have to decide if it’s worth voting in person in NYC amidst the pandemic. Mitzi applied a week before me at the end of September and received her absentee ballot two weeks later. Given the apparent gap in belief in Covid’s seriousness between the supporters of the two parties and the increased danger to the elderly from Covid, this feels like it could have a non-trivial effect on the election. At the end of September, erroneous ballots were shipped to Brooklyn. The absentee ballots themselves are apparently super confusing, but I wouldn’t know, since I didn’t get one.

Hiring at all levels at Flatiron Institute’s Center for Computational Mathematics

We’re hiring at all levels at my new academic home, the Center for Computational Mathematics (CCM) at the Flatiron Insitute in New York City.

We’re going to start reviewing applications January 1, 2021.

A lot of hiring

We’re hoping to hire many people for each of the job ads. The plan is to grow CCM from around 30 people to around 60, which will involve hiring 20 or 30 more postdocs and permanent research staff over the next few years!!! Most of those hires are going to be on the machine learning and stats side.

What we do

I’m still working on Stan and computational stats and would like to hire some more people to work on computational stats and probabilistic programming. There’s also lots of other fascinating work going on in the center, including equivariant neural networks for respecting physical constraints, spike sorting for neural signal classification, Markov chain Monte Carlo for molecular dynamics, phase retrieval (Fourier inverse problem) for cryo electron microscopy, and lots of intricate partial differential equation solvers being developed for challenging problems like cellular fluid dynamics or modeling light flow in the visual system of a wasp.

Plus, there’s a lot of communication across centers. There are Stan users in both the astro and bio centers, working on things like the LIGO project for gravitational waves, ocean biome factor modeling, and protein conformation estimation. If you like science, stats, and computation, it’s a great place to hang out.

The mission

The Flatiron Institute is unusual for an academic institution in that it’s focused squarely on computation.

The mission of the Flatiron Institute is to advance scientific research through computational methods, including data analysis, theory, modeling and simulation.

The motivation remains to fill a gap in scientific software development that’s not supported well by research grants, academia, or industry.

The job listings

The job listings are for two postdoc-level positions (called “fellows”) and an open-rank faculty-level positions (your choice of a “research scientist” or “data scientist” title). Please keep in mind that we’re hiring for all of these positions in bulk over the next few years across the range of center interests.

If you’re interested in comp stats and are going to apply, please drop me a line directly at [email protected].

More on absolute error vs. relative error in Monte Carlo methods

This came up again in a discussion from someone asking if we can use Stan to evaluate arbitrary integrals. The integral I was given was the following:

$latex \displaystyle \alpha = \int_{y \in R\textrm{-ball}} \textrm{multinormal}(y \mid 0, \sigma^2 \textrm{I}) \, \textrm{d}y$

where the $latex R$-ball is assumed to be in $latex D$ dimensions so that $latex y \in \mathbb{R}^D$.

(MC)MC approach

The textbook Monte Carlo approach (Markov chain or plain old) to evaluating such an integral is to evaluate

$latex \displaystyle \frac{1}{M} \sum_{m=1}^M \textrm{normal}(y^{(m)} \mid 0, \sigma)$,

where the marginal distribution of each $latex y^{(m)}$ is

$latex \displaystyle y^{(m)} \sim \textrm{uniform}(R\textrm{-ball})$.

I provide a Stan program for doing this below.

If you want to do this without Stan, you can rejection sample points in the ball very easily in low dimensions. In higher dimensions, you can generate a random angle and radius (the random radius needs to be drawn non-uniformly in dimensions higher than 1 to account for increasing area/volume as the radius increases). Stan could’ve also used an angle/radius parameterization, but I’m too lazy to work out the trig.

Most draws can contribute zero

If the volume of the $latex R$-ball is large compared to the volume containing the draws you get from $latex \textrm{normal}(0, \sigma)$, then most of the draws from the $latex R$-ball will have roughly zero density in $latex \textrm{normal}(0, \sigma)$. The effect of dimensionality is exponential on the difference in volume.

Absolute error

Even the volume of the $latex R$-ball is large compared to the volume containing the draws you get from $latex \textrm{normal}(0, \sigma)$, the (MCMC) central limit theorem still controls error. It’s just that the error it controls is squared error. Thus if we measure absolute error, we’re fine. We’ll get an estimate near zero which is right to several decimal places.

Relative error

If the value of the integral is 0.01 or something like that, then to get 10% relative error, we need our estimate to fall within 0.001 of the true answer. 0 doesn’t do that.

What we have to do with a naive application of Monte Carlo methods is make a whole lot of draws to get an estimate between 0.009 and 0.011.

Relation to discrete sampling

This is related to the discrete sampling problem we ran into when trying to estimate the probability of winning the lottery by buying tickets and computing a Monte Carlo estimate. See my previous post, Monte Carlo and winning the lottery. The setup was very hard for readers to swallow when I first posted about it (my experience is that statisticians don’t like thought experiments or simplified hello world examples).

Stan program

The complexity comes in from the constraining trasnform to the $latex R$-ball and corresponding Jacobian adjustment to make the distribution in the ball uniform.

For the transform, I used a stick-breaking procedure very similar to how Stan transforms simplex variables and the rows of correlation matrix Cholesky factor. The transform requires a Jacobian, which is calculated on the fly. An approach that could be done with uniform draws purely through a transform would be to draw an angle (uniformly) and radius (non-uniformly based on distance from the origin); this would probably be more efficient.

/**
 * Sampling for this program computes
 *
 *   INTEGRAL_{theta in R ball in N dimensions} normal(theta | 0, sigma) d.theta
 */
data {
  int N;
  real R;
  real sigma;
}
parameters {
  vector<lower = -1, upper = 1>[N] alpha;
}
transformed parameters {
  // transform alpha to the unit ball
  vector[N] theta;

  // log_abs_det_J accumulates the change-of-variables correction
  real log_abs_det_J = 0;
  {
    real sum_sq = 0;
    for (n in 1:N) {
      real mult = sqrt(1 - sum_sq);
      theta[n] = alpha[n] * mult;
      sum_sq += theta[n]^2;
      log_abs_det_J += log(mult);
    }      
  }
  
  // scale from unit ball to R-ball;  no Jacobian because R is const.
  theta *= R;
}  
model {
  // the Jacobian's triangular, so the log determinant is the sum of
  // the log of the absolute derivatives of the diagonal of J, i.e.,
  // sum(log(d.theta[n] / d.alpha[n])), as computed above

  target += log_abs_det_J;

  // theta is now uniform on R-ball
}
generated quantities {
  // posterior mean of E will be:
  //    INTEGRAL_{y in R-ball} normal(y | 0,  sigma) d.y
  //    =approx= 1/M SUM_{m in 1:M}  normal(theta(m) | 0, sigma)
  //             where theta(m) drawn uniformly from R-ball
  real E = exp(normal_lpdf(theta | 0, sigma));
}

As noted in the program inline documentation, the posterior mean of E is the result of the integral of interest.

Probabilities for action and resistance in Blades in the Dark

Later this week, I’m going to be GM-ing my first session of Blades in the Dark, a role-playing game designed by John Harper. We’ve already assembled a crew of scoundrels in Session 0 and set the first score. Unlike most of the other games I’ve run, I’ve never played Blades in the Dark, I’ve only seen it on YouTube (my fave so far is Jared Logan’s Steam of Blood x Glass Cannon play Blades in the Dark!).

Action roll

In Blades, when a player attempts an action, they roll a number of six-sided dice and take the highest result. The number of dice rolled is equal to their action rating (a number between 0 and 4 inclusive) plus modifiers (0 to 2 dice). The details aren’t important for the probability calculations. If the total of the action rating and modifiers is 0 dice, the player rolls two dice and takes the worst. This is sort of like disadvantage and (super-)advantage in Dungeons & Dragons 5e.

A result of 1-3 is a failure with a consequence, a result of 4-5 is a success with a consequence, and a result of 6 is an unmitigated success without a consequence. If there are more than two 6s in the result, it’s a success with a benefit (aka a “critical” success).

The GM doesn’t roll. In a combat situation, you can think of the player roll encapsulating a turn of the player attacking and the opponent(s) counter-attacking. On a result of 4-6, the player hits, on a roll of 1-5, the opponent hits back or the situation becomes more desperate in some other way like the character being disarmed or losing their footing. On a critical result (two or more 6s in the roll), the player succeeds with a benefit, perhaps cornering the opponent away from their flunkies.

Resistance roll

When a player suffers a consequence, they can resist it. To do so, they gather a pool of dice for the resistance roll and spend an amount of stress equal to six minus the highest result. Again, unless they have zero dice in the pool, in which case they can roll two dice and take the worst. If the player rolls a 6, the character takes no stress. If they roll a 1, the character takes 5 stress (which would very likely take them out of the action). If the player has multiple dice and rolls two or more 6s, they actually reduce 1 stress.

For resistance rolls, the value between 1 and 6 matters, not just whether it’s in 1-3, in 4-5, equal to 6, or if there are two 6s.

Probabilities
Resistance rolls are rank statistics for pools of six-sided dice. Action rolls just group those. Plus a little sugar on top for criticals. We could do this the hard way (combinatorics) or we could do this the easy way. That decision was easy.

Here’s a plot of the results for action rolls, with dice pool size on the x-axis and line plots of results 1-3 (fail plus a complication), 4-5 (succeed with complication), 6 (succeed) and 66 (critical success with benefit). This is based on 10m simulations.



You can find a similar plot from Jasper Flick on AnyDice, in the short note Blades in the Dark.

I find the graph pretty hard to scan, so here’s a table in ASCII format, which also includes the resistance roll probabilities. The 66 result (at least two 6 rolls in the dice pool) is a possibility for both a resistance roll and an action roll. Both decimal places should be correct given the 10M simulations.

DICE   RESISTANCE                      ACTION           BOTH

DICE    1    2    3    4    5    6     1-3  4-5    6      66
----  ----------------------------     -------------    ----
 0d   .36  .25  .19  .14  .08  .03     .75  .22  .03     .00

 1d   .17  .17  .17  .17  .17  .17     .50  .33  .17     .00
 2d   .03  .08  .14  .19  .25  .28     .25  .44  .28     .03

 3d   .01  .03  .09  .17  .29  .35     .13  .45  .35     .07
 4d   .00  .01  .05  .14  .29  .39     .06  .42  .39     .13

 5d   .00  .00  .03  .10  .27  .40     .03  .37  .40     .20
 6d   .00  .00  .01  .07  .25  .40     .02  .32  .40     .26

 7d   .00  .00  .01  .05  .22  .39     .01  .27  .39     .33
 8d   .00  .00  .00  .03  .19  .38     .00  .23  .38     .39

One could go for more precision with more simulations, or resort to working them all out combinatorially.

The hard way

The hard way is a bunch of combinatorics. These aren’t too bad because of the way the dice are organized. For the highest value of throwing N dice, the probability that a value is less than or equal to k is one minus the probability that a single die is greater than k raised to the N-th power. It’s just that there are a lot of cells in the table. And then the differences would be required. Too error prone for me. Criticals can be handled Sherlock Holmes style by subtracting the probability of a non-critical from one. A non-critical either has no sixes (5^N possibilities with N dice) or exactly one six ((6 choose 1) * 5^(N – 1)). That’s not so bad. But there are a lot of entries in the table. So let’s just simulate.

Edit: Cumulative Probability Tables

I really wanted the cumulative probability tables of a result or better (I suppose I could’ve also done it as result or worse). I posted these first on the Blades in the Dark forum. It uses Discourse, just like Stan’s forum.

Action Rolls

Here’s the cumulative probabilities for action rolls.



And here’s the table of cumulative probabilities for action rolls, with 66 representing a critical, 6 a full success, and 4-5 a partial success:

ACTION ROLLS, CUMULATIVE
        probability of result or better
 dice   4-5+     6+     66
    0  0.250  0.028  0.000
    1  0.500  0.167  0.000
    2  0.750  0.306  0.028
    3  0.875  0.421  0.074
    4  0.938  0.518  0.132
    5  0.969  0.598  0.196
    6  0.984  0.665  0.263
    7  0.992  0.721  0.330
    8  0.996  0.767  0.395

Resistance Rolls

And here are the basic probabilities for resistance rolls.



Here’s the table for stress probabilities based on dice pool size

RESISTANCE ROLLS

             Probability of Stress
Dice    5    4    3    2    1    0   -1
   0  .31  .25  .19  .14  .08  .03  .00 
   1  .17  .17  .17  .17  .17  .17  .00
   2  .03  .08  .14  .19  .25  .28  .03
   3  .00  .03  .09  .17  .28  .35  .07
   4  .00  .01  .05  .13  .28  .39  .13
   5  .00  .00  .03  .10  .27  .40  .20
   6  .00  .00  .01  .07  .25  .40  .26
   7  .00  .00  .01  .05  .22  .39  .33
   8  .00  .00  .00  .03  .19  .37  .40

Here’s the plot for the cumulative probabilities for resistance rolls.



Here’s the table of cumulative resistance rolls.

RESISTANCE ROLLS, CUMULATIVE

             Probability of Stress or Less
Dice       5      4     3     2    1    0     -1
   0    1.00    .69   .44   .25   .11  .03   .00 
   1    1.00    .83   .67   .50   .33  .17   .00
   2    1.00    .97   .89   .75   .56  .31   .03
   3    1.00   1.00   .96   .87   .70  .42   .07
   4    1.00   1.00   .99   .94   .80  .52   .13
   5    1.00   1.00  1.00   .97   .87  .60   .20
   6    1.00   1.00  1.00   .98   .91  .67   .26
   7    1.00   1.00  1.00   .99   .94  .72   .33
   8    1.00   1.00  1.00  1.00   .96  .77   .40

For example, with 4 dice (the typical upper bound for resistance rolls), there’s an 80% chance that the character takes 1, 0, or -1 stress, and 52% chance they take 0 or -1 stress. With 0 dice, there’s a better than 50-50 chance of taking 4 or more stress because the probability of 3 or less stress is only 44%.

Finally, here’s the R code for the resistance and cumulative resistance.

# row = dice, col = c(1:6, 66)
resist <- matrix(0, nrow = 8, ncol = 7)
resist[1, 1:6] <- 1/6
for (d in 2:8) {
  for (result in 1:5) {
    resist[d, result] <-
      sum(resist[d - 1, 1:result]) * 1/6 +
      resist[d - 1, result] *  (result -1) / 6
  }
  resist[d, 6] <- sum(resist[d - 1, 1:5]) * 1/6 +
                  sum(resist[d - 1, 6]) * 5/6
  resist[d, 7] <- resist[d - 1, 7] + resist[d - 1, 6] * 1/6
}

cumulative_resist <- resist  # just for sizing
for (d in 1:8) {
  for (result in 1:7) {
    cumulative_resist[d, result] <- sum(resist[d, result:7])
  }
}

library('reshape')
library('ggplot2')


zero_dice_probs <-  c(11, 9, 7, 5, 3, 1, 0) / 36
zero_dice_cumulative_probs <- zero_dice_probs
for (n in 1:7)
  zero_dice_cumulative_probs[n] <- sum(zero_dice_probs[n:7])

z <- melt(cumulative_resist)  # X1 = dice, X2 = result, value = prob
stress <- 6 - z$X2
df <- data.frame(dice = z$X1, stress = as.factor(stress), prob = z$value)
df <- rbind(df, data.frame(dice = rep(0, 7), stress = as.factor(6 - 1:7), prob = zero_dice_cumulative_probs))

cumulative_plot <- ggplot(df, aes(x = dice, y = prob,
                   colour = stress, group = stress)) +
  geom_line() + geom_point() +
  xlab("dice for resistance roll") +
  ylab("prob of stress or less") +
  scale_x_continuous(breaks = 0:8)
cumulative_plot
ggsave('cumulative-resistance.jpg', plot = cumulative_plot, width = 5, height = 4)


z2 <- melt(resist)  # X1 = dice, X2 = result, value = prob
stress2 <- 6 - z2$X2
df2 <- data.frame(dice = z2$X1, stress = as.factor(stress2), prob = z2$value)
df2 <- rbind(df2, data.frame(dice = rep(0, 7), stress = as.factor(6 - 1:7),
                             prob = zero_dice_probs))

plot <- ggplot(df2, aes(x = dice, y = prob,
               colour = stress, group = stress)) +
  geom_line() + geom_point() +
  xlab("dice for resistance roll") +
  ylab("prob of stress") +
  scale_x_continuous(breaks = 0:8)
plot
ggsave('resistance.jpg', plot = plot, width = 5, height = 4)

Drunk-under-the-lamppost testing

Edit: Glancing over this again, it struck me that the title may be interpreted as being mean. Sorry about that. It wasn’t my intent. I was trying to be constructive and I really like that analogy. The original post is mostly reasonable other than on this one point that I thought was important to call out.

I’m writing a response here to Abraham Mathews’s post, Best practices for code review, R edition, because my comment there didn’t show up and I think the topic’s important. Mathews’s post starts out on the right track, then veers away from best practices in the section “What code should be reviewed?” where he says,

…In general, we would never want to review four or five different files at a time. Instead, code review should be more targeted and the focus must be on the code that requires more prudent attention. The goal should be to review files or lines of code that contain complex logic and may benefit from a glance from other team members.

Given that guidance, the single file from the above example that we should never consider for code review is basic_eda.R. It’s just a simple file with procedural code and will only be run once or twice. …

The standard for code review in industry and large open-source projects is to review every piece of code before it’s merged. The key to this strategy is ensuring that every line of code has been viewed by at the very least the author and one trusted team member. Sampling-based code review that’s biased to where the group thinks errors may be has the drunks-under-the-lamppost problem of not covering a large chunk of code. Software developers obsess on test coverage, but it’s very challenging and we haven’t been able to get there with Stan. If we were developing flight control or pacemaker or transactional banking software, the standards would be much higher.

Typically, APIs are designed top-down from a client’s perspective (the client being a human or another piece of code that calls the API), then coded bottom up. Each component is reviewed and unit tested before being merged. The key to this strategy is being able to develop high-level modules with the confidence that the low-level pieces work. It may sound like it’s going to take longer to unit test as you go, but the net is a huge time savings with the upside of having more reliable code.

It’s also critical to keep the three key components of software development in synch: documenting (i.e., design), testing, and coding. In larger projects, features of any size always start with a functional spec outlining how it works from the client point of view—that’s usually written like the eventual documentation will be written because that’s what says what code does. With just doc, the key here is to make sure the API that is being delivered is both easy to document and easy to test. For example, large functions with intertwined, dependent arguments, as often found in REPL languages like R, Python, and Julia, produce what programmers call a “bad smell”, precisely because such functions are hard to document and test.

Consider the rgamma function in R. It takes three parameter arguments, shape, rate, and scale. Experienced statisticians might know that scale and rate parameters are conventionally inverses, yet this isn’t mentioned in the doc anywhere other than implicitly with the values of the default arguments. What happens if you supply both scale and rate? The doc doesn’t say, so I just tried it. It does not return an error, as one might expect from languages that try to keep their arguments coherent, but rather uses the rate and ignores the scale (order doesn’t matter). At the point someone proposed the rgamma function’s API, someone else should’ve piped up and said, “Whoa, hang on there a second, cowpoke; this function’s going to be a mess to test and document because of the redundant arguments.” With scale not getting a default and rate and shape being inverses, the tests need to cover behavior for all 8 possible input patterns. The doc should really say what happens when both scale and rate are specified. Instead, it just says “Invalid arguments will result in return value ‘NaN’, with a warning.” That implies that inconsistent rate and scale arguments (e.g., rate = 10, scale = 10) aren’t considered invalid arguments.

I should also say that my comments above are intended for API design, such as an R package one might want to distribute or a piece of infrastructure a lab or company wants to support. I wouldn’t recommend this style of functional design and doc and testing for exploratory research code, because it’s much harder to design up front and isn’t intended to be portable or distributed beyond a tiny group of collaborators. I’m not saying don’t test such code, I’m just saying the best practices there would be different than for designing APIs for public consumption. For example, no need to test Windows and Linux and Mac if you only ever target one platform, no reason to test all the boundary conditions if they’re never going to be used, and so on. It absolutely still helps to design top down and write modular reusable components bottom up. It’s just usually not apparent what these modules will be until after many iterations.

P.S. I highly recommend Hunt and Thomas’s book, The Pragmatic Programmer. It’s a breeze to read and helped me immensely when I was making the move from a speech recognition researcher writing experimental code to an industrial programmer. Alternatives I’ve read suffer from being too long and pedantic, too dogmatic, and/or too impractical.

P.P.S. I’ve been meaning to write a blog post on the differences in best practices in research versus publicly distributed code. I know they’re different, but haven’t been able to characterize what I’d recommend in the way of methodology for research code. Maybe that’s because I spend maybe one day/month on personal or group research code (for example, the code Andrew and I developed for an analysis of SARS-CoV-2 seroprevalence), and nineteen days a month working on Stan API code. I’d be curious as to what other people do to test and organize their research code.

Super-duper online matrix derivative calculator vs. the matrix normal (for Stan)

I’m implementing the matrix normal distribution for Stan, which provides a multivariate density for a matrix with covariance factored into row and column covariances.

The motivation

A new colleague of mine at Flatiron’s Center for Comp Bio, Jamie Morton, is using the matrix normal to model the ocean biome. A few years ago, folks in Richard Bonneau‘s group at Flatiron managed to extend the state of the art from a handful of dimensions to a few dozen dimensions using Stan, as described here

That’s a couple orders of magnitude short of the models Jamie would like to fit. So instead of Stan, which he’s used before and is using for other smaller-scale problems, he’s turning to the approach in the following paper:

I’d just like to add matrix normal to Stan and see if we can scale up Äijö et al.’s results a bit.

The matrix normal

The density is defined for an

  • $latex N \times P$ observation matrix $latex Y$

with parameters

  • $latex N \times P$ mean matrix $latex M$,
  • positive-definite $latex P \times P$ column covariance matrix $latex U$, and
  • positive-definite $latex N \times N$ row covariance matrix $latex V$

as
$latex \displaystyle \textrm{matrix\_normal}(Y \mid M, U, V) = \frac{\displaystyle \exp\left(-\frac{1}{2} \cdot \textrm{tr}\left(V^{-1} \cdot (Y – M)^{\top} \cdot U^{-1} \cdot (Y – M)\right)\right)}{\displaystyle (2 \cdot \pi)^{{N \cdot P} / {2}} \cdot \textrm{det}(V)^{{N} / {2}} \cdot \textrm{det}(U)^{{P} / {2}}}.$

It relates to the multivariate normal through vectorization (stacking the columns of a matrix) and Kronecker products as

$latex \textrm{matrix\_normal}(Y \mid M, U, V) = \textrm{multi\_normal}(\textrm{vec}(Y) \mid \textrm{vec}(M), V \otimes U).$

After struggling with tensor differentials for the derivatives of the matrix normal with respect to its covariance matrix parameters long enough to get a handle on how to do it but not long enough to get good at it or trust my results, I thought I’d try Wolfram Alpha. I could not make that work.

matrixcalculus.org to the rescue

After a bit more struggling, I entered the query [matrix derivative software] into Google and the first hit was a winner:

This beautiful piece of online software has a 1990s interface and 2020s functionality. I helped out by doing the conversion to log scale and dropping constant terms,

$latex \begin{array}{l}\displaystyle\log \textrm{matrix\_normal}(Y \mid M, U, V) \\[8pt] \qquad = -\frac{1}{2} \cdot \left(\textrm{tr}\!\left[V^{-1} \cdot (Y – M)^{\top} \cdot U^{-1} \cdot (Y – M)\right] – N \cdot \log \textrm{det}(V) – P \cdot \log \textrm{det}(U) \right) + \textrm{const}.\end{array}$

Some of these terms have surprisingly simple derivatives, like $latex \frac{\partial}{\partial U} \log \textrm{det}(U) = U^{-1}$. Nevertheless, when you push the differential through, you wind up with tensor derivatives inside that trace which are difficult to manipulate without guesswork at my level of matrix algebra skill.
Unlike me, the matrix calculus site chewed this up and spit it out in neatly formatted LaTeX in a split second:



Despite the simplicity, this is a really beautiful interface. You enter a formula and it parses out the variables and asks you for their shape (scalar, vector, row vector, or matrix). Another win for delcaring data types in that it lets the interface resolve all the symbols. There may be a way to do that in Wolfram Alpha, but it’s not baked into their interface.

The paper

There’s a very nice paper behind that explains what they did and how it relates to autodiff.

I haven’t digested it all, but as you may suspect, they implement a tensor algebra for derivatives. Here’s the meaty part of the abstract.

The framework can be used for symbolic as well as for forward and reverse mode automatic differentiation. Experiments show a speedup of up to two orders of magnitude over state-of-the-art frameworks when evaluating higher order derivatives on CPUs and a speedup of about three orders of magnitude on GPUs.

As far as I can tell, the tool is only first order. But that’s all we need for specialized forward and reverse mode implementations in Stan. The higher-order derivatives start from reverse node then nest in one more forward mode instances. I’m wondering if this will give us a better way to specialize fvar<var> in Stan (the type used for second derivatives).

Some suggestions for the tool

It wouldn’t be Andrew’s blog without suggestions about improving interfaces. My suggestions are to

  1. check the “Common subexpressions” checkbox by default (the alternative is nice for converting to code),
  2. stick to a simple checkbox interface indicating with a check that common subexpression sharing is on (as is, the text says “On” with an empty checkbox when it’s off and “Off” with an empty checkbox when it is on),
  3. get rid of extra vertical whitespace output at hte bottom of the return box, and
  4. provide a way to make the text entry box bigger and multiple lines (as is, I composed in emacs and cut-and-paste into the interface), and
  5. allow standard multi-character identifiers (it seems to only allow single character variable names).

I didn’t try the Python output, but that’s a great idea if it produces code to compute both the function and partials efficiently.

Translating to the Stan math library

With the gradient in hand, it’s straightforward to define efficient forward-mode and reverse-mode autodiff in Stan using our general operands-and-partials builder structure. But now that I look at our code, I see our basic multivariate-normal isn’t even using that efficient code. So I’ll have to fix that, too, which should be a win for everyone.

What I’m actually going to do is define the matrix normal in terms of Cholesky factors. That has the huge advantage of not needing to put a whole covariance matrix together only to factor it (we need a log determinant). Sticking to Cholesky factors the whole way is much more arithmetically stable and requires only quadratic time to factor rather than cubic.

In terms of using the matrix derivative site, just replace U and V with (L_U * L_U’) and (L_V & L_V’) in the formulas and it should be good to go. Actually, it won’t because the parser apparently requires single letter variabes. So I wound up just using (U * U’), which does work.

Here’s the formula I used for the log density up to a constant:

-1/2 * (N * log(det(V * V')) + P * log(det(U * U')) + tr([inv(V * V') * (Y - M)'] * [inv(U * U') * (Y - M)]))

Because there’s no way to tell it that U and V are Cholesky factors, and hence that U * U’ is symmetric and positive definite, I had to do some reduction by hand, such as

inv(U * U') * U = inv(U')

That yields a whole lot of common subexpressions between the function and its gradient, and I think I can go a bit further by noting (U * U’) = (U * U’)’.

matrix_normal_lpdf(Y | M, U, V)
= -1/2 * (N * log(det(V * V'))
  + P * log(det(U * U'))
  + tr([inv(V * V') * (Y - M)'] * [inv(U * U') * (Y - M)]))
  + const

d/d.Y: -[inv(U * U') * (Y - M)] * inv(V * V')

d/d.M: -[inv(U * U') * (Y - M)] * inv(V * V')

d/d.U: -P * inv(U')
       - inv(U * U') * (Y - M)] * [inv(V * V') * (Y - M)'] * inv(U')

d/d.V: -N * inv(V')
       -[inv(V * V') * (Y - M)'] * [inv(U * U') * (Y - M)] * inv(V')

This is totally going in the book

Next up, I’d like to add all the multivariate densities to the following work in progress.

There’s a draft up on GitHub with all the introductory material and a reference C++ implementation and lots of matrix derivatives and even algebraic solvers and HMMs. Lots still to do, though.

We haven’t done the main densities yet, so we can start with multivariate normal and T with all four parameterizations (covariance, precision and Cholesky factors of same), along with the (inverse) Wisharts and LKJ we need for priors. This matrix calculus site’s going to make it easy to deal with all the Cholesky-based parameterizations.

If you’d like to help implementing these in Stan or in joining the effort for the handbook, let me know. I recently moved from Columbia University to the Flatiron Institute and I’m moving to a new email address: [email protected]. I didn’t change GitHub handles. Flatiron is great, by the way, and I’m still working with the Stan dev team including Andrew.

Make Andrew happy with one simple ggplot trick

By default, ggplot expands the space above and below the x-axis (and to the left and right of the y-axis). Andrew has made it pretty clear that he thinks the x axis should be drawn at y = 0. To remove the extra space around the axes when you have continuous (not discrete or log scale) axes, add the following to a ggplot plot,

plot <-
  plot + 
  scale_x_continuous(expand = c(0, 0)) + 
  scale_y_continuous(expand = c(0, 0))

Maybe it could even go in a theme.

Hats off to A5C1D2H2I1M1N2O1R2T1 (I can't make these handles up) for posting the solution on Stack Overflow.

Prior predictive, posterior predictive, and cross-validation as graphical models

I just wrote up a bunch of chapters for the Stan user’s guide on prior predictive checks, posterior predictive checks, cross-validation, decision analysis, poststratification (with the obligatory multilevel regression up front), and even bootstrap (which has a surprisingly elegant formulation in Stan now that we have RNGs in trnasformed data).

Andrew then urged me to look at figure 1 of Gelman, Meng, and Stern’s 1996 paper on posterior predictive assessement (I would’ve included a diagram if the journal didn’t own the copyright). I looked at figure 1 and got really confused about why the y and y_rep were on the same side of theta. Then Andrew said it was like a probabilistic program. Which I took to mean you could write in BUGS directed graphical modeling language. Below, I write down how I’d code prior and posterior predictive checks and cross-validation in a graphical modeling language.

Instead of arrows and boesLet me present all this in a unified way the way I think it’s clearest, which does away with arrows and boxes and just uses graphical modeling notation (like BUGS or Stan supports). Then Andrew can fill all of us in on what I’m missing

A simple regression example

Nothing interesting here, just a simple regression of a univariate outcome given a univariate predict. I’ll use explicit indexing to make it clear where there are multivariate quantities. I’m also just throwing down an arbitrary prior for completeness.

a ~ normal(0, 2)  
b ~ normal(0, 2)
s ~ lognormal(0, 1)
y[1:N] ~ normal(a + b * x[1:N], s)

The variables a, b, and s are parameters, whereas y and x are the observed outcomes and predictors. y and x will be known, and we’ll run something like Stan to get posterior draws

a, b, s ~ p(a, b, s | x, y)

Posterior predictive checks

To get posterior predictive draws, we add a single line to the graphical model,

a ~ normal(0, 2)  
b ~ normal(0, 2)
s ~ lognormal(0, 1)
y[1:N] ~ normal(a + b * x[1:N], s)
y_rep[1:N] ~ normal(a + b * x[1:N], s)

Here y_rep is declared as a parameter in Stan, because it’s not observed. Also note that the same x values are used for both y and y_rep.

We observe y and x as before, but now get posterior draws for y_rep in addition to the regression parameters,

a, b, s, y_rep ~ p(a, b, s, y_rep | x, y)

We just throw away the draws for the parameters and get draws from the posterior predictive distribution

y_rep ~ p(y_rep | x, y)

Monte Carlo methods are so much easier than calculus.

Prior predictive checks

This one just drops the line with the data, but continues to use the same predictor vector x for the replications. The graphical model is

a ~ normal(0, 2)  
b ~ normal(0, 2)
s ~ lognormal(0, 1)
y_rep[1:N] ~ normal(a + b * x[1:N], s)

Our posterior draws in a system like Stan now look like

a, b, s, y_rep ~ p(a, b, s, y_rep | x)

and we throw away the parameters again to get prior predictive draws,

y_rep ~ p(y_rep | x)

Held-out evaluation and cross-validation

Suppose we divide our N data items up into a training set of size M and test set of size N – M. We’ll train on the training set then predict for the test set.

a ~ normal(0, 2)  
b ~ normal(0, 2)
s ~ lognormal(0, 1)
y[1:M] ~ normal(a + b * x[1:M], s)
y[M + 1:N] ~ normal(a + b * x[M + 1:N], s)

We’ll provide y[1:M] and x[1:N] as data (that is, the training set of y and all of x). Then we get draws from the posterior predictive distribution:

a, b, s, y[M + 1:N] ~ p(a, b, s, y[M + 1:N] | x[1:N], y[1:M])

and we again just drop the parameters to get posterior predictive draws for evaluation

y[M + 1:N] ~ p(y[M + 1:N] | x[1:N], y[1:M])

For cross-validation, you just provide different slices. Or random slices. I show how to do that in the forthcoming user’s guide chapters. I also show how to use the generated quantities block to make the predictive draws pure Monte Carlo draws and also cut down on computation time compared to using MCMC. But that’s just an implementation efficiency detail.

What about Gelman, Meng, and Stern’s diagram?

I’m still confused. But now my confusion is more about why there are multiple y_rep. I only use one in my approach. Then we get simulations for it which characterize the relevant predictive distribution. Also, I don’t understand why there’s only one theta in the posterior predictive diagram (1a), whereas there are multiples in the prior predictive diagram. To me, the only difference is that the edge for y doesn’t show up in the prior predictive check. It’s something you have, but not something that’s in the model. I think what Gelman, Meng, and Stern are doing here is trying to include y in the prior predictive model. My guess is that Andrew’s going to say that they know y at the point the prior predictive check is performed and all data should be on the table. Unfortunately, that requires what’s essentially a BUGS-style cut in the graphical model where you don’t let information from y bleed into theta. The multiple theta is an attempt to render it without cut. At least that’s my guess. Let’s see what Andrew says. (I could just ask via email, but it’s more fun to do my homework in front of a live audience.)

Naming conventions for variables, functions, etc.

The golden rule of code layout is that code should be written to be readable. And that means readable by others, including you in the future.

Three principles of naming follow:

1. Names should mean something.

2. Names should be as short as possible.

3. Use your judgement to balance (1) and (2).

The third one’s where all the fun arises. Do we use “i” or “n” for integer loop variables by convention? Yes, we do. Do we choose “inv_logit” or “inverse_logit”? Stan chose “inv_logit”. Do we choose “complex” or “complex_number”? C++ chose “complex”, as well as choosing “imag” over “imaginary” for the method to pull the imaginary component out.

Do we use names like “run_helper_function”, which is both long and provides zero clue as to what it does? We don’t if we want to do unto others as we’d have them do unto us.

P.S. If the producers of Silicon Valley had asked me, Winnie would’ve dumped Richard after a fight about Hungarian notation, not tabs vs. spaces.