Another physically motivated sampler: Microcanonical HMC

This time it’s astrophysicists rather than quantum physicists getting in on the sampling action.

Microcanonical HMC

Uroš Seljak, an astrophysics professor at UC Berkeley, and collaborators developed a form of Hamiltonian Monte Carlo (HMC) sampling with an alternative approach to energy related to underdamped Langevin dynamics. Here’s the link to the arXiv paper.

Uros presented a preliminary version last month when he visited Flatiron Institute for our workshop on measure transport, diffusion processes, and sampling (a topic which was way more popular than my co-organizer Michael Albergo and I anticipated).

Meaningful evaluation vs. NUTS

I like that the microcanonical HMC paper demonstrates an auto-tuning scheme that shows impressive improvements over the no-U-turn sampler from Stan rather than using vanilla Hamiltonian Monte Carlo as a baseline. Vanilla HMC is very difficult to tune to work and even harder to tune for efficiency, especially without integration time jittering (step size or number of steps).

Langevin diffusions everywhere

I’m seeing Langevin diffusions everywhere these days. Chirag Modi, a joint astro and math postdoc here at Flatiron Institute who did his Ph.D. with Uros, Alex Barnett, Edward Roualdes and I are working on mashing up our recent work on delayed rejection for HMC for multiscale distributions with Radford Neal’s latest partial momentum refresh Langevin sampler, with a dash of parallel auto-adaptation courtesy of Matt Hoffman and Pavel Sountsov’s latest sampler, MEADS. This is the project that motivated Edward to start BridgeStan.

The field is heating up

The field of NUTS competitors is continuing to heat up (physical systems pun purely coincidental). Stay tuned for the results of BridgeStan implementations with more extensive evaluations based on posteriordb.

BridgeStan: Stan model log densities, gradients, Hessians, and transforms in R, Python, and Julia

We’re happy to announce the official 1.0.0 release of BridgeStan.

What is BridgeStan?

From the documentation:

BridgeStan provides efficient in-memory access through Python, Julia, and R to the methods of a Stan model, including log densities, gradients, Hessians, and constraining and unconstraining transforms.

BridgeStan should be useful for developing algorithms and deploying applications. It connects to R and Python through low-level foreign function interfaces (.C and ctypes, respectively) and is thus very performant and portable. It is also easy to install and keep up to date with Stan releases. BridgeStan adds the model-level functionality from RStan/PyStan that is not implemented in CmdStanR/CmdStanPy.

Documentation and source code

Detailed forum post

Here’s a post on the Stan forums with much more detail:

Among other things, it describes the history and relations to other Stan-related projects. Edward Roualdes started the project in order to access Stan models through Julia, then Brian Ward and I (mostly Brian!) helped Edward finish it, with some contributions from Nicholas Siccha and Mitzi Morris.

OpenAI’s GPT chat bot enters the uncanny valley

The baseline: Shannon (1948)

The first attempt at computer generation of human language of which I’m aware was by Claude Shannon in his seminal 1948 paper, A Mathematical Theory of Communication. Not only did this paper introduce the entire field of information theory, it also applied a notion of posterior predictive checking to language models. That is, it could generate novel language based on bigram statistics (predicting conditional probability of a word given the previous word). Here’s the example from Shannon’s paper:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Computing was pretty much all caps until the 1980s.

The state of the art: OpenAI GPT-3 (2022)

Fast forward 75 years. I’ve recently been hearing a lot about GPT-3, the language model developed by OpenAI. So I signed up for the
OpenAI ChatGPT API
and then went to their playground to see what ChatGPT could do.

Now I’m getting that same uneasy feeling I felt watching the lifelike movement and reactions of the Boston Dynamics robots. It’s the uncanny valley effect of watching something that’s almost, but not quite fully lifelike.

Here are the first prompts I gave it (bold) and what it responded. I chose the prompts knowing I’d be posting on the blog. Like all the other spell checkers, it broke “poststratification” into two words itself.

Write an introduction for Andrew Gelman giving a seminar on post stratification to R users.

Good morning everyone. It’s a pleasure to be here today to discuss post stratification with all of you R users. My name is Andrew Gelman and I am a statistician, professor of political science and statistics at Columbia University, and co-director of the Applied Statistics Center. I am here to talk to you about post stratification, a widely used statistical technique. Post stratification is used to adjust survey estimates when they are known to be biased due to the sampling design or nonresponse. In this seminar, I will provide an overview of post stratification, discuss how to perform the calculations in R, and provide examples of how post stratification can be applied. I hope that by the end of this seminar, you will have a better understanding of post stratification and be able to use it to improve the accuracy and reliability of your survey estimates.

AI had intended to get an introduction of Andrew, not an introduction by Andrew, but prepositions like “for” are ridiculously overloaded in human languages. That is, I was expecting something like, “Professor Gelman needs no introduction,” or maybe, “Andrew is currently a statistics professor at Columbia University in the City of New York. Many of you may know him through his book, BDA3 or his latest stories about regression.”

But what if Jonah were talking instead of Andrew?

Write an introduction for Jonah Gabry giving a seminar on post stratification to R users.

Good morning everyone! My name is Jonah Gabry and I’m here to give a seminar on post-stratification to R users. Post-stratification is an important statistical technique used to adjust survey results to better reflect the true population. This seminar will cover the basics of post-stratification, such as the different types of post-stratification, how to implement post-stratification in R, and how to interpret the results. At the end of the seminar, I hope that you’ll have a better understanding of post-stratification and how to use it in your own research.

I guess Jonah’s not so well known on the internet, so he doesn’t mention who he is in the intro. Jonah’s also a bit more excited saying good morning (exclamation point!), but GPT doesn’t have him mention is pleasure at being here. Otherwise, it sort of feels canned. But the language varies in “At the end of the seminar, I hope that you’ll …” vs. “I hope that by the end of this seminar, you will …” When you do two, it sounds like boilerplate, but it’s not just sentences retrieved from the internet.

Going back to Andrew, let’s have him frame the problem a bit before diving in.

How is post stratification different than survey weighting?

Survey weighting is a process which adjusts responses to a survey to reflect the population’s proportions in the sample. This is done by applying weights to individual responses so that the aggregate results of the survey match the known population proportions.

Post stratification, on the other hand, is a process which further refines the results of a survey by taking into account additional information about the population after the survey has been conducted. This involves applying population estimates to the survey results in order to adjust the results to better reflect the population. Post stratification can help to reduce survey bias and improve the accuracy of the survey results.

Now that we have an intro and a body, how about a conclusion?

Write the conclusion to a talk on statistics by Andrew Gelman.

In conclusion, statistics can be a powerful tool to help us better understand our world. By leveraging the power of statistical thinking and analysis, we can gain insight into complex systems, uncover new information, and make decisions with confidence. With the right tools and the right mindset, we can use the power of statistics to make our world a better place. Thank you.

A bit generic, so not really in Andrew’s inimitable, concrete style. But asking this to mimic Andrew’s style is asking too much. It does give what sounds just like the end of a talk. Bravo!

These are, so far, the only sentences I’ve tried. I have to say, it’s pretty amazing. In some ways, it’s a lot like Watson’s approach to Jeopardy. It’s free associating and copying standard information. And while it’s a lot of what seems like boilerplate, it is actually varying a bit.

But is it funny?

Of course, the real test is whether it can do comedy. Let’s try the famous Abbot and Costello sketch, Who’s on First? I’ll feed it Lou Costello’s side of the dialogue and see what we get.

I love baseball.

That’s great! Baseball is a great sport to watch and play.

When we get to St. Louis, will you tell me the guys’ names on the team, so when I go to see them in that St. Louis ballpark, I’ll be able to know those fellas?

Sure! Currently, the Cardinals roster includes Yadier Molina, Paul Goldschmidt, Matt Carpenter, Dexter Fowler, Kolten Wong, Tommy Edman, Paul DeJong, Harrison Bader, Marcell Ozuna, Matt Wieters, Adam Wainwright, Miles Mikolas, Jack Flaherty, Dakota Hudson, Andrew Miller, Giovanny Gallegos, John Brebbia, Genesis Cabrera, Ryan Helsley, and Carlos Martinez.

Cool—it figured out I meant the baseball team from the word “ballpark” presumably. I don’t think it’s storing context. Next line?

Funny names?

1. Salami Sanders
2. Shaggy Hogbottom
3. Wendy Windyweather
4. Daisy Dizzydoodle
5. Rocky Roadrunner
6. Percy Pluperfect
7. Caramel Crunch
8. Ginger Giggles
9. Tofu Twinkles
10. Felix the Flamingo

Maybe I should’ve asked it to be funny?

Not as funny as my name, Sebastian Dinwiddle.

Rate limit reached for default-text-davinci-003-playground in organization org-tbVmbwds9qRPdKPSBoaf6Q0v on requests per min. Limit: 10.000000 / min. Current: 20.000000 / min. Contact [email protected] if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://beta.openai.com/account/billing to add a payment method.

😂 Oh well, what do you expect for free?

Hiring at Flatiron Institute (NYC): research scientists, postdoctoral fellows, interns, and software engineers at Flatiron Institute

Edit: Before going into detail, I would like to make it clear up front that I’m specifically looking for people with experience and a publication record in algorithms, open-source software, and directly relevant theory and methodology. Flatiron Institute as a whole is focused on the physical and biological sciences (not education, social science, finance, etc.).

Edit #2: I don’t know anything yet about the new Salary Transparency in Job Advertisements law and what that will mean for our job ads.

Hiring season is officially open! I’m now officially the group leader for computational statistics and am looking to hire at all levels.

Jobs

  • Postdoctoral Fellows: 2 year appointments with a 1-year extension baked in.
     
  • Open-rank Research Scientists: permanent appointments at all levels; we typically do not hire straight out of graduate school for these permanent positions.
     
  • Software engineers: permanent software engineering positions.
     
  • Summer interns: we bring in a large number of interns across Flatiron Institute and Simons Foundation, who are mentored by research scientists, postdocs, or software engineers.

Official job ads

Here’s the link to the official Flatiron Institute jobs board, where you will find the postdoctoral fellow and open-rank job ads:

Application dates

Flatiron Institute cafeteriaPostdoctoral Fellow applications are due 15 December 2022 for positions with flexible starting dates in 2023. We accept open-rank applications for permanent research scientist positions at any time, though we will also be hiring in the usual academic cycle (i.e., early 2023 interviews and job offers). We do not have an ad for software engineers up, but will create one if we find relevant candidates—we are always looking to hire permanent software engineers.

The official announcement for interns is not yet up on our jobs site, but I will post a separate blog entry when it’s up.

Please contact me directly

If you’re interested in one of these positions, please send me an email directly to discuss ([email protected]).

Center for Computational Mathematics Mission

This is the only place I’ve ever worked where the mission statement lines up so well with my own interests:

CCM’s mission is to create new mathematical approaches, algorithms and software to advance scientific research in multiple disciplines, often in collaboration with other Flatiron Centers.

The other centers have similar missions, because the institute was established to fill the gap in scientific funding for algorithms and software. If the mission appeals to you, you will probably like Flatiron Institute. More specifically, members of our center work on methodology, applications, and theory for computational stats and machine learning, as well as more traditional scientific computing in the form of optimization, differential equation solvers, non-uniform FFTs, phase retrieval, optimization etc.

We are all principal investigators in the sense that there are no top-down goals from the administration other than to work toward our mission. Our president, David Spergel, is an astrophysicist, which gives the whole place a more traditional academic feel than the current MBA-led situation in most universities.

Computational Statistics Group

Our group so far consists of

  • Bob Carpenter (Senior Research Scientist), working on computational statistics algorithms, Stan, and applied statistics (crowdsourcing, coherent diffraction imaging, genomics, etc.)
  • Robert Gower (Research Scientist), who’s a specialist in optimization and working with us and others on both variational inference and with me on optimization for large scale transcriptomics and biome modeling.
  • Brian Ward (Software Engineer), who’s been working on the Stan language, interfaces, and applications like CDI.
  • Yuling Yao (Postdoctoral Fellow), who’s working on Bayesian methodology, Bayesian inference and beyond, including some really exciting ML-motivated variational inference algorithms.
  • Charles Margossian (Postdoctoral Fellow), who’s working on massively parallel inference and convergence monitoring, adjoint methods for autodial of implicit functions (like root finders and differential equation solvers), as well as nested Laplace approximation.

Machine Learning Group

Flatiron Institute roof gardenThe complementary group to ours is machine learning, led by Lawrence Saul, and we’re hiring in machine learning as well as computational statistics (the line is admittedly blurry and purely administrative). Dave Blei is also visiting one day a week after being here for a year on sabbatical.

In addition to Dave, we have a steady stream of visitors. In our group, Justin Domke is visiting for 5 months in early 2023, and Edward Roulades (a Stan developer and applied mathematician/statistician), will be back to work on massively parallel inference next summer. Robert Gower also has a steady stream of optimization specialists visiting.

We also have a very active internship program and both postdocs and research scientists can recruit interns who we bring to NY and put up for the summer.

Focus on algorithms and computation

Although most of our permanent staff researchers and software engineers work on open-source software, we realize postdocs need to get a job, so we give them the flexibility to concentrate on research. We also understand that research is an international collaborative effort and we encourage postdocs and research scientists to maintain and develop academic collaborations outside of Flatiron Institute.

A highly collaborative environment

My favorite part of this place is that it’s packed to the rafters with scientists working on deep and interesting problems across the biological and physical sciences (our other centers are for computational astronomy, computational biology, computational neuroscience, and computational quantum physics).

christian mueller at blackboardI’m afraid we don’t do social science or humanities computing and are unlikely to hire anyone who’s concentrated mainly in these areas. The centers run popular science talks for each other on a weekly basis, which are really wonderful because they’re pitched at mathematically- and computationally-oriented scientists (so they assume you know basic physics and differential equations). We also do in-house training in computational statistics, machine learning, and applied mathematics—I often say it’s a great place to work for people who like math class (from both the teacher and student side). It’s been a wonderful place to learn new subjects and techniques.

Compute and support

Another great benefit of working here is our large-scale, state of the art compute clusters and the Scientific Computing Core staff who can help you get your C++, Fortran, or even Python code running efficiently on the cluster.

Like continental Europe and the old Bell Labs, we have a very pleasant lunch culture where we eat together almost every day (it’s not required but very pleasant; one of the pictures above is our cafeteria and another is the roof deck, where we have receptions after events and usually eat during the summer.

We also have truly fantastic admin support at all levels. I don’t think I could’ve imagined admin support this good and it makes a huge difference. Even our security guards are friendly and helpful.

We’re one of the few places that has sufficient meeting rooms for our staff size (as well as great professional AV support). We’re also one of those places with chalkboards and white boards in all the halls, and it feels very welcoming to be surrounded by mathematical scribbling (that’s my office you can see behind Christian Mueller writing on the blackboard below).

Simons Foundation

Flatiron Institute is part of the Simons Foundation, which is one of the largest non-governmental science funders. Jim and Marilyn Simons funded the foundation with a large enough endowment to exist in perpetuity. Flatiron Institute auditoriumSimons Foundation also runs all kinds of popular outreach programs around math and science education and engagement, including Quanta Magazine and Sandbox Films (science documentaries), Math for America, an oceanographic data collection arm, observatories, and a very large autism research insitiative dedicated to open-access data collection and curation. We even fund arXiv.

We are fully funded through the Simons Foundations and cannot apply for grants, though we are free to collaborate with whomever we want.

Offices and in-person work

We are not in the Flatiron Building, but have very nice digs in the Flatiron neighborhood of NYC (as you can see above), though hardly anyone has an individual office. Flatiron Institute and Simons Foundations have offices in two buildings at Fifth Avenue and 21st St, only a few blocks from Greenwich Village (NYU and subsidized postdoc housing) and Chelsea (Google and the Hudson River).

We’re 100% back to the office and do not hire people for remote positions.

Best Meetup swag: OVHcloud

Mitzi gave a talk last night at the Paris PyData Meetup. It was hosted by OVHcloud, a cloud provider based in Paris. Not only was the setting amazing (window ringed conference room with a thunderstorm outside with lightning bolts shooting behind Mitzi), they passed out the best swag ever. Check it out:

At first I thought the thing in the lower left was one of those old school erasers. But then I realized it was a combination beer key and pizza cutter. How cool is that for a meetup? And naturally, this meetup had both beer in bottles and pizza. But no chance to try it out, as the staff supplied beer keys and had cut all the pizza.

Yes, that’s a pair of socks on top. I remember having lunch at a conference with Stu Shieber years ago, when I had plenty of swag from the startup I was at (SpeechWorks). Stu said we should pass out branded sweatpants, because everyone passed out t-shirts and he had to buy his own sweatpants.

Then there’s the quintessential French writing device—the Bic 4-color pen. Bic’s a French company, so you see these everywhere. They were exotic in the states when I was a kid and I used to get one for Xmas every year. There’s also a pack of biodegradable playing cards, a tote bag, and a really nice notebook.

Thanks, OVHcloud, both for hosting and the super-cool swag.

0 to 100K users in 10 years: how Stan got to where it is

I did a talk at Simons Foundation on how Stan was initially developed and how it got to where it is today:

It was the first time I’d presented this and I was having so much fun I ran out of time near the end. If you’d like to go more slowly, here’s

Also, here’s a direct link to the the TV clip that was played near the start of the talk.

Pathfinder, Causality, and SARS-CoV-2 talks in Paris this week

Tuesday, 11 October 2022

Andrew and I both have talks tomorrow, so I’m bummed I have to miss his. I’ll be talking about Pathfinder at the Polytechnique:

Hi! PARIS Seminar – Bob Carpenter, 11 October 2022

As you may recall from his earlier post, Andrew’s giving a talk on causal inference the same day at Marie Curie in the center of Paris.

Thursday, 13 October 2022

On Thursday, I’ll be giving a talk on SARS-CoV-2 diagnostic testing to France Mentre’s group at Paris Diderot / INSERM. Here’s the title and abstract:

Test Site Variation and Sensitivity Time Trends in Covid Diagnostic Tests

I’ll present two relatively simple models of Covid test sensitivity and specificity. In the first, sensitivity and specificity are allowed to vary by test site. We use calibration tests at sites for known positive or known negative cases to develop a hierarchical model to predict accuracy at a new site. I’ll present results for the data from Santa Clara early in the pandemic. This first model can act as a component in any diagnostic study, much like a PK model for a drug can be coupled with a range of PD models. In the second model, sensitivity is allowed to grow from infection to a peak at symptom onset and decay after that. I’ll provide several parameterizations and show results for different cohorts of patience. I’ll discuss coupling the second model with time trends of infection fatality and hopspitalization in order to infer population infection levels. Theoretically, these models are easy to combine, but practically speaking, we do not have a good data set to fit a joint model without very strong assumptions.

The first model is joint work with Andrew Gelman (Columbia University). The second model is joint work with Tom Ward (UK Health Security Agency).

The location is

2.30 pm, Oct 13, University Paris Diderot, 16 rue Henri Huchard, room 342

and apparently you have to leave ID with security in order to access floor 3 (it’s in the medical school).

Mira magic—a probabilistic card trick

One of the fun and educational exhibits at Antonietta Mira’s children’s science museum exhibition, Numb3d by Numbers, is a probabilistic card trick. Antonietta didn’t invent the trick and can’t recall who did (nor did a quick internet search reveal the originator). The trick is characteristic of probabilistic card tricks, which are tricks that work with some high probability.

The card trick

Kids divide up a standard deck of 52 playing cards and lay them out end to end in any order. The cards are big enough for kids to stand on, and laying 52 of them out creates a winding path through the exhibition hall. The exhibition is just full of these nice touches—everything is highly interactive and scaled for not just one child, but groups of children.

Next, each kid chooses a starting point from among the first five cards and stands on the card they have chosen. At this point, the science educator leading the tour tells them where they will all end up. Each child proceeds along the path of cards the path (down the deck), each time advancing by the value of the card on which they were standing (cards 1–9 have their natural values, whereas ace, 10, jack, queen, king all count as 1).

How does the educator know where they’ll all end up and what’s the probability that the trick works?

Coupling in Markov chains

The first observation to make about this is that the trick works by coupling. That is, each kids’ path picks out a Markov chain, and they will couple with high probability. Coupling here means that they get to the same card at some point before the end of the path. Given that the path is deterministic, as soon as a pair of paths couple, they are guaranteed to wind up in the same location.

Stanislaw Ulam to the rescue

Ulam developed the Monte Carlo method when he was laid up in hospital in 1946 and playing solitaire to pass the time. He couldn’t work out the exact probability of winning analytically, so he played a bunch of games and estimated the chance of winning by the empirical win frequency. Fun aside: according to the A bit of history section of Manuelle Leonelli’s online book Simulation and Modelling to Understand Change, the name “Monte Carlo” was chosen by Metropolis as a code name for von Neumann and Ulam’s work, which was classified (being during the Cold War).

We’re going to follow in Ulam’s footsteps and figure out the answer by simulation. Mitzi and I wrote the following Python code to do the simulation on the train after visiting Antonietta at Università della Svizzera Italiana (USI) in Lugano.

mira-magic.py:

import random

suit = [1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 1, 1, 1]
deck = 4 * suit

def walk(deck, start):
    """returns final position for deck from start position"""
    pos = start
    while pos + deck[pos] < len(deck):
        pos += deck[pos]
    return pos

def couples(deck):
    """return True if two random starts in 0:4 of a shuffled deck couple"""
    random.shuffle(deck)
    starts = random.sample(range(5), 2)
    end1 = walk(deck, starts[0])
    end2 = walk(deck, starts[1])
    return end1 == end2

N = 10**6
successes = 0
for _ in range(N):
    successes += couples(deck)

rate = successes / N    
var = rate * (1 - rate) * N
sd = var**0.5
se = sd / N
conf_lb = rate - 1.96 * se
conf_ub = rate + 1.96 * se

print(f'Based on {N} simulations:')
print(f'  probability of coupling: {rate:5.3f}')
print(f'  se: {se:7.5f}')
print(f'  95% confidence interval = ({conf_lb:6.4f}, {conf_ub:6.4f})')

We define one suit of a deck of cards and then repeat it 4 times for a full deck. We then defined a function walk() that evaluates the final position given a starting position and the deck. We define a second function couples() that returns True if we shuffle the deck, choose two different starting points among the first five cards, and wind up at the same place. The main loop just does a bunch of simulations and counts the number of successes. This lets us define the success rate. The variance of a binomial distribution is known and computed as indicated. We then calculate the standard deviation and standard error, which lets us compute the bounds of a confidence interval. We then print the results with carefully chosen precision to match the standard errors.

The results

Here's what happens when we run the program.

card-coupling$ python3 mira-magic.py 
Based on 1000000 simulations:
  probability of coupling: 0.983
  se: 0.00013
  95% confidence interval = (0.9828, 0.9833)

So that's our answer---98.3%. The confidence interval just shows we've run enough simulations. The whole simulation takes close to 15s on my new M2-based MacBook Air, because Python is really really slow for loops [edit: one of the comments has a faster Julia version.]

I wonder what the science educators do when either (a) kids count wrong, or (b) the trick just fails probabilistically.

Variants

We can also consider variants by slightly tweaking the code.

  • If the kids are forced to choose the first two cards, the probability goes up to 98.8%.
  • If we count 10 as 10, the probability goes down to 94.3%.
  • If we count 10 as 10 and all the face cards as 10, the probability goes way down to 69.5%.
  • If we add two jokers and have them count as 1, the probability goes up to 99.7%.
  • If we were to use two decks instead of one, the probability goes up to 99.98%.
  • If we use two decks and add two jokers, the probability is above 99.99%.

All those cards with value 1 clearly have a strong effect on the probability of coupling, as does the size of the deck.

For more about coupling in Markov chains and applications to shuffling cards and random walks on a hypercube, here's a nice slide deck from Andreas Klappenecker.

[edit: I forgot to say how the educator knew the answer---they just count from card 1 as the kids lay the cards out. This also seems like it might be error prone, at least for those of us who are mathematically inclined but sloppy at arithmetic.]

Numb3d by Numbers—a must see kid’s science exhibition about measurement, counting, and stats

Mitzi and I just visited Antonietta Mira in Lugano. We went to talk about applying her delayed rejection algorithm to Hamiltonian Monte Carlo and partial momentum refresh Metropolis-adjusted Langevin.

Numb3d by Numbers

We were absolutely blown away when Antonietta took us on a tour of her science museum exhibition on probability, statistics, and simulation (Diamo I Numeri! in Italian, Numb3d by Numbers in English, which are not quite direct translations). If I had seen this when I was a kid, you wouldn’t have been able to tear me away from the exhibits until I’d figured each one out. The kids that were there when we were touring were all enthralled.

Here’s a description of the exhibit (in English): Numb3d by Numbers. And here’s a short video in Italian, which shows how beautiful the exhibits are: Diamo I Numeri!

The exhibits

Among the exhibits Mitzi and I recall (and we were only there for an hour or so!):

  • the coolest, largest-scale Galton board I’ve ever seen
  • a horse race with two six-sided dice and horses numbered 2–12
  • the towers of Hanoi
  • number writing systems that had kids translate their age
  • ancient games of counting
  • a magic trick with a giant deck of cards (see below)
  • population pyramids where people put their number down
  • crowdsourced guesses of jars of regular and irregular items
  • visualizations with strings of what you’re good and bad about to see trends
  • tangrams and other shape-based puzzles
  • magic squares with all the math
  • combinatorics and growth of possibilities
  • real-time counters, scales, and how they need to be contextualized
  • emergent patterns like dragonfly wings and sunflower illustrations of Fibonacci sequences
  • randomness and patterns
  • parity and how shepherds count goats
  • positive predictive accuracy of diagnostic tests using frequencies rather than probabilities
  • abacuses and calculating methods

The exhibits are aimed at kids, but they also work for adults and some of them even for young kids. It’s set up in a beautiful hillside science museum (all of Lugano is amazingly beautiful, being situated on a crystal clear alpine lake). The exhibition is designed to be led by a science educator, though it’s also possible to do a self-guided tour.

On the road

The great news is that Antonietta and crew just got a grant to translate the exhibit into English and German (it’s now in Italian) and take it on the road. She’s looking for collaborators interested in bringing it to their home towns. If you have any interest in stats education for kids, this is a golden opportunity.

A probabilistic card trick

I was fascinated by the probabilistic card trick (Antonietta’s a mathemagician and did a bunch of other card tricks for us). Kids divide up a deck of cards and lay them out end to end in any order (these were huge cards so the deck made a winding path through the exhibits). Then kids chose different starting points in the first five cards and stood on the card they chose. Then each proceeds to move along the path (down the deck), each time advancing by the value of the card on which they were standing (cards 1–9 had their natural values, ace, 10, jack, queen, king all count as 1). The science educator conducting the tour tells the kids where they’ll land before they start.

So the question is how does this work? If you’re used to thinking about simulation, you may understand the theory of why this is likely to happen. Mitzi and I had to put on our Stan(islaw Ulam) hats to work out how probable the trick is to work by simulation. I’ll leave it as an exercise for the reader (and provide a solution in a follow-up post).

A horse race

I really liked the horse race because it involved dice and was very graphic and tactile. Kids all get a pair of really big dice and the horses are numbered 2–12. Each time a horse’s number comes up, it advances one step. The first horse to take 20 (?) steps wins. Of course, the 7 horse is the strong favorite, especially with a lot of steps. We didn’t get around to doing the simulation for that one, but it’d be a nice exercise to introduce discrete Monte Carlo methods.

Just show me the data, baseball edition

Andrew’s always enjoining people to include their raw data. Jim Albert, of course, does it right. Here’s a recent post from his always fascinating baseball blog, Exploring Baseball Data with R,

The post “just” plots the raw data and does a bit of exploratory data analysis, concluding that the apparent trends are puzzling. Albert’s blog has it all. The very next post fits a simple Bayesian predictive model to answer the question every baseball fan in NY is asking,

P.S. If you like Albert’s blog, check out his fantastic intro to baseball stats, which only assumes a bit of algebra, yet introduces most of statistics through simulation. It’s always the first book I recommend to anyone who wants a taste of modern statistical thinking and isn’t put off by the subject matter,

  • Jim Albert and Jay Bennet. 2001. Curve Ball. Copernicus.


 

Stan downtown intern posters: scikit-stan & constraining transforms

It’s been a happening summer here at Stan’s downtown branch at the Flatiron Institute. Brian Ward and I advised a couple of great interns. Two weeks or so before the end of the internship, our interns present posters. Here are the ones from Brian’s intern Alexey and my intern Meenal.

Alexey Izmailov: scikit-stan

Alexey built a version of the scikit-learn API backed by Stan’s sampling, optimization, and variational inference. It’s plug and play with scikit.learn.

Meenal Jhajharia: unconstraining transforms

Meenal spent the summer exploring constraining transforms and how to evaluate them with a goal toward refining Stan’s transform performance and to add new data structures. This involved both figuring out how to evaluate them (vs. target distributions w.r.t. convexity, condition if convex, and sampling behavior in the tail, body, and near the mode of target densities). Results are turning out to be more interesting than we suspected in that different transforms seem to work better under different conditions. We’re also working with Seth Axen (Tübingen) and Stan devs Adam Haber and Sean Pinkney.

They don’t make undergrads like they used to

Did I mention they were undergrads? Meenal’s heading back to University of Delhi to finish her senior year and Alexey heads back to Brown to start his junior year! The other interns at the Center for Computational Mathematics, many of whom were undergraduates, have also done some impressive work in everything from using normalizing flows to improve sampler proposals for molecular dynamics to building 2D surface PDE solvers at scale to HPC for large N-body problems. In this case, not making undergrads like they used to is a good thing!

Hiring for next summer

If you’re interested in working on statistical computing as an intern next summer, drop me a line at [email protected]. I’ll announce when applications are open here on the blog.

 

PPL Workshop this Friday (29 July 2022)

The Computational Abstractions for Probabilistic and Differentiable Programming (CAPP) Workshop is happening this Friday, 29 July 2022. It’s organized out of Cambridge and is running on UK time.

Program and Zoom link

The schedule of speakers is online and the free Zoom link is available under the Location tab of their web site:

Despite the title, the talks sound very practical. The speaker lineup includes developers from BUGS, JAGS, Stan, JAX, and Turing.jl.

My talk

I (Bob) will be talking about our plan to introduce Python-like comprehensions into the Stan language to deal with constant matrix constructions like Gaussian Process covariance matrices. This is motivated by our shift in underlying matrix data structure for autodiff in Stan—what the Stan developers have been calling “mat-var” in meetings and pull requests because it’s an autodiff variable with a matrix value and matrix adjoint rather than our original approach using a matrix of scalar autodiff variables.

P.S. This is a big reduction in memory pressure and time for larger-scale matrix ops for Stan. The really clever part is how the C++ devs (Tadej Ciglarič, Steve Bronder, Rok Češnovar, et al.) have managed to abstract the mat-var construct in a way that’s backward compatible with our existing autodiff with simpler code. If you like C++ and functional template meta programming in C++, I urge you to follow along. Steve’s about to drop all the dev-facing doc for all of this—it’s currently in a massive PR undergoing final revisions.

“Unsupervised learning” gets a bad rap

Yesterday, a speaker from the company formerly known as Facebook corrected themselves after mentioning “unsupervised learning,” saying they weren’t supposed to use that term any more. This must be challenging after years of their chief AI scientist promoting unsupervised learning. The reason is apparently that they don’t want the public to worry about the rise of unsupervised AI! Instead, they are using the terms “unlabeled” and “self supervised”.

If I were the lexical police, I would’ve blacklisted the term “learning,” because we’re really just estimating parameters (aka “weights”) of a statistical model. I felt bad saying “learning” even back when I worked in ML full time, but that’s probably just because I lived through the AI winter during which the quickest route to rejection of a paper or grant was to mention “artificial intelligence.”

Bayes factors measure prior predictive performance

I was having a discussion with a colleague after a talk that was focused on computing the evidence and mentioned that I don’t like Bayes factors because they measure prior predictive performance rather than posterior predictive performance. But even after filling up a board, I couldn’t convince my colleagues that Bayes factors were really measuring prior predictive performance. So let me try in blog form and maybe the discussion can help clarify what’s going on.

Prior predictive densities (aka, evidence)

If we have data $latex y$, parameters $latex \theta$, sampling density $latex p(y \mid \theta)$ and prior density $latex p(\theta)$, the prior predictive density is defined as

$latex p(y) = \int p(y \mid \theta) \, p(\theta) \, \textrm{d}\theta.$

 

The integral computes an average of the sampling density $latex p(y \mid \theta)$ weighted by the prior $latex p(\theta)$. That’s why we call it “prior predictive”.

Bayes factors compare prior predictive densities

Let’s write $latex p_{\mathcal{M}}(y)$ to indicate that the prior predictive density depends on the model $latex \mathcal{M}$. Then if we have two models, $latex \mathcal{M}_1, \mathcal{M}_2$, the Bayes factor for data $latex y$ is defined to be

$latex \textrm{BF}(y) = \frac{p_{\mathcal{M}_1}(y)}{p_{\mathcal{M}_2}(y)}$.

What are Bayes factors measuring? Ratios of prior predictive densities. Usually this isn’t so interesting because the difference between a weakly informative prior and one an order of magnitude wider usually doesn’t make much of a difference for posterior predictive inference. There’s more discussion of this with examples in Gelman an et al.’s Bayesian Data Analysis.

Jeffreys set thresholds for Bayes factors of “barely worth mentioning” (below $latex \sqrt{10}$) to “decisive” (above 100). But we don’t need to worry about that.

Posterior predictive distribution

Suppose we’ve already observed some data $latex y^{\textrm{obs}}$. The posterior predictive distribution is

$latex p(y \mid y^{\textrm{obs}}) = \int p(y \mid \theta) \, p(\theta \mid y^{\textrm{obs}}) \, \textrm{d}\theta.$

 

The key difference from the prior predictive distribution is that we average our sampling density $latex p(y \mid \theta)$ over the posterior $latex p(\theta \mid y^{\textrm{obs}})$ rather than the prior $latex p(\theta)$.

Cross-validation

In the Bayesian workflow paper, we recommend using cross-validation to compare posterior predictive distributions and we don’t even mention Bayes factors. Stan provides an R package, loo, for efficiently computing approximate leave-one-out cross-validation.

The path from prior predictive to posterior predictive

Introductions to Bayesian inference often start with a very simple beta-binomial model which can be solved analytically online. That is, we can update the posterior by simple counting after each observation. Each posterior is also a beta distribution. We can do this in general and consider our data $latex y = y_1, \ldots, y_N$ arriving sequentially and updating the posterior each time.

$latex p(y_1, \ldots, y_N) = p(y_1 \mid \theta) \ p(y_2 \mid y_1, \theta) \, \cdots \, p(y_N \mid y_1, \ldots, y_{N-1}, \theta).$

 

In this factorization, we predict $latex y_1$ based only on the prior, then $latex y_2$ based on $latex y_1$ and the prior and so on until the last point is modeled in the same way as leave-one-out cross-validation as $latex p(y_N \mid y_1, \ldots, y_{N-1})$. We can do this in any order and the result will be the same. As $latex N$ increases, prior predictive density converges to posterior predictive density on an average (per observation $latex y_n$) basis. But for finite amounts of data $latex N \ll \infty$, the measures can be very different.

Pen and notebook recommendations?

Current state of my art

It’s an all Japanese line up.

  • Maruman Mnemosyne A5, 5mm dot or square pads, side or top bound
  • Sakura Sigma Micron 0.1 pens

I’m curious if there are people that have tried these and have something they prefer.

The pads

I just switched pad brands a few days ago. I used to use Rhodia A5 5mm dotpads with top perforation (it’s a French brand that’s part of Claire Fontaine). I found the pencil didn’t show up well enough on the dark lines, but then Andrew turned me onto the Dotpads about 10 years ago. The Mnemosyne pads have off-white instead of white pages (maybe a bit greenish like old graph paper) and the rules (or dots) are perfectly light. But it’s the smoothness of the paper that’s really incredible. I’ve never felt paper like it and now I’m hooked.

The pens

Matt Hoffman turned me onto the pens when we started working on Stan (though he’s now using Muji pens). They are great at writing fine lines, which is perfect for math. I ran out of pens on a trip to Berlin and found that Staedtler Pigment Liner 0.1 pens are a reasonable substitute (not quite as dark, but the nib’s a bit tougher). Andrew uses some kind of erasable pen—I’ve never cottoned to those.

I have used Pentel Quicker-Clicker pencils since high school (circa 1980). I still use them sometimes, though with B lead as I find the HB too light. I thought I’d dislike doing math in pen, but now I prefer it because pen is more fun to write with and the end result is much easier to read.

Summer internships at Flatiron Institute’s Center for Computational Mathematics

[Edit: Sorry to say this to everyone, but we’ve selected interns for this summer and are no longer taking applications. We’ll be taking applications again at the end of 2022 for positions in summer 2023.]

We’re hiring a crew of summer interns again this summer. We are looking for both undergraduates and graduate students. Here’s the ad.

I’m afraid the pay is low, but to make up for it, we cover travel, room, and most board (3 meals/day, 5 days/week). Also, there’s a large cohort of interns every summer across the five institutes at Flatiron (biology, astrophysics, neuroscience, quantum physics, and math), so there are plenty of peers with whom to socialize. Another plus is that we’re in a great location, on Fifth Avenue just south of the Flatiron Building (in the Flatiron neighborhood, which is a short walk to NYU in Greenwich Village and Google in Chelsea as well as to Times Square and the Hudson River Park).

If you’re interested in working on stats, especially applied Bayesian stats, Bayesian methodology, or Stan, please let me know via email at [email protected] so that I don't miss your application. We have two other Stan devs here, Yuling Yao (postdoc) and Brian Ward (software engineer).

We're also hiring full-time permanent research scientists at both the junior level and senior level, postdocs, and software engineers. For more on those jobs, see my previous post on jobs at Flatiron. That post has lots of nice photos of the office, which is really great. Or check out Google's album of photos.

It’s so hard to compare the efficiency of MCMC samplers

I thought I’d share a comment I made on Perry de Valpine‘s post on the NIMBLE blog,

Perry was posting about a paper that tried to compare efficiency of Stan, NIMBLE, and JAGS for some linear modeling problems. You can find the link in Perry’s post if you really care and then the paper links to their GitHub. Perry was worried they were misrepresenting NIMBLE and users would get the wrong idea. We decided to take the approach of established companies and simply ignore this kind of criticism. But this time I couldn’t resist as it’s not really about us.

Here’s my comment (lightly edited):

Comparing systems on an equal footing is a well-nigh impossible task, which is why we shy away from doing it in Stan. The polite thing to do with these kinds of comparisons is to send your code to the devs of each system for tuning. That’s what we did for our autodiff eval paper.

I don’t like this paper’s evaluation any more than you do! I’d like to see an evaluation with (a) arithmetic on an equal footing, (b) the kinds of priors we actually use in our applied work, and (c) something higher dimensional than p = 100 (as in p = 10,000 or even p = 100,000 like in the genomics regressions I’m working on now). Then the evaluation I care about is time to ESS = 100 as measured by our conservative cross-chain ESS evaluations that also allow for antithetical sampling (Stan can produce samples whose ESS is higher than the number of iterations; may estimators just truncate at the number of iterations because they don’t understand ESS and its relation to square error through the MCMC CLT). The problem with this kind of eval is that we want to represent actual practice but also minimize warmup to put systems in as favorable a light as possible. In simple GLMs like these, Stan usually only needs 100 or maybe 200 iterations of warmup compared to harder models. So if you use our default of 1000 warmup iterations and then run sampling until you hit ESS = 100, then you’ve wasted a lot of time in too much iteration. But in practice, you don’t know if you can get away with less warmup (though you can use something like iterative deepening algorithms to probe deeper, they’re not built in yet).

One way to get around this sensitivity to warmup is to evaluate ESS/second after warmup (or what you might call “burnin” if you’re still following BUGS’s terminology). But given that we rarely need more than ESS = 100 and want to run at least 4 parallel chains to debug convergence, that’s not many iterations and you start getting a lot of variance in the number of iterations it takes to get there. And things are even more sensitive to getting adaptation right. And also, I don’t think ESS/second after warmup is the metric practitioners care about unless they’re trying to evaluate tail statistics, at which point they should be seriously considering control variates rather than more sampling.

In other words, this is a really hard problem.

I then read Perry’s follow up and couldn’t help myself. I actually looked at their Stan code. Then I had a follow up comment.

I just read their source code. It’s not exactly Stan best practices for statistics or computation. For instance, in mixture_model.stan, there are redundant data computations per iteration (line 25), redundant distributions (also line 25), inefficient function calls (line 31), and a conjugate parameterization inducing extra work like sqrt (line 23). Then in AFT_non_informative.stan, they use very weakly informative priors (so it’s misnamed), there are missing constraints on constrained variables (lines 17, 32), redundant computation of subexpressions and of constants (lines 26, 27), missing algebraic reductions (also lines 26 and 27), redundant initialization and setting (lines 22/23 and 26/27), redundant computations (line 32).

The worst case for efficiency is in their coding of linear models where they use a loop rather than a matrix multiply (LinearModel_conjugate.stan, lines 30–32, which also violates every naming convention in the world by mixing underscore separators and camel case). This code also divides by 2 everywhere when it should be multiplying by 0.5 and it also has a bunch of problems like the other code of missing constraints (this one’s critical—`sigma` needs to be constrained to be greater than 0).

Then when we look at LinearModel_non_informative_hc.stan, things get even worse. It combines the problems of LinearModel_conjugate with two really bad ones for performance: not vectorizing the normal distribution and needlessly truncating the Cauchy distribution. These would add up to at least a factor of 2 and probably much more.

And of course, none of these exploited within-chain parallelization or GPU. And no use of sufficient stats in the conjugate cases like Linear_model_conjugate.stan.

Bayesian approaches to differential gene expression

TPM, FPKM, WTF?

I (Bob) am dipping my toes back into differential expression modeling for RNA-seq data. Doesn’t everyone love high-dimensional count data? I just can’t get over their standard units: TPM (transcripts per million) is almost OK, but FPKM (fragments per kilo base of transcript per million fragments mapped)? Seriously? We have scientific notation for a reason.

If you’re thinking about bioinformatics…

If you’re going to be looking at RNA-seq data, I would strongly recommend reading the following clear, insightful, and comprehensive Bayesian take on the subject.

Lewin, Bottolo, and Richardson. 2019. Bayesian methods for gene expression analysis. In Balding, Moltke, and Marioni. Handbook of Statistical Genomics, 4th Edition. Wiley.

Yes, that’s Sylvia Richardson, so you know this one’s going to be good. So good that I really wish I’d found it in 2018 when they apparently put the draft out online. But it doesn’t come up in obvious searches for RNA-seq and Bayes on Google and it’s a sad indictment of scholarship that it has zero citations in Google Scholar. This should be among the first things anyone reads as an introduction to microarrays or RNA-seq in bioinformatics! If you have other suggestions in this category, please let me know in the comments.

Since 2018, Shuonan Chen (Columbia systems biology Ph.D. student), Chaolin Zhang (Columbia systems biology professor), and I developed a multilevel Bayesian alternative to rMATS (differential isoform expression with replicates). We’re nearly done with the draft and I’ll announce it here when it’s up on arXiv. We hadn’t read Lewin et al. at the time we developed it, but you can view what we did as a specific instance of the kinds of models they describe in their paper. They even used the trick I learned from Andrew to parameterize the location of two groups as a mean and difference from mean rather than two locations. Even better, the paper goes over how to handle non-uniquely mapped reads, which systems like rMATS can’t handle; my only concern is that they don’t directly address identifiability issues in all these models, which can be challenging.

P.S. My plan’s to write a case study along the same lines as Lewin et al. with more background on the RNA-seq process and simulated data sets with Stan code illustrating the main points, with an extended discussion of Bayesian approaches to non-identifiability. If anyone’s interested in being involved in this, please let me know via email ([email protected]).

My slides and paper submissions for Prob Prog 2021

Prob Prog 2021 just ended. Prob Prog is the big (250 registered attendees, and as many as 180 actually online at one point) probabilistic programming conference. It’s a very broadly scoped conference.

The online version this year went very smoothly. It ran a different schedule every day to accommodate different time zones. So I wound up missing the Thursday talks other than the posters because of the early start. There was a nice amount of space between sessions to hang out in the break rooms and chat.

Given that there’s no publication for this conference, I thought I’d share my slides here. The talks should go up on YouTube at some point.

Slides: What do we need from a PPL to support Bayesian workflow?

There was a lot of nice discussion around bits of workflow we don’t really discuss in the paper or book: how to manage file names for multiple models, how to share work among distributed teammates, how to put models into production and keep them updated for new data. In my talk, I brought up issues others have to deal with like privacy or intellectual property concerns.

My main focus was on modularity. After talking to a bunch of people after my talk, I still don’t think we have any reasonable methodology as a field to test out components of a probabilistic program that are between the level of a density we can unit test and a full model we can subject to our whole battery of workflow tests. How would we go about just testing a custom GP prior or spatio-temporal model component? There’s not even a way to represent such a module in Stan, which was the motivation for Maria Gorinova‘s work on SlicStan. Ryan Bernstein (a Stan developer and Gelman student) is also working on IDE-like tools that provide a new language for expressing a range of models.

Then Eli Bingham (of Pyro fame) dropped the big question: is there any hope we could use something like these PPLs to develop a scalable, global climate model? Turns out that we don’t even know how they vet the incredibly complicated components of these models. Just the soil carbon models are more complicated than most of the PK/PD models we fit and they’re one of the simplest parts of these models.

I submitted two abstracts this year and then they invited me to do a plenary session and I decided to focus on the first.

Paper submission 1: What do we need from a probabilistic programming language to support Bayesian workflow?

Paper submission 2: Lambdas, tuples, ragged arrays, and complex numbers in Stan

P.S. Andrew: have you considered just choosing another theme at random? It’s hard to imagine it’d be harder to read than this one.

ProbProg 2021 is Oct 20–22 and registration is open

ProbProb 2021, the probabilistic programming conference, is coming up in two weeks and registration is only US$25. It’s entirely online still this year and runs from 10.30 am Wednesday, 20 October to 2.30 pm Friday 22 October EDT (eastern daylight time, aka Boston time).

Here’s the schedule of talks and posters. On Friday at noon, I’ll be talking about Bayesian workflow and what probabilistic programming languages need in order to support it. I’ll survey a range of PPLs, which I conclude all lack key features required for seamless workflow support. Perhaps surprisingly, I conclude that BUGS is the best of a bad bunch for workflow when dealing with problems that can be expressed in its language and where Gibbs sampling is effective. Most other systems, including Stan, PyMC3, Pyro, and all the ML toolkits, require rewriting models for different uses like simulation-based calibration, posterior predictive checks, and cross-validation. Part III of the Stan User’s guide shows how to code these model variants in Stan. I’ll try to squeeze in some details of how we hope to extend Stan to better support workflow.

Hope to see you there!