Summer School on Advanced Bayesian Methods in Belgium

(this post is by Charles)

This September, the Interuniversity Institute for Biostatistics and statistical Bioinformatics is holding its 5th Summer School on Advanced Bayesian Methods. The event is set to take place in Leuven, Belgium. From their webpage:

As before, the focus is on novel Bayesian methods relevant to the applied statistician. In the fifth edition of the summer school, the following two courses will be organized in Leuven from 11 to 15 September 2023:

The target audience of the summer school are statisticians and/or epidemiologists with a sound background in statistics, but also with a background in Bayesian methodology. In both courses, practical sessions are organized, so participants are asked to bring along their laptop with the appropriate software (to be announced) pre-installed.

I’m happy to do a three-day workshop on Stan: we’ll have ample time to dig into a lot of interesting topics and students will have a chance to do plenty of coding.

I’m also looking forward to the course on spatial modeling. I’ve worked quite a bit on the integrated Laplace approximation (notably its implementation in autodiff systems such as Stan), but I’ve never used the INLA package itself (or one of its wrappers), nor am I very familiar with applications in ecology. I expect this will be a very enriching experience.

The registration deadline is July 31st.

HIIT Research Fellow positions in Finland (up to 5 year contracts)

This job post is by Aki

The Helsinki Institute for Information Technology has some funding for Research Fellows and the research topics can include Bayes, probabilistic programming, ML, AI, etc

HIIT Research Fellow positions support the career development of excellent advanced researchers who already have some postdoctoral research experience. While HIIT Research Fellows have a designated supervisor at University of Helsinki or Aalto, they are expected to develop their own research agenda and to gain the skills necessary to lead their own research group in the future. HIIT Research Fellows should strengthen Helsinki’s ICT research community either through collaboration or by linking ICT research with another scientific discipline. In either case, excellence and potential for impact are the primary criteria for HIIT Research Fellow funding.

The contract period is up to five years in length.

I (Aki) am one of the potential supervisors, so you could benefit from my help (other professor are great, too), but as the text says you would be an independent researcher. This is an awesome opportunity to advance your career in a lovely and lively environment between Aalto University and University of Helsinki. I can provide further information about the research environment and working in Finland.

The deadline is August 13th 2023

See more at HIIT webpage

Social experiments and the often-neglected role of theory

Jason Collins discusses a paper by Milkman et al. that presented “a megastudy testing 54 interventions to increase the gym visits of 61,000 experimental participants.” Some colleagues and I discussed that paper awhile ago—I think we were planning to write something up about it but I don’t remember what happened with that.

As I recall, the study had two main results. First, researchers had overestimated the effects of various interventions: basically, people thought they had great ides for increasing fitness participation, but the real world is complicated, and most things don’t work as effectively as you might expect. Second, regarding the interventions themselves, evidence was mixed: even in a large study it can be hard to detect average effects

The overestimation of effect sizes is consistent with things we’ve seen before in other areas of policy research. Past literature tends to report inflated effect sizes: the statistical significance filter, both within and between studies, leads to a selection bias which is always a concern but particularly so when improvements are incremental (there is no magic bullet that will get people to the gym). Beyond this, the effects we typically envision are the effects when the treatment is effective. When considering the average treatment effect, we’re also averaging over all those people for whom the effect is near zero, as illustrated in Figure 1d of this paper.

The big problem: Where do the interventions come from?

Collins’s discussion seems reasonable to me. In particular, I agree with his big problem about the design of this “mega-study,” which is that there’s all sorts of rigor in the randomization and analysis plan, but no rigor at all when it comes to deciding what interventions to test.

Unfortunately, this is standard practice in policy analysis! Indeed, if you look at a statistics book, including mine, you’ll see lots and lots on causal inference and estimation, but nothing on how to come up with the interventions to study in the first place.

Here’s how Collins puts it:

At first glance, the list of 54 interventions suggests the megastudy has an underlying philosophy of “throw enough things at a wall and surely something will stick”. . . .

Fair enough. But this concession implicitly means the authors have given up on developing an understanding of human decision making that might allow us to make predictions. Each hypothesis or set of hypotheses they tested concern discrete empirical regularities. They are not derived from or designed to test a core model of human decision making. We have behavioural scientists working as technicians, seeking to optimise a particular objective with the tools at hand. . . .

A big problem here is that “tools at hand” are not always so good, especially when these tools are themselves selected based on past noisy and biased evaluations. What are those 54 interventions, anyway? Just some things that a bunch of well-connected economists wanted to try out. Well-connected economists know lots of things, but maybe not so much about motivating people to go to the gym.

A related problem is variation: These treatments, even when effective, are not simply push-button-X-and-then-you-get-outcome-Y. Effects will be zero for most people and will be highly variable among the people for whom effects are nonzero. The result is that the average treatment effect will be much smaller than you expect. This is not just a problem of “statistical power”; it’s also a conceptual problem with this whole “reduced-form” way of looking at the world. To put it another way: Lack of good theory has practical consequences.

“Bayes is guaranteed to overfit”: What does this mean? There’s a factor of 2 here.

Yuling writes, “Bayes is guaranteed to overfit, for any model, any prior, and every data point.” This statement is not literally true: it does not hold for degenerate examples of models with no unknown parameters (for which p(theta|y) is the same as p(theta) as these are both delta functions at the true, known value of theta), nor does it hold for degenerate examples of data whose distribution does not depend on unknown parameters (it’s easy to come up with examples of this sort, for example consider the model y_i ~ normal(theta*x_i, 1), independent for data i=1,…n. In this case, for any data x_i = 0, the predictive distribution of y_i is fixed, so there’s no “overfitting”).

So, yeah, the statement is not true as written—but I know what Yuling is saying, which is that when there’s uncertainty in predictions, Bayesian fitting will pull the prediction from the prior toward the observed data, so that, in Yuling’s words, the in-sample error is smaller than the out-of-sample error, which is what he is calling “overfitting.”

This should not be surprising: from Akaike (1973) onward, there’s been a whole subfield of statistics devoted to correcting for the difference between within-sample and out-of-sample prediction error.

There’s something interesting here, though, and that is how much does the Bayesian inference “overfit,” in Yuling’s terms?

Example 1: Normal distribution with flat prior.

Let me explain in the simple case of one data point, y|a ~ normal(a, 1), with a flat prior for a. Because we’re also interested in out-of-sample prediction error, we’ll also define y_rep|a ~ normal(a, 1), independent of y|a.

Let’s start with the maximum likelihood estimate, â_mle = y, which gives the prediction y_pred = y. The within-sample prediction squared error is (y_pred – y)^2 = 0. Meanwhile, the expected out-of-sample prediction squared error is E((y_rep – â_mle)^2) = 2. So that’s how much overfitting the mle has: its within-sample squared prediction error overestimates the expected out-of-sample prediction error by 2. This difference corresponds to the “2” in the AIC formula.

In this case with a flat prior, the Bayesian posterior mean is â_bayes = y, so same point prediction with same within-sample prediction squared error of 0.

But Bayes doesn’t just give a point estimate; it gives a posterior distribution, p(a|y). Thus the relevant predictive summary here is the expected within-sample prediction squared error, averaging over the posterior distribution for a; that is, E((a – y)^2 | y). In this very simple example, the posterior distribution is just a ~ normal(y, 1), so the expected within-sample prediction error of the Bayesian inference is 1.

What about the expected squared out-of-sample prediction error? This is E((â_bayes – y_rep)^2 | y) = 2. So the overfitting is 1. Averaging over the posterior distribution has reduced the overfitting by half.

Example 2: Normal distribution with informative prior.

Now let’s consider the next most complicated example. Same as above but now our prior is a ~ normal(0, s), and our posterior is a|y ~ normal((s^2/(1 + s^2))*y, s/sqrt(1 + s^2)).

The Bayes posterior mean is now â_bayes = r*y, where r = s^2/(1 + s^2), which gives a within-sample squared prediction error of y^2/(1 + s^2)^2. This depends on y, so we can compute its expectation averaging over the prior predictive distribution (i.e., the marginal distribution of y), which here is y ~ normal(0, sqrt(1 + s^2)), hence the expected within-sample squared prediction error of the posterior mean is 1/(1 + s^2).

What about the expected within-sample squared prediction error of the posterior inference? For this, we have to add the posterior variance, and we get 1/(1 + s^2) + s/(1 + s^2) = 1.

Finally, the expected squared out-of-sample prediction error comes to (1 + 2s^2)/(1 + s^2).

We can now take differences. Suppose the goal is to estimate the expected squared out-of-sample prediction error, and you can do that using the squared within-sample prediction error of the Bayes point estimate, or the squared within-sample prediction error of the Bayes posterior distribution. It turns out that both are too optimistic. The Bayes point estimate underestimates the out-of-sample prediction error by 2s^2/(1 + s^2), and the Bayes posterior distribution underestimates the out-of-sample prediction error by s^2/(1 + s^2).

Thus, in this normal-normal example, when we us the full Bayesian posterior instead of the Bayesian point estimate, it reduces the overfitting by a factor of 2.

That factor of 2

Example 2 above is not just a special case; it’s of general interest for well-behaved problems where the limit kicks in and the likelihood follows an approximate normal curve.

It’s also the basis for information criterion calculations such as AIC, DIC, and WAIC, as discussed in our 2014 article and chapter 7 of BDA3 (see also our followup paper from 2017 focusing on leave-one-out cross validation). The “effective number of parameters” in the model is associated with the difference of within-sample and out-of-sample prediction accuracy.

None of this is new. The goal of this post is to clarify what Yuling wrote about Bayes overfitting. As Yuling said, any fitting procedure will “overfit” in the sense that it is adapting to the data.
– The Bayesian point estimate overfits (on average).
– The Bayesian posterior distribution accounts for uncertainty and reduces the overfitting by a factor of 2 (in the normal-normal case).
– This is not a problem with Bayesian inference; it’s a recognition that any fitting is, in a sense, over fitting; or, to put it another way, fitting in parameter space inevitably leads to overfitting in the space of mean squared error or log predictive density.

P.S. As Aki and Phil point out, the above definition of “overfitting” can be misleading in that, under that definition, all fitting is overfitting; see further discussion here in comments.

The Economist is hiring a political data scientist to do election modeling!

Dan Rosenheck writes:

Our data-journalism team conducts original quantitative research, deploying cutting-edge statistical methods to ask and answer important, relevant questions about politics, economics and society. We are looking to hire a full-time political data scientist. . . .

The data scientist will oversee all of our poll aggregators and predictive models for elections. This entails learning all of the code and calculations behind our existing politics-related technical projects, such as our forecasting systems for presidential and legislative elections in the United States, France and Germany; proposing and implementing improvements to them; and setting up and maintaining data pipelines to keep them updated regularly once they launch. The data scientist will also have the opportunity to design and build new models and trackers on newsworthy subjects.

It’s a permanent, full-time staff position, to replace Elliott Morris, who worked on the 2020 forecast with Merlin and me (see here for one of our blog posts and here and here for relevant academic articles).

Sounds like a great job for a statistician or a political scientist, and I hope I’ll have the opportunity to work with whoever the Economist’s new data scientist is. We built a hierarchical model and fit it in Stan!

Hedging your bets by weighting regressions?

Cody Boyer writes:

I’ve had a question in the back of my mind since I read this article years ago. What I’m curious about is this section, quoted below:

A major challenge is that there are a lot of plausible variables that affect primary and caucus outcomes — but relatively little data to test them out on, especially early in the process when no states have voted and few states have reliable polling. In cases like these, regression analysis is at risk of overfitting, where it may seemingly describe known data very precisely but won’t do a good job of predicting outcomes that it doesn’t know in advance.

Our solution is twofold. First, rather than just one regression, we run a series of as many as 360 regressions using all possible combinations of the variables described below. Although the best-performing regressions receive more weight in the model, all of the regressions receive at least some weight. Essentially, the model is hedging its bets on whether the variables that seemingly describe the results the best so far will continue to do so.

Is this an “actual” statistical method (like something with theory backing it up and a “right” way to do it?). Why do this instead of using something like AIC to select the model to use?

My response:

I’m not sure exactly what they’re doing, but it sounds like stacking based on predictive averaging. Here’s a paper on Bayesian stacking, but the non-Bayesian version’s been around for awhile.

Choosing one model using AIC or other approach is a special case where one of the models gets weight 1 and all the rest get weight 0.

More generally, fully expressing uncertainty in predictions is a challenge. You can do all the model averaging you’d like but you’re still extrapolating from the data you have. And then you can add extra uncertainty in your prediction to account for what’s not in your model, but when working with a multivariate distribution (for example, forecasting an election in fifty states at once), it still be hard to get things right. Add uncertainty so that certain marginal forecasts make sense, and this can distort the forecast of the joint distribution (as in that notorious map from fivethirtyeight.com that showed a scenario where Biden won every state except New Jersey). This is not to say that our forecasts were better. We don’t have great tools for capturing multivariate uncertainty. This comes up in missing-data imputation too.

Anyway, I like predictive model averaging, just don’t want to overstate its benefits.

Prior knowledge elicitation: The past, present, and future

Petrus Mikkola, Osvaldo A. Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, and Arto Klami write in a paper that recently appeared online in Bayesian Analysis journal

Specification of the prior distribution for a Bayesian model is a central part of the Bayesian workflow for data analysis, but it is often difficult even for statistical experts. In principle, prior elicitation transforms domain knowledge of various kinds into well-defined prior distributions, and offers a solution to the prior specification problem. In practice, however, we are still fairly far from having usable prior elicitation tools that could significantly influence the way we build probabilistic models in academia and industry. We lack elicitation methods that integrate well into the Bayesian workflow and perform elicitation efficiently in terms of costs of time and effort. We even lack a comprehensive theoretical framework for understanding different facets of the prior elicitation problem.

Why are we not widely using prior elicitation? We analyse the state of the art by identifying a range of key aspects of prior knowledge elicitation, from properties of the modelling task and the nature of the priors to the form of interaction with the expert. The existing prior elicitation literature is reviewed and categorized in these terms. This allows recognizing under-studied directions in prior elicitation research, finally leading to a proposal of several new avenues to improve prior elicitation methodology.

There is a lot we couldn’t include, probably something we missed, and a huge amount of potential work to do in the future. Happy to get comments and pointers to recent related work.

Stan class at NYR Conference in July (in person and virtual)

I (Jonah) am excited to be teaching a 2-day Stan workshop preceding the NYR Conference in July. The workshop will be July 11-12 and the conference July 13-14.  The focus of the workshop will be to introduce the basics of applied Bayesian data analysis, the Stan modeling language, and how to interface with Stan from R. Over the course of two full days, participants will learn to write models in the Stan language, run them in R, and use a variety of R packages to work with the results.

There are both in-person and remote spots available, so if you can’t make it to NYC you can still participate. For tickets to the workshop and/or conference head to https://rstats.ai/nyr.

P.S. In my original post I forgot to mention that you can use the discount code STAN20 to get 20% off tickets for the workshop and conference!

 

blme: Bayesian Linear Mixed-Effects Models

The problem:

When fitting multilevel models, the group-level variance parameters can be difficult to estimate. Posterior distributions are wide, and point estimates are noisy. The maximum marginal likelihood estimate of the variance parameter can often be zero, which is a problem for computational algorithms such as lme4 which are based on this marginal mode. For models with multiple varying coefficients (varying-intercept, varying-slope models), the bigger the group-level covariance matrix, the more likely it is that its max marginal likelihood estimate will be degenerate. This leads to computational problems as well as problems with the estimated coefficients, as they get not just partially pooled but completely pooled toward the fitted model.

The solution:

Priors. Zero-avoiding or boundary-avoiding priors to avoid zero or degenerate estimates of group-level variances, also informative priors to get more reasonable estimates when the number of groups is small.

The research papers:

[2013] A nondegenerate estimator for hierarchical variance parameters via penalized likelihood estimation. {\em Psychometrika} {\bf 78}, 685–709. (Yeojin Chung, Sophia Rabe-Hesketh, Andrew Gelman, Jingchen Liu, and Vincent Dorie)

[2014] Weakly informative prior for point estimation of covariance matrices in hierarchical models. {\em Journal of Educational and Behavioral Statistics} {\bf 40}, 136–157. (Yeojin Chung, Andrew Gelman, Sophia Rabe-Hesketh, Jingchen Liu, and Vincent Dorie)

The R package:

blme: Bayesian Linear Mixed-Effects Models, by Vince Dorie, Doug Bates, Martin Maechler, Ben Bolker, and Steven Walker

Going forward:

blme is great but we’d also like to have full Bayes. Stan does full Bayes but can be slow if you have a lot of data and a lot of groups. Just for example, suppose you have longitudinal data with 5 observations on each of 100,000 people. Then a hierarchical model will have hundreds of thousands of parameters—that’s a lot! On the other hand, the Bayesian central limit theorem should be working in your favor (see appendix B of BDA, for example). So some combination of approximate and full Bayesian inference should work.

Also, lme4, and even blme, can have trouble when you have lots of variance parameters running around, and lme4 has its own issues, which unfortunately blme inherits, regarding computation with empty groups and various issues like that which should not really be a problem with Bayesian inference with informative priors.

Right now, though, we don’t have this best-of-both-worlds Bayesian solution that does full Bayes when computationally feasible and uses appropriate approximations otherwise. So blme is part of our toolbox. Thanks to Vince!

Multilevel regression and poststratification (MRP) vs. survey sample weighting

Marnie Downes and John Carlin write:

Multilevel regression and poststratification (MRP) is a model-based approach for estimating a population parameter of interest, generally from large-scale surveys. It has been shown to be effective in highly selected samples, which is particularly relevant to investigators of large-scale population health and epidemiologic surveys facing increasing difficulties in recruiting representative samples of participants. We aimed to further examine the accuracy and precision of MRP in a context where census data provided reasonable proxies for true population quantities of interest. We considered 2 outcomes from the baseline wave of the Ten to Men study (Australia, 2013–2014) and obtained relevant population data from the 2011 Australian Census. MRP was found to achieve generally superior performance relative to conventional survey weighting methods for the population as a whole and for population subsets of varying sizes. MRP resulted in less variability among estimates across population subsets relative to sample weighting, and there was some evidence of small gains in precision when using MRP, particularly for smaller population subsets. These findings offer further support for MRP as a promising analytical approach for addressing participation bias in the estimation of population descriptive quantities from large-scale health surveys and cohort studies.

This article appeared in 2020 but I just happened to hear about it now.

Here’s the result from the first example considered by Downes and Carlin:

For the dichotomous labor-force outcome, MRP produced very accurate population estimates, particularly at the national level and for the larger states, where the employment rate was estimated within 1% in each case. For the smallest states of ACT and NT, MRP overestimated the employment rate by approximately 5%. Post-hoc analyses revealed that these discrepancies could be explained partly by important interaction terms that were evident in population data but not included in multilevel models due to insufficient data. For example, based on Census data, there was a much higher proportion of Indigenous Australians living in NT (25%) compared with all other states (<5%), but only 3% (n = 2) of the Ten to Men sample recruited from NT identified as Indigenous. There were also differences in the labor-force status of Indigenous Australians by state according to the Census: 90% of Indigenous Australians residing in ACT were employed compared with 79% residing in NT. Due to insufficient data available, it was not possible to obtain a meaningful estimate of this Indigenous status-by-state interaction effect.

And here’s their second example:

For the continuous outcome of hours worked, the performance of MRP was less impressive, with population quantities consistently overestimated by approximately 2 hours at the national level and for the larger states and by up to 4 hours for the smaller states. MRP still however, outperformed both unweighted and weighted estimation in most cases. The inaccuracy of all 4 estimation methods for this outcome likely reflects that the 2011 Census data for hours worked was not a good proxy for the true population quantities being estimated by the Ten to Men baseline survey conducted in 2013–2014. It is entirely plausible that the number of hours worked in all jobs in a given week could fluctuate considerably due to temporal factors and a wide range of individual-level covariates not included in our multilevel model. This was also evidenced by the large amount of residual variation in the multilevel model for this outcome.

Downes and Carlin summarize what they learned from the examples:

The increased consistency among state-level estimates achieved by MRP can be attributed to the partial pooling of categorical covariate parameter estimates toward their mean in multilevel modeling. This was particularly evident in the estimation of labor-force status for the smaller states of TAS, ACT, and NT, where MRP estimates fell part of the way between the unweighted state estimates and the national MRP estimate, with the degree of shrinkage reflecting the relative amount of information available about the individual state and all the states combined.

We did not observe, in this study, the large gains in precision achieved with MRP seen in our previous case study and simulation study. The multilevel models fitted here were more complex, including a larger number of covariates and multiple interaction effects. While we have sacrificed precision, this increased model complexity appears to have achieved increased accuracy. We did see small gains in precision when using MRP, particularly for the smaller states, and we might expect these gains to be larger for smaller sample sizes where the benefits of partial pooling in multilevel modeling would be greater.

Also:

The employment outcome measures considered in this study are not health outcomes per se; rather, they were chosen in the absence of any health outcomes for which census data were available to provide a comparison in terms of accuracy. We have no reason to expect MRP would behave any differently for outcome measures more commonly under investigation in population health or epidemiologic studies.

MRP can often lead to a very large number of poststratification cells. Our multilevel models generated 60,800 unique poststratification cells. With a total population size of 4,990,304, almost three-fourths of these cells contained no population data. This sparseness is not a problem, however, due to the smoothing of the multilevel model and the population cell counts used simply as weights in poststratification.

They conclude:

Results of this case-study analysis further support previous findings that MRP provides generally superior performance in both accuracy and precision relative to the use of conventional sample weighting for addressing potential participation bias in the estimation of population descriptive quantities from large-scale health surveys. Future research could involve the application of MRP to more complex problems such as estimating changes in prevalence over time in a longitudinal study or developing some user-friendly software tools to facilitate more widespread usage of this method.

It’s great to see people looking at these questions in detail. Mister P is growing up!

OK, I was wrong about Paul Samuelson.

Regarding my post from the other day, someone who knows much more about Samuelson than I do provides some helpful background:

It is your emphasis on “realistic” that is wrong. Paul played a significant role in the random walk model for stock prices and knew a huge amount about both theory and practice, as part of the MIT group, including others such as Bob Solow. He had no shades on his eyes – but he knew that the model fit well enough that betting on indexes was for most people as good as any strategy. But how to convey this in an introductory text? Most people would go to simple random walks – coin tossing. But that was far from realistic. Stock prices jumped irregularly and were at that time limited to something like half shares or quarter shares. It is a clever idea to think of a sort of random draw of actual price changes as a device to teach students what was going on. Much more realistic. I cannot believe he ever said this was exactly how it worked. Text books have to have a degree of idealization if they are to make a complex subject understandable. Paul was certainly not a pompous fool. Economics was a no holds barred field in those days and all models were actively criticized from both theoretical and empirical sides. His clever instructional idea was indeed more realistic and effective as well. He did get the Soviet economy wrong, but so dd every other economist.

Regarding that last point, I still maintain that Samuelson’s problem was not getting the Soviet economy wrong, but rather that he didn’t wrestle with his error in later editions of his book; see third and fourth paragraph of this comment. But, sure, that’s just one thing. Overall I get my correspondent’s point, and I no longer think the main substance of my earlier post was correct.

The mistake comes when it is elevated from a heuristic to a principle.

Gary Smith pointed me to this post, “Don’t worship math: Numbers don’t equal insight,” subtitled, “The unwarranted assumption that investing in stocks is like rolling dice has led to some erroneous conclusions and extraordinarily conservative advice,” which reminded me of my discussion with Nate Silver a few years ago regarding his mistaken claim that, “the most robust assumption is usually that polling is essentially a random walk, i.e., that the polls are about equally likely to move toward one or another candidate, regardless of which way they have moved in the past.” My post was called, “Politics is not a random walk: Momentum and mean reversion in polling,” and David Park, Noah Kaplan, and I later expanded that into a paper, “Understanding persuasion and activation in presidential campaigns: The random walk and mean-reversion models.”

The random walk model for polls is a bit like the idea that the hot hand is a fallacy: it’s an appealing argument that has a lot of truth to it (as compared to the alternative model that poll movement or sports performance is easily predictable given past data) but is not quite correct, and the mistake comes when it is elevated from a heuristic to a principle.

This mistake happens a lot, no? It comes up in statistics all the time.

P.S. Some discussion in comments on stock market and investing. I know nothing about that topic; the above post is just about the general problem of people elevating a heuristic to a principle.

Principal stratification for vaccine efficacy (causal inference)

Rob Trangucci, Yang Chen, and Jon Zelner write:

In order to meet regulatory approval, pharmaceutical companies often must demonstrate that new vaccines reduce the total risk of a post-infection outcome like transmission, symptomatic disease, severe illness, or death in randomized, placebo-controlled trials. Given that infection is a necessary precondition for a post-infection outcome, one can use principal stratification to partition the total causal effect of vaccination into two causal effects: vaccine efficacy against infection, and the principal effect of vaccine efficacy against a post-infection outcome in the patients that would be infected under both placebo and vaccination. Despite the importance of such principal effects to policymakers, these estimands are generally unidentifiable, even under strong assumptions that are rarely satisfied in real-world trials. We develop a novel method to nonparametrically point identify these principal effects while eliminating the monotonicity assumption and allowing for measurement error. Furthermore, our results allow for multiple treatments, and are general enough to be applicable outside of vaccine efficacy. Our method relies on the fact that many vaccine trials are run at geographically disparate health centers, and measure biologically-relevant categorical pretreatment covariates. We show that our method can be applied to a variety of clinical trial settings where vaccine efficacy against infection and a post-infection outcome can be jointly inferred. This can yield new insights from existing vaccine efficacy trial data and will aid researchers in designing new multi-arm clinical trials.

Sounds important. And they use Stan, which always makes me happy.

Bayesian Believers in a House of Pain?

This post is by Lizzie

A colleague from U-Mass Amherst sent me this image yesterday. He said he had found ‘multiple paper copies’ in an office and then ruminated on how they might have been used. I suggested they might have been for ‘a group of super excited folks at a conference jam session!’

This leads back to a rumination I have had for a long time: how come I cannot find a MCMC version of ‘Jump Around’? It seems many of the lyrics could be improved upon with an MCMC spin (though I would keep: I got more rhymes than there’s cops at a Dunkin’).

My colleague suggested that there is perhaps a need to host a Bayesian song adaptation contest ….

Stan for Pharmacometrics Day in Paris: 8 June 2023

The one-day workshop is organized by Julie Bertrand and her colleagues at Inserm:

9:30 Presentation of the program
9:40 Andrew Gelman (Columbia University)
10:40 Sebastian Weber (Novartis) – Supporting phase I dose-escalation trials in Oncology
11:00 Tea/coffee break
11:30 Aki Vehtari (Aalto University) – Bayesian Workflow
12:30 Lunch break
14:00 Maxime Beaulieu (INSERM) – Hierarchical Nonlinear Joint Modelling in Oncology – a simulation study
14:20 Stanislas du Ché (Univ Paris Orsay) – Parallelization for Markov chains Monte Carlo with Heterogeneous Runtimes
14:40 Julie Bertrand (INSERM) – Standard Errors at finite distance in Non linear Mixed Effect models
15:00 Céline Brochot (Certara) – Stan for Bioequivalence studies
15:20 Tea/coffee break
15:50 François Mercier and Daniel Sabanes-Bove (Roche) – jmpost: An R package for Bayesian joint tumor growth inhibition and overall survival models using Stan
16:10 Charles Margossian (Flatiron Institute) – Making Bayesian pharmacometrics modeling simpler (but not too simple) with Torsten
16:30 Day wrap-up

If you’ll be in town, you can sign up!

I’m not sure what I’ll speak on, given that Aki seems to have already taken the Bayesian Workflow topic. But I’ll think of something!

PhD or PostDoc position on simulation-based inference with Paul “brms” Bürkner

Hi all, this is Paul. Andrew was so kind to allow me to post a job ad here on his blog.

At the Technical University of Dortmund, Germany, I am currently looking for a PhD Student or PostDoc to work with me on simulation-based Bayesian inference research in the context of our BayesFlow framework.

BayesFlow is a Python library for efficient simulation-based Bayesian Inference. It enables users to create specialized neural networks for amortized Bayesian inference, which repays users with rapid statistical inference after a potentially longer simulation-based training phase. A cornerstone idea of amortized Bayesian inference is to employ generative neural networks for parameter estimation, model comparison, and model validation when working with intractable simulators whose behavior as a whole is too complex to be described analytically.

Both the BayesFlow library itself and its community are quickly growing. Our goal is to make it the gold-standard simulation-based inference library within the next couple of years.

For more details about the position, please see Paul Bürkner – Open Positions

I am looking forward to your applications!

Paul

BDA: Bayesian Dolphin Analysis

Matthieu Authier writes:

Here is a simulation study using regularized regression with post-stratification to estimate dolphin bycatch from non-representative samples. The Stan code is accessible here.

We’ve used also RRP on a case study with samples from FR where we know that independent observers are preferentially allowed on boat when dolphin bycatch is low (a report is being written at the moment on that, and it will be the second part of the dead dolphin duology for RRP). RRP is giving more plausible estimates in this case.

For those not familiar with the jargon, “bycatch” is “the non-intentional capture or killing of non-target species in commercial or recreational fisheries.”

It’s great to see MRP and Stan being used for all sorts of real problems.

P.S. Authier has an update:

The article has been published and vastly improved thank to peer review. The simulation study is here and the case study is here.

Stan was instrumental in both cases to be able to fit the models.

The model we developed is now used to update estimate bycatch in the Bay of Biscay.

Using the anthropic principle to think about researchers’ priors on effect sizes

Evan Werfel writes:

I’ve been thinking about the problem of setting an appropriate prior when conducting Bayesian analysis. Do you know if anyone has done any work to quantify how much prior-belief-ness undertaking a study to test a particular hypothesis might represent? Part of me wants to say that given all of the effort to design a study, get IRB approval and collect data, there has to be some lower bound on one’s prior before doing the data analysis… because if it were any lower, presumably one would undertake a different study. At the very least, one imagines that the researcher’s prior is greater than 0. I wonder if cognitive psychological researchers studying hypothesis testing could estimate how much prior belief people need, on average, to take action on their beliefs.

My reply: It’s hard to say. For example, a researcher’s prior belief in the efficacy of a proposed new treatment might be low, but if it could benefit millions of people, then it could be worth studying because of the high potential benefit. Even a treatment with negative expected value can be worth studying: if there is a small probability that it has a consistent positive benefit, then you do the experiment to see whether to proceed further. This is basic decision-analytic reasoning. Conversely, if the prior is that the treatment is probably beneficial, it can still be a good idea to do the experiment, just in case it actually has a negative effect, in which case you’d learn that and know not to proceed.

All that is not even considering issues such as cognitive biases, financial and career incentives, and all sorts of other reasons why we would expect researchers’ priors to be wrong.

Regarding your original question: Sometimes I call this sort of thing “anthropic reasoning” by analogy to the anthropic principle in physics, whereby we can derive some properties of our world, given the information that we exist in it.

Here’s an example from a few years ago where I used anthropic reasoning to answer the question, Should we take measurements at an intermediate design point? I love that paper, and I remain bummed that it’s only been cited 3 times.

Carl Morris: Man Out of Time [reflections on empirical Bayes]

Carl Morris recently passed away, so I will repost this article from 2015:

When Carl Morris came to our department in 1989, I and my fellow students were so excited. We all took his class. The funny thing is, though, the late 1980s might well have been the worst time to be Carl Morris, from the standpoint of what was being done in statistics at that time—not just at Harvard, but in the field in general. Carl has made great contributions to statistical theory and practice, developing ideas which have become particularly important in statistics in the last two decades. In 1989, though, Carl’s research was not in the mainstream of statistics, or even of Bayesian statistics.

When Carl arrived to teach us at Harvard, he was both a throwback and ahead of his time.

Let me explain. Two central aspects of Carl’s research are the choice of probability distribution for hierarchical models, and frequency evaluations in hierarchical settings where both Bayesian calibration (conditional on inferences) and classical bias and variance (conditional on unknown parameter values) are relevant. In Carl’s terms, these are “NEF-QVF” and “empirical Bayes.” My point is: both of these areas were hot at the beginning of Carl’s career and they are hot now, but somewhere in the 1980s they languished.

In the wake of Charles Stein’s work on admissibility in the late 1950s there was an interest, first theoretical but with clear practical motivations, to produce lower-risk estimates, to get the benefits of partial pooling while maintaining good statistical properties conditional on the true parameter values, to produce the Bayesian omelet without cracking the eggs, so to speak. In this work, the functional form of the hierarchical distribution plays an important role—and in a different way than had been considered in statistics up to that point. In classical distribution theory, distributions are typically motivated by convolution properties (for example, the sum of two gamma distributions with a common shape parameter is itself gamma), or by stable laws such as the central limit theorem, or by some combination or transformation of existing distributions. But in Carl’s work, the choice of distribution for a hierarchical model can be motivated based on the properties of the resulting partially pooled estimates. In this way, Carl’s ideas are truly non-Bayesian because he is considering the distribution of the parameters in a hierarchical model not as a representation of prior belief about the set of unknowns, and not as a model for a population of parameters, but as a device to obtain good estimates.

So, using a Bayesian structure to get good classical estimates. Or, Carl might say, using classical principles to get better Bayesian estimates. I don’t know that they used the term “robust” in the 1950s and 1960s, but that’s how we could think of it now.

The interesting thing is, if we take Carl’s work seriously (and we should), we now have two principles for choosing a hierarchical model. In the absence of prior information about the functional form of the distribution of group-level parameters, and in the absence of prior information about the values of the hyperparameters that would underlie such a model, we should use some form with good statistical properties. On the other hand, if we do have good prior information, we should of course use it—even R. A. Fisher accepted Bayesian methods in those settings where the prior distribution is known.

But, then, what do we do in those cases in between—the sorts of problems that arose in Carl’s applied work in health policy and other areas? I learned from Carl to use our prior information to structure the model, for example to pick regression coefficients, to decide which groups to pool together, to decide which parameters to model as varying, and then use robust hierarchical modeling to handle the remaining, unexplained variation. This general strategy wasn’t always so clear in the theoretical papers on empirical Bayes, but it came through in the Carl’s applied work, as well as that of Art Dempster, Don Rubin, and others, much of which flowered in the late 1970s—not coincidentally, a few years after Carl’s classic articles with Brad Efron that put hierarchical modeling on a firm foundation that connected with the edifice of theoretical statistics, gradually transforming these ideas from a parlor trick into a way of life.

In a famous paper, Efron and Morris wrote of “Stein’s paradox in statistics,” but as a wise man once said, once something is understood, it is no longer a paradox. In un-paradoxing shrinkage estimation, Efron and Morris finished the job that Gauss, Laplace, and Galton had begun.

So far, so good. We’ve hit the 1950s, the 1960s, and the 1970s. But what happened next? Why do I say that, as of 1989, Carl’s work was “out of time”? The simplest answer would be that these ideas were a victim of their own success: once understood, no longer mysterious. But it was more than that. Carl’s specific research contribution was not just hierarchical modeling but the particular intricacies involved in the combination of data distribution and group-level model. His advice was not simply “do Bayes” or even “do empirical Bayes” but rather had to do with a subtle examination of this interaction. And, in the late 1980s and early 1990s, there wasn’t so much interest in this in the field of statistics. On one side, the anti-Bayesians were still riding high in their rejection of all things prior, even in some quarters a rejection of probability modeling itself. On the other side, a growing number of Bayesians—inspired by applied successes in fields as diverse as psychometrics, pharmacology, and political science—were content to just fit models and not worry about their statistical properties.

Similarly with empirical Bayes, a term which in the hands of Efron and Morris represented a careful, even precarious, theoretical structure intended to capture classical statistical criteria in a setting where the classical ideas did not quite apply, a setting that mixed estimation and prediction—but which had devolved to typically just be shorthand for “Bayesian inference, plugging in point estimates for the hyperparameters.” In an era where the purveyors of classical theory didn’t care to wrestle with the complexities of empirical Bayes, and where Bayesians had built the modeling and technical infrastructure needed to fit full Bayesian inference, hyperpriors and all, there was not much of a market for Carl’s hybrid ideas.

This is why I say that, at the time Carl Morris came to Harvard, his work was honored and recognized as pathbreaking, but his actual research agenda was outside the mainstream.

As noted above, though, I think things have changed. The first clue—although it was not at all clear to me at the time—was Trevor Hastie and Rob Tibshirani’s lasso regression, which was developed in the early 1990s and which has of course become increasingly popular in statistics, machine learning, and all sorts of applications. Lasso is important to me partly as the place where Bayesian ideas of shrinkage or partial pooling entered what might be called the Stanford school of statistics. But for the present discussion what is most relevant is the centrality of the functional form. The point of lasso is not just partial pooling, it’s partial pooling with an exponential prior. As I said, I did not notice the connection with Carl’s work and other Stein-inspired work back when lasso was introduced—at that time, much was made of the shrinkage of certain coefficients all the way to zero, which indeed is important (especially in practical problems with large numbers of predictors), but my point here is that the ideas of the late 1950s and early 1960s again become relevant. It’s not enough just to say you’re partial pooling—it matters _how_ this is being done.

In recent years there’s been a flood of research on prior distributions for hierarchical models, for example the work by Nick Polson and others on the horseshoe distribution, and the issues raised by Carl in his classic work are all returning. I can illustrate with a story from my own work. A few years ago some colleagues and I published a paper on penalized marginal maximum likelihood estimation for hierarchical models using, for the group-level variance, a gamma prior with shape parameter 2, which has the pleasant feature of keeping the point estimate off of zero while allowing it to be arbitrarily close to zero if demanded by the data (a pair of properties that is not satisfied by the uniform, lognormal, or inverse-gamma distributions, all of which had been proposed as classes of priors for this model). I was (and am) proud of this result, and I linked it to the increasingly popular idea of weakly informative priors. After talking with Carl, I learned that these ideas were not new to me, indeed these were closely related to the questions that Carl has been wrestling with for decades in his research, as they relate both to the technical issue of the combination of prior and data distributions, and the larger concerns about default Bayesian (or Bayesian-like) inferences.

In short: in the late 1980s, it was enough to be Bayesian. Or, perhaps I should say, Bayesian data analysis was in its artisanal period, and we tended to be blissfully ignorant about the dependence of our inferences on subtleties of the functional forms of our models. Or, to put a more positive spin on things: when our inferences didn’t make sense, we changed our models, hence the methods we used (in concert with the prior information implicitly encoded in that innocent-sounding phrase, “make sense”) had better statistical properties than one would think based on theoretical analysis alone. Real-world inferences can be superefficient, as Xiao-Li Meng might say, because they make use of tacit knowledge.

In recent years, however, Bayesian methods (or, more generally, regularization, thus including lasso and other methods that are only partly in the Bayesian fold) have become routine, to the extent that we need to think of them as defaults, which means we need to be concerned about . . . their frequency properties. Hence the re-emergence of truly empirical Bayesian ideas such as weakly informative priors, and the re-emergence of research on the systematic properties of inferences based on different classes of priors or regularization. Again, this all represents a big step beyond the traditional classification of distributions: in the robust or empirical Bayesian perspective, the relevant properties of a prior distribution depend crucially on the data model to which it is linked.

So, over 25 years after taking Carl’s class, I’m continuing to see the centrality of his work to modern statistics: ideas from the early 1960s that were in many ways ahead of their time.

Let me conclude with the observation that Carl seemed to us to be a “man out of time” on the personal level as well. In 1989 he seemed ageless to us both physically and in his personal qualities, and indeed I still view him that way. When he came to Harvard he was not young (I suppose he was about the same age as I am now!) but he had, as the saying goes, the enthusiasm of youth, which indeed continues to stay with him. At the same time, he has always been even-tempered, and I expect that, in his youth, people remarked upon his maturity. It has been nearly fifty years since Carl completed his education, and his ideas remain fresh, and I continue to enjoy his warmth, humor, and insights.

Here is a video from his retirement event.

A proposal to build new hardware and thermodynamic algorithms for stochastic computing

Patrick Coles writes:

Modern AI has moved away from the absolute, deterministic procedures of early machine learning models. Nowadays, probability and randomness are fully embraced and utilized in AI. Some simple examples of this are avoiding overfitting by randomly dropping out neurons (i.e., dropout), and escaping local minima during training thanks to noisy gradient estimates (i.e., stochastic gradient descent). A deeper example is Bayesian neural networks, where the network’s weights are sampled from a probability distribution and Bayesian inference is employed to update the distribution in the presence of data . . .

Another deep example is generative modeling with diffusion models. Diffusion models add noise to data in a forward process, and then reverse the process to generate a new datapoint (see figure illustrating this for generating an image of a leaf). These models have been extremely successful not only in image generation, but also in generating molecules, proteins and chemically stable materials . . .

AI is currently booming with breakthroughs largely because of these modern AI algorithms that are inherently random. At the same time, it is clear that AI is not reaching its full potential, because of a mismatch between software and hardware. For example, sample generation rate can be relatively slow for diffusion models, and Bayesian neural networks require approximations for their posterior distributions to generate samples in reasonable time.

Then comes the punchline:

There’s no inherent reason why digital hardware is well suited for modern AI, and indeed digital hardware is handicapping these exciting algorithms at the moment.

For production AI, Bayesianism in particular has been stifled from evolving beyond a relative niche because of its lack of mesh with digital hardware . . . .the next hardware paradigm should be specifically tailored to the randomness in modern AI. Specifically, we must start viewing stochasticity as a computational resource. In doing so, we could build a hardware that uses the stochastic fluctuations produced by nature.

Coles continues:

The aforementioned building blocks are inherently static. Ideally, the state does not change over time unless it is intentionally acted upon by a gate, in these paradigms.

However, modern AI applications involve accidental time evolution, or in other words, stochasticity. This raises the question of whether we can construct a building block whose state randomly fluctuates over time. This would be useful for naturally simulating the fluctuations in diffusion models, Bayesian inference, and other algorithms.

The key is to introduce a new axis when plotting the state space: time. Let us define a stochastic bit (s-bit) as a bit whose state stochastically evolves over time according to a continuous time Markov chain . . .

Ultimately this involves a shift in perspective. Certain computing paradigms, such as quantum and analog computing, view random noise as a nuisance. Noise is currently the biggest roadblock to realizing ubiquitous commercial impact for quantum computing. On the other hand, Thermodynamic AI views noise as an essential ingredient of its operation. . . .

I think that when Coles says “AI,” he means what we would call “Bayesian inference.” Or maybe AI represents some particularly challenging applications of Bayesian computation.

Analog computing

OK, the above is all background. Coles’s key idea here is to build a computer using new hardware, to build these stochastic bits so that continuous computation gets done directly.

This is reminiscent of what in the 1950s and 1960s was called “analog computation” or “hybrid computation.” An analog computer is something you build with a bunch of resistors and capacitors and op-amps to solve a differential equation. You plug it in, turn on the power, and the voltage tells you the solution. Turn some knobs to change the parameters in the model, or set it up in a circuit with a sawtooth input and plug it into an oscilloscope to get the solution as a function of the input, etc. A hybrid computer mixes analog and digital elements. Coles is proposing something different in that he’s interested in the time evolution of the state (which, when marginalized over time, can be mapped to a posterior distribution), whereas in traditional analog computer, you just look at the end state and you’re not interested in the transient period that it takes to get there.

Here’s the technical report from Coles. I have not read it carefully or tried to evaluate it. That would be hard work! Could be interest to many of you, though.