The Current State of Undergraduate Bayesian Education and Recommendations for the Future

This post is by Aki

Mine Dogucu and Jingchen Hu arxived in September a paper The Current State of Undergraduate Bayesian Education and Recommendations for the Future with an abstract

With the advances in tools and the rise of popularity, Bayesian statistics is becoming more important for undergraduates. In this study, we surveyed whether an undergraduate Bayesian course is offered or not in our sample of 152 high-ranking research universities and liberal arts colleges. For each identified Bayesian course, we examined how it fits into the institution’s undergraduate curricula, such as majors and prerequisites. Through a series of course syllabi analyses, we explored the topics covered and their popularity in these courses, the adopted teaching and learning tools, such as software. This paper presents our findings on the current practices of Bayesian education at the undergraduate level. Based on our findings, we provide recommendations for programs that may consider offering Bayesian education to their students.

It’s not mentioned in the abstract, but based on the references “U.S. News (2021a), ‘2021 Best national university rankings’” and “U.S. News (2021b), ‘National liberal arts colleges’” the current state is only about US research universities and liberal arts colleges.

Anyway, I highly recommend checking out their paper as in addition to the survey of the state, they make recommendations for the future:

  1. Expand the access to Bayesian courses
  2. Make Bayesian courses a part of the majors
  3. Balance statistics with computing
  4. Use a variety of assessments

It’s easy for me to agree with these, as I’ve been doing these for years: 1) I’m making my course available for everyone, 2) my course is optional in Data science BSc major (and some other majors), and required in Machine learning / AI MSc major at Aalto University (the course is quite advanced, and we’re missing an intermediate course at BSc level, so that’s why it’s only optional at BSc level), 4) I use a variety of assessments. Especially, I like the subpoints in the recommendation 3:

  1. Introduce simulation-based learning early in the course.
  2. Encourage students to write self-coded MCMC algorithms for relatively simple multiparameter models
  3. If the course puts equal emphasis on computing and modeling, consider adopting one of the popular probabilistic programming languages for Bayesian model estimation through MCMC (e.g., JAGS and Stan).
  4. If the course has a slightly stronger emphasis on modeling over computing, consider introducing one of the wrapper packages for Stan for its simpler posterior summary procedure (e.g., rstanarm and brms).

I have been including all these in my course (except JAGS), so I’m already in the future :D

I’m also happy to see that the most commonly required or recommended book is still 8 year old “Bayesian Data Analysis”.

Dogucu and Hu end their paper with

Last but not least, we invite current and future Bayesian educators to join the undergraduate Bayesian education network, an online community that fosters discussions of undergraduate Bayesian education.

Several postdoc, research fellow, and doctoral student positions in Aalto/Helsinki, Finland

This job ad is by Aki

Aalto University, University of Helsinki, and Finnish Center for Artificial Intelligence have a great probabilistic modeling community, and we’re looking for several postdocs, research fellows and doctoral students with topics including a lot of Bayesian statistics.

I’m looking for a postodc and doctoral student to work on Bayesian workflow and a postodc to work on AI assisted modeling.

Other Bayesian flavored topics (I’m involved also in some of these) are

  • AI-assisted design and decisions: from foundations to practice
  • AI-assisted design of experiments and interventions
  • Advanced user models
  • Virtual atmospheric laboratory
  • Variable selection with missing data with applications to genetics
  • Bayesian machine learning for sensing
  • Deep learning with differential equations
  • Deep generative modeling for precision medicine and future clinical trials
  • Probabilistic modelling and Bayesian machine learning
  • Physics-inspired geometric deep representation learning for drug design
  • Probabilistic modelling for collaborative human-in-the-loop design
  • Machine Learning for Health
  • Statistical Genetics and Machine Learning
  • Bayesian machine learning and differential privacy

More information about the topics and how to apply

And here’s a photo of Finnish summer at 11pm +25C A summer sunset 11pm in Finland

Post-doc to work on developing Bayesian workflow tools

I (Aki) am looking for a post-doc to work on developing Bayesian workflow tools at Aalto University, Finland, and Finnish Center for Artificial Intelligence, in collaboration with Andrew, Dan Simpson, Paul Bürkner, Lauren Kennedy, Måns Magnusson, Stan developers, ArviZ developers, and others.

The topic is related to many ideas discussed in Bayesian Workflow paper. You can also watch me talking about the key ideas for the research topic.

Strong background in Bayesian inference and some experience in programming is needed. You will co-supervise several doctoral students. Aalto and Helsinki region have a strong and active probabilistic modeling community, and you’ll work with many local collaborators, too. Possibility to spend time in NYC, Melbourne or Stuttgart. Flexible contract time.

Finland is the happiest country in the world. We pay quite well and the healthcare system is excellent.

You can find my contact information on my web page

Post-stratified longitudinal item response model for trust in state institutions in Europe

This is a guest post by Marta Kołczyńska:

Paul, Lauren, Aki, and I (Marta) wrote a preprint where we estimate trends in political trust in European countries between 1989 and 2019 based on cross-national survey data.

This paper started from the following question: How to estimate country-year levels of political trust with data from surveys that (a) mostly have the same trust questions but measured with ordinal rating scales of different lengths, and (b) mostly have samples that aim to be representative for general adult populations, but this representativeness is likely reached to different degrees?

Our solution combines:

  1. item response models of responses to trust items that account for the varying scale lengths across survey projects,
  2. splines to model changes over time,
  3. post-stratification by age, sex, and education.

In the paper we try to explain all the modeling decisions, so that the paper may serve as a guide for people who want to apply similar methods or — even better — extend and improve them.

We apply this approach to data from 12 cross-national projects (1663 national surveys) carried out in 27 European countries between 1989 and 2019. We find that (a) political trust is pretty volatile, (b) there has not been any clear downward trend in political trust in Europe in the last 30 years, although trust did decline in many Central-East European countries in the 1990s, and there was a visible dip following the 2008 crisis in countries that were hit most, followed by at least partial recovery. Below are estimated levels of political trust for the 27 countries (see the preprint for more details on differences in political trust by sex, age, and education):

Estimated political trust in Europe

The modeling was done in brms thanks to some special features that Paul wrote, and overall this is one of the projects that would not have been possible without Stan.

One of the main obstacles we faced was the limited availability of population data for post-stratification. In the end we used crude education categories (less than high school, high school or above – also because of the incoherent coding of education in surveys), combined Eurostat data with harmonized census samples from IPUMS International, and imputed values for the missing years.

We think our approach or some of its components can be more broadly applied to modeling attitudes in a way that addresses issues of measurement and sample representativeness.

More limitations of cross-validation and actionable recommendations

This post is by Aki.

Tuomas Sivula, Måns Magnusson, and I (Aki) have a new preprint paper that analyzes one of the limitations of cross-validation Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison.

Normal distribution has been used to present the uncertainty in cross-validation for a single model and in model comparison at least since the 1980’s (Breiman et al, 1984, Ch 11). Surprisingly, there hasn’t been much theoretical analysis of validity of the normal approximation (or it’s very well hidden).

We (Vehtari and Lampinen, 2002; Vehtari, Gelman and Gabry, 2017) have also recommended using normal approximation and our loo package reports elpd_loo SE and elpd_diff SE, but we have been cautious about making strong claims about their accuracy.

Starting points for writing this paper were

  1. Shao (1993) showed in (non-Bayesian) context of model selection with nested linear models and if the true model is included, the model selection based on LOO-CV with squared prediction error as the cost function is asymptotically inconsistent. In other words, if we are comparing models A and B, where B has an additional predictor with true coefficient being zero, then the difference of the predictive performance and the associated uncertainty have similar magnitude asymptotically.
  2. Bengio and Grandvalet (2004) showed that there is no generally unbiased estimator for the variance used in the normal approximation, and that the estimate tends to be underestimated. This is due to the dependency between cross-validation folds as each observation is used once for testing and K-1 (where in case of LOO, K=N) times for training (or conditioning the posterior in Bayesian approach).
  3. Bengio and Grandvalet also demonstrated that in case of small N or bad model misspecification the estimate tends to be worse. Varoquaux et al. (2017) and Varoquaux (2018) provide additional demonstrations that the variance is underestimated if N is small.
  4. The normal approximation is based on law of large numbers, but in finite cases the distribution of individual predictive utilities/losses can have a very skewed distribution which could make the normal approximation badly calibrated. Vehtari and Lampinen (2002) had proposed to use Bayesian bootstrap to take into account the skewness, but they didn’t provide thorough analysis whether it actually works.

What we knew before started writing this paper

  1. Shao’s inconsistency result is not that worrying as asymptotically the models A and B are indistinguishable and thus the posterior of that extra coefficient will concentrate on zero and the predictions from the models are indistinguishable. Shao’s result however hints that if the relative variance (compared to the difference) doesn’t go to zero then the central limit theorem is not kicking in as usual and the distribution is not necessarily asymptotically normal. We wanted to learn more about this.
  2. Bengio and Grandvalet focused on variance estimators and didn’t consider skewness, but also when demonstrating with outliers they also missed to look at the possible finite case bias and asymptotic behavior. We wanted to learn more about this.
  3. We wanted to learn more about what is a small N in case well-specified models. When comparing small N case and outlier case, we can consider as outliers dominating the sum and thus a small number of outliers case is similar to small N in well-specified case, except we can also get significant bias. We wanted to learn more about this.
  4. Vehtari and Lampinen proposed to use Bayesian bootstrap, but in later experiments there didn’t seem to be much benefit compared to normal approximation. We wanted to learn more about this.

There were many papers discussing the predictive performance estimates for single models, but it turned out that the uncertainty in the model comparison has much different behavior.

Thanks to hard work by Tuomas and Måns, we learned that the uncertainty estimates in model comparison can perform badly, namely when:

  1. the models make very similar predictions,
  2. the models are misspecified with outliers in the data, and
  3. the number of observations is small.

We also learned that the problematic skewness of the distribution of the error of the approximation occurs with models which are making similar predictions and it is possible that the skewness does not fade away when N grows. We show that considering the skewness of the sampling distribution is not sufficient to improve the uncertainty estimate as it has a weak connection to the skewness of the distribution of the estimators’ error. This explains why Bayesian bootstrap can’t improve calibration much compared to the normal approximation.

On Twitter someone considered our results as pessimistic, as we mention misspecified models and in real life we can assume that none of the models is the true data generating mechanism. With misspecified model we mean opposite of well-specified model that doesn’t need to the true data generating mechanism, and naturally the amount of misspecification matters. The discussion about well specified and misspecified models holds for any modeling approach and it’s not unique for cross-validation. Bengio and Grandvalet had used just the term outlier, but we wanted to emphasize that outlier is not necessary a property of the data generating mechanism, but more of something that is not well modeled with a given model.

We are happy that we now know better than ever before when we can trust CV uncertainty estimates. The consequences of the above points are

  1. The bad calibration when models are very similar makes LOO-CV less useful for separating very small effect sizes from zero effect sizes. When the models make similar predictions there is not much difference in the predictive performance, and thus for making predictions it doesn’t matter which model we choose. The bad calibration of the uncertainty estimate doesn’t matter as the possible error is small anyway. Separating very small effect sizes from zero effect sizes is very difficult problem anyway and whatever approach is used probably needs very well specified and well identifiable models (e.g. posterior probabilities of models also suffer from overconfidence) and large N.
  2. The model misspecification in model comparison should be avoided by proper model checking and expansion before using LOO-CV. But this is something we should do anyway (and posterior probabilities of models also suffer from overconfidence in case of model misspecification)
  3. Small differences in the predictive performance can not reliably be detected by LOO-CV if the number of observations is small. What is small? We write in the paper “small data (say less than 100 observations)”, but of course that is not a change-point in the behavior, but the calibration improves gradually when N gets larger.

Cross-validation is often advocated for M-open case where we assume that none of the compared models is presenting the true data generating mechanism. The point 2. doesn’t invalidate the M-open case. If the model misspecification is bad and N is not very big, then the calibration in comparison gets worse, but the cross-validation is still useful for detecting big differences, and only when trying to detect small differences we need well behaving models. This is true for any modeling approach.

We don’t have the following in the paper, so you can consider this as my personal opinion based on what we learned. Based on the paper we could add to loo package documentation that

  1. If
    • the compared models are well specified
    • N is not too small (say > 100)
    • and elpd_diff > 4

    then elpd_diff SE is likely to be good presentation of the related uncertainty.

  2. If
    • the compared models are well specified
    • N is not too small (say 100)
    • elpd_diff < 4
    • and elpd_diff SE < 4

    then elpd_diff SE is not a good presentation of the related uncertainty, but the error is likely to be small. Stacking can provide additional insight as it takes into account not only the average difference, but the shape of the predictive distributions and combination of models can perform be better than a single model.

  3. If
    • the compared models are not well specified

    then elpd_diff and related SE can be useful, but you should improve your models anyway.

  4. If
    • N is small (say < 100)

    then proceed with caution and think harder as with any statistical modeling approach in case of small data (or get more observations). (There can’t be an exact rule when N is small to be able to make inference as sometimes just N=1 can be sufficient to say that what was observed is possible etc.)

All this is supported with plenty of proofs and experiments (24 page article and 64 page appendix), but naturally we can’t claim that these recommendations are bullet proof and happy to see counter examples.

In the paper we intentionally avoid dichotomous testing and focus on what we can say about the error distribution as it has more information than yes/no answer.

Extra

We have also another new (much shorter, just 22 pages) paper Unbiased estimator for the variance of the leave-one-out cross-validation estimator for a Bayesian normal model with fixed variance showing that although there is no generally unbiased estimator (as shown by Bengio and Grandvalet (2004)) there can be unbiased estimator for a specific model. The unbiasedness is not the goal itself, but this paper shows that it could be possible to derive model specific estimators that could have better calibration and smaller error than the naive estimator discussed above.

Aki’s talk about reference models in model selection in Laplace’s demon series

I (Aki) talk about reference models in model selection in Laplace’s demon series 24 June 15UTC (Finland 18, Paris 17, New York 11). See the seminar series website for a registration link, the schedule for other talks, and the list of the recorded talks.

The short summary: 1) Why a bigger model helps inference for smaller models, 2) Bayesian decision theoretic justification, 3) Examples. There’s time for questions and discussion. Yes, it will be recorded.

The abstract:

I discuss and demonstrate the benefits of using a reference model in variable selection. A reference model acts as a noise-filter on the target variable by modeling its data generating mechanism. As a result, using the reference model predictions in the model selection procedure reduces the variability and improves stability leading to improved model selection performance and improved predictive performance of the selected model. Assuming that a Bayesian reference model describes the true distribution of future data well, the theoretically preferred usage of the reference model is to project its predictive distribution to a reduced model leading to projection predictive variable selection approach. Alternatively, reference models may also be used in an ad-hoc manner in combination with common variable selection methods.

The talk is based on work with many co-authors. The list of papers and software with links.

Juho Piironen, Markus Paasiniemi, and Aki Vehtari (2020). Projective inference in high-dimensional problems: prediction and feature selection. Electronic Journal of Statistics, 14(1):2155-2197. https://doi.org/10.1214/20-EJS1711

Federico Pavone, Juho Piironen, Paul-Christian Bürkner, and Aki Vehtari (2020). Using reference models in variable selection. arXiv preprint arXiv:2004.13118. https://arxiv.org/abs/2004.13118

Juho Piironen and Aki Vehtari (2016). Projection predictive input variable selection for Gaussian process models. 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), doi:10.1109/MLSP.2016.7738829. https://dx.doi.org/10.1109/MLSP.2016.7738829

Juho Piironen and Aki Vehtari (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3):711-735. doi:10.1007/s11222-016-9649-y. https://link.springer.com/article/10.1007/s11222-016-9649-y

Homayun Afrabandpey, Tomi Peltola, Juho Piironen, Aki Vehtari, and Samuel Kaski (2019). Making Bayesian predictive models interpretable: A decision theoretic approach. arXiv preprint arXiv:1910.09358 https://arxiv.org/abs/1910.09358

Donald R. Williams, Juho Piironen, Aki Vehtari, and Philippe Rast (2018). Bayesian estimation of Gaussian graphical models with projection predictive selection. arXiv preprint arXiv:1801.05725. https://arxiv.org/abs/1801.05725

Software:
Juho Piironen, Markus Paasiniemi, Alejandro Catalina and Aki Vehtari (2020). projpred: Projection Predictive Feature Selection. https://mc-stan.org/projpred

Video:
Recorded video of the talk and discussion

Case studies:
Model selection case studies

EDIT: added one paper
EDIT2: video link + case studies

Several post-doc positions in probabilistic programming etc. in Finland

There are several open post-doc positions in Aalto and University of Helsinki in 1. probabilistic programming, 2. simulator-based inference, 3. data-efficient deep learning, 4. privacy preserving and secure methods, 5. interactive AI. All these research programs are connected and collaborating. I (Aki) am the coordinator for the project 1 and contributor in the others. Overall we are developing methods and tools for a big part of Bayesian workflow. The methods are developed generally, but Stan is one of the platforms used to make the first implementations and thus some post-docs will work also with Stan development team. See more details here.

StanCon 2018 Helsinki talk slides, notebooks and code online

StanCon 2018 Helsinki talk slides, notebooks and code have been available for some time in StanCon talks repository, but it seems we forgot to announce this. The StanCon 2018 Helsinki talk list includes also links to videos.

StanCon’s version of conference proceedings is a collection of contributed talks based on interactive notebooks. Every submission is peer reviewed by at least two reviewers. The reviewers are members of the Stan Conference Organizing Committee and the Stan Developmemt Team. This repository contains all of the accepted notebooks as well as any supplementary materials required for building the notebooks. The slides presented at the conference are also included.

Thanks for all the presenters, and see you in StanCon 2019!

Postdocs and Research fellows for combining probabilistic programming, simulators and interactive AI

Here’s a great opportunity for those interested in probabilistic programming and workflows for Bayesian data analysis:

We (including me, Aki) are looking for outstanding postdoctoral researchers and research fellows to work for a new exciting project in the crossroads of probabilistic programming, simulator-based inference and user interfaces. You will have an opportunity to work with top research groups in Finnish Center for Artificial Intelligence, including both Aalto University and at the University of Helsinki and to cooperate with several industry partners.

The topics for which we are recruiting are

  • Machine learning for simulator-based inference
  • Intelligent user interfaces and techniques for interacting with AI
  • Interactive workflow support for probabilistic programming based modeling

Find the full descriptions here

Postdoc position: Stan and composite mechanistic and data-driven models of cellular metabolism

Very cool project and possibility to work 3 years developing Stan and collaborating with me (Aki) and other Stan development team. Deadline for applications is 22 October.

Quantitative Modelling of Cell Metabolism (QMCM) group headed by Professor Lars Keld Nielsen at DTU, Copenhagen, is looking for experienced Bayesian statistician for a postdoc position.

Group specializes in creating complex composite mechanistic / data-driven models of cellular metabolism.

Such accurate and interpretable models can be used to analyze response to drugs, search for a new pharmaceutical targets or explore basic biology.

Some of responsibilities include working with well-known Stan framework: improving it for HPC usage and fitting large nonlinear models.

This kind of activities supposed to be carried out with collaborators in Aalto University (Aki Vehtari) and core Stan developers at Columbia University.

All details could be found here

StanCon 2018 Helsinki tutorial videos online

StanCon 2018 Helsinki tutorial videos are now online at Stan YouTube channel

List of tutorials at StanCon 2018 Helsinki

  • Basics of Bayesian inference and Stan, parts 1 + 2, Jonah Gabry & Lauren Kennedy
  • Hierarchical models, parts 1 + 2, Ben Goodrich
  • Stan C++ development: Adding a new function to Stan, parts 1 + 2, Bob Carpenter, Sean Talts & Mitzi Morris
  • Ordinary differential equation (ODE) models in Stan, Daniel Lee
  • Productization of Stan, Eric Novik, Markus Ojala, Tom Nielsen, Anna Kircher
  • Model assessment and selection, Aki Vehtari

Abstracts for tutorials are available at conference website.

Talk videos will be edited and divided to individual talks this week.

When LOO and other cross-validation approaches are valid

Introduction

Zacco asked in Stan discourse whether leave-one-out (LOO) cross-validation is valid for phylogenetic models. He also referred to Dan’s excellent blog post which mentioned iid assumption. Instead of iid it would be better to talk about exchangeability assumption, but I (Aki) got a bit lost in my discourse answer (so don’t bother to go read it). I started to write this blog post to clear my thoughts and extend what I’ve said before, and hopefully produce a better answer.

TL;DR

The question is when leave-one-out cross-validation or leave-one-group-out cross-validation is valid for model comparison. The short answer is that we need to think about what is the joint data generating mechanism, what is exchangeable, and what is the prediction task. LOO can be valid or invalid, for example, for time-series and phylogenetic modelling depending on the prediction task. Everything said about LOO applies also to AIC, DIC, WAIC, etc.

iid and exchangeability

Dan wrote

To see this, we need to understand what the LOO methods are using to select models. It is the ability to predict a new data point coming from the (assumed iid) data generating mechanism. If two models asymptotically produce the same one point predictive distribution, then the LOO-elpd criterion will not be able to separate them.

This is well put, although iid is stronger than necessary assumption. It would be better to assume exchangeability which doesn’t imply independence. Exchangeability assumption thus extends in which cases LOO is applicable. The details of difference between iid and exchangeability is not that important for this post, but I recommend interested readers to see Chapter 5 in BDA3. Instead of iid vs exchangeability, I’ll focus here on prediction tasks and data generating mechanisms.

Basics of LOO

LOO is easiest to understand in the case where we have a joint distribution $latex p(y|x,\theta)p(x|\phi)$, and $latex p(y|x,\theta)$ factorizes as

$latex p(y|x,\theta) = \prod_{n=1}^N p(y_n|x_n,\theta).$

We are interested in predicting a new data point coming from the data generating mechanism

$latex p(y_{N+1}|x_{N+1},\theta).$

When we don’t know $latex \theta$, we’ll integrate over posterior of $latex \theta$ to get posterior predictive distribution

$latex p(y_{N+1}|x_{N+1},x,y)=\int p(y_{N+1}|x_{N+1},\theta) p(\theta|x,y) d\theta.$

We would like to evaluate how good this predictive distribution is by comparing it to observed $latex (y_{N+1}, x_{N+1})$. If we have not yet observed $latex (y_{N+1}, x_{N+1})$ we can use LOO to approximate expectation over different possible values for $latex (y_{N+1}, x_{N+1})$. Instead of making a model $latex p(y,x)$, we re-use observation as pseudo-Monte Carlo samples from $latex p(y_{N+1},x_{N+1})$, and in addition not to use the same observation for fitting theta and evaluation we use LOO predictive distributions

$latex p(y_{n}|x_{n},x_{-n},y_{-n})=\int p(y_{n}|x_{n},\theta) p(\theta|x_{-n},y_{-n}) d\theta.$

The usual forgotten assumption is that $latex x_{N+1}$ is unknown with the uncertainty described by a distribution! We have a model for $latex p(y|x)$, but often we don’t build a model for $latex p(x)$. BDA3 Chapter 5 discusses that when we have extra information $latex x_n$ for $latex y_n$, then $latex y_n$ are not exchangeable, but $latex (x_n, y_n)$ pairs are. BDA3 Chapter 14 discusses that if we have a joint model $latex p(x,y|\phi,\theta)$ and a prior which factorizes as

$latex p(\phi,\theta)=p(\phi)p(\theta),$

then the posterior factorizes as

$latex p(\phi,\theta|x,y)=p(\phi|x)p(\theta|x,y)$.

We can analyze the second factor by itself with no loss of information.

For the predictive performance estimate we however need to know how the future $latex x_{N+1}$ would be generated! In LOO we avoid making that explicit model by assuming observed $latex x$’s present implicitly (or non-parametrically) the distribution $latex p(x)$.
If we assume that the future distribution is different from the past, we can use importance weighting to adjust. Extrapolation (far) beyond the observed data would require explicit model for $latex p(x_{N+1})$. We discuss different joint data generating processes in Vehtari & Ojanen (2012), but I’ll use more words and examples below.

Data generating mechanisms

Dan wrote

LOO-elpd can fail catastrophically and silently when the data cannot be assumed to be iid. A simple case where this happens is time-series data, where you should leave out the whole future instead.

This is a quite common way to describe the problem, but I hope this can be more clear by discussing more about the data generating mechanisms.

Sometimes $latex p(x)$ does not exist, for example when $latex x$ is fixed, chosen by design, or deterministic! Then $latex p(x,y)$ also does not exist, and exchangeability (or iid) does not apply to $latex (x_n,y_n)$. We can still make a conditional model $latex p(y|x,\theta)$ and analyze posterior $latex p(\theta|x,y)$.

If $latex x$ is fixed or chosen by design, we could think of a prediction task in which we would repeat one of the measurements with same $latex x$’s. In that case, we might want to predict all the new observations jointly, but in cross-validation approximation that would lead to leave-all-out cross-validation which is also far from what we want (and sensitive to prior choices). So we may then use more stable and easier to compute LOO, which is still informative on predictive accuracy for repeated experiments.

LOO can be used in case of fixed $latex x$ to evaluate the predictive performance of the conditional terms $latex p(\tilde{y}_n|x_n,x,y)$, where $latex \tilde{y}_n$ is a new observation conditionally a fixed $latex x_n$. Taking the sum (or mean) of these terms then weights equally each fixed $latex x_n$. If we care about the performance for some fixed $latex x_n$ than for others, we can use different and weighting schemes to adjust.

$latex x$ can sometimes be a mix of something with a distribution and something which is fixed or deterministic. For example in clinical study, we could assume that patient covariates come from some distribution in the past and in the future, but the drug dosage is deterministic. It’s clear that if the drug helps, we don’t assume that we would continue giving bad dosage in the future so that it’s unlikely we would ever observe corresponding outcome either

Time series

We could assume future be fixed, chosen by design, or deterministic, but so that $latex x$ are different. For example, in time series we have observed time points $latex 1,\ldots,N$ and then often we want to predict for $latex N+1,\ldots,N+m$. Here the data generating mechanism for $latex x$ (time) is deterministic and all values of $latex x$ are outside of our observed $latex x$’s. It is still possible that the conditional part for $latex y$ factorizes as $latex \prod_{n=1}^N p(y_n|f_n,\theta)$ given latent process values $latex f$ (and we have a joint prior on $latex f$ describing our time series assumptions), and we may assume $latex (y_n-f_n)$ to be exchangeable (or iid). What matters here more, is the structure of $latex x$. Approximating the prediction task leads to $latex m$-step-ahead cross-validation where we don’t use the future data to provide information about $latex f_n$ or $latex \theta$ (see a draft case study). Under short range dependency and stationarity assumptions, we can also use some of the future data in $latex m$-step-ahead leave-a-block-out cross-validation (see a draft case study).

We can also have time series, where we don’t want to predict for the future and thus the focus is only on the observed time points $latex 1,\ldots,N$. For example, we might be interested analysing whether more or less babies are born on some special days during a time period in the past (I would assume there are plenty of more interesting examples in history studies and social sciences). It’s sensible to use LOO here to analyze whether we have been able to learn relevant structure in the time series in order to better predict the number of births in a left out day. Naturally there we could also analyze different aspects of the time series model, by leaving out one week, one month, one year, or several days around special days to focus on different assumptions in our time series model.

LOO-elpd can fail or doesn’t fail for time-series depending on your prediction task. LOO is not great if you want to estimate the performance of extrapolation to future, but having a time series doesn’t automatically invalidate the use of cross-validation.

Multilevel data and models

Dan wrote

Or when your data has multilevel structure, where you really should leave out whole strata.

For multilevel structure, we can start from a simple example with individual observations from $latex M$ known groups (or strata as Dan wrote). The joint conditional model is commonly

$latex \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) p(\theta_m|\psi) \right] p(\psi),$

where $latex y$ are partially exchangeable, so that $latex y_{mn}$ are exchangeable in group $latex j$, and $latex \theta_m$ are exchangeable. But for the prediction we need to consider also how $latex x$’s are generated. If we have a model also for $latex x$, then we might have a following joint model

$latex p(y,x)= \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m)p(\theta_m|\psi) \right] p(\psi) \prod_{m=1}^M \left[ \prod_{n=1}^N p(x_{mn}|\phi_m)p(\phi_m|\varphi) \right] p(\varphi).$

Sometimes we assume we haven’t observed all groups and want to make predictions for $latex y_{M+1,n}$ for a new group $latex M+1$. Say we have observed individual students in different schools and we want to predict for a new school. Then it is natural to consider leave-one-school-out cross-validation to simulate the fact that we don’t have any observations yet from that new school. Leave-one-school-out cross-validation will then also implicitly (non-parametrically) model the distribution of $latex x_{mn}$.

On the other hand, if we poll people in all states of USA, there will be no new states (at least in near future) and some or all $latex x$ might be fixed or deterministic, but there can be new people to poll in these states and LOO could be sensible for predicting what the next person would answer. And even if there are more schools we could focus just to these schools and have fixed or deterministic $latex x$. Thus validity of LOO in hierarchical models also depends on the data generating mechanism and the prediction task.

Phylogenetic models and non-factorizable priors

Zacco’s discourse question was related to phylogenetic models.

Wikipedia says

In biology, phylogenetics is the study of the evolutionary history and relationships among individuals or groups of organisms (e.g. species, or populations). These relationships are discovered through phylogenetic inference methods that evaluate observed heritable traits, such as DNA sequences or morphology under a model of evolution of these traits. The result of these analyses is a phylogeny (also known as a phylogenetic tree) – a diagrammatic hypothesis about the history of the evolutionary relationships of a group of organisms. The tips of a phylogenetic tree can be living organisms or fossils, and represent the “end”, or the present, in an evolutionary lineage.

This is a similar to above hierarchical model, but now we assume a non-factorizable prior for theta

$latex \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) \right] p(\theta_1,\ldots,\theta_M|\psi)p(\psi),$

and if we have a model for $latex x$, we may have also a non-factorizable prior for $latex \phi$

$latex p(y,x)= \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) \right] p(\theta_1,\ldots,\theta_M|\psi)p(\psi) \prod_{m=1}^M \left[ \prod_{n=1}^N p(x_{mn}|\phi_j) \right] p(\phi_1,\ldots,\phi_M|\varphi)p(\varphi).$

Again LOO is valid, if we focus on new observations in the current groups (e.g. observed species). Alternatively we could consider prediction for new individuals in a new group (e.g. species) and use leave-one-group-out cross-validation. I would assume we might have extra information about some $latex x$’s for a new species which would require re-weighting or more modeling. Non-factorizable priors are no problem for LOO or cross-validation in general, although fast LOO computations maybe possible only for special non-factorizable forms and direct cross-validation may require the full prior structure to be included as demonstrated in Leave-one-out cross-validation for non-factorizable models vignette.

Prediction in scientific inference

Zacco wrote also

In general, with a phylogenetics model there is only very rarely interest in predicting new data points (ie. unsampled species). This is not to say that predicting these things would not be interesting, just that in the traditions of this field it is not commonly done.

It seems quite common in many fields to have a desire to have some quantity to report as an assessment of how good the model is, but not to consider whether that model would generalize to new observations. The observations are really the only thing we have observed and something we might observe more in the future. Only in toy problems, we might be able to observe also usually unknown model structure and parameters. All model selection methods are inherently connected to the observations, and instead of thinking the model assessment methods as black boxes (*cough* information criteria *cough*) it’s useful to think how the model can help us to predict something.

In hierarchical models there are also different parts we can focus, and the depending on the focus, benefit of some parts can sometimes be hidden beyond benefits of other parts. For example, in a simple hierarchical model described above, if we have a plenty of individual observations in each group, then the hierarchical prior does have only weak effect if we leave one observation out and LOO is then assessing mostly the lowest level model part $latex p(y_{mn}|x_{mn},\theta_m)$. On the other hand if we use leave-one-group-out cross-validation, then hierarchical prior has a strong effect in prediction for that group and we are assessing more the higher level model part $latex p(\theta_1,\ldots,\theta_M|\psi)$. I would guess, this would be the part in phylogenetic models which should be in focus. If there is no specific prediction task, it’s probably useful to do both LOO and leave-one-group-out cross-validation.

Or spatial data, where you should leave out large-enough spatial regions that the point you are predicting is effectively independent of all of the points that remain in the data set.

This would be sensible if we assume that future locations are spatially disconnected from the current locations, or if the focus is specifically in non-spatial model part $latex p(y_{mn}|x_{mn},\theta_m)$ and we don’t want spatial information to help that prediction.

Information criteria

Zacco wrote

My general impression from your 2017 paper is that WAIC has fewer issues with exchangeability.

It’s unfortunate that we didn’t make it more clear in that paper. I naïvely trusted too much that people would read also the cited theoretical 87 pages long paper (Vehtari & Ojanen, 2012), but I now do understand that some repetition is needed. Akaike’s (1974) original paper made the clear connection to the prediction task of predicting $latex y_{N+1}$ given $latex x_{N+1}$ and $latex \hat{\theta}$. Stone (1997) showed that TIC (Takeuchi, 1976) which is a generalization of AIC, can be obtained from Taylor series approximation of LOO. Also papers introducing RIC (Shibata, 1989), NIC (Yoshizawa and Amari, 1994), DIC (Spiegelhalter et al., 2002) make the connection to the predictive performance. Watanabe (2009, 2010a,b,c) is very explicit on the equivalance of WAIC and Bayesian LOO. The basic WAIC has exactly the same exchangeability assumptions as LOO, estimates the same quantity, but uses a different computational approximation. It’s also possible to think of DIC and WAIC at different levels of hierarchical models (see, e.g., Spiegelhalter et al., 2002; Li et al., 2016; Merkle et al. 2018).

Unfortunately most of the time, information criteria are presented as fit + complexity penalty without any reference to the prediction task, assumptions about exchangeability, data generating mechanism or which part of the model is in the focus. This, combined with the difficult to interpret unit of the resulting quantity, has lead to the fact that information criteria are used as black box measure. As the assumptions are hidden, people think they are always valid (if you already forgot: WAIC and LOO have the same assumptions).

I prefer LOO over WAIC because of better diagnostics and better accuracy in difficult cases (see, e.g., Vehtari, Gelman, Gabry, 2017; Using Stacking to Average Bayesian Predictive Distributions we used LOO, but we can do Bayesian stacking also with other forms of cross-validation like leave-one-group-out or $latex m$-step-ahead cross-validation.

 

PS. I found this great review of the topic Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Although it completely ignores all Bayesian cross-validation literature and gives some recommendations not applicable for Bayesian modeling, it mostly gives the same recommendations as what I discuss above.

Parsimonious principle vs integration over all uncertainties

tl;dr If you have bad models, bad priors or bad inference choose the simplest possible model. If you have good models, good priors, good inference, use the most elaborate model for predictions. To make interpretation easier you may use a smaller model with similar predictive performance as the most elaborate model.

Merijn Mestdagh emailed me (Aki) and asked

In your blogpost “Comments on Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection” you write that

“My recommendation is that if LOO comparison taking into account the uncertainties says that there is no clear winner”…“In case of nested models I would choose the more complex model to be certain that uncertainties are not underestimated”. Could you provide some references or additional information on this claim?

I have to clarify additional conditions for my recommendation for using the encompassing model in the case of nested model

  • models are used to make predictions
  • the encompassing model has passed model checking
  • the inference has passed diagnostic checks

The Bayesian theory says that we should integrate over the uncertainties. The encompassing model includes the submodels, and if the encompassing model has passed model checking, then the correct thing is to include all the models and integrate over the uncertainties (and I assume that inference is correct). To pass the model checking, it may require good priors on the model parameters, which maybe something many ignore and then they may get bad performance with more complex models. If the models have similar loo performance, the encompassing model is likely to have thicker tails of the predictive distribution, meaning it is more cautious about rare events. I think this is good.

The main reasons why it is so common to favor more parsimonious models

  • The maximum likelihood inference is common and it doesn’t work well with more complex models. Favoring the simpler models is a kind of regularization.
  • Bad model misspecification. Even Bayesian inference doesn’t work well if the model is bad, and with complex models there are more possibilities for misspecifing the model and the misspecification even in one part can have strange effects in other parts. Favoring the simpler models is a kind of regularization.
  • Bad priors. Actually priors and model are inseparable, so this is kind of same as the previous one. It is more difficult to choose good priors for more complex models, because it’s difficult for humans to think about high dimensional parameters and how they affect the predictions. Favoring the simpler models can avoid the need to think harder about priors. See also Dan’s blog post and Mike’s case study.

But when I wrote my comments, I assumed that we are considering sensible models, priors, and inference, so there is no need for parsimony. See also paper Occam’s razor illustrating the effect of good priors and increasing model complexity.

In “VAR(1) based models do not always outpredict AR(1) models in typical psychological applications.” https://www.ncbi.nlm.nih.gov/pubmed/29745683 we compare AR models with VAR models (AR models are nested in VAR models). When both models perform equally we prefer, in contrast to your blogpost, the less complex AR models because

– The one standard error rule (“ Hastie et al. (2009) propose to select the most parsimonious model within the range of one standard error above the prediction error of the best model (this is the one standard error rule)”).

Hastie is not using priors or Bayesian inference, and thus he needs the parsimony rule. He may also have implicit cost function…

– A big problem (in my opinion) in psychology is the over interpretation of small effects which exacerbates by using complex models. I fear that some researchers will feel the need to interpret the estimated value of every parameter in the complex model.

Yes, it is dangerous to over-interpret especially if the estimates don’t include good uncertainty estimates, and even then it’s difficult to make interpretations in case of many collinear covariates.

My assumption above was that the model is used for predictions and I care about best possible predictive distribution.

The situation is different if we add cost of interpretation or cost of measurements in the future. I don’t know literature analysing what is a cost of interpretation for AR vs VAR model, or cost of of interpretation of additional covariate in a model, but when I’m considering interpretability I favor smaller models with similar predictive performance as the encompassing model. But if we have the encompassing model, then I also favor projection predictive approach where the full model is used as the reference so that the selection process is not overfitting to the data and the inference given the smaller model is conditional the information from the full model (Piironen & Vehtari, 2017). In case of a small number of models, LOO comparison or Bayesian stacking weights can also be sufficient (some examples here and here).

Comments on Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection

There is a recent pre-print Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection by Quentin Gronau and Eric-Jan Wagenmakers. Wagenmakers asked for comments and so here are my comments.

Short version: They report a known limitation of LOO when it’s used in a non-recommended way for model selection. They report that their experiments show that LOO is worse than expected as it doesn’t favor the simpler true model strongly enough. I think their experiments show that the specific non-recommended LOO comparison is favoring the simpler model more than expected. I enjoyed the clarity of the writing and the experiments, but they are missing many limitations of LOO and alternative methods, and I disagree on conclusions made on the experiments.

Long version:

The paper focuses on model comparison. I’d like to remind that cross-validation overall and LOO are also useful for estimating the predictive performance and checking of a single model. You can find some examples in Gabry, Simpson, Vehtari, Betancourt, and Gelman (2018), loo package vignettes, and my model selection tutorial notebooks, but I’ll focus here on model comparison.

Cross-validation and LOO have many limitations. The paper focuses on a variation of one known limitation: “idealized case where there exist a data set of infinite size that is perfectly consistent with the simple model M_S, LOO will nevertheless fail to strongly endorse M_S.” The paper uses one specific variant of LOO model selection. The further variation in the paper is the use of idealized data in the experiments.

The strongest assumption in the paper is that true model is one of the models compared. This case is often called M-closed case (Bernardo and Smith, 1994). Cross-validation is usually recommended for M-open case where the true model is not included in the set of models compared (Bernardo and Smith, 1994), and thus we might say that the paper is using cross-validation for something that it is not recommended.

When we estimate the predictive performance of the model for the future data which is not yet available, we need to integrate over the future data distribution p_t(y), which we usually don’t know. Vehtari and Ojanen (2012) provide extensive survey of alternative scenarios. If we assume M-closed case, then we know that one of the models is p_t(y), but we don’t know which one. In that case, it is sensible to model p_t(y) as a posterior weighted average of the all models under consideration, that is, replace p_t(y) with Bayesian model averaging (BMA) model as proposed by San Martini and Spezzaferri (1984). This approach has the same model selection consistent behavior as Bayes factor. Thus if the assumption is that one of the models is true, BMA reference model approach by San Martini and Spezzaferri could be used, although asymptotic model selection consistency doesn’t guarantee good finite sample performance, and BMA may have worse performance than cross-validation for small sample sizes even if one of the models is true.

LOO is known to be model selection inconsistent in M-closed case, but it is good to remember that there are also cross-validation hold-out rules, which are model selection consistent (Shao, 1992). I don’t think these versions of cross-validation are that important, since there are better options for predictive performance estimation in M-closed case as mentioned above and below.

If the methods, that are model selection consistent in M-closed, are used in M-open case they will eventually give weight 1 to one model that is closest to the truth, but which can still be arbitrarily far from truth. In M-open case it can be better if the model weights stay away from 0 and 1 as demonstrated also by Yao, Vehtari, Simpson, and Gelman, 2018.

In M-open case, we assume that none of the models is true one, and we don’t trust any of our models to be even very close to true model. Due to the lack of trust, we make minimal assumptions for p_t(y) and instead of any explicit model, we re-use data as pseudo Monte Carlo draws from the unknown distribution. Alternatively we can also say that we are using Dirichlet distribution with unknown probability mass at the observed locations as a non-parametric model of p_t(y). This step assumes that we believe that this simple non-parametric assumption can represent the relevant properties of the true distribution, which is one of the big limitations of cross-validation. The edge case of theta_t being very close to 0 or 1 and respectively all observations being 0 or 1, is known to be very sensitive to prior assumptions in any inference (see, e.g., Bernardo and Juárez, 2003) and using all 1 observations for a Dirichlet distribution is clearly missing information about the possible variability in the future. In this kind of very weak data for some important features of the future distribution, I would not use Dirichlet distribution, but instead would insist that stronger model information is included in p_t(y). Classic example is the rare disease prevalence estimation, where we might observe hundreds of healthy persons and no persons with disease. Without good prior information, the posterior uncertainty remains large on how far from 0 probability we are and the results are necessarily sensitive to prior information (as reflected also in experiment 1). For experiments 2 and 3 the Dirichlet distribution approximation is likely to work better.

Non-monotonicity observed in some weight curves is likely due to the fact that idealized data is symmetric and centered, and when leaving one observation out, this symmetry and centering doesn’t hold anymore. I guess that in scenario 2, leave-two-out approach by leaving a pair of 0 and 1 at the same time wouldn’t have non-monotonicity in curves. The same holds for Example 3, and it would be better to generate the idealized data as pairs which have the same absolute value but different signs and leave two out at the same time. This way the examples would be even more idealized and still showing the limitations of LOO.

In M-closed case, it is known that LOO is not model selection consistent (e.g., Shao, 1992), which means that the weight for the true model never goes to 1. Gronau and Wagenmakers write that they were surprised that model weights stayed so far from 1. I was surprised that the model weights were so close to 1. If asymptotically the simpler and more complex model are giving practically the same predictions (except in example 1, H_0 model is not giving any probability for a 0 event), then I would assume the weights to be closer to 0.5. I can think two reasons why the weights are not closer to 0.5:

  • Idealized data makes also LOO to favor the simpler model more and if the data would be generated from the true generating process in examples 2 and 3, I would expect the weights to be closer to 0.5.
  • Gronau and Wagenmakers are computing the weights directly from LOO expectations (Pseudo BF), although at least since Breiman et al. (1984, ch. 11), it has been recommended that the uncertainty in cross-validation model comparison should also be taken into account. That is, the model with higher estimated performance is not automatically selected, but uncertainty in the estimates should be considered, too (see also, Vehtari, Gelman, and Gabry, 2017).

My recommendation is that if LOO comparison taking into account the uncertainties says that there is no clear winner, then neither of the models is selected and model averaging is used instead. In case of two models, if both models give very similar predictions, then it doesn’t matter which one is used. In case of nested models I would choose the more complex model to be certain that uncertainties are not underestimated (which happens in the simpler model as some parameters are fixed compare to the more complex model) and then make strict model checking and calibration, and then proceed with M-completed approach to decide if some parts can be dropped as discussed below.

Yao, Vehtari, Simpson, and Gelman (2018) propose Pseudo-BMA+ (for model weights this could be called Pseudo-BF+ weights) which take the relevant uncertainty in the account and produces weights which are further away from 0 and 1.

Yao, Vehtari, Simpson, and Gelman (2018) propose also Bayesian stacking which uses LOO as part of the approach. The Pseudo-BMA(+) has the limitations that it’s seeing only scalar value of the predictive performance, while stacking has the benefit that it sees also how similar or different the predictive distributions of different models are and thus it can jointly optimize better weights (Yao, Vehtari, Simpson, and Gelman, 2018).

Yuling Yao commented in an email:

A quick note is that in the beta-bernoulli examples, the stacking weights can be derived analytically. It is 1 for H_0 and 0 for H_1. Surprisingly it is independent of both the prior distribution and sample size n. The independence of n may look suspicious. But intuitively when each data point (example 1) or each pair of data point (example 2) uniformly supports H_0 more than H_1, it does not take n \to infinity to conclude H_0 dominates H_1. Also only when n goes to infinity shall we observe perfect 0-1 pair in example 2 and exact sample mean= 0 in example 3, so the stacking weighting/selection depends on sample size implicitly.

Although the stacking seems to produce something Gronau and Wagenmakers desire, I would not trust stacking in the edge case with all 1 observations for the reasons I mentioned above. I guess that in this idealized case, stacking with symmetric-leave-two-out would also converge faster.

Gronau and Wagenmakers focused on the case of the simpler model being true, but to better illustrate the limitation of the LOO, it would be good to consider also the case where the more complex model is true one. Consider following alternative experiments where the more complex model is true one.

  • In example 1, let the true theta_t= 1-epsilon, but for the simpler model keep theta_0=1.
  • In example 2, let the true theta_t= 1/2+epsilon, but for the simpler model keep theta_0=1/2.
  • In example 3, let the true theta_t= epsilon, but for the simpler model keep theta_0.

If we choose epsilon very small but within the limits of the floating point accuracy for the experiments, we should see the same weights as as in the original experiments as long as we observe the same data, and only when we occasionally observe one extra 0 in example 1, one extra 1 in example 2, or extra positive value in 3 we would see differences. In the example 1, even one 0 will make theta=1 to have zero probability.

Another experiment to illustrate the limitations of LOO (and cross-validation and information criteria in general) would be to vary epsilon from very small to much larger, plotting how large the epsilon needs to be before we see that the more complex model is strongly favored. I’m not certain what happens for the Pseudo-BF version with no proper uncertainty handling used by Gronau and Wagenmakers, but one of the big limitations of the cross-validation is that the uncertainty about the future when not making any model assumptions is so big, that for being able to make confident choice between the model the epsilon needs to be larger than what would be needed if we just look at the posterior distribution of theta in this kind of simple models (see demonstration in a notebook, and related results by Wang and Gelman, 2014). This is the price we pay for not trusting any model and thus not getting benefits of reduced variance through the proper modeling of the future data distribution! This variability makes LOO bad for model selection when the differences between the model are small, and it just gets worse in case of a large number of models with similar true performance (see, e.g. Piironen and Vehtari, 2017). Even with just two models to compare, cross-validation has also a limitation that the simple variance estimate tend to be optimistic (Bengio and Grandvalet, 2004).

If we are brave enough to assume M-closed or M-completed case, then we can reduce the uncertainty in the model comparison by using a reference model for p_t(y) as demonstrated by Piironen and Vehtari (2017) (see also demonstration in a notebook). In M-completed case, we assume that none of the models is true one, but there is one model which in the best way presents all the uncertainties we can think of (see more in Vehtari and Ojanen, 2012). M-completed case is close to San-Martini and Spezzaferri’s M-closed approach, but with BMA model replaced with just some model we trust. For example, in variable selection the reference model in the M-completed case can be a model with all variables and a sensible prior on coefficients, and which has survived through the model checking and assessment (which may involve cross-validation for that single model). In such case we have started with M-open assumption, but after model checking and assessment we trust one of the models enough that we can switch to M-completed case for certain model selection tasks. In case of the reference model, we can further reduce the variance in the model selection by using the projection predictive approach as demonstrated by Piironen and Vehtari (2017) (with a reference implementation in projpred package). In M-closed case and BMA reference model, projection predictive approach is also model selection consistent, but more importantly it has very low variance in model selection in finite case.

Often the discussion on cross-validation vs. Bayes factors focuses on arguments whether M-closed is sensible assumption. I think M-closed case is rare, but if you insist on M-closed, then you can still use a predictive approach (Vehtari and Ojanen, 2012). If BMA is easy to compute and stable use that as the reference model and then do the model selection as San Martini and Spezzaferri (1984) or even better with the projection predictive approach (Piironen and Vehtari, 2017). If BMA is difficult to compute or unstable, I would recommend trying Bayesian stacking (Yao, Vehtari, Simpson, and Gelman, 2018).

Mostly I don’t trust any model, and I assume M-open case and that it’s possible that models are badly mis-specified. Before any model selection I can discard models based on prior, posterior and cross-validation predictive checking (see, e.g., Gabry, Simpson, Vehtari, Betancourt, and Gelman, 2018). For M-open model selection with a small number of models I use cross-validation with uncertainty estimates (Vehtari, Gelman, and Gabry, 2017). If there is no clear winner, then I recommend model averaging with Bayesian stacking (Yao, Vehtari, Simpson, and Gelman, 2018). In case of large number of models, I recommend trying to convert the problem to M-completed, and to use reference predictive approach or even better the projection predictive approach (Vehtari and Ojanen, 2012; Piironen and Vehtari, 2017).

Since the above is a long discussion, here is my final recommendations how to revise the paper if I would be a hypothetical reviewer

  • Since the paper is citing Vehtari, Gelman, & Gabry (2017) and Yao, Vehtari, Simpson, & Gelman (in press), the minimal requirement would be to mention that these papers explicitly recommend to compute the uncertainties due to not knowing the feature data, and they do not recommend LOO weights as used in the experiments in the paper. In the current form, the paper gives misleading impression about the recommended use of LOO.
  • Even better would be to add also experiments using the other LOO based model weights (Pseudo-BMA+ and Bayesian stacking) introduced in Yao, Vehtari, Simpson, & Gelman (in press). Or at least mention in the discussion that it would be better to make additional experiments with these. I would love to see these results.
  • Even better would be to add small epsilon and varying epsilon experiments described above. Or at least mention in the discussion that it would be better to make additional experiments with these. I would love to see these results.
  • Even better would be to add projection predictive experiments. I would love to see research on the limitations of the projection predictive approach. I know this one requires much more work, and thus I understand if the authors are not willing to go that far.
  • Minor comment: For the earlier use of Bayesian LOO see Geisser (1975).

Aki’s favorite scientific books (so far)

A month ago I (Aki) started a series of tweets about “scientific books which have had big influence on me…”. They are partially in time order, but I can’t remember the exact order. I may have forgotten some, and some stretched the original idea, but I can recommend all of them.

I have collected all those book tweets below and fixed only some typos. These are my personal favorites, and there are certainly many great books I haven’t listed. Please, tell your own favorite books and short description why you like those books in the comments.

I start to tweet about scientific books which have had big influence on me…

  1. Bishop, Neural Networks for Pattern Recognition, 1995. The first book where I read about Bayes. I learned a lot about probabilities, inference, model complexity, GLMs, NNs, gradients, Hessian, chain rule, optimization, integration, etc. I used it a lot for many years.

    Looking again at contents, it is still a great book although naturally some parts are bit outdated.

  2. Bishop (1995) referred to Neal, Bayesian Learning for Neural Networks, 1996, from which I learned about sampling in high dimensions, HMC, prior predictive analysis, evaluation of methods and models. Neal’s free FBM code made it easy to test everything in practice.

  3. Jaynes, Probability Theory: The Logic of Science, 1996: I read this because it was freely available online. There is not much for practical work, but plenty of argumentation why using Bayesian inference makes sense, which I did find useful when I was just learning B.

    15 years later I participated in a reading circle with mathematicians and statisticians going through the book in detail. The book was still interesting, but not that spectacular anymore. The discussion in the reading circle was worth it.

  4. Gilks, Richardson & Spiegelhalter (eds), Markov Chain Monte Carlo in Practice (1996). Very useful introductions to different MCMC topics by Gilks, Richardson & Spiegelhalter Ch1, Roberts Ch3, Tierney Ch4, Gilks Ch5, Gilks & Roberts Ch6, Raftery & Lewis Ch7.

    And with special mentions to Gelman on monitoring convergence Ch8, Gelfand on importance-sampling leave-one-out cross-validation Ch9, and Gelman & Meng on posterior predictive checking Ch11. My copy is worn out from heavy use.

  5. Gelman, Carlin, Stern, and Rubin (1995). I just loved the writing style, and it had so many insights and plenty of useful material. During my doctoral studies I also made about 90% of the exercises as self-study.

    I considered using the first edition when I started teaching Bayesian data analysis, but I thought it was maybe too much for a introduction course, and it didn’t have model assessment and selection, which is important for me.

    This book (and its later editions) is the one I have re-read most, and when re-reading I keep finding things I didn’t remember being there (I guess I have a bad memory). I still use the last edition regularly, and I’ll get later back to these later editions.

  6. Bernardo and Smith, Bayesian Theory, 1994. Great coverage (although not complete) of foundations and axioms of Bayesian theory with emphasize that actions and utilities are inseparable part of the theory.

    They admit problems of theory in continuous space (which seem to not have a solution that would please everyone, even if it works in practice) and review general probability theory. They derive basic models from simple exchangeability and invariance assumptions.

    They review utility and discrepancy based model comparison and rejection with definitions of M-open, -complete, and -closed. This and Bernardo’s many papers had strong influence how I think about model assessment and selection (see, e.g. http://dx.doi.org/10.1214/12-SS102).

  7. Box and Tiao, Bayesian Inference in Statistical Analysis, 1973. Wonderful book, if you want to see how difficult inference was before MCMC and prob. programming. Includes some useful models, and we used one of them as a prior in a neuromagnetic inverse problem http://becs.aalto.fi/en/research/bayes/brain/lpnorm.pdf

  8. Jeffreys, Theory of Probability, 3rd ed, 1961. Another book with historical interest. The intro and estimation part are sensible. I was very surprised to learn that he wrote about all the problems of Bayes factor, which was not evident from the later literature on BF.

  9. Jensen, A introduction to Bayesian Networks, 1996. I’m travelling to Denmark, which reminded me about this nice book on Bayesian networks. It’s out of print, but Jensen & Nielsen, Bayesian Networks and Decision Graphs, 2007, seems to be a good substitute.

  10. Dale, A History of Inverse Probability: From Thomas Bayes to Karl Pearson, 1991. Back to historically interesting books. Dale has done lot of great research on history of statistics. This one helps to understand Bayesian-Frequentist conflict in 20th century.

    The conflict can be seen, eg, Lindley writing in 1968: “The approach throughout is Bayesian: there is no discussion of this point, I merely ask the non-Bayesian reader to examine the results and consider whether they provide sensible and practical answers”.

    McGrayne, The Theory That Would Not Die: How Bayes’ Rule Cracked The Enigma Code, Hunted Down Russian Submarines, & Emerged Triumphant from Two Centuries of Controversy, 2011 is more recent and entertaining, but based also on much of Dale’s research.

  11. Laplace, Philosophical Essay on Probabilities, 1825. English translation with notes by Dale, 1995. Excellent book. I enjoyed how Laplace justified the models and priors he used. Considering clarity of the book, it’s strange how little these ideas were used before 20th century

  12. Press & Tanur, The Subjectivity of Scientists and the Bayesian Approach, 2001. Many interesting and fun stories about progress of science by scientists being very subjective. Argues that Bayesian approach at least tries to be more explicit on assumptions.

  13. Spirer, Spirer & Jaffe, Misused Statistics, 1998. Examples of common misuses of statistics (deliberate or inadvertent) in graphs, methodology, data collection, interpretation, etc. Great and fun (or scary) way to teach common pitfalls and how to do things better.

  14. Howson & Urbach, Scientific Reasoning: The Bayesian Approach, 2nd ed, 1999. Nice book on Bayesianism and philosophy of science: induction, confirmation, falsificationism, axioms, Popper, Lakatos, Kuhn, Cox, Good, and contrast to Fisherian & Neyman-Pearson significance tests.

    There are also 1st ed 1993 and 3rd ed 2005.

  15. Gentle, Random Number Generation and Monte Carlo Methods, 1998, 2.ed 2003. Great if you want to understand or implement: pseudo rng’s, checking quality, quasirandom, transformations from uniform, methods for specific distributions, permutations, dependent samples & sequences.

  16. Sivia, Data Analysis. A Bayesian tutorial, 1996. I started teaching a Bayesian analysis course in 2002 using this thin very Jaynesian book, as it had many good things. Afterward I realized that it missed too much from the workflow, so that students could do their own projects

  17. Gelman, Carlin, Stern, & Rubin, BDA2, 2003. This hit the spot. Improved model checking, new model comparison, more on MCMC, and new decision analysis made it at that time the best book for the whole workflow. I started using it in my teaching the same year it was published.

    Of course it still had some problems, like using DIC instead of cross-validation, effective sample size estimate without autocorrelation analysis, etc., but additional material I needed to introduce in my course was minimal compared what any other book would had required.

    My course included the chapters 1-11 and 22 (with varying emphasis), and I recommended for students to read other chapters.

  18. MacKay, Information Theory, Inference, and Learning Algorithms, 2003. Super clear introduction to information theory and codes. Has also excellent chapters on probabilities, Monte Carlo, Laplace approximation, inference methods, Bayes, and ends up with neural nets and GPs.

    The book is missing the workflow part, but it has many great insights clearly explained. For example, in Monte Carlo chapter, I love how MacKay tells when the algorithms fail and what happens in high dimensions.

    Before the 2003 version, I had been reading also drafts which had been available since 1997.

  19. O’Hagan and Forster, Bayesian Inference, 2nd ed, vol 2B of Kendall’s Advanced Theory of Statistics, 2004. A great reference on all the important concepts in Bayesian inference. Fits well between BDA and Bayesian Theory, and one of my all of favorite books on Bayes.

    Covers, e.g., inference, utilities, decisions, value of information, estimation, likelihood principle, sufficiency, ancillarity, nuisance, non-identifiability, asymptotics, Lindley’s paradox, conflicting information, probability as a degree of belief, axiomatic formulation, …

    finite additivity, comparability of events, weak prior information, exchangeability, non-subjective theories, specifying probabilities, calibration, elicitation, model comparison (a bit outdated), model criticism, computation (MCMC part is a bit outdated), and some models…

  20. Rasmussen and Williams, Gaussian Processes for Machine Learning, 2006. I was already familiar with GPs through many articles, but this become very much used handbook and course book for us. The book is exceptional in that it also explains how to implement stable computation.

    It has a nice chapter on Laplace approximation and expectation propagation conditional on hyperparameters, but has only Type II MAP estimate for hyperparameters. It has a ML flavor overall, and I know statisticians who have difficulties following the story.

    The book was very useful when writing GPstuff. It’s also available free online.

  21. Gelman & Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, 2006. I was already familiar with the models and methods in the book, but I loved how it focused on how to think about models, modeling and inference, using many examples to illustrate concepts.

    The book starts from a simple linear models and has a patience to progress slowly not to go to early to details on computation and it works surprisingly well even if Bayesian inference comes only after 340 pages.

    Gaussian linear model, logistic regression, generalized linear models, simulation, model checking, causal inference, multilevel models, Bugs, Bayesian inference, sample size and power calculations, summarizing models, ANOVA, model comparison, missing data.

    I recommended the book to my students after BDA2 and O’Hagan & Forster, as it seemed to be a good and quick read for someone who knows how to do the computation already, but I couldn’t see how I would use it in teaching as Bayesian inference comes late and it was based on BUGS!

    More recently re-reading the book, I still loved the good bits, but also was shocked to see how much it was encouraging to wander around in a garden of forking paths. AFAIK there is a new edition in progress which updates it to use more modern computation and model comparison.

  22. Harville, Matrix Algebra From a Statistician’s Perspective, 1997. 600 pages of matrix algebra with focus on that part of matrix algebra commonly used in statistics. Great book for people implementing computational methods for GPs and multivariate linear models.

    Nowadays with Matrix cookbook online, I use it less often to check simpler matrix algebra tricks, but my students still find it useful as it goes deeper and has more derivations in many topics.

  23. Gelman and Nolan, Teaching Statistics: A Bag of Tricks, 2002 (2.ed 2017). A large number of examples, in-class activities, and projects to be used in teaching concepts in intro stats course. I’ve used ideas from different parts and especially from decision analysis part.

  24. Abrams, Spiegelhalter & Myles, Bayesian Approaches to Clinical Trials and Health-Care Evaluation, 2004. This was helpful book to learn basic statistical issues in clinical trials and health-care evaluation, and how to replace “classic” methods with Bayesian.

    Medical trials, sequential analysis, randomised controlled trials, ethics of randomization, sample-size assessment, subset and multi-center analysis, multiple endpoints and treatments, observational studies, meta-analysis, cost-effectiveness, policy-making, regulation, …

  25. Ibrahim, Chen & Sinha, Bayesian Survival Analysis, 2001. The book goes quickly to the details of model and inference and thus is not an easy one. There has been a lot of progress in models and inference afterwards, but it’s still very valuable reference on survival analysis.

  26. O’Hagan et al, Uncertain Judgments: Eliciting Experts’ Probabilities, 2006. A great book on very important but too much ignored topic of eliciting prior information. A must read for anyone considering using (weakly) informative priors.

    The book reviews psychological research that shows, e.g., how the form of the questions affect the experts’ answers. The book also provides recommendations how to make better elicitation and how to validate the results of elicitation.

    Uncertainty & the interpretation of probability, aleatory & epistemic, what is an expert?, elicitation process, the psychology of judgment under uncertainty, biases, anchoring, calibration, representations, debiasing, elicitation, evaluating elicitation, multiple experts, …

  27. Bishop, Pattern Recognition and Machine Learning, 2006. It’s quite different from 1995 book, although it covers mostly the same models. For me there was not much new to learn, but my students have used it a lot as a reference, and I also enjoyed the explanations of VI and EP.

    Based on the contents and the point of view, the name of the book could also be “Probabilistic Machine Learning”

Due to the theme “influence on me”, it happened that all books I listed were published 2006 or earlier. After that I’ve seen great books, but those have had less influence on me. I may later make a longer list of more recent books I can recommend, but here are some as a bonus:

  • McGrayne, The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy, 2012. Entertaining book about history of Bayes theory.

  • Gelman, Carlin, Stern, Dunson, Vehtari & Rubin, Bayesian Data Analysis, 3rd ed, 2013. Obviously a great update of the classic book.

  • Särkkä, Bayesian Filtering and Smoothing, 2013. A concise introduction to non-linear Kalman filtering and smoothing, particle filtering and smoothing, and to the related parameter estimation methods from the Bayesian point of view.

  • McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan, 2015. Easier than BDA3 and well written. I don’t like how the model comparison is presented, but after reading this book, just check my related articles which were mostly published after this book.

  • Goodfellow, Bengio & Sourville, Deep Learning, 2016. This deep learning introduction has enough probabilistic view that I also can recommend it.

  • Stan Development Team, Stan Modeling Language: User’s Guide and Reference Manual, 2017. It’s not just Stan language manual, it’s also full of well written text about Bayesian inference and models. There is a plan to divide this in parts, and one part would make a great text book.

I’ve read more than these and the list was just the ones I enjoyed most. I think people, and also I, read less books now when it’s easier to find articles, case studies, and blog posts in internet. Someday I’ll make similar list for the top papers I’ve enjoyed.

The curse of dimensionality and finite vs. asymptotic convergence results

Related to our (Aki, Andrew, Jonah) Pareto smoothed importance sampling paper I (Aki) received a few times a comment that why bother with Pareto smoothing as you can always choose the proposal distribution so that importance ratios are bounded and then central limit theorem holds. The curse of dimensionality here is that the papers they refer used small dimensional experiments and the results do not work so well in high dimensions. Readers of this blog should not be surprised that things look not the same in high dimensions. In high dimensions the probability mass is far from the mode. It’s spread thinly in surface of high-dimensional sphere. See, e.g. Mike’s paper Bob’s case study, and blog post

In importance sampling one working solution in low dimensions is to use mixture of two proposals. One component tries to match the mode, and the other takes care that tails go down slower than the tails of the target ensuring bounded ratios. In the following I only look at the behavior with one component which has thicker tail and thus importance ratios are bounded (but I have made the similar experiment with mixture, too).

The target distribution is multidimensional normal distribution with zero mean and unit covariance matrix. In the first case the proposal distribution is also normal, but with scale 1.1 in each dimension. The scale is just slightly larger than for the proposal and we are often lucky if we can guess the scale of proposal with 10% accuracy. I draw 100000 draws from the proposal distribution.

The following figure shows when the number of dimensions go from 1 to 1024.

The upper subplot shows the estimated effective sample size. By D=512 importance weighted 100000 draws have only a few practically non-zero weights. The middle subplot shows the convergence rate compared to independent sampling, ie, how fast the variance goes down. By D=1024 the convergence rate has dramatically dropped and getting any improvement in the accuracy requires more and more draws. The bottom subplot shows Pareto khat diagnostic (see the paper for details). Dashed line is k=0.5, which the limit for variance being finite and dotted line is our suggestion for practically useful performance when using PSIS. But how can khat be larger than 0.5 when we have bounded weights! Central limit theorem has not failed here, but we have just not reach yet the asymptotic regime to get CLT to kick in!

The next plot shows more information what happens with D=1024.

Since humans are lousy in looking at 1024 dimensional plots the top subplot shows the 1 dimensional marginal density of the target (blue) and the proposal (red) densities of the distance from the origo r=sqrt(sum_{d=1}^D x_d^2). The proposal density has only 1.1 larger scale than the target, but most of the draws from the proposal are away from the typical set of the target! The vertical dashed line shows 1e-6 quantile of the proposal, ie when we draw 100000 draws, 90% of time we don’t get any draws from there. The middle subplot shows the importance ratio function, and we can see that the highest value is at 0, but that value is larger than 2*10^42! That’s a big number. The bottom subplot scales the y axis show that we see importance ratios near that 1e-6 quantile. Check the y-axis: it’s still from 0 to 1e6. So if we are lucky we may get a draw below the dashed line, but then it’s likely to get all the weight. The importance function is practically as steep everywhere where we can get draws in a time of the age of the universe. 1e-80 quantile is at 21.5 (1e80 is the estimated number of atoms in the visible universe). and it’s still far away from the region where the boundedness of the importance ratio function starts to affect.

I have more similar plots with thick tailed Student’s t, mixture of proposals etc, but I’ll save you from more plots. As long as there is some difference in target and proposal taking the number of dimensions to high enough, IS and PSIS break (PSIS giving slight improvement in the performance, and more importantly can diagnose the problem and improves the Monte Carlo estimate).

In addition that we need to take into account that many methods which work in small dimensions can break in high dimensions, we need to focus more on finite case performance. As seen here it doesn’t help us that CLT holds if we never can reach that asymptotic regime (same as why Metropolis algorithm in high dimensions may require close to infinite time to produce useful results). Pareto diagnostics has been empirically shown to provide very good finite case convergence rate estimates which also match some theoretical bounds.

PS. There has been lot of discussion in comments about typical set vs. high probability set. In the end Carlos Ungil wrote

Your blog post seems to work equally well if you change
“most of the draws from the proposal are away from the typical set of the target!”
to
“most of the draws from the proposal are away from the high probability region of the target!”

I disagree with this, and to show evidence for this I add here couple more plots I didn’t include before (I hope that this blog post will not have eventually as many plots as the PSIS paper).

If the target is multivariate normal, we get bounded weights by using Student-t distribution with finite degrees of freedom nu, even if the scale of the Student-t is smaller than the scale of the target distribution. In the next example the target is same as above, and the proposal distribution is multivariate Student-t with degrees of freedom nu=7 and variance=1.

The following figure shows when the number of dimensions go from 1 to 1024.

The upper subplot shows the estimated effective sample size. By D=64 importance weighted 100000 draws have only a few practically non-zero weights. The middle subplot shows the convergence rate compared to independent sampling, ie, how fast the variance goes down. By D=256 the convergence rate has dramatically dropped and getting any improvement in the accuracy requires more and more draws. The bottom subplot shows Pareto khat diagnostic which predicts well the finite case convergence rate (even if asymptotically with bounded ratios k<1/2).

The next plot shows more information what happens with D=512.

The top subplot shows the 1 dimensional marginal density of the target (blue) and the proposal (red) densities of the distance from the origo r=sqrt(sum_{d=1}^D x_d^2). The proposal density has the same variance but thicker tail. Most of the draws from the proposal are away from the typical set of the target and in this towards higher densities than the density in the typical set. The middle subplot shows the importance ratio function, and we can see that the highest value is close to 47.5 and then the ratio function starts to decrease again. The highest value of the ratio function is larger than 10^158. The region with highest value is far away from the typical set of the proposal distribution. The bottom subplot scales the y axis show that we see importance ratios near the proposal and the target distributions. The importance ratio goes from very small values up to 10^30 in very narrow range, and thus it’s likely that the largest draw from the proposal will get all the weight. It’s unlikely that in practical time we would get enough draws to get the asymptotic benefits of bounded ratios to kick in.

PPS. The purpose of this post was to illustrate that bounded ratios and asymptotic convergence results are not enough for the practically useful performance for IS, but there are also special cases where due to special structure of the posterior we can get practically good performance with IS, and especially with PSIS, also in high dimensions cases (PSIS paper has 400 dimensional example with khat<0.7).

 

Stacking and multiverse

It’s a coincidence that there is another multiverse posting today.

Recently Tim Disher asked in Stan discussion forum a question “Multiverse analysis – concatenating posteriors?”

Tim refers to a paper “Increasing Transparency Through a Multiverse Analysis” by Sara Steegen, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. The abstract says

Empirical research inevitably includes constructing a data set by processing raw data into a form ready for statistical analysis. Data processing often involves choices among several reasonable options for excluding, transforming, and coding data. We suggest that instead of performing only one analysis, researchers could perform a multiverse analysis, which involves performing all analyses across the whole set of alternatively processed data sets corresponding to a large set of reasonable scenarios. Using an example focusing on the effect of fertility on religiosity and political attitudes, we show that analyzing a single data set can be misleading and propose a multiverse analysis as an alternative practice. A multiverse analysis offers an idea of how much the conclusions change because of arbitrary choices in data construction and gives pointers as to which choices are most consequential in the fragility of the result.

In that paper the focus is in looking at the possible results from the multiverse of forking paths, but Tim asked whether it would “make sense at all to combine the posteriors from a multiverse analysis in a similar way to how we would combine multiple datasets in multiple imputation”?

After I (Aki) thought about this, my answer is

  • in multiple imputation the different data sets are posterior draws from the missing data distribution and thus usually equally weighted
  • I think multiverse analysis is similar to case of having a set of models with different variables, variable transformations, interactions and non-linearities like in our Stacking paper (Yao, Vehtari, Simpson, Gelman), where we have different models for arsenic well data (section 4.6). Then stacking would be sensible way to combine *predictions* (as we may have different model parameters for differently processed data) with non-equal weights. Stacking is a good choice for model combination here as
    1. we don’t need to assign prior probabilities for different forking paths
    2. stacking favors paths which give good predictions
    3. it avoids “prior dilutation problem” if some processed datasets happen to be very similar with each other (see fig 2c in Stacking paper)

StanCon 2018 Helsinki, 29-31 August 2018

Photo of Helsinki by (c) Visit Helsinki / Jussi Hellsten.
Photo (c) Visit Helsinki / Jussi Hellsten

StanCon 2018 Asilomar was so much fun that we are organizing StanCon 2018 Helsinki August 29-31, 2018 at Aalto University, Helsinki, Finland (location chosen using antithetic sampling).

Full information is available at StanCon 2018 Helsinki website

Summary of the information

What: One day of tutorials and two days of talks, open discussions, and statistical modeling in beautiful Helsinki.

When: August 29-31, 2018

Where: Aalto University, Helsinki, Finland

Invited speakers

  • Richard McElreath, Max Planck Institute for Evolutionary Anthropology
  • Maggie Lieu, European Space Astronomy Centre
  • Sarah Heaps, Newcastle University
  • Daniel Simpson, University of Toronto

Call for contributed talks

StanCon’s version of conference proceedings is a collection of contributed talks based on interactive, self-contained notebooks (e.g., knitr, R Markdown, Jupyter, etc.). For example, you might demonstrate a novel modeling technique, or (possibly simplified version of) a novel application, etc. There is no minimum or maximum length and anyone using Stan is welcome to submit a contributed talk.

More details are available on the StanCon submissions web page and examples of accepted submissions from StanCon 2017 are available in our stancon_talks repository on GitHub.

Contributed posters

We will accept poster submissions on a rolling basis until July 31st. One page exclusive of references is the desired format but anything that gives us enough information to make a decision is fine. See the conference web page for submission instructions.

Sponsors

If you’re interested in sponsoring StanCon 2018 Helsinki, please reach out to [email protected]. Your generous contributions will ensure that our registration costs are kept as low as possible and allow for us to subsidize attendance for students who would otherwise be unable to come.

Custom Distribution Solutions

Custom Distribution Solutions

I (Aki) recently made a case study that demonstrates how to implement user defined probability functions in Stan language (case study, git repo). As an example I use the generalized Pareto distribution (GPD) to model extreme values of geomagnetic storm data from the World Data Center for Geomagnetism. Stan has had support for user defined functions for a long time, but there wasn’t a full practical example of how to implement all the functions that built-in distributions have (_lpdf (or _lpmf),_cdf, _lcdf, _lccdf, and_rng). Having the full set of functions makes it easy to implement models, censoring, posterior predictive checking and loo. The most interesting things I learned while making the case study were:

  • How to replicate the behavior of Stan’s internal distribution functions as close as possible (due to lack of overloading of user defined functions, we have to make some compromises).
  • How to make tests for the user defined distribution functions.

By using this case study as a template, it should be easier and faster to implement and test new custom distributions for your Stan models.

Postdoc in Finland and NY to work on probabilistic inference and Stan!

I (Aki) got 2 year funding to hire a postdoc to work on validation of probabilistic inference approaches and model selection in Stan. Work would be done with Stan team in Aalto, Helsinki and Columbia, New York. We probably have PhD positions, too.

The funding is part of the joint project with Antti Honkela and Arto Klami at University of Helsinki and we are together hiring 3 postdocs. Two other postdocs would work in Helsinki and part of the time in DTU, Denmark (Ole Winther) or Cambridge, UK (Zoubin Ghahramani).

The project is about theory and methods for assessing the quality of distributional approximations, improving the inference accuracy by targeting the approximation towards the eventual application goal and by better utilising the available data, e.g., when having data with privacy constraints.

More information about the position and how to apply is here.

You can manage very well in Finland with English and you don’t need to learn any Finnish for the job. Helsinki has been selected many times among world’s top 10 liveable cities https://yle.fi/uutiset/osasto/news/helsinki_again_among_worlds_top_10_liveable_cities/9781098