When LOO and other cross-validation approaches are valid

Introduction

Zacco asked in Stan discourse whether leave-one-out (LOO) cross-validation is valid for phylogenetic models. He also referred to Dan’s excellent blog post which mentioned iid assumption. Instead of iid it would be better to talk about exchangeability assumption, but I (Aki) got a bit lost in my discourse answer (so don’t bother to go read it). I started to write this blog post to clear my thoughts and extend what I’ve said before, and hopefully produce a better answer.

TL;DR

The question is when leave-one-out cross-validation or leave-one-group-out cross-validation is valid for model comparison. The short answer is that we need to think about what is the joint data generating mechanism, what is exchangeable, and what is the prediction task. LOO can be valid or invalid, for example, for time-series and phylogenetic modelling depending on the prediction task. Everything said about LOO applies also to AIC, DIC, WAIC, etc.

iid and exchangeability

Dan wrote

To see this, we need to understand what the LOO methods are using to select models. It is the ability to predict a new data point coming from the (assumed iid) data generating mechanism. If two models asymptotically produce the same one point predictive distribution, then the LOO-elpd criterion will not be able to separate them.

This is well put, although iid is stronger than necessary assumption. It would be better to assume exchangeability which doesn’t imply independence. Exchangeability assumption thus extends in which cases LOO is applicable. The details of difference between iid and exchangeability is not that important for this post, but I recommend interested readers to see Chapter 5 in BDA3. Instead of iid vs exchangeability, I’ll focus here on prediction tasks and data generating mechanisms.

Basics of LOO

LOO is easiest to understand in the case where we have a joint distribution $latex p(y|x,\theta)p(x|\phi)$, and $latex p(y|x,\theta)$ factorizes as

$latex p(y|x,\theta) = \prod_{n=1}^N p(y_n|x_n,\theta).$

We are interested in predicting a new data point coming from the data generating mechanism

$latex p(y_{N+1}|x_{N+1},\theta).$

When we don’t know $latex \theta$, we’ll integrate over posterior of $latex \theta$ to get posterior predictive distribution

$latex p(y_{N+1}|x_{N+1},x,y)=\int p(y_{N+1}|x_{N+1},\theta) p(\theta|x,y) d\theta.$

We would like to evaluate how good this predictive distribution is by comparing it to observed $latex (y_{N+1}, x_{N+1})$. If we have not yet observed $latex (y_{N+1}, x_{N+1})$ we can use LOO to approximate expectation over different possible values for $latex (y_{N+1}, x_{N+1})$. Instead of making a model $latex p(y,x)$, we re-use observation as pseudo-Monte Carlo samples from $latex p(y_{N+1},x_{N+1})$, and in addition not to use the same observation for fitting theta and evaluation we use LOO predictive distributions

$latex p(y_{n}|x_{n},x_{-n},y_{-n})=\int p(y_{n}|x_{n},\theta) p(\theta|x_{-n},y_{-n}) d\theta.$

The usual forgotten assumption is that $latex x_{N+1}$ is unknown with the uncertainty described by a distribution! We have a model for $latex p(y|x)$, but often we don’t build a model for $latex p(x)$. BDA3 Chapter 5 discusses that when we have extra information $latex x_n$ for $latex y_n$, then $latex y_n$ are not exchangeable, but $latex (x_n, y_n)$ pairs are. BDA3 Chapter 14 discusses that if we have a joint model $latex p(x,y|\phi,\theta)$ and a prior which factorizes as

$latex p(\phi,\theta)=p(\phi)p(\theta),$

then the posterior factorizes as

$latex p(\phi,\theta|x,y)=p(\phi|x)p(\theta|x,y)$.

We can analyze the second factor by itself with no loss of information.

For the predictive performance estimate we however need to know how the future $latex x_{N+1}$ would be generated! In LOO we avoid making that explicit model by assuming observed $latex x$’s present implicitly (or non-parametrically) the distribution $latex p(x)$.
If we assume that the future distribution is different from the past, we can use importance weighting to adjust. Extrapolation (far) beyond the observed data would require explicit model for $latex p(x_{N+1})$. We discuss different joint data generating processes in Vehtari & Ojanen (2012), but I’ll use more words and examples below.

Data generating mechanisms

Dan wrote

LOO-elpd can fail catastrophically and silently when the data cannot be assumed to be iid. A simple case where this happens is time-series data, where you should leave out the whole future instead.

This is a quite common way to describe the problem, but I hope this can be more clear by discussing more about the data generating mechanisms.

Sometimes $latex p(x)$ does not exist, for example when $latex x$ is fixed, chosen by design, or deterministic! Then $latex p(x,y)$ also does not exist, and exchangeability (or iid) does not apply to $latex (x_n,y_n)$. We can still make a conditional model $latex p(y|x,\theta)$ and analyze posterior $latex p(\theta|x,y)$.

If $latex x$ is fixed or chosen by design, we could think of a prediction task in which we would repeat one of the measurements with same $latex x$’s. In that case, we might want to predict all the new observations jointly, but in cross-validation approximation that would lead to leave-all-out cross-validation which is also far from what we want (and sensitive to prior choices). So we may then use more stable and easier to compute LOO, which is still informative on predictive accuracy for repeated experiments.

LOO can be used in case of fixed $latex x$ to evaluate the predictive performance of the conditional terms $latex p(\tilde{y}_n|x_n,x,y)$, where $latex \tilde{y}_n$ is a new observation conditionally a fixed $latex x_n$. Taking the sum (or mean) of these terms then weights equally each fixed $latex x_n$. If we care about the performance for some fixed $latex x_n$ than for others, we can use different and weighting schemes to adjust.

$latex x$ can sometimes be a mix of something with a distribution and something which is fixed or deterministic. For example in clinical study, we could assume that patient covariates come from some distribution in the past and in the future, but the drug dosage is deterministic. It’s clear that if the drug helps, we don’t assume that we would continue giving bad dosage in the future so that it’s unlikely we would ever observe corresponding outcome either

Time series

We could assume future be fixed, chosen by design, or deterministic, but so that $latex x$ are different. For example, in time series we have observed time points $latex 1,\ldots,N$ and then often we want to predict for $latex N+1,\ldots,N+m$. Here the data generating mechanism for $latex x$ (time) is deterministic and all values of $latex x$ are outside of our observed $latex x$’s. It is still possible that the conditional part for $latex y$ factorizes as $latex \prod_{n=1}^N p(y_n|f_n,\theta)$ given latent process values $latex f$ (and we have a joint prior on $latex f$ describing our time series assumptions), and we may assume $latex (y_n-f_n)$ to be exchangeable (or iid). What matters here more, is the structure of $latex x$. Approximating the prediction task leads to $latex m$-step-ahead cross-validation where we don’t use the future data to provide information about $latex f_n$ or $latex \theta$ (see a draft case study). Under short range dependency and stationarity assumptions, we can also use some of the future data in $latex m$-step-ahead leave-a-block-out cross-validation (see a draft case study).

We can also have time series, where we don’t want to predict for the future and thus the focus is only on the observed time points $latex 1,\ldots,N$. For example, we might be interested analysing whether more or less babies are born on some special days during a time period in the past (I would assume there are plenty of more interesting examples in history studies and social sciences). It’s sensible to use LOO here to analyze whether we have been able to learn relevant structure in the time series in order to better predict the number of births in a left out day. Naturally there we could also analyze different aspects of the time series model, by leaving out one week, one month, one year, or several days around special days to focus on different assumptions in our time series model.

LOO-elpd can fail or doesn’t fail for time-series depending on your prediction task. LOO is not great if you want to estimate the performance of extrapolation to future, but having a time series doesn’t automatically invalidate the use of cross-validation.

Multilevel data and models

Dan wrote

Or when your data has multilevel structure, where you really should leave out whole strata.

For multilevel structure, we can start from a simple example with individual observations from $latex M$ known groups (or strata as Dan wrote). The joint conditional model is commonly

$latex \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) p(\theta_m|\psi) \right] p(\psi),$

where $latex y$ are partially exchangeable, so that $latex y_{mn}$ are exchangeable in group $latex j$, and $latex \theta_m$ are exchangeable. But for the prediction we need to consider also how $latex x$’s are generated. If we have a model also for $latex x$, then we might have a following joint model

$latex p(y,x)= \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m)p(\theta_m|\psi) \right] p(\psi) \prod_{m=1}^M \left[ \prod_{n=1}^N p(x_{mn}|\phi_m)p(\phi_m|\varphi) \right] p(\varphi).$

Sometimes we assume we haven’t observed all groups and want to make predictions for $latex y_{M+1,n}$ for a new group $latex M+1$. Say we have observed individual students in different schools and we want to predict for a new school. Then it is natural to consider leave-one-school-out cross-validation to simulate the fact that we don’t have any observations yet from that new school. Leave-one-school-out cross-validation will then also implicitly (non-parametrically) model the distribution of $latex x_{mn}$.

On the other hand, if we poll people in all states of USA, there will be no new states (at least in near future) and some or all $latex x$ might be fixed or deterministic, but there can be new people to poll in these states and LOO could be sensible for predicting what the next person would answer. And even if there are more schools we could focus just to these schools and have fixed or deterministic $latex x$. Thus validity of LOO in hierarchical models also depends on the data generating mechanism and the prediction task.

Phylogenetic models and non-factorizable priors

Zacco’s discourse question was related to phylogenetic models.

Wikipedia says

In biology, phylogenetics is the study of the evolutionary history and relationships among individuals or groups of organisms (e.g. species, or populations). These relationships are discovered through phylogenetic inference methods that evaluate observed heritable traits, such as DNA sequences or morphology under a model of evolution of these traits. The result of these analyses is a phylogeny (also known as a phylogenetic tree) – a diagrammatic hypothesis about the history of the evolutionary relationships of a group of organisms. The tips of a phylogenetic tree can be living organisms or fossils, and represent the “end”, or the present, in an evolutionary lineage.

This is a similar to above hierarchical model, but now we assume a non-factorizable prior for theta

$latex \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) \right] p(\theta_1,\ldots,\theta_M|\psi)p(\psi),$

and if we have a model for $latex x$, we may have also a non-factorizable prior for $latex \phi$

$latex p(y,x)= \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) \right] p(\theta_1,\ldots,\theta_M|\psi)p(\psi) \prod_{m=1}^M \left[ \prod_{n=1}^N p(x_{mn}|\phi_j) \right] p(\phi_1,\ldots,\phi_M|\varphi)p(\varphi).$

Again LOO is valid, if we focus on new observations in the current groups (e.g. observed species). Alternatively we could consider prediction for new individuals in a new group (e.g. species) and use leave-one-group-out cross-validation. I would assume we might have extra information about some $latex x$’s for a new species which would require re-weighting or more modeling. Non-factorizable priors are no problem for LOO or cross-validation in general, although fast LOO computations maybe possible only for special non-factorizable forms and direct cross-validation may require the full prior structure to be included as demonstrated in Leave-one-out cross-validation for non-factorizable models vignette.

Prediction in scientific inference

Zacco wrote also

In general, with a phylogenetics model there is only very rarely interest in predicting new data points (ie. unsampled species). This is not to say that predicting these things would not be interesting, just that in the traditions of this field it is not commonly done.

It seems quite common in many fields to have a desire to have some quantity to report as an assessment of how good the model is, but not to consider whether that model would generalize to new observations. The observations are really the only thing we have observed and something we might observe more in the future. Only in toy problems, we might be able to observe also usually unknown model structure and parameters. All model selection methods are inherently connected to the observations, and instead of thinking the model assessment methods as black boxes (*cough* information criteria *cough*) it’s useful to think how the model can help us to predict something.

In hierarchical models there are also different parts we can focus, and the depending on the focus, benefit of some parts can sometimes be hidden beyond benefits of other parts. For example, in a simple hierarchical model described above, if we have a plenty of individual observations in each group, then the hierarchical prior does have only weak effect if we leave one observation out and LOO is then assessing mostly the lowest level model part $latex p(y_{mn}|x_{mn},\theta_m)$. On the other hand if we use leave-one-group-out cross-validation, then hierarchical prior has a strong effect in prediction for that group and we are assessing more the higher level model part $latex p(\theta_1,\ldots,\theta_M|\psi)$. I would guess, this would be the part in phylogenetic models which should be in focus. If there is no specific prediction task, it’s probably useful to do both LOO and leave-one-group-out cross-validation.

Or spatial data, where you should leave out large-enough spatial regions that the point you are predicting is effectively independent of all of the points that remain in the data set.

This would be sensible if we assume that future locations are spatially disconnected from the current locations, or if the focus is specifically in non-spatial model part $latex p(y_{mn}|x_{mn},\theta_m)$ and we don’t want spatial information to help that prediction.

Information criteria

Zacco wrote

My general impression from your 2017 paper is that WAIC has fewer issues with exchangeability.

It’s unfortunate that we didn’t make it more clear in that paper. I naïvely trusted too much that people would read also the cited theoretical 87 pages long paper (Vehtari & Ojanen, 2012), but I now do understand that some repetition is needed. Akaike’s (1974) original paper made the clear connection to the prediction task of predicting $latex y_{N+1}$ given $latex x_{N+1}$ and $latex \hat{\theta}$. Stone (1997) showed that TIC (Takeuchi, 1976) which is a generalization of AIC, can be obtained from Taylor series approximation of LOO. Also papers introducing RIC (Shibata, 1989), NIC (Yoshizawa and Amari, 1994), DIC (Spiegelhalter et al., 2002) make the connection to the predictive performance. Watanabe (2009, 2010a,b,c) is very explicit on the equivalance of WAIC and Bayesian LOO. The basic WAIC has exactly the same exchangeability assumptions as LOO, estimates the same quantity, but uses a different computational approximation. It’s also possible to think of DIC and WAIC at different levels of hierarchical models (see, e.g., Spiegelhalter et al., 2002; Li et al., 2016; Merkle et al. 2018).

Unfortunately most of the time, information criteria are presented as fit + complexity penalty without any reference to the prediction task, assumptions about exchangeability, data generating mechanism or which part of the model is in the focus. This, combined with the difficult to interpret unit of the resulting quantity, has lead to the fact that information criteria are used as black box measure. As the assumptions are hidden, people think they are always valid (if you already forgot: WAIC and LOO have the same assumptions).

I prefer LOO over WAIC because of better diagnostics and better accuracy in difficult cases (see, e.g., Vehtari, Gelman, Gabry, 2017; Using Stacking to Average Bayesian Predictive Distributions we used LOO, but we can do Bayesian stacking also with other forms of cross-validation like leave-one-group-out or $latex m$-step-ahead cross-validation.

 

PS. I found this great review of the topic Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Although it completely ignores all Bayesian cross-validation literature and gives some recommendations not applicable for Bayesian modeling, it mostly gives the same recommendations as what I discuss above.

5 thoughts on “When LOO and other cross-validation approaches are valid

  1. I saw this had no comments and I wanted to say thank you for writing this! The correspondence between having an explicit parametric model for the covariates versus using LOO as an implicit model makes good sense, but I hadn’t thought of it that way before.

  2. Aki, it seems to me, naively, that LOO and other cross validation strategies are designed to answer the question: how much of my fit is due to the information in the specific dataset, vs structural information in the model.

    LOO error will be small when the model’s predictions are relatively accurate even without the one (or N) data points left out, usually because the structure of the model is inflexible to localized errors in fit and also relatively accurate. Whereas when the model is very flexible and can represent many behaviors, then the constrains on behavior imposed by the data are important for prediction of that data point. Consider the difference between say fitting a specific 2 term function y = a*f(x) + b*g(x) where f and g are chosen for theoretical properties, vs fitting a 32 term fourier series or the like.

    From this perspective, I think we should try to interpret LOO and similar measures as letting us know how sensitive our model is to the specific information in the dataset, and with that generalized framework in mind, we could potentially pick specific alternative schemes to support asking the appropriate questions for our particular scientific application.

  3. Would it be too hard to translate acronyms that some people might not be familiar with the first time that they are used? LOO=???

    Sorry, I’m just not familiar with this particular acronym.

Leave a Reply to Aki Vehtari Cancel reply

Your email address will not be published. Required fields are marked *