## Don’t call it a bandit

Here’s why I don’t like the term “multi-armed bandit” to describe the exploration-exploitation tradeoff of inference and decision analysis.

First, and less importantly, each slot machine (or “bandit”) only has one arm. Hence it’s many one-armed bandits, not one multi-armed bandit.

Second, the basic strategy in these problems is to play on lots of machines until you find out which is the best, and then concentrate your plays on that best machine. This all presupposes that either (a) you’re required to play, or (b) at least one of the machines has positive expected value. But with slot machines, they all have negative expected value for the player (that’s why they’re called “bandits”), and the best strategy is not to play at all. So the whole analogy seems backward to me.

Third, I find the “bandit” terminology obscure and overly cute. It’s an analogy removed at two levels from reality: the optimization problem is not really like playing slot machines, and slot machines are not actually bandits. It’s basically a math joke, and I’m not a big fan of math jokes.

So, no deep principles here, the “bandit” name just doesn’t work for me. I agree with Bob Carpenter about the iid part (although I do remember a certain scene from The Grapes of Wrath with a non-iid slot machine), but other than that, the analogy seems a bit off.

## The replication crisis and the political process

Jackson Monroe writes:

I thought you might be interested in an article [by Dan McLaughlin] in NRO that discusses the replication crisis as part of a broadside against all public health research and social science. It seemed as though the author might be twisting the nature of the replication crisis toward his partisan ends, but I was curious as to your thoughts.

The social-science problem is that “public health” studies — like that NRA-convention study — can be highly subjective and ungoverned by the rigors of hard sciences that seek to test a hypothesis with results that can be replicated by other researchers. Indeed, the social sciences in general today suffer from a systemic “replication crisis,” a bias toward publishing only results that support the researcher’s hypothesis, and chronic problems with errors remaining uncorrected.

NRO is the website of the National Review, a conservative magazine, and the NRA (National Rifle Association) convention study is something we discussed in this space recently.

I think McLaughlin is probably correct that studies that seek to advance a political agenda are likely to have serious methodological problems. I’ve seen this in both the left and the right, and I don’t really know what to do about it, except to hope that all important research has active opposition. By “opposition,” I mean, ideally, honest opposition, and I don’t mean gridlock. It’s just good if serious claims are evaluated seriously, and not just automatically believed because they are considered to be on the side of the angels.

## When LOO and other cross-validation approaches are valid

Introduction

Zacco asked in Stan discourse whether leave-one-out (LOO) cross-validation is valid for phylogenetic models. He also referred to Dan’s excellent blog post which mentioned iid assumption. Instead of iid it would be better to talk about exchangeability assumption, but I (Aki) got a bit lost in my discourse answer (so don’t bother to go read it). I started to write this blog post to clear my thoughts and extend what I’ve said before, and hopefully produce a better answer.

TL;DR

The question is when leave-one-out cross-validation or leave-one-group-out cross-validation is valid for model comparison. The short answer is that we need to think about what is the joint data generating mechanism, what is exchangeable, and what is the prediction task. LOO can be valid or invalid, for example, for time-series and phylogenetic modelling depending on the prediction task. Everything said about LOO applies also to AIC, DIC, WAIC, etc.

iid and exchangeability

Dan wrote

To see this, we need to understand what the LOO methods are using to select models. It is the ability to predict a new data point coming from the (assumed iid) data generating mechanism. If two models asymptotically produce the same one point predictive distribution, then the LOO-elpd criterion will not be able to separate them.

This is well put, although iid is stronger than necessary assumption. It would be better to assume exchangeability which doesn’t imply independence. Exchangeability assumption thus extends in which cases LOO is applicable. The details of difference between iid and exchangeability is not that important for this post, but I recommend interested readers to see Chapter 5 in BDA3. Instead of iid vs exchangeability, I’ll focus here on prediction tasks and data generating mechanisms.

Basics of LOO

LOO is easiest to understand in the case where we have a joint distribution $p(y|x,\theta)p(x|\phi)$, and $p(y|x,\theta)$ factorizes as

$p(y|x,\theta) = \prod_{n=1}^N p(y_n|x_n,\theta).$

We are interested in predicting a new data point coming from the data generating mechanism

$p(y_{N+1}|x_{N+1},\theta).$

When we don’t know $\theta$, we’ll integrate over posterior of $\theta$ to get posterior predictive distribution

$p(y_{N+1}|x_{N+1},x,y)=\int p(y_{N+1}|x_{N+1},\theta) p(\theta|x,y) d\theta.$

We would like to evaluate how good this predictive distribution is by comparing it to observed $(y_{N+1}, x_{N+1})$. If we have not yet observed $(y_{N+1}, x_{N+1})$ we can use LOO to approximate expectation over different possible values for $(y_{N+1}, x_{N+1})$. Instead of making a model $p(y,x)$, we re-use observation as pseudo-Monte Carlo samples from $p(y_{N+1},x_{N+1})$, and in addition not to use the same observation for fitting theta and evaluation we use LOO predictive distributions

$p(y_{n}|x_{n},x_{-n},y_{-n})=\int p(y_{n}|x_{n},\theta) p(\theta|x_{-n},y_{-n}) d\theta.$

The usual forgotten assumption is that $x_{N+1}$ is unknown with the uncertainty described by a distribution! We have a model for $p(y|x)$, but often we don’t build a model for $p(x)$. BDA3 Chapter 5 discusses that when we have extra information $x_n$ for $y_n$, then $y_n$ are not exchangeable, but $(x_n, y_n)$ pairs are. BDA3 Chapter 14 discusses that if we have a joint model $p(x,y|\phi,\theta)$ and a prior which factorizes as

$p(\phi,\theta)=p(\phi)p(\theta),$

then the posterior factorizes as

$p(\phi,\theta|x,y)=p(\phi|x)p(\theta|x,y)$.

We can analyze the second factor by itself with no loss of information.

For the predictive performance estimate we however need to know how the future $x_{N+1}$ would be generated! In LOO we avoid making that explicit model by assuming observed $x$‘s present implicitly (or non-parametrically) the distribution $p(x)$.
If we assume that the future distribution is different from the past, we can use importance weighting to adjust. Extrapolation (far) beyond the observed data would require explicit model for $p(x_{N+1})$. We discuss different joint data generating processes in Vehtari & Ojanen (2012), but I’ll use more words and examples below.

Data generating mechanisms

Dan wrote

LOO-elpd can fail catastrophically and silently when the data cannot be assumed to be iid. A simple case where this happens is time-series data, where you should leave out the whole future instead.

This is a quite common way to describe the problem, but I hope this can be more clear by discussing more about the data generating mechanisms.

Sometimes $p(x)$ does not exist, for example when $x$ is fixed, chosen by design, or deterministic! Then $p(x,y)$ also does not exist, and exchangeability (or iid) does not apply to $(x_n,y_n)$. We can still make a conditional model $p(y|x,\theta)$ and analyze posterior $p(\theta|x,y)$.

If $x$ is fixed or chosen by design, we could think of a prediction task in which we would repeat one of the measurements with same $x$‘s. In that case, we might want to predict all the new observations jointly, but in cross-validation approximation that would lead to leave-all-out cross-validation which is also far from what we want (and sensitive to prior choices). So we may then use more stable and easier to compute LOO, which is still informative on predictive accuracy for repeated experiments.

LOO can be used in case of fixed $x$ to evaluate the predictive performance of the conditional terms $p(\tilde{y}_n|x_n,x,y)$, where $\tilde{y}_n$ is a new observation conditionally a fixed $x_n$. Taking the sum (or mean) of these terms then weights equally each fixed $x_n$. If we care about the performance for some fixed $x_n$ than for others, we can use different and weighting schemes to adjust.

$x$ can sometimes be a mix of something with a distribution and something which is fixed or deterministic. For example in clinical study, we could assume that patient covariates come from some distribution in the past and in the future, but the drug dosage is deterministic. It’s clear that if the drug helps, we don’t assume that we would continue giving bad dosage in the future so that it’s unlikely we would ever observe corresponding outcome either

Time series

We could assume future be fixed, chosen by design, or deterministic, but so that $x$ are different. For example, in time series we have observed time points $1,\ldots,N$ and then often we want to predict for $N+1,\ldots,N+m$. Here the data generating mechanism for $x$ (time) is deterministic and all values of $x$ are outside of our observed $x$‘s. It is still possible that the conditional part for $y$ factorizes as $\prod_{n=1}^N p(y_n|f_n,\theta)$ given latent process values $f$ (and we have a joint prior on $f$ describing our time series assumptions), and we may assume $(y_n-f_n)$ to be exchangeable (or iid). What matters here more, is the structure of $x$. Approximating the prediction task leads to $m$-step-ahead cross-validation where we don’t use the future data to provide information about $f_n$ or $\theta$ (see a draft case study). Under short range dependency and stationarity assumptions, we can also use some of the future data in $m$-step-ahead leave-a-block-out cross-validation (see a draft case study).

We can also have time series, where we don’t want to predict for the future and thus the focus is only on the observed time points $1,\ldots,N$. For example, we might be interested analysing whether more or less babies are born on some special days during a time period in the past (I would assume there are plenty of more interesting examples in history studies and social sciences). It’s sensible to use LOO here to analyze whether we have been able to learn relevant structure in the time series in order to better predict the number of births in a left out day. Naturally there we could also analyze different aspects of the time series model, by leaving out one week, one month, one year, or several days around special days to focus on different assumptions in our time series model.

LOO-elpd can fail or doesn’t fail for time-series depending on your prediction task. LOO is not great if you want to estimate the performance of extrapolation to future, but having a time series doesn’t automatically invalidate the use of cross-validation.

Multilevel data and models

Dan wrote

Or when your data has multilevel structure, where you really should leave out whole strata.

For multilevel structure, we can start from a simple example with individual observations from $M$ known groups (or strata as Dan wrote). The joint conditional model is commonly

$\prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) p(\theta_m|\psi) \right] p(\psi),$

where $y$ are partially exchangeable, so that $y_{mn}$ are exchangeable in group $j$, and $\theta_m$ are exchangeable. But for the prediction we need to consider also how $x$‘s are generated. If we have a model also for $x$, then we might have a following joint model

$p(y,x)= \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m)p(\theta_m|\psi) \right] p(\psi) \prod_{m=1}^M \left[ \prod_{n=1}^N p(x_{mn}|\phi_m)p(\phi_m|\varphi) \right] p(\varphi).$

Sometimes we assume we haven’t observed all groups and want to make predictions for $y_{M+1,n}$ for a new group $M+1$. Say we have observed individual students in different schools and we want to predict for a new school. Then it is natural to consider leave-one-school-out cross-validation to simulate the fact that we don’t have any observations yet from that new school. Leave-one-school-out cross-validation will then also implicitly (non-parametrically) model the distribution of $x_{mn}$.

On the other hand, if we poll people in all states of USA, there will be no new states (at least in near future) and some or all $x$ might be fixed or deterministic, but there can be new people to poll in these states and LOO could be sensible for predicting what the next person would answer. And even if there are more schools we could focus just to these schools and have fixed or deterministic $x$. Thus validity of LOO in hierarchical models also depends on the data generating mechanism and the prediction task.

Phylogenetic models and non-factorizable priors

Zacco’s discourse question was related to phylogenetic models.

Wikipedia says

In biology, phylogenetics is the study of the evolutionary history and relationships among individuals or groups of organisms (e.g. species, or populations). These relationships are discovered through phylogenetic inference methods that evaluate observed heritable traits, such as DNA sequences or morphology under a model of evolution of these traits. The result of these analyses is a phylogeny (also known as a phylogenetic tree) – a diagrammatic hypothesis about the history of the evolutionary relationships of a group of organisms. The tips of a phylogenetic tree can be living organisms or fossils, and represent the “end”, or the present, in an evolutionary lineage.

This is a similar to above hierarchical model, but now we assume a non-factorizable prior for theta

$\prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) \right] p(\theta_1,\ldots,\theta_M|\psi)p(\psi),$

and if we have a model for $x$, we may have also a non-factorizable prior for $\phi$

$p(y,x)= \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) \right] p(\theta_1,\ldots,\theta_M|\psi)p(\psi) \prod_{m=1}^M \left[ \prod_{n=1}^N p(x_{mn}|\phi_j) \right] p(\phi_1,\ldots,\phi_M|\varphi)p(\varphi).$

Again LOO is valid, if we focus on new observations in the current groups (e.g. observed species). Alternatively we could consider prediction for new individuals in a new group (e.g. species) and use leave-one-group-out cross-validation. I would assume we might have extra information about some $x$‘s for a new species which would require re-weighting or more modeling. Non-factorizable priors are no problem for LOO or cross-validation in general, although fast LOO computations maybe possible only for special non-factorizable forms and direct cross-validation may require the full prior structure to be included as demonstrated in Leave-one-out cross-validation for non-factorizable models vignette.

Prediction in scientific inference

Zacco wrote also

In general, with a phylogenetics model there is only very rarely interest in predicting new data points (ie. unsampled species). This is not to say that predicting these things would not be interesting, just that in the traditions of this field it is not commonly done.

It seems quite common in many fields to have a desire to have some quantity to report as an assessment of how good the model is, but not to consider whether that model would generalize to new observations. The observations are really the only thing we have observed and something we might observe more in the future. Only in toy problems, we might be able to observe also usually unknown model structure and parameters. All model selection methods are inherently connected to the observations, and instead of thinking the model assessment methods as black boxes (*cough* information criteria *cough*) it’s useful to think how the model can help us to predict something.

In hierarchical models there are also different parts we can focus, and the depending on the focus, benefit of some parts can sometimes be hidden beyond benefits of other parts. For example, in a simple hierarchical model described above, if we have a plenty of individual observations in each group, then the hierarchical prior does have only weak effect if we leave one observation out and LOO is then assessing mostly the lowest level model part $p(y_{mn}|x_{mn},\theta_m)$. On the other hand if we use leave-one-group-out cross-validation, then hierarchical prior has a strong effect in prediction for that group and we are assessing more the higher level model part $p(\theta_1,\ldots,\theta_M|\psi)$. I would guess, this would be the part in phylogenetic models which should be in focus. If there is no specific prediction task, it’s probably useful to do both LOO and leave-one-group-out cross-validation.

Or spatial data, where you should leave out large-enough spatial regions that the point you are predicting is effectively independent of all of the points that remain in the data set.

This would be sensible if we assume that future locations are spatially disconnected from the current locations, or if the focus is specifically in non-spatial model part $p(y_{mn}|x_{mn},\theta_m)$ and we don’t want spatial information to help that prediction.

Information criteria

Zacco wrote

My general impression from your 2017 paper is that WAIC has fewer issues with exchangeability.

It’s unfortunate that we didn’t make it more clear in that paper. I naïvely trusted too much that people would read also the cited theoretical 87 pages long paper (Vehtari & Ojanen, 2012), but I now do understand that some repetition is needed. Akaike’s (1974) original paper made the clear connection to the prediction task of predicting $y_{N+1}$ given $x_{N+1}$ and $\hat{\theta}$. Stone (1997) showed that TIC (Takeuchi, 1976) which is a generalization of AIC, can be obtained from Taylor series approximation of LOO. Also papers introducing RIC (Shibata, 1989), NIC (Yoshizawa and Amari, 1994), DIC (Spiegelhalter et al., 2002) make the connection to the predictive performance. Watanabe (2009, 2010a,b,c) is very explicit on the equivalance of WAIC and Bayesian LOO. The basic WAIC has exactly the same exchangeability assumptions as LOO, estimates the same quantity, but uses a different computational approximation. It’s also possible to think of DIC and WAIC at different levels of hierarchical models (see, e.g., Spiegelhalter et al., 2002; Li et al., 2016; Merkle et al. 2018).

Unfortunately most of the time, information criteria are presented as fit + complexity penalty without any reference to the prediction task, assumptions about exchangeability, data generating mechanism or which part of the model is in the focus. This, combined with the difficult to interpret unit of the resulting quantity, has lead to the fact that information criteria are used as black box measure. As the assumptions are hidden, people think they are always valid (if you already forgot: WAIC and LOO have the same assumptions).

I prefer LOO over WAIC because of better diagnostics and better accuracy in difficult cases (see, e.g., Vehtari, Gelman, Gabry, 2017; Using Stacking to Average Bayesian Predictive Distributions we used LOO, but we can do Bayesian stacking also with other forms of cross-validation like leave-one-group-out or $m$-step-ahead cross-validation.

PS. I found this great review of the topic Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Although it completely ignores all Bayesian cross-validation literature and gives some recommendations not applicable for Bayesian modeling, it mostly gives the same recommendations as what I discuss above.

## China air pollution regression discontinuity update

Avery writes:

There is a follow up paper for the paper “Evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River policy” [by Yuyu Chen, Avraham Ebenstein, Michael Greenstone, and Hongbin Li] which you have posted on a couple times and used in lectures. It seems that there aren’t much changes other than newer and better data and some alternative methods. Just curious what you think about it.

The paper is called, “New evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River Policy” [by Avraham Ebenstein, Maoyong Fan, Michael Greenstone, Guojun He, and Maigeng Zhou].

The cleanest summary of my problems with that earlier paper is this article, “Evidence on the deleterious impact of sustained use of polynomial regression on causal inference,” written with Adam Zelizer.

Here’s the key graph, which we copied from the earlier Chen et al. paper:

The most obvious problem revealed by this graph is that the estimated effect at the discontinuity is entirely the result of the weird curving polynomial regression, which in turn is being driven by points on the edge of the dataset. Looking carefully at the numbers, we see another problem which is that life expectancy is supposed to be 91 in one of these places (check out that green circle on the upper right of the plot)—and, according to the fitted model, the life expectancy there would be 5 years higher, that is, 96 years!, if only they hadn’t been exposed to all that pollution.

As Zelizer and I discuss in our paper, and I’ve discussed elsewhere, this is a real problem, not at all resolved by (a) regression discontinuity being an identification strategy, (b) high-degree polynomials being recommended in some of the econometrics literature, and (c) the result being statistically significant at the 5% level.

Indeed, items (a), (b), (c) above represent a problem, in that they gave the authors of that original paper, and the journal reviewers and editors, a false sense of security which allowed them to ignore the evident problems in their data and fitted model.

We’ve talked a bit recently about “scientism,” defined as “excessive belief in the power of scientific knowledge and techniques.” In this case, certain conventional statistical techniques for causal inference and estimation of uncertainty have led people to turn off their critical thinking.

That said, I’m not saying, nor have I ever said, that the substantive claims of Chen et al. are wrong. It could be that this policy really did reduce life expectancy by 5 years. All I’m saying is that their data don’t really support that claim. (Just look at the above scatterplot and ignore the curvy line that goes through it.)

You can make of this what you will. One thing that’s changed is that the two places with life expectancy greater than 85 have disappeared. So that seems like progress. I wonder what happened? I did not read through every bit of the paper—maybe it’s explained there somewhere?

Anyway, I still don’t buy their claims. Or, I should say, I don’t buy their statistical claim that their data strongly support their scientific claim. To flip it around, though, if the public-health experts find the scientific claim plausible, then I’d say, sure, the data are consistent with the this claimed effect on life expectancy. I just don’t see distance north or south of the river as a key predictor, hence I have no particular reason to believe that the data pattern shown in the above figure would’ve appeared, without the discontinuity, had the treatment not been applied.

I feel like kind of a grinch saying this. After all, air pollution is a big problem, and these researchers have clearly done a lot of work with robustness studies etc. to back up their claims. All I can say is: (1) Yes, air pollution is a big problem so we want to get these things right, and (2) Even without the near-certainty implied by these 95% intervals excluding zero, decisions can and will be made. Scientists and policymakers can use their best judgment, and I think they should do this without overrating the strength of any particular piece of evidence. And I do think this new paper is an improvement on the earlier one.

P.S. If you want to see some old-school ridiculous regression discontinuity analysis, check this out:
Continue reading ‘China air pollution regression discontinuity update’ »

## Continuous tempering through path sampling

Yuling prepared this poster summarizing our recent work on path sampling using a continuous joint distribution. The method is really cool and represents a real advance over what Xiao-Li and I were doing in our 1998 paper. It’s still gonna have problems in high or even moderate dimensions, and ultimately I think we’re gonna need something like adiabatic Monte Carlo, but I think that what Yuling and I are doing is a step forward in that we’re working with HMC and path sampling together, and our algorithm, while not completely automatic, is closer to automatic than other tempering schemes that I’ve seen.

## “Seeding trials”: medical marketing disguised as science

Paul Alper points to this horrifying news article by Mary Chris Jaklevic, “how a medical device ‘seeding trial’ disguised marketing as science.”

I’d never heard of “seeding trials” before. Here’s Jaklevic:

## What makes Robin Pemantle’s bag of tricks for teaching math so great?

It’s here, and he even calls it a “bag of tricks”!

Robin’s suggestions are similar to what Deb and I recommend, but Robin’s article is a crisp 25 pages and is purely focused on general advice for getting things to go well in the classroom, whereas we spend most of our book on specific activities related to statistics. Robin’s article would fit in well as a chapter in our book. Considered in that context, I’d say it’s better than the corresponding material in our book (in the second edition, this is Chapter 12, “How to do it”).

I prefer Robin’s article to our chapter because Robin’s article is more focused on what the teacher should do to maintain 100% student involvement during the entire class period.

Here are the sections of the article:

Introduction
Basics
Philosophy
Typical classroom mechanics
Highly recommended procedures
Class composition and small group dynamics
Doing the rounds
Help we’re stuck
Getting groups to work together
Free riders
Students who are behind
Managing Socratic discussion
How to listen
Staging
Order versus chaos
Curriculum
A vision
Pacing
Resources
Organization
Records
The bell

He uses examples from mathematics rather than statistics examples, but the general principles should be clear to all.

The actual teaching strategy that Pemantle suggests is very close to what I do in class, which is no surprise: I suppose we’ve both seen the same literature on the importance of active learning, and I suppose we’ve both had frustrating experiences with students not learning in passive, lecture-style classes. What I particularly like about Robin’s document is that he gives advice to handle various important situations; see list above.

I’m not sure when Robin’s document was written. I recommend that Robin take a couple days and do the following steps:

1. Go through it and see if any of the advice has changed. Rewrite where necessary. Or, maybe better: Add a new section at the end, “What have I learned since 1997?” [or whatever year it was that this document was prepared].

2. Remove section 4.3 which is specific to the institution where this material was first used. You’re sitting on dynamite, a wonderful article that can benefit thousands of teachers and students if it’s disseminated widely. In the intro you can say that this article derived from notes for a certain course, that’s enough.

3. Reformat it single-spaced and not using the ugly Latex defaults. Here’s an example using my currently-preferred template. You’ll have your own preferences; my point here is just that you might as well have a document that is more compact and readable.

4. Put the table of documents in the pdf document and also add some references at the end. You want a self-contained article that people can point to and pass around. It’s also fine to have an html page; two formats can reach more audiences.

5. Post it on arxiv, and if you happen to have a friend with a statistics blog, get him to post something on all this.

## Awesome MCMC animation site by Chi Feng! On Github!

Sean Talts and Bob Carpenter pointed us to this awesome MCMC animation site by Chi Feng. For instance, here’s NUTS on a banana-shaped density.

This is indeed super-cool, and maybe there’s a way to connect these with Stan/ShinyStan/Bayesplot so as to automatically make movies of Stan model fits. This would be great, both to help understand what the sampler is doing, and to demonstrate to outsiders what Stan does.

## How to think about an accelerating string of research successes?

While reading this post by Seth Frey on famous scientists who couldn’t let go of bad ideas, I followed a link to this post by David Gorski from 2010 entitled, “Luc Montagnier: The Nobel disease strikes again.” The quick story is that Montagnier endorsed some dubious theories. Here’s Gorski:

He only won the Nobel Prize in 2008, and it only took him two years to endorse homepathy-like concepts. He’s also made a name for himself, such as it is, by appearing in the HIV/AIDS denialist film House of Numbers stating that HIV can be cleared naturally through nutrition and supplements. This he did after publishing a paper in a journal that for which himelf is the editor . . .

But that’s just the beginning:

From there it only took Montagnier a few months more to turn his eye to applying that “knowledge” to autism . . . Unfortunately, the pseudoscience that Montagnier appears to have embraced with respect to autism is combined with a highly unethical study . . . The trial is sponsored by the Autism Treatment Trust (ATT) and the Autism Research Institute (ARI), both institutions that are–shall we say?–not exactly known for their scientific rigor. Apparently Montagnier has teamed up with a Dr. Corinne Skorupka, who is a DAN! practitioner from France . . . Whenever you see an “investigator” charge patients to undergo an experimental protocol, be very very wary. . . . here we have Montagnier and colleagues charging the parents of autistic children . . . Perhaps even worse than that, check out how badly designed this experimental protocol is . . . there are no convincing preclinical data . . . Based on an unsupported hypothesis that bacterial infections cause autism, Montagnier will be subjecting autistic children to blood draws and treatment with antibiotics. The former will cause unnecessary pain and suffering, and the latter has the potential to cause the complications that can occur due to long term antibiotic use over several months. . . . The study proposed is poorly designed even for a pilot study. There is no control group . . . Moreover, because the selection criteria for the study are not specified, there is no way of knowing how much selection bias might be operative there.

I’ve wondered how some Nobel Laureates, after having achieved so much at science, proving themselves at the highest levels by making fundamental contributions to our understanding of science that rate the highest honors, somehow end up embracing dubious science . . . or even outright pseudoscience . . . Does the fame go to their head? . . .

I’m guessing the story is a bit different, maybe not for the particular case of Montagnier but for this general “Nobel prize disease” thing—the pattern of celebrated scientists embracing wacky ideas. It’s not so much that these scientists get drunk by fame; rather, it’s that the prize attracted more attention to the wacky ideas they were susceptible to in the first place.

And then the feedback loop comes in. Scientist expresses wacky idea; then because of the Nobel prize, his wacky pronouncements get attention; scientist enjoys being in the limelight (maybe it’s been a bit disappointing after the Nobel publicity fades and his life is pretty much the same as always) so he makes more pronouncements; these pronouncements get more attention; scientist realizes that to continue to get in the news, he needs to make grander and grander claims; etc.

The apparent research progress comes in, faster and faster with stronger and stronger results

But I actually want to talk about something else, not the Nobel disease or anything like that, but the following pattern which I’ve seen from time to time.

The pattern goes like this: A researcher studies some topic, and after lots of effort and many false starts, he makes some progress. After that, progress comes faster and faster, and more and more research results come in.

This escalating pattern can arise legitimately: you develop a new tool and then find applications for it everywhere. For example, it took us a few years to write the Red State Blue State paper, but from there it only took a year to write the book, which had tons of empirical results.

Other times, though, it seems that what’s happening is escalating overconfidence, exacerbated by whatever echo chambers happen to be nearby. Luc Montagnier, for example, will have no problem finding yes-men, with that Nobel prize hanging in the corner. Another echo chamber is the science publication and grants system: if you have a track record of success, you’re likely to have figured out ways of presenting your results so they’re publishable and grant-worthy.

But the example I have in mind is my friend Seth Roberts, who spent about 10 years on his self-treatment for depression and then a few years more on his weight-loss method. At this point he spent a few years working on that, writing it up, and becoming a bit of a culture hero. And then he started to let his ambition get ahead of him, using self-experimentation to conclude that eating a stick of butter a day improved his brain functioning, among other things. I’m not saying that Seth was wrong—who knows? Maybe eating a stick of butter a day does improve brain functioning—but I’m skeptical of the idea that he came up with some trick for scientific discovery, so that what took him 5 or 10 years in the past could now be done, routinely, every couple of months.

Beware the escalating pattern of research results.

P.S. Gorski’s post also has a Herbalife connection.

P.P.S. Frey’s post is interesting too, but does he really think that all those people on his list are “way way smarter than everyone I know.” Does Frey really not know anyone as smart as Trofim Lysenko?

P.P.P.S. I came across this other post where Frey remarks that he used to work for Marc “Evilicious” Hauser!

Frey’s Hauser-related post is interesting but he makes one common mistake when he defines exploratory data analysis as “what you do when you suspect there is something interesting in there but you don’t have a good idea of what it might be, so you don’t use a hypothesis.” No! Exploratory data analysis is all about finding the unexpected, which is defined (explicitly or implicitly) relative to the expected, that is, a hypothesis or model. See this paper from 2004 for further discussion of this point.

## Parsimonious principle vs integration over all uncertainties

tl;dr If you have bad models, bad priors or bad inference choose the simplest possible model. If you have good models, good priors, good inference, use the most elaborate model for predictions. To make interpretation easier you may use a smaller model with similar predictive performance as the most elaborate model.

Merijn Mestdagh emailed me (Aki) and asked

In your blogpost “Comments on Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection” you write that

“My recommendation is that if LOO comparison taking into account the uncertainties says that there is no clear winner”…“In case of nested models I would choose the more complex model to be certain that uncertainties are not underestimated”. Could you provide some references or additional information on this claim?

I have to clarify additional conditions for my recommendation for using the encompassing model in the case of nested model

• models are used to make predictions
• the encompassing model has passed model checking
• the inference has passed diagnostic checks

The Bayesian theory says that we should integrate over the uncertainties. The encompassing model includes the submodels, and if the encompassing model has passed model checking, then the correct thing is to include all the models and integrate over the uncertainties (and I assume that inference is correct). To pass the model checking, it may require good priors on the model parameters, which maybe something many ignore and then they may get bad performance with more complex models. If the models have similar loo performance, the encompassing model is likely to have thicker tails of the predictive distribution, meaning it is more cautious about rare events. I think this is good.

The main reasons why it is so common to favor more parsimonious models

• The maximum likelihood inference is common and it doesn’t work well with more complex models. Favoring the simpler models is a kind of regularization.
• Bad model misspecification. Even Bayesian inference doesn’t work well if the model is bad, and with complex models there are more possibilities for misspecifing the model and the misspecification even in one part can have strange effects in other parts. Favoring the simpler models is a kind of regularization.
• Bad priors. Actually priors and model are inseparable, so this is kind of same as the previous one. It is more difficult to choose good priors for more complex models, because it’s difficult for humans to think about high dimensional parameters and how they affect the predictions. Favoring the simpler models can avoid the need to think harder about priors. See also Dan’s blog post and Mike’s case study.

But when I wrote my comments, I assumed that we are considering sensible models, priors, and inference, so there is no need for parsimony. See also paper Occam’s razor illustrating the effect of good priors and increasing model complexity.

In “VAR(1) based models do not always outpredict AR(1) models in typical psychological applications.” https://www.ncbi.nlm.nih.gov/pubmed/29745683 we compare AR models with VAR models (AR models are nested in VAR models). When both models perform equally we prefer, in contrast to your blogpost, the less complex AR models because

– The one standard error rule (“ Hastie et al. (2009) propose to select the most parsimonious model within the range of one standard error above the prediction error of the best model (this is the one standard error rule)”).

Hastie is not using priors or Bayesian inference, and thus he needs the parsimony rule. He may also have implicit cost function…

– A big problem (in my opinion) in psychology is the over interpretation of small effects which exacerbates by using complex models. I fear that some researchers will feel the need to interpret the estimated value of every parameter in the complex model.

Yes, it is dangerous to over-interpret especially if the estimates don’t include good uncertainty estimates, and even then it’s difficult to make interpretations in case of many collinear covariates.

My assumption above was that the model is used for predictions and I care about best possible predictive distribution.

The situation is different if we add cost of interpretation or cost of measurements in the future. I don’t know literature analysing what is a cost of interpretation for AR vs VAR model, or cost of of interpretation of additional covariate in a model, but when I’m considering interpretability I favor smaller models with similar predictive performance as the encompassing model. But if we have the encompassing model, then I also favor projection predictive approach where the full model is used as the reference so that the selection process is not overfitting to the data and the inference given the smaller model is conditional the information from the full model (Piironen & Vehtari, 2017). In case of a small number of models, LOO comparison or Bayesian stacking weights can also be sufficient (some examples here and here).

Julia Hirschberg sent this along to the natural language processing mailing list at Columbia:

here are some slides from last spring’s CRA-W Grad Cohort and previous years that might be of interest. all sorts of topics such as interviewing, building confidence, finding a thesis topic, preparing your thesis proposal, publishing, entrepreneurialism, and a very interesting panel on human-human interaction skills.

I took a look at a couple of these and they look like useful advice for grad students. I wrote back to Julia that I used to have a hard time as a professor convincing students their evening would be better spent having dinner with the invited seminar speaker than revising their homework.

Advice to grad students is a longstanding, niche writing genre. There are even widely-cited classics, like the transciption of Richard Hamming’s talk You and Your Research, which has been recommended to me more times than I can count. There’s also a later YouTube presentation, which I haven’t seen (yet!).

Hamming’s advice on lunchtime behavior reminds me of one of the reasons I like places like Google—they still have active round table tech lunches. We try to do that once per week after the Stan meetings. If you’re in town and want to join us, drop me or Andrew a line.

P.S. I got the term “soft skills” from a friend of mine who’s going through soft skills training at Amazon before they’ll let him speak to customers—things like giving presentations and staying on topic.

## Journals and refereeing: toward a new equilibrium

As we’ve discussed before (see also here), one of the difficulties of moving from our current system of review of scientific journal articles, to a new model of post-publication review, is that any major change seems likely to break the current “gift economy” system in which thousands of scientists put in millions of hours providing free reviews. And these reviews can be pretty good. Doing pre-publication reviews at the request of a journal editor: that seems like an obligation, and it’s helped along by pressure from all those associate editors. Remove that system of social obligation, just tell people they can do post-publication reviews when they want, and you’ll see a lot less reviewing. And the reviewing that does get done will be disproportionately by people with an axe to grind. So that could be a problem.

So what’s the new equilibrium, if we move away from I-give-free-labor-to-Elsevier-by-reviewing-random-papers-for-their-journals-at-the-behest-of-equally-uncompensated-editors to open post-publication review? Is it just that zillions of things get published and a few of them get reviewed in an unsystematic manner?

I don’t know.

## Recently in the sister blog

In the article, “Testing the role of convergence in language acquisition, with implications for creole genesis,” Marlyse Baptista et al. write:

The main objective of this paper is to test experimentally the role of convergence in language acquisition (second language acquisition specifically), with implications for creole genesis. . . . Our experiment is unique on two fronts as it is the first to use an artificial language to test the convergence hypothesis by making it observable, and it is also the first experimental study to clarify the notion of similarity by varying the levels and types of similarity that are expressed. We report an experiment with 94 English-speaking adults . . . A miniature artificial language was created that included morphological elements to express negation and pluralization. Participants were randomly assigned to one of three conditions: congruent (form and function of novel grammatical morphemes were highly similar to those in English), reversed (negative grammatical morpheme was highly similar to that of English plural, and plural grammatical morpheme was highly similar to that of English negation), and novel (form and function were highly dissimilar to those of English).

A miniature artificial language! Cool.

And here’s what the researchers found:

Participants in the congruent condition performed best, indicating that features that converge across form and function are learned most fully. More surprisingly, results showed that participants in the reversed condition acquired the language more readily than those in the novel condition, contrary to expectation.

It’s funny they find this result surprising. Speaking as an outsider to this field of research, based only on introspection and experience, I’d’ve thought that the novel condition would be more difficult than the reverse condition. To me, reversal’s pretty trivial; but an entirely new system, that sounds hard. For example, in our apartment we have a backwards clock (the gear is flipped so the hands go counter-clockwise), a clock with the minute and hour hands switched, and a 24-hour clock. I’ve never had any problem reading the backward clock—once you realize it’s reverse, reading it is automatic—but the other two clocks still give me difficulty and I have to consciously work it out each time I read them.

Sure, language is verbal and clock-reading is visual. Still, as Phil would say, I’m surprised that Baptista et al. are surprised that learning something reversed is easier than learning something new.

If the result really is a surprise, I’d like to see a replication study. Also some graphs of the data.