Skip to content

Don’t call it a bandit

Here’s why I don’t like the term “multi-armed bandit” to describe the exploration-exploitation tradeoff of inference and decision analysis.

First, and less importantly, each slot machine (or “bandit”) only has one arm. Hence it’s many one-armed bandits, not one multi-armed bandit.

Second, the basic strategy in these problems is to play on lots of machines until you find out which is the best, and then concentrate your plays on that best machine. This all presupposes that either (a) you’re required to play, or (b) at least one of the machines has positive expected value. But with slot machines, they all have negative expected value for the player (that’s why they’re called “bandits”), and the best strategy is not to play at all. So the whole analogy seems backward to me.

Third, I find the “bandit” terminology obscure and overly cute. It’s an analogy removed at two levels from reality: the optimization problem is not really like playing slot machines, and slot machines are not actually bandits. It’s basically a math joke, and I’m not a big fan of math jokes.

So, no deep principles here, the “bandit” name just doesn’t work for me. I agree with Bob Carpenter about the iid part (although I do remember a certain scene from The Grapes of Wrath with a non-iid slot machine), but other than that, the analogy seems a bit off.

The replication crisis and the political process

Jackson Monroe writes:

I thought you might be interested in an article [by Dan McLaughlin] in NRO that discusses the replication crisis as part of a broadside against all public health research and social science. It seemed as though the author might be twisting the nature of the replication crisis toward his partisan ends, but I was curious as to your thoughts.

From the linked article:

The social-science problem is that “public health” studies — like that NRA-convention study — can be highly subjective and ungoverned by the rigors of hard sciences that seek to test a hypothesis with results that can be replicated by other researchers. Indeed, the social sciences in general today suffer from a systemic “replication crisis,” a bias toward publishing only results that support the researcher’s hypothesis, and chronic problems with errors remaining uncorrected.

NRO is the website of the National Review, a conservative magazine, and the NRA (National Rifle Association) convention study is something we discussed in this space recently.

I think McLaughlin is probably correct that studies that seek to advance a political agenda are likely to have serious methodological problems. I’ve seen this in both the left and the right, and I don’t really know what to do about it, except to hope that all important research has active opposition. By “opposition,” I mean, ideally, honest opposition, and I don’t mean gridlock. It’s just good if serious claims are evaluated seriously, and not just automatically believed because they are considered to be on the side of the angels.

When LOO and other cross-validation approaches are valid


Zacco asked in Stan discourse whether leave-one-out (LOO) cross-validation is valid for phylogenetic models. He also referred to Dan’s excellent blog post which mentioned iid assumption. Instead of iid it would be better to talk about exchangeability assumption, but I (Aki) got a bit lost in my discourse answer (so don’t bother to go read it). I started to write this blog post to clear my thoughts and extend what I’ve said before, and hopefully produce a better answer.


The question is when leave-one-out cross-validation or leave-one-group-out cross-validation is valid for model comparison. The short answer is that we need to think about what is the joint data generating mechanism, what is exchangeable, and what is the prediction task. LOO can be valid or invalid, for example, for time-series and phylogenetic modelling depending on the prediction task. Everything said about LOO applies also to AIC, DIC, WAIC, etc.

iid and exchangeability

Dan wrote

To see this, we need to understand what the LOO methods are using to select models. It is the ability to predict a new data point coming from the (assumed iid) data generating mechanism. If two models asymptotically produce the same one point predictive distribution, then the LOO-elpd criterion will not be able to separate them.

This is well put, although iid is stronger than necessary assumption. It would be better to assume exchangeability which doesn’t imply independence. Exchangeability assumption thus extends in which cases LOO is applicable. The details of difference between iid and exchangeability is not that important for this post, but I recommend interested readers to see Chapter 5 in BDA3. Instead of iid vs exchangeability, I’ll focus here on prediction tasks and data generating mechanisms.

Basics of LOO

LOO is easiest to understand in the case where we have a joint distribution p(y|x,\theta)p(x|\phi), and p(y|x,\theta) factorizes as

p(y|x,\theta) = \prod_{n=1}^N p(y_n|x_n,\theta).

We are interested in predicting a new data point coming from the data generating mechanism


When we don’t know \theta, we’ll integrate over posterior of \theta to get posterior predictive distribution

p(y_{N+1}|x_{N+1},x,y)=\int p(y_{N+1}|x_{N+1},\theta) p(\theta|x,y) d\theta.

We would like to evaluate how good this predictive distribution is by comparing it to observed (y_{N+1}, x_{N+1}). If we have not yet observed (y_{N+1}, x_{N+1}) we can use LOO to approximate expectation over different possible values for (y_{N+1}, x_{N+1}). Instead of making a model p(y,x), we re-use observation as pseudo-Monte Carlo samples from p(y_{N+1},x_{N+1}), and in addition not to use the same observation for fitting theta and evaluation we use LOO predictive distributions

p(y_{n}|x_{n},x_{-n},y_{-n})=\int p(y_{n}|x_{n},\theta) p(\theta|x_{-n},y_{-n}) d\theta.

The usual forgotten assumption is that x_{N+1} is unknown with the uncertainty described by a distribution! We have a model for p(y|x), but often we don’t build a model for p(x). BDA3 Chapter 5 discusses that when we have extra information x_n for y_n, then y_n are not exchangeable, but (x_n, y_n) pairs are. BDA3 Chapter 14 discusses that if we have a joint model p(x,y|\phi,\theta) and a prior which factorizes as


then the posterior factorizes as


We can analyze the second factor by itself with no loss of information.

For the predictive performance estimate we however need to know how the future x_{N+1} would be generated! In LOO we avoid making that explicit model by assuming observed x‘s present implicitly (or non-parametrically) the distribution p(x).
If we assume that the future distribution is different from the past, we can use importance weighting to adjust. Extrapolation (far) beyond the observed data would require explicit model for p(x_{N+1}). We discuss different joint data generating processes in Vehtari & Ojanen (2012), but I’ll use more words and examples below.

Data generating mechanisms

Dan wrote

LOO-elpd can fail catastrophically and silently when the data cannot be assumed to be iid. A simple case where this happens is time-series data, where you should leave out the whole future instead.

This is a quite common way to describe the problem, but I hope this can be more clear by discussing more about the data generating mechanisms.

Sometimes p(x) does not exist, for example when x is fixed, chosen by design, or deterministic! Then p(x,y) also does not exist, and exchangeability (or iid) does not apply to (x_n,y_n). We can still make a conditional model p(y|x,\theta) and analyze posterior p(\theta|x,y).

If x is fixed or chosen by design, we could think of a prediction task in which we would repeat one of the measurements with same x‘s. In that case, we might want to predict all the new observations jointly, but in cross-validation approximation that would lead to leave-all-out cross-validation which is also far from what we want (and sensitive to prior choices). So we may then use more stable and easier to compute LOO, which is still informative on predictive accuracy for repeated experiments.

LOO can be used in case of fixed x to evaluate the predictive performance of the conditional terms p(\tilde{y}_n|x_n,x,y), where \tilde{y}_n is a new observation conditionally a fixed x_n. Taking the sum (or mean) of these terms then weights equally each fixed x_n. If we care about the performance for some fixed x_n than for others, we can use different and weighting schemes to adjust.

x can sometimes be a mix of something with a distribution and something which is fixed or deterministic. For example in clinical study, we could assume that patient covariates come from some distribution in the past and in the future, but the drug dosage is deterministic. It’s clear that if the drug helps, we don’t assume that we would continue giving bad dosage in the future so that it’s unlikely we would ever observe corresponding outcome either

Time series

We could assume future be fixed, chosen by design, or deterministic, but so that x are different. For example, in time series we have observed time points 1,\ldots,N and then often we want to predict for N+1,\ldots,N+m. Here the data generating mechanism for x (time) is deterministic and all values of x are outside of our observed x‘s. It is still possible that the conditional part for y factorizes as \prod_{n=1}^N p(y_n|f_n,\theta) given latent process values f (and we have a joint prior on f describing our time series assumptions), and we may assume (y_n-f_n) to be exchangeable (or iid). What matters here more, is the structure of x. Approximating the prediction task leads to m-step-ahead cross-validation where we don’t use the future data to provide information about f_n or \theta (see a draft case study). Under short range dependency and stationarity assumptions, we can also use some of the future data in m-step-ahead leave-a-block-out cross-validation (see a draft case study).

We can also have time series, where we don’t want to predict for the future and thus the focus is only on the observed time points 1,\ldots,N. For example, we might be interested analysing whether more or less babies are born on some special days during a time period in the past (I would assume there are plenty of more interesting examples in history studies and social sciences). It’s sensible to use LOO here to analyze whether we have been able to learn relevant structure in the time series in order to better predict the number of births in a left out day. Naturally there we could also analyze different aspects of the time series model, by leaving out one week, one month, one year, or several days around special days to focus on different assumptions in our time series model.

LOO-elpd can fail or doesn’t fail for time-series depending on your prediction task. LOO is not great if you want to estimate the performance of extrapolation to future, but having a time series doesn’t automatically invalidate the use of cross-validation.

Multilevel data and models

Dan wrote

Or when your data has multilevel structure, where you really should leave out whole strata.

For multilevel structure, we can start from a simple example with individual observations from M known groups (or strata as Dan wrote). The joint conditional model is commonly

\prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) p(\theta_m|\psi) \right] p(\psi),

where y are partially exchangeable, so that y_{mn} are exchangeable in group j, and \theta_m are exchangeable. But for the prediction we need to consider also how x‘s are generated. If we have a model also for x, then we might have a following joint model

p(y,x)= \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m)p(\theta_m|\psi) \right] p(\psi) \prod_{m=1}^M \left[ \prod_{n=1}^N p(x_{mn}|\phi_m)p(\phi_m|\varphi) \right] p(\varphi).

Sometimes we assume we haven’t observed all groups and want to make predictions for y_{M+1,n} for a new group M+1. Say we have observed individual students in different schools and we want to predict for a new school. Then it is natural to consider leave-one-school-out cross-validation to simulate the fact that we don’t have any observations yet from that new school. Leave-one-school-out cross-validation will then also implicitly (non-parametrically) model the distribution of x_{mn}.

On the other hand, if we poll people in all states of USA, there will be no new states (at least in near future) and some or all x might be fixed or deterministic, but there can be new people to poll in these states and LOO could be sensible for predicting what the next person would answer. And even if there are more schools we could focus just to these schools and have fixed or deterministic x. Thus validity of LOO in hierarchical models also depends on the data generating mechanism and the prediction task.

Phylogenetic models and non-factorizable priors

Zacco’s discourse question was related to phylogenetic models.

Wikipedia says

In biology, phylogenetics is the study of the evolutionary history and relationships among individuals or groups of organisms (e.g. species, or populations). These relationships are discovered through phylogenetic inference methods that evaluate observed heritable traits, such as DNA sequences or morphology under a model of evolution of these traits. The result of these analyses is a phylogeny (also known as a phylogenetic tree) – a diagrammatic hypothesis about the history of the evolutionary relationships of a group of organisms. The tips of a phylogenetic tree can be living organisms or fossils, and represent the “end”, or the present, in an evolutionary lineage.

This is a similar to above hierarchical model, but now we assume a non-factorizable prior for theta

\prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) \right] p(\theta_1,\ldots,\theta_M|\psi)p(\psi),

and if we have a model for x, we may have also a non-factorizable prior for \phi

p(y,x)= \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) \right] p(\theta_1,\ldots,\theta_M|\psi)p(\psi) \prod_{m=1}^M \left[ \prod_{n=1}^N p(x_{mn}|\phi_j) \right] p(\phi_1,\ldots,\phi_M|\varphi)p(\varphi).

Again LOO is valid, if we focus on new observations in the current groups (e.g. observed species). Alternatively we could consider prediction for new individuals in a new group (e.g. species) and use leave-one-group-out cross-validation. I would assume we might have extra information about some x‘s for a new species which would require re-weighting or more modeling. Non-factorizable priors are no problem for LOO or cross-validation in general, although fast LOO computations maybe possible only for special non-factorizable forms and direct cross-validation may require the full prior structure to be included as demonstrated in Leave-one-out cross-validation for non-factorizable models vignette.

Prediction in scientific inference

Zacco wrote also

In general, with a phylogenetics model there is only very rarely interest in predicting new data points (ie. unsampled species). This is not to say that predicting these things would not be interesting, just that in the traditions of this field it is not commonly done.

It seems quite common in many fields to have a desire to have some quantity to report as an assessment of how good the model is, but not to consider whether that model would generalize to new observations. The observations are really the only thing we have observed and something we might observe more in the future. Only in toy problems, we might be able to observe also usually unknown model structure and parameters. All model selection methods are inherently connected to the observations, and instead of thinking the model assessment methods as black boxes (*cough* information criteria *cough*) it’s useful to think how the model can help us to predict something.

In hierarchical models there are also different parts we can focus, and the depending on the focus, benefit of some parts can sometimes be hidden beyond benefits of other parts. For example, in a simple hierarchical model described above, if we have a plenty of individual observations in each group, then the hierarchical prior does have only weak effect if we leave one observation out and LOO is then assessing mostly the lowest level model part p(y_{mn}|x_{mn},\theta_m). On the other hand if we use leave-one-group-out cross-validation, then hierarchical prior has a strong effect in prediction for that group and we are assessing more the higher level model part p(\theta_1,\ldots,\theta_M|\psi). I would guess, this would be the part in phylogenetic models which should be in focus. If there is no specific prediction task, it’s probably useful to do both LOO and leave-one-group-out cross-validation.

Or spatial data, where you should leave out large-enough spatial regions that the point you are predicting is effectively independent of all of the points that remain in the data set.

This would be sensible if we assume that future locations are spatially disconnected from the current locations, or if the focus is specifically in non-spatial model part p(y_{mn}|x_{mn},\theta_m) and we don’t want spatial information to help that prediction.

Information criteria

Zacco wrote

My general impression from your 2017 paper is that WAIC has fewer issues with exchangeability.

It’s unfortunate that we didn’t make it more clear in that paper. I naïvely trusted too much that people would read also the cited theoretical 87 pages long paper (Vehtari & Ojanen, 2012), but I now do understand that some repetition is needed. Akaike’s (1974) original paper made the clear connection to the prediction task of predicting y_{N+1} given x_{N+1} and \hat{\theta}. Stone (1997) showed that TIC (Takeuchi, 1976) which is a generalization of AIC, can be obtained from Taylor series approximation of LOO. Also papers introducing RIC (Shibata, 1989), NIC (Yoshizawa and Amari, 1994), DIC (Spiegelhalter et al., 2002) make the connection to the predictive performance. Watanabe (2009, 2010a,b,c) is very explicit on the equivalance of WAIC and Bayesian LOO. The basic WAIC has exactly the same exchangeability assumptions as LOO, estimates the same quantity, but uses a different computational approximation. It’s also possible to think of DIC and WAIC at different levels of hierarchical models (see, e.g., Spiegelhalter et al., 2002; Li et al., 2016; Merkle et al. 2018).

Unfortunately most of the time, information criteria are presented as fit + complexity penalty without any reference to the prediction task, assumptions about exchangeability, data generating mechanism or which part of the model is in the focus. This, combined with the difficult to interpret unit of the resulting quantity, has lead to the fact that information criteria are used as black box measure. As the assumptions are hidden, people think they are always valid (if you already forgot: WAIC and LOO have the same assumptions).

I prefer LOO over WAIC because of better diagnostics and better accuracy in difficult cases (see, e.g., Vehtari, Gelman, Gabry, 2017; Using Stacking to Average Bayesian Predictive Distributions we used LOO, but we can do Bayesian stacking also with other forms of cross-validation like leave-one-group-out or m-step-ahead cross-validation.


PS. I found this great review of the topic Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Although it completely ignores all Bayesian cross-validation literature and gives some recommendations not applicable for Bayesian modeling, it mostly gives the same recommendations as what I discuss above.

China air pollution regression discontinuity update

Avery writes:

There is a follow up paper for the paper “Evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River policy” [by Yuyu Chen, Avraham Ebenstein, Michael Greenstone, and Hongbin Li] which you have posted on a couple times and used in lectures. It seems that there aren’t much changes other than newer and better data and some alternative methods. Just curious what you think about it.

The paper is called, “New evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River Policy” [by Avraham Ebenstein, Maoyong Fan, Michael Greenstone, Guojun He, and Maigeng Zhou].

The cleanest summary of my problems with that earlier paper is this article, “Evidence on the deleterious impact of sustained use of polynomial regression on causal inference,” written with Adam Zelizer.

Here’s the key graph, which we copied from the earlier Chen et al. paper:

The most obvious problem revealed by this graph is that the estimated effect at the discontinuity is entirely the result of the weird curving polynomial regression, which in turn is being driven by points on the edge of the dataset. Looking carefully at the numbers, we see another problem which is that life expectancy is supposed to be 91 in one of these places (check out that green circle on the upper right of the plot)—and, according to the fitted model, the life expectancy there would be 5 years higher, that is, 96 years!, if only they hadn’t been exposed to all that pollution.

As Zelizer and I discuss in our paper, and I’ve discussed elsewhere, this is a real problem, not at all resolved by (a) regression discontinuity being an identification strategy, (b) high-degree polynomials being recommended in some of the econometrics literature, and (c) the result being statistically significant at the 5% level.

Indeed, items (a), (b), (c) above represent a problem, in that they gave the authors of that original paper, and the journal reviewers and editors, a false sense of security which allowed them to ignore the evident problems in their data and fitted model.

We’ve talked a bit recently about “scientism,” defined as “excessive belief in the power of scientific knowledge and techniques.” In this case, certain conventional statistical techniques for causal inference and estimation of uncertainty have led people to turn off their critical thinking.

That said, I’m not saying, nor have I ever said, that the substantive claims of Chen et al. are wrong. It could be that this policy really did reduce life expectancy by 5 years. All I’m saying is that their data don’t really support that claim. (Just look at the above scatterplot and ignore the curvy line that goes through it.)

OK, what about this new paper? Here’s the new graph:

You can make of this what you will. One thing that’s changed is that the two places with life expectancy greater than 85 have disappeared. So that seems like progress. I wonder what happened? I did not read through every bit of the paper—maybe it’s explained there somewhere?

Anyway, I still don’t buy their claims. Or, I should say, I don’t buy their statistical claim that their data strongly support their scientific claim. To flip it around, though, if the public-health experts find the scientific claim plausible, then I’d say, sure, the data are consistent with the this claimed effect on life expectancy. I just don’t see distance north or south of the river as a key predictor, hence I have no particular reason to believe that the data pattern shown in the above figure would’ve appeared, without the discontinuity, had the treatment not been applied.

I feel like kind of a grinch saying this. After all, air pollution is a big problem, and these researchers have clearly done a lot of work with robustness studies etc. to back up their claims. All I can say is: (1) Yes, air pollution is a big problem so we want to get these things right, and (2) Even without the near-certainty implied by these 95% intervals excluding zero, decisions can and will be made. Scientists and policymakers can use their best judgment, and I think they should do this without overrating the strength of any particular piece of evidence. And I do think this new paper is an improvement on the earlier one.

P.S. If you want to see some old-school ridiculous regression discontinuity analysis, check this out:
Continue reading ‘China air pollution regression discontinuity update’ »

Continuous tempering through path sampling

Yuling prepared this poster summarizing our recent work on path sampling using a continuous joint distribution. The method is really cool and represents a real advance over what Xiao-Li and I were doing in our 1998 paper. It’s still gonna have problems in high or even moderate dimensions, and ultimately I think we’re gonna need something like adiabatic Monte Carlo, but I think that what Yuling and I are doing is a step forward in that we’re working with HMC and path sampling together, and our algorithm, while not completely automatic, is closer to automatic than other tempering schemes that I’ve seen.

“Seeding trials”: medical marketing disguised as science

Paul Alper points to this horrifying news article by Mary Chris Jaklevic, “how a medical device ‘seeding trial’ disguised marketing as science.”

I’d never heard of “seeding trials” before. Here’s Jaklevic:

As a new line of hip implants was about to be launched in 2000, a stunning email went out from the manufacturer’s marketing department. It described a “clinical research strategy” to pay orthopedic surgeons $400 for each patient they enrolled in a company-sponsored trial. . . . Ostensibly the trial was intended to measure how often liners of the Pinnacle Hip System, made by Johnson & Johnson’s DePuy subsidiary, stayed in place after five years. But according to a newly published review article [by Joan Steffen, Ella Fassler, Kevin Reardon, and David Egilman], the trial was really a scheme to gin up sales momentum under the guise of scientific research.

How did the scam work? Jaklevic explains:

The internal company email outlined a “strategy for collecting survivorship data on PINNACLE while maximizing our impact on the market.” It said the trial would include a large group of 40 surgeons in order to achieve “very fast patient enrollment” that would generate sales of 1,000 implants in a year.

While $345,000 would be paid to doctors, it said, “The sales revenue estimate for this study is $4.2 million.” . . .

“Seeding trials are one method by which drug or device companies can just pay physicians for using their products without calling it an actual bribe,” said Adriane Fugh-Berman MD, a professor of pharmacology and physiology at Georgetown University . . .

According to the paper, the trial generated millions in sales but yielded no valid research findings, although the company heavily manipulated the data it did collect to show a false 99.9% success rate that was used in promotional materials. . . .

Dayum. 99.9% success rate, huh? You usually hear about that level of success; indeed the only examples I can think of offhand are old-time elections in the Soviet Union, and the replication rate as reported by the Harvard psychology department.

Here are some details:

J&J violated its own clinical research guidelines in manipulating data, delaying reports of adverse events and failing to follow parameters established for the study, such as not reporting results until all patients had been enrolled for five years and not retrospectively enrolling patients, it says. In some cases there were no patient consents. One surgeon continued to enroll patients in the trial and submit data even after his hospital’s review board refused to approve his participation in the trial.

Further, the company went to elaborate lengths to show a 99.9% success rate for five years, using sleights-of-hand such as not including certain types of device failures and hiding the fact that just 21 patients had been followed for a full five years.

Wait a minute. If there are only 21 patients, then the success rate could be 21/21 = 100%, or 20/21 = 95.2% . . . How do you get 99.9%?

Jaklevic continues:

Eventually, the bogus trial data was used as the “fundamental selling point” in Pinnacle marketing, providing physicians and patients with “a false sense of security,” the article says. The incredible near-perfect track record appeared in ads in medical journals, patient brochures, and ads in consumer publications such as the Ladies’ Home Journal and Golf Digest. . . .

Some of those ads featured an endorsement from Duke University basketball coach and hip implant recipient Mike Krzyzewski, even though Krzyzewski didn’t have Pinnacle implants. Krzyzewski’s osteoarthritis awareness promotions were covered uncritically by USA Today, the Florida Times- Union, and CBS News, only the latter of which mentioned he was paid.

No . . . not Coach K! All my illusions are shattered. Next you’re gonna tell me that Michael Jordan doesn’t really eat at McDonalds?

P.S. Concern with “seeding trials” is not new. Alper wrote about the topic in 2011, citing a Wikipedia article that pointed to a journal article from 1996. But it keeps happening, I guess.

P.P.S. Full disclosure: I’ve been a paid consultant for pharmaceutical companies.

Thanks, NVIDIA

Andrew and I both received a note like this from NVIDIA:

We have reviewed your NVIDIA GPU Grant Request and are happy support your work with the donation of (1) Titan Xp to support your research.


In case other people are interested, NVIDA’s GPU grant program provides ways for faculty or research scientists to request GPUs; they also have graduate fellowships and larger programs.

Stan on the GPU

The pull requests are stacked up and being reviewed and integrated into the testing framework as I write this. Stan 2.19 (or 2.20 if we get out a quick 2.19 in the next month) will have OpenCL-based GPU support for double-precision matrix operations like multiplication and Cholesky decomposition. And the GPU speedups are stackable with the multi-core MPI speedups that just came out in CmdStan 2.18 (RStan and PyStan 2.18 are in process and will be out soon).

Plot of GPU timing

Figure 1. The plot shows the latest performance figures for Cholesky factorization; the X-axis is the matrix dimensionality and the Y-axis the speedup vs. the regular Cholesky factorization. I’m afraid I don’t know which CPU/GPU combo this was tested on.

Academic hardware grants

I’ve spent my academic career coasting on donated hardware back when hardware was a bigger deal. It started at Edinburgh in the mid-80s with a Sun Workstation I donated to our department. LaTeX on the big screen was just game changing over working on a terminal then printing the postscript. Then we got Dandelions from Xerox (crazy Lisp machines with a do-what-I-mean command line), continued with really great HP Unix workstations at Carnegie Mellon that had crazy high-res CRT monitors for the late ’80s. Then I went into industry, where we had to pay for hardware. Now that I’m back in academia, I’m glad to see there are still hardware grant programs.

Stan development is global

We’re also psyched that so much core Stan development is coming from outside of Columbia. For the core GPU developers, Steve Bronder is at Capital One and Erik Štrumbelj and Rok Češnovar are at the University of Ljubljana. Erik’s the one who tipped me off about the NVIDIA GPU Grant program.

Daniel Lee is also helping out with the builds and testing and coding standards, and he’s at Generable. Sean Talts is also working on the integration here at Columbia; he played a key design role in the recent MPI launch, which was largely coded by Sebastian Weber at Novartis in Basel.

Amelia, it was just a false alarm

Nah, jet fuel can’t melt steel beamsI’ve watched enough conspiracy documentaries – Camp Cope

Some ideas persist long after the mounting evidence against them becomes overwhelming. Some of these things are kooky but probably harmless (try as I might, I do not care about ESP etc), whereas some are deeply damaging (I’m looking at you “vaccines cause autism”).

When these ideas have a scientific (be it social or physical) basis, there’s a pretty solid pattern to be seen: there is a study that usually over-interprets a small sample of data and there is an explanation for the behaviour that people want to believe is true.

This is on my mind because today I ran into a nifty little study looking at whether or not student evaluation of teaching (SET) has any correlation with student learning outcomes.

As a person who’s taught at a number of universities for quite a while, I have some opinions about this.

I know that when I teach my SET scores better be excellent or else I will have some problems in my life. And so I put some effort into making my students like me (Trust me, it’s a challenge) and perform a burlesque of hyper-competence lest I get that dreaded “doesn’t appear to know the material” comment. I give them detailed information about the structure of the exam. I don’t give them tasks that they will hate even when I think it would benefit certain learning styles. I don’t expect them to have done the reading*.

Before anyone starts up on a “kids today are too coddled” rant, it is not the students who make me do this. I teach the way I do because ensuring my SET scores are excellent is a large part** of both my job and my job security. I adapt my teaching practice to the instrument used to measure it***.

I actually enjoy this challenge.  I don’t think any of the things that I do to stabilize my SET scores are bad practice (otherwise I would do it differently), but let’s not mistake the motives.

(For the record, I also adapt my teaching practice to minimize the effects of plagiarism and academic dishonesty. This means that take-home assignments cannot be a large part of my assessment scheme. If I had to choose between being a dick to students who didn’t do the readings and being able to assess my courses with assignments, I’d choose the latter in a heartbeat.)

But SETs have some problems. Firstly, there’s increasingly strong evidence that women, people of colour, and people who speak english with the “wrong” accent**** receive systematically lower SET scores. So as an assessment instrument, SETs are horribly biased.

The one shining advantage to SETs, however, is that they are cheap and easy to measure. They are also centred on the student experience andnd there have been a number of studies that suggest that SET scores are correlated with student results.

However a recent paper from Bob Uttl, Carmela A. White, and Daniela Wong Gonzalez suggests that this observed correlation in these studies is most likely due to the small sample sizes.

The paper is a very detailed meta-analysis (and meta-reanalysis of the previous positive results) of a number of studies on the correlation between SET scores and final grades. The experiments are based on large, multi-section courses where the sections are taught by multiple instructors. The SET score of the instructor is compared to the student outcomes (after adjusting for various differences between cohorts). Some of these courses are relatively small and hence the observed correlation will be highly variable.

The meta-analytic methods used in this paper are heavily based on p-values, but are also very careful to correctly account for the differing sample sizes across studies. The paper also points out that if you look at the original data from the studies, some of the single-study correlations are absurdly large. It’s always a good idea to look at your data!

So does this mean SETs are going to go away? I doubt it. Although they don’t measure the effectiveness of teaching, universities increasingly market themselves based on the student experience, which is measured directly. And let us not forget that in the exact same way that I adapt my teaching to the metrics that are used to evaluate it, universities will adapt to metrics used to evaluate them. Things like the National Student Survey in the UK and the forthcoming Teaching Excellence Framework (also in the UK) will strongly influence how universities expect their faculty to teach.


*I actually experimented assuming the students would do the reading once when I taught a small grad course. Let’s just say the students vociferously disagreed with the requirement. I’ve taught very similar material since then without this requirement (also with some other changes) and it went much better. Obviously very small cohorts and various other changes mean that I can’t definitively say it was the reading requirement that sunk that course, but I can say that it’s significantly easier to teach students who don’t hate you.

** One of the things I really liked about teaching in Bath was that one of the other requirements was to make sure that scatterplot of a student’s result in my class against an average of their marks on their other subjects that semester was clustered around the line y=x.

***I have unstable employment and a visa that is tied to my job. What do you expect?

**** There is no such thing as a wrong english accent. People are awful and students are people.

Is it really true that babies should sleep on their backs?

Asher Meir writes:

Arnold Kling is a well-regarded economics blogger. Here he expresses skepticism about the strength of the evidence behind recommending that babies sleep on their backs.

I recall seeing another blogger expressing the same doubt at some length, or maybe it is another post by Arnold, I can’t find it right now.

Of course there are many cases of leading experts spouting authoritative advice with little evidence, but this is in many ways more worrisome. First of all they are claiming that ignoring their advice is life-threatening which is quite coercive. In fact I saw at least one court case where a mother was held liable in part for not putting her child to sleep on its back. In addition, there is really little harm to avoiding eggs even if they are not really bad for you, but the SIDS advice is affecting the well-being of millions of infants around the world as well as their parents. If the advice is not necessary it is extremely harmful just because of the lost sleep of parents and babies, and it is at least plausible that the advice is actually harming babies in other ways as Arnold mentions.

Obviously if the research is carefully done then the physicians have an obligation to warn parents not to endanger their babies but I think it is worth looking into the strength of the evidence on this.

Meir summarizes:

I am worried that this may be a redux of the ridiculous unfounded nutritional advice we have been getting for years, except that the nutritional advice was mostly merely silly and harmless and this sounds potentially quite harmful. But who knows. Let’s let your readership weigh in.

I have no idea, myself. We did follow this advice with our kids, but I’ve never looked into the statistics on it; I just assumed it was a good idea to follow the general recommendations.

The file drawer’s on fire!

Kevin Lewis sends along this article, commenting, “That’s one smokin’ file drawer!”

Here’s the story, courtesy of Clayton Velicer, Gideon St. Helen, and Stanton Glantz:

We examined the relationship between the tobacco industry and the journal Regulatory Toxicology and Pharmacology (RTP) using the Truth Tobacco Industry Documents Library and internet sources. We determined the funding relationships, and categorised the conclusions of all 52 RTP papers on tobacco or nicotine between January 2013 and June 2015, as “positive”, “negative” or “neutral” for the tobacco industry. RTP’s editor, 57% (4/7) of associate editors and 37% (14/38) of editorial board members had worked or consulted for tobacco companies. Almost all (96%, 50/52) of the papers had authors with tobacco industry ties. Seventy-six percent (38/50) of these papers drew conclusions positive for industry; none drew negative conclusions. The two papers by authors not related to the tobacco industry reached conclusions negative to the industry (p < .001). These results call into question the confidence that members of the scientific community and tobacco product regulators worldwide can have in the conclusions of papers published in RTP.

I wonder what statisticians Herbert Solomon, Richard Tweedie, Arnold Zellner, Paul Switzer, Joseph Fleiss, Nathan Mantel, Joseph Berkson, Ingram Olkin, Donald Rubin, and Ronald Fisher would have said about this sort of selection bias.

Revisiting “Is the scientific paper a fraud?”

Javier Benitez points us to this article from 2014 by Susan Howitt and Anna Wilson, which has subtitle, “The way textbooks and scientific research articles are being used to teach undergraduate students could convey a misleading image of scientific research,” and begins:

In 1963, Peter Medawar gave a talk, Is the scientific paper a fraud?, in which he argued that scientific journal articles give a false impression of the real process of scientific discovery. In answering his question, he argued that, “The scientific paper in its orthodox form does embody a totally mistaken conception, even a travesty, of the nature of scientific thought.” His main concern was that the highly formalized structure gives only a sanitized version of how scientists come to a conclusion and that it leaves no room for authors to discuss the thought processes that led to the experiments.

Medawar explained that papers were presented to appear as if the scientists had no pre-conceived expectations about the outcome and that they followed an inductive process in a logical fashion. In fact, scien- tists do have expectations and their observa- tions and analysis are made in light of those expectations. Although today’s scientific papers are increasingly presented as being hypothesis-driven, the underlying thought processes remain hidden; scientists appear to follow a logical and deductive process to test their idea and the results of these tests lead them to support or reject the hypothesis. However, even the trend toward more explicit framing of a hypothesis is often misleading, as hypotheses may be framed to explain a set of observations post hoc, suggesting a linear process that does not describe the actual discovery.

Howitt and Wilson continue:

There is, of course, a good reason why the scientific paper is highly formalized and structured. Its purpose is to communicate a finding and it is important to do this as clearly as possible. Even if actual process of discovery had been messy, a good paper presents a logical argument, provides supporting evidence, and comes to a conclusion. The reader usually does not need or want to know about false starts, failed experiments, and changes of direction.

Fair enough. There’s a tension between full and accurate description of the scientific process, on one hand, and concise description of scientific findings, on the other. Howitt and Wilson talk about the relevance of this to science teaching: by giving students journal articles to read, they get a misleading impression of what science is actually about.

Here I want to go in a slightly different discussion and talk about the ways in which the form of conventional science paper has been directly damaging science itself.

The trouble comes in when the article contains misrepresentations or flat-out lies, when the authors falsely or incompletely describe the processes of design, data collection, data processing, and data analysis. We’ve seen lots of examples of this in recent years.

Three related problems arise:

1. A scientific paper can mislead. People can read a paper, or see later popularizations of the work, and think that “science shows” something that science didn’t show. Within science, overconfidence in published claims can distort future research: lots of people can waste their time trying to track down an effect that was never there in the first place, or is too variable to measure using existing techniques.

2. Without a culture of transparency, there is an incentive to cheat. OK, a short-term incentive. Long-term, if your goal is scientific progress, cheating can just send you down the wrong track, or at best a random track. Cheating can get you publication and fame. There’s also an incentive for a sort of soft cheating, in which researchers pursue a strategy of active incompetence. Recall Clarke’s Law.

3. Scientific papers are typically written as triumphant success stories, and this can fool real-life adult scientists—not just students—leading them to expect the unrealistic, and to make it hard for them to learn from the data they do end up collecting.

P.S. One problem is that some people who present themselves as promoters or defenders of science will go around defending incompetent science, I assume because they think that science in general is a good thing, and so even the bad stuff deserves defending. What I think is that the people who do the bad science are exploiting this attitude.

To all the science promoters and defenders out there—yes, I’m looking at you, NPR, PNAS, Freakonomics, Gladwell, and the Cornell University Media Relations Office: Junk scientists are not your friends. By defending bad work, or dodging criticism of bad work, you’re doing science no favors. Criticism is an essential part of science. The purveyors of junk science are taking advantage of your boosterish attitude.

To put it another way: If you want to defend every scientific paper ever published—or every paper ever published in Science, Nature, Psychological Science, and PNAS, or whatever—fine. But then, on the same principle, you should be defending every Arxiv paper ever written, and for that matter every scrawling on every lab notebook ever prepared by anyone with a Ph.D. or with some Ivy League connection or whatever other rule you’re using to decide what’s worth giving the presumption of correctness. If you really think the replication rate in your field is “statistically indistinguishable from 100%”—or if you don’t really believe this, but you think you have to say it to defend your field from outsiders—then you’re the one being conned.

Of Tennys players and moral Hazards

Zach Shahn writes:

After watching Tennys Sandgren play in the Australian Open quarterfinals last night, I think it might be time to accept that the dentists named Dennis people were onto something. Looking him up revealed that he was named after his great grandfather and not by a Richard Williams type parent who planned on grooming him into a pro. And on top of his name, he grew up in Tennessee. So he’s a tennis player named Tennys from Tennessee. Incidentally, it turns out he also seems to be a white supremacist who was a proponent of the “pizza-gate” theory… [no, not that Pizzagate — ed.]

I responded by pointing to this guy: a man named Hazard who was an expert on legal ethics.

Shahn replied:

Presumably he was a very moral Hazard.

And that’s all for the day.

How dumb do you have to be…

I (Phil) just read an article about Apple. Here’s the last sentence: “Apple has beaten earnings expectations in every quarter but one since March 2013.”

[Note added a week later: on July 31 Apple reported earnings for the fiscal third quarter.  Earnings per share was $2.34 vs. the ‘consensus estimate’ of $2.18, according to Thomson Reuters.]


What makes Robin Pemantle’s bag of tricks for teaching math so great?

It’s here, and he even calls it a “bag of tricks”!

Robin’s suggestions are similar to what Deb and I recommend, but Robin’s article is a crisp 25 pages and is purely focused on general advice for getting things to go well in the classroom, whereas we spend most of our book on specific activities related to statistics. Robin’s article would fit in well as a chapter in our book. Considered in that context, I’d say it’s better than the corresponding material in our book (in the second edition, this is Chapter 12, “How to do it”).

I prefer Robin’s article to our chapter because Robin’s article is more focused on what the teacher should do to maintain 100% student involvement during the entire class period.

Here are the sections of the article:

Typical classroom mechanics
Highly recommended procedures
Class composition and small group dynamics
Doing the rounds
Help we’re stuck
Getting groups to work together
Free riders
Staying on task
Students who are behind
Students who are ahead
Managing Socratic discussion
Dead ends
How to listen
Asking the right questions
Order versus chaos
A vision
The bell

He uses examples from mathematics rather than statistics examples, but the general principles should be clear to all.

The actual teaching strategy that Pemantle suggests is very close to what I do in class, which is no surprise: I suppose we’ve both seen the same literature on the importance of active learning, and I suppose we’ve both had frustrating experiences with students not learning in passive, lecture-style classes. What I particularly like about Robin’s document is that he gives advice to handle various important situations; see list above.

I’m not sure when Robin’s document was written. I recommend that Robin take a couple days and do the following steps:

1. Go through it and see if any of the advice has changed. Rewrite where necessary. Or, maybe better: Add a new section at the end, “What have I learned since 1997?” [or whatever year it was that this document was prepared].

2. Remove section 4.3 which is specific to the institution where this material was first used. You’re sitting on dynamite, a wonderful article that can benefit thousands of teachers and students if it’s disseminated widely. In the intro you can say that this article derived from notes for a certain course, that’s enough.

3. Reformat it single-spaced and not using the ugly Latex defaults. Here’s an example using my currently-preferred template. You’ll have your own preferences; my point here is just that you might as well have a document that is more compact and readable.

4. Put the table of documents in the pdf document and also add some references at the end. You want a self-contained article that people can point to and pass around. It’s also fine to have an html page; two formats can reach more audiences.

5. Post it on arxiv, and if you happen to have a friend with a statistics blog, get him to post something on all this.

Awesome MCMC animation site by Chi Feng! On Github!

Sean Talts and Bob Carpenter pointed us to this awesome MCMC animation site by Chi Feng. For instance, here’s NUTS on a banana-shaped density.

This is indeed super-cool, and maybe there’s a way to connect these with Stan/ShinyStan/Bayesplot so as to automatically make movies of Stan model fits. This would be great, both to help understand what the sampler is doing, and to demonstrate to outsiders what Stan does.

How to think about an accelerating string of research successes?

While reading this post by Seth Frey on famous scientists who couldn’t let go of bad ideas, I followed a link to this post by David Gorski from 2010 entitled, “Luc Montagnier: The Nobel disease strikes again.” The quick story is that Montagnier endorsed some dubious theories. Here’s Gorski:

He only won the Nobel Prize in 2008, and it only took him two years to endorse homepathy-like concepts. He’s also made a name for himself, such as it is, by appearing in the HIV/AIDS denialist film House of Numbers stating that HIV can be cleared naturally through nutrition and supplements. This he did after publishing a paper in a journal that for which himelf is the editor . . .

But that’s just the beginning:

From there it only took Montagnier a few months more to turn his eye to applying that “knowledge” to autism . . . Unfortunately, the pseudoscience that Montagnier appears to have embraced with respect to autism is combined with a highly unethical study . . . The trial is sponsored by the Autism Treatment Trust (ATT) and the Autism Research Institute (ARI), both institutions that are–shall we say?–not exactly known for their scientific rigor. Apparently Montagnier has teamed up with a Dr. Corinne Skorupka, who is a DAN! practitioner from France . . . Whenever you see an “investigator” charge patients to undergo an experimental protocol, be very very wary. . . . here we have Montagnier and colleagues charging the parents of autistic children . . . Perhaps even worse than that, check out how badly designed this experimental protocol is . . . there are no convincing preclinical data . . . Based on an unsupported hypothesis that bacterial infections cause autism, Montagnier will be subjecting autistic children to blood draws and treatment with antibiotics. The former will cause unnecessary pain and suffering, and the latter has the potential to cause the complications that can occur due to long term antibiotic use over several months. . . . The study proposed is poorly designed even for a pilot study. There is no control group . . . Moreover, because the selection criteria for the study are not specified, there is no way of knowing how much selection bias might be operative there.

Gorski asks:

I’ve wondered how some Nobel Laureates, after having achieved so much at science, proving themselves at the highest levels by making fundamental contributions to our understanding of science that rate the highest honors, somehow end up embracing dubious science . . . or even outright pseudoscience . . . Does the fame go to their head? . . .

I’m guessing the story is a bit different, maybe not for the particular case of Montagnier but for this general “Nobel prize disease” thing—the pattern of celebrated scientists embracing wacky ideas. It’s not so much that these scientists get drunk by fame; rather, it’s that the prize attracted more attention to the wacky ideas they were susceptible to in the first place.

And then the feedback loop comes in. Scientist expresses wacky idea; then because of the Nobel prize, his wacky pronouncements get attention; scientist enjoys being in the limelight (maybe it’s been a bit disappointing after the Nobel publicity fades and his life is pretty much the same as always) so he makes more pronouncements; these pronouncements get more attention; scientist realizes that to continue to get in the news, he needs to make grander and grander claims; etc.

The apparent research progress comes in, faster and faster with stronger and stronger results

But I actually want to talk about something else, not the Nobel disease or anything like that, but the following pattern which I’ve seen from time to time.

The pattern goes like this: A researcher studies some topic, and after lots of effort and many false starts, he makes some progress. After that, progress comes faster and faster, and more and more research results come in.

This escalating pattern can arise legitimately: you develop a new tool and then find applications for it everywhere. For example, it took us a few years to write the Red State Blue State paper, but from there it only took a year to write the book, which had tons of empirical results.

Other times, though, it seems that what’s happening is escalating overconfidence, exacerbated by whatever echo chambers happen to be nearby. Luc Montagnier, for example, will have no problem finding yes-men, with that Nobel prize hanging in the corner. Another echo chamber is the science publication and grants system: if you have a track record of success, you’re likely to have figured out ways of presenting your results so they’re publishable and grant-worthy.

But the example I have in mind is my friend Seth Roberts, who spent about 10 years on his self-treatment for depression and then a few years more on his weight-loss method. At this point he spent a few years working on that, writing it up, and becoming a bit of a culture hero. And then he started to let his ambition get ahead of him, using self-experimentation to conclude that eating a stick of butter a day improved his brain functioning, among other things. I’m not saying that Seth was wrong—who knows? Maybe eating a stick of butter a day does improve brain functioning—but I’m skeptical of the idea that he came up with some trick for scientific discovery, so that what took him 5 or 10 years in the past could now be done, routinely, every couple of months.

Beware the escalating pattern of research results.

P.S. Gorski’s post also has a Herbalife connection.

P.P.S. Frey’s post is interesting too, but does he really think that all those people on his list are “way way smarter than everyone I know.” Does Frey really not know anyone as smart as Trofim Lysenko?

P.P.P.S. I came across this other post where Frey remarks that he used to work for Marc “Evilicious” Hauser!

Frey’s Hauser-related post is interesting but he makes one common mistake when he defines exploratory data analysis as “what you do when you suspect there is something interesting in there but you don’t have a good idea of what it might be, so you don’t use a hypothesis.” No! Exploratory data analysis is all about finding the unexpected, which is defined (explicitly or implicitly) relative to the expected, that is, a hypothesis or model. See this paper from 2004 for further discussion of this point.

Parsimonious principle vs integration over all uncertainties

tl;dr If you have bad models, bad priors or bad inference choose the simplest possible model. If you have good models, good priors, good inference, use the most elaborate model for predictions. To make interpretation easier you may use a smaller model with similar predictive performance as the most elaborate model.

Merijn Mestdagh emailed me (Aki) and asked

In your blogpost “Comments on Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection” you write that

“My recommendation is that if LOO comparison taking into account the uncertainties says that there is no clear winner”…“In case of nested models I would choose the more complex model to be certain that uncertainties are not underestimated”. Could you provide some references or additional information on this claim?

I have to clarify additional conditions for my recommendation for using the encompassing model in the case of nested model

  • models are used to make predictions
  • the encompassing model has passed model checking
  • the inference has passed diagnostic checks

The Bayesian theory says that we should integrate over the uncertainties. The encompassing model includes the submodels, and if the encompassing model has passed model checking, then the correct thing is to include all the models and integrate over the uncertainties (and I assume that inference is correct). To pass the model checking, it may require good priors on the model parameters, which maybe something many ignore and then they may get bad performance with more complex models. If the models have similar loo performance, the encompassing model is likely to have thicker tails of the predictive distribution, meaning it is more cautious about rare events. I think this is good.

The main reasons why it is so common to favor more parsimonious models

  • The maximum likelihood inference is common and it doesn’t work well with more complex models. Favoring the simpler models is a kind of regularization.
  • Bad model misspecification. Even Bayesian inference doesn’t work well if the model is bad, and with complex models there are more possibilities for misspecifing the model and the misspecification even in one part can have strange effects in other parts. Favoring the simpler models is a kind of regularization.
  • Bad priors. Actually priors and model are inseparable, so this is kind of same as the previous one. It is more difficult to choose good priors for more complex models, because it’s difficult for humans to think about high dimensional parameters and how they affect the predictions. Favoring the simpler models can avoid the need to think harder about priors. See also Dan’s blog post and Mike’s case study.

But when I wrote my comments, I assumed that we are considering sensible models, priors, and inference, so there is no need for parsimony. See also paper Occam’s razor illustrating the effect of good priors and increasing model complexity.

In “VAR(1) based models do not always outpredict AR(1) models in typical psychological applications.” we compare AR models with VAR models (AR models are nested in VAR models). When both models perform equally we prefer, in contrast to your blogpost, the less complex AR models because

– The one standard error rule (“ Hastie et al. (2009) propose to select the most parsimonious model within the range of one standard error above the prediction error of the best model (this is the one standard error rule)”).

Hastie is not using priors or Bayesian inference, and thus he needs the parsimony rule. He may also have implicit cost function…

– A big problem (in my opinion) in psychology is the over interpretation of small effects which exacerbates by using complex models. I fear that some researchers will feel the need to interpret the estimated value of every parameter in the complex model.

Yes, it is dangerous to over-interpret especially if the estimates don’t include good uncertainty estimates, and even then it’s difficult to make interpretations in case of many collinear covariates.

My assumption above was that the model is used for predictions and I care about best possible predictive distribution.

The situation is different if we add cost of interpretation or cost of measurements in the future. I don’t know literature analysing what is a cost of interpretation for AR vs VAR model, or cost of of interpretation of additional covariate in a model, but when I’m considering interpretability I favor smaller models with similar predictive performance as the encompassing model. But if we have the encompassing model, then I also favor projection predictive approach where the full model is used as the reference so that the selection process is not overfitting to the data and the inference given the smaller model is conditional the information from the full model (Piironen & Vehtari, 2017). In case of a small number of models, LOO comparison or Bayesian stacking weights can also be sufficient (some examples here and here).

Advice on soft skills for academics

Julia Hirschberg sent this along to the natural language processing mailing list at Columbia:

here are some slides from last spring’s CRA-W Grad Cohort and previous years that might be of interest. all sorts of topics such as interviewing, building confidence, finding a thesis topic, preparing your thesis proposal, publishing, entrepreneurialism, and a very interesting panel on human-human interaction skills.

I took a look at a couple of these and they look like useful advice for grad students. I wrote back to Julia that I used to have a hard time as a professor convincing students their evening would be better spent having dinner with the invited seminar speaker than revising their homework.

Advice to grad students is a longstanding, niche writing genre. There are even widely-cited classics, like the transciption of Richard Hamming’s talk You and Your Research, which has been recommended to me more times than I can count. There’s also a later YouTube presentation, which I haven’t seen (yet!).

Hamming’s advice on lunchtime behavior reminds me of one of the reasons I like places like Google—they still have active round table tech lunches. We try to do that once per week after the Stan meetings. If you’re in town and want to join us, drop me or Andrew a line.

P.S. I got the term “soft skills” from a friend of mine who’s going through soft skills training at Amazon before they’ll let him speak to customers—things like giving presentations and staying on topic.

Journals and refereeing: toward a new equilibrium

As we’ve discussed before (see also here), one of the difficulties of moving from our current system of review of scientific journal articles, to a new model of post-publication review, is that any major change seems likely to break the current “gift economy” system in which thousands of scientists put in millions of hours providing free reviews. And these reviews can be pretty good. Doing pre-publication reviews at the request of a journal editor: that seems like an obligation, and it’s helped along by pressure from all those associate editors. Remove that system of social obligation, just tell people they can do post-publication reviews when they want, and you’ll see a lot less reviewing. And the reviewing that does get done will be disproportionately by people with an axe to grind. So that could be a problem.

So what’s the new equilibrium, if we move away from I-give-free-labor-to-Elsevier-by-reviewing-random-papers-for-their-journals-at-the-behest-of-equally-uncompensated-editors to open post-publication review? Is it just that zillions of things get published and a few of them get reviewed in an unsystematic manner?

I don’t know.

Recently in the sister blog

In the article, “Testing the role of convergence in language acquisition, with implications for creole genesis,” Marlyse Baptista et al. write:

The main objective of this paper is to test experimentally the role of convergence in language acquisition (second language acquisition specifically), with implications for creole genesis. . . . Our experiment is unique on two fronts as it is the first to use an artificial language to test the convergence hypothesis by making it observable, and it is also the first experimental study to clarify the notion of similarity by varying the levels and types of similarity that are expressed. We report an experiment with 94 English-speaking adults . . . A miniature artificial language was created that included morphological elements to express negation and pluralization. Participants were randomly assigned to one of three conditions: congruent (form and function of novel grammatical morphemes were highly similar to those in English), reversed (negative grammatical morpheme was highly similar to that of English plural, and plural grammatical morpheme was highly similar to that of English negation), and novel (form and function were highly dissimilar to those of English).

A miniature artificial language! Cool.

And here’s what the researchers found:

Participants in the congruent condition performed best, indicating that features that converge across form and function are learned most fully. More surprisingly, results showed that participants in the reversed condition acquired the language more readily than those in the novel condition, contrary to expectation.

It’s funny they find this result surprising. Speaking as an outsider to this field of research, based only on introspection and experience, I’d’ve thought that the novel condition would be more difficult than the reverse condition. To me, reversal’s pretty trivial; but an entirely new system, that sounds hard. For example, in our apartment we have a backwards clock (the gear is flipped so the hands go counter-clockwise), a clock with the minute and hour hands switched, and a 24-hour clock. I’ve never had any problem reading the backward clock—once you realize it’s reverse, reading it is automatic—but the other two clocks still give me difficulty and I have to consciously work it out each time I read them.

Sure, language is verbal and clock-reading is visual. Still, as Phil would say, I’m surprised that Baptista et al. are surprised that learning something reversed is easier than learning something new.

If the result really is a surprise, I’d like to see a replication study. Also some graphs of the data.