Skip to content

Barry Gibb came fourth in a Barry Gibb look alike contest

Every day a little death, in the parlour, in the bed. In the lips and in the eyes. In the curtains in the silver, in the buttons, in the bread, in the murmurs, in the gestures, in the pauses, in the sighs. – Sondheim

The most horrible sound in the world is that of a reviewer asking you to compare your computational method to another, existing method. Like bombing countries in the name of peace, the purity of intent drowns out the voices of our better angels as they whisper: at what cost.

Before the unnecessary drama of that last sentence sends you running back to the still-open browser tab documenting the world’s slow slide into a deeper, danker, more complete darkness that we’ve seen before, I should say that I understand that for most people this isn’t a problem. Most people don’t do research in computational statistics. Most people are happy.

So why does someone asking for a comparison of two methods for allegedly computing the same thing fill me with the sort of dread usually reserved for climbing down the ladder into my basement to discover, by the the light of a single, swinging, naked lightbulb, that the evil clown I keep chained in the corner has escaped? Because it’s almost impossible to do well.

I go through all this before you wake up so I can feel happier to be safe again with you

Many many years ago, when I still had all my hair and thought it was impressive when people proved things, I did a PhD in numerical analysis. These all tend to have the same structure:

  • survey your chosen area with a simulation study comparing all the existing methods,
  • propose a new method that should be marginally better than the existing ones,
  • analyse the new method, show that it’s at least not worse than the existing ones (or worse in an interesting way),
  • construct a simulation study that shows the superiority of your method on a problem that hopefully doesn’t look too artificial,
  • write a long discussion blaming the inconsistencies between the maths and the simulations on “pre-asymptotic artefacts”.

Which is to say, I’ve done my share of simulation studies comparing algorithms.

So what changed? When did I start to get the fear every time someone mentioned comparing algorithms?

Well, I left numerical analysis and moved to statistics and I learnt the one true thing that all people who come to statistics must learn: statistics is hard.

When I used to compare deterministic algorithms it was easy: I would know the correct answer and so I could compare algorithms by comparing the error in their approximate solutions (perhaps taking into account things like how long it took to compute the answer).

But in statistics, the truth is random. Or the truth is a high-dimensional joint distribution that you cannot possibly know. So how can you really compare your algorithms, except possibly by comparing your answer to some sort of “gold standard” method that may or may not work.

Inte ner för ett stup. Inte ner från en bro. Utan från vattentornets topp.

[No I don’t speak Swedish, but one of my favourite songwriters/lyricists does. And sometimes I’m just that unbearable. Also the next part of this story takes place in Norway, which is near Sweden but produces worse music (Susanne Sunfør and M2M being notable exceptions)]

The first two statistical things I ever really worked on (in an office overlooking a fjord) were computationally tractable ways of approximating posterior distributions for specific types of models.  The first of these was INLA. For those of you who haven’t heard of it, INLA (and it’s popular R implementation R-INLA) is a method for doing approximate posterior computation for a lot of the sorts of models you can fit in rstanarm and brms. So random effect models, multilevel models, models with splines, and spatial effects.

At the time, Stan didn’t exist (later, it barely existed), so I would describe INLA as being Bayesian inference for people who lacked the ideological purity to wait 14 hours for a poorly mixing BUGS chain to run, instead choosing to spend 14 seconds to get a better “approximate” answer.  These days, Stan exists in earnest and that 14 hours is 20 minutes for small-ish models with only a couple of thousand observations, and the answer that comes out of Stan is probably as good as INLA. And there are plans afoot to make Stan actually solve these models with at least some sense of urgency.

Working on INLA I learnt a new fear: the fear that someone else was going to publish a simulation study comparing INLA with something else without checking with us first.

Now obviously, we wanted people to run their comparisons past us so we could ruthlessly quash any dissent and hopefully exile the poor soul who thought to critique our perfect method to the academic equivalent of a Siberian work camp.

Or, more likely, because comparing statistical models is really hard, and we could usually make the comparison much better by asking some questions about how it was being done.

Sometimes, learning from well-constructed simulation studies how INLA was failing lead to improvements in the method.

But nothing could be learned if, for instance, the simulation study was reporting runs from code that wasn’t doing what the authors thought it was. And I don’t want to suggest that bad or unfair comparisons comes from malice (for the most part, we’re all quite conscientious and fairly nice), but rather that they happen because comparing statistical algorithms is hard.

And comparing algorithms fairly where you don’t understand them equally well is almost impossible.

Well did you hear the one about Mr Ed? He said I’m this way because of the things I’ve seen

Why am I bringing this up? It’s because of the second statistical thing that I worked on while I was living in sunny Trondheim (in between looking at the fjord and holding onto the sides of buildings for dear life because for 8 months of the year Trondheim is a very pretty mess of icy hills).

During that time, I worked with Finn Lindgren and Håvard “INLA” Rue on computationally efficient approximations to Gaussian random fields (which is what we’re supposed to call Gaussian Processes when the parameter space is more complex than just “time” [*shakes fist at passing cloud*]). Finn (with Håvard and Johan Lindström) had proposed a new method, cannily named the Stochastic Partial Differential Equation (SPDE) method, for exploiting the continuous-space Markov property in higher dimensions. Which all sounds very maths-y, but it isn’t.

The guts of the method says “all of our problems with working computationally with Gaussian random fields comes from the fact that the set of all possible functions is too big for a computer to deal with, so we should do something about that”.  The “something” is replace the continuous function with a piecewise linear one defined over a fairly fine triangulation on the domain of interest.

But why am I talking about this? Sorry. One day I’ll write a short post.

A very exciting paper popped up on arXiv on Monday comparing a fairly exhaustive collection of recent methods for making spatial Gaussian random fields more computationally efficient.

Why am I not cringing in fear? Because if you look at the author list, they have included an author from each of the projects they have compared! This means that the comparison will probably be as good as it can be. In particular, it won’t suffer from the usual problem of the authors understanding some methods they’re comparing better than others.

The world is held together by the wind that blows through Gena Rowland’s hair

So how did they go?  Well, actually, they did quite well.  I like that

  • They describe each problem quite well
  • The simulation study and the real data analysis uses a collection of different evaluations metrics
  • Some of these are proper scoring rules, which is the correct framework for evaluating probabilistic predictions
  • They acknowledge that the wall clock timings are likely to be more a function of how hard a team worked to optimise performance on this one particular model than a true representation of how these methods would work in practice.

Not the lovin’ kind

But I’m an academic statistician. And our key feature, as a people, is that we loudly and publicly dislike each other’s work. Even the stuff we agree with.  Why? Because people with our skills who also have impulse control tend to work for more money in the private sector.

So with that in mind, let’s have some fun.

(Although seriously, this is the best comparison of this type I’ve ever seen. So, really, I’m just wanting it to be even bester.)

So what’s wrong with it?

It’s gotta be big. I said it better be big

The most obvious problem with the comparison is that the problem that these methods are being compared on is not particularly large.  You can see that from the timings.  Almost none of these implementations are sweating, which is a sign that we are not anywhere near the sort of problem that would really allow us to differentiate between methods.

So how small is small? The problem had 105,569 observations and required prediction at at most 44,431 other locations. To be challenging, this data needed to be another order of magnitude bigger.

God knows I know I’ve thrown away those graces

(Can you tell what I’m listening to?)

The second problem with the comparison is that the problem is tooooooo easy. As the data is modelled with a Gaussian observation noise and a multivariate Gaussian latent random effect, it is a straightforward piece of algebra to eliminate all of the latent Gaussian variables from the model.  This leads to a model with only a small number of parameters, which should make inference much easier.

How do you do that?  Well, if the data is y, the Gaussian random field is x and and all the hyperparmeters \theta.  In this case, we can use conditional probability to write that

p(\theta \mid y) \propto \frac{p(y,x,\theta)}{p(x \mid y, \theta)},

which holds for every value of x and particularly x=0.  Hence if you have a closed form full conditional (which is the case when you have Gaussian observations), you can write the marginal posterior out exactly without having to do any integration.

A much more challenging problem would have had Poisson or binomial data, where the full conditional doesn’t have a known form. In this case you cannot do this marginalisation analytically, so you put much more stress on your inference algorithm.

I guess there’s an argument to be made that some methods are really difficult to extend to non-Gaussian observations. But there’s also an argument to be made that I don’t care.

Don’t take me back to the range

The prediction quality is measured in terms of mean squared error and mean absolute error (which are fine), the continuous rank probability score (CRPS) and and the Interval Score (INT), both of which are proper scoring rules. Proper scoring rules (and follow the link or google for more if you’ve never heard of them) are the correct way to compare probabilitic predictions, regardless of the statistical framework that’s used to make the predictions. So this is an excellent start!

But one of these measures does stand out: the prediction interval coverage (CVG) which is defined in the paper as “the percent of intervals containing the true predicted value”.  I’m going to parse that as “the percent of prediction intervals containing the true value”. The paper suggests (through use of bold in the tables) that the correct value for CVG is 0.95. That is, the paper suggests the true value should lie within the 95% interval 95% of the time.

This is not true. 

Or, at least, this is considerably more complex than the result suggests.

Or, at least, this is only true if you compute intervals that are specifically built to do this, which is mostly very hard to do. And you definitely don’t do it by providing a standard error (which is an option in this competition).

Boys on my left side. Boys on my right side. Boys in the middle. And you’re not here.

So what’s wrong with CVG?

Why? Well first of all it’s a multiple testing problem. You are not testing the same interval multiple times, you are checking multiple intervals one time each. So it can only be meaningful if the prediction intervals were constructed jointly to solve this specific multiple testing problem.

Secondly, it’s extremely difficult to know what is considered random here. Coverage statements are statements about repeated tests, so how you repeat them will affect whether or not a particular statement is true. It will also affect how you account for the multiple testing when building your prediction intervals. (Really, if anyone did opt to just return standard errors, nothing good is going to happen for them in this criterion!)

Thirdly, it’s already covered by the interval score.  If your interval is [l,u] with nominal level \alpha, the interval score is for an observation y is

\text{INT}_\alpha(l, u, y) = u - l + \frac{2}{\alpha}(l-y) \mathbf{1}\{y < l\} + \frac{2}{\alpha}(y-u)\mathbf{1}\{y>u\}.

This score (where smaller is better) rewards you for having a narrow prediction interval,  but penalises you every time the data does not lie in the interval.  The score is minimised when \Pr(y \in [l,u]) = \alpha. So this really is a good measure of how well the interval estimate is calibrated that also checks more aspects of the interval than CVG (which lacks the first term) does.

There’s the part you’ve braced yourself against, and then there’s the other part

Any conversation about how to evaluate the quality of an interval estimate really only makes sense in the situation where everyone has constructed their intervals the same way. Now the authors have chosen not to provide their code, so it’s difficult to tell what people actually did.  But there are essentially four options:

  • Compute pointwise prediction means \hat{\mu}_i and standard errors \hat{\sigma}_i and build the pointwise intervals \hat{\mu}_i \pm 1.96\hat{\sigma}.
  • Compute the pointwise Bayesian prediction intervals, which are formed from the appropriate quantiles (or the HPD region if you are Tony O’Hagan) of \int \int p(\hat{y} \mid x,\theta) p(x,\theta \mid y)\,dxd\theta.
  • An interval of the form \hat{\mu}_i \pm c\hat{\sigma}, where is chosen to ensure coverage.
  • Some sort of clever thing based on functional data analysis.

But how well these different options work will depend on how they’re being assessed (or what they’re being used for).


Option 1: We want to fill in our sparse observation by predicting at more and more points

(This is known as “in-fill asymptotics”). This type of question occurs when, for instance, we want to fill in the holes in satellite data (which are usually due to clouds).

This is the case that most closely resembles the design of the simulation study in this paper. In this case you refine your estimated coverage by computing more prediction intervals and checking if the true value lies within the interval.

Most of the easy to find results about coverage in these is from the 1D literature (specifically around smoothing splines and non-parametric regression). In these cases, it’s known that the first option is bad, the second option will lead to conservative regions (the coverage will be too high), the third option involves some sophisticated understanding of how Gaussian random fields work, and the fourth is not something I know anything about.

Option 2: We want to predict at one point, where the field will be monitored multiple times

This second option comes up when we’re looking at a long-term monitoring network. This type data is common in environmental science, where a long term network of sensors is set up to monitor, for example, air pollution.  The new observations are not independent of the previous ones (there’s usually some sort of temporal structure), but independence can often be assumed if the observations are distant enough in time.

In this case, Option 1 will be the right way to construct your interval, option 2 will probably still be a bit broad but might be ok, and options 3 and 4 will probably be too narrow if the underlying process is smooth.

Option 3: Mixed asymptotics! You do both at once

Simulation studies are the last refuge of the damned.

I see the sun go down. I see the sun come up. I see a light beyond the frame.

So what are my suggestions for making this comparison better (other than making it bigger, harder, and dumping the weird CVG criterion)?

  • randomise
  • randomise
  • randomise

What do I mean by that?  Well in the simulation study, the paper only considered one possible set of data simulated from the correct model. All of the results in their Table 2, which contains the scores, and timings on the simulated data, depends on this particular realisation. And hence Table 2 is a realisation of a random variable that will have a mean and standard deviation.

This should not be taken as an endorsement of the frequentist view that the observed data is random and estimators should be evaluated by their average performance over different realisation of the data. This is an acknowledgement of the fact that in this case the data is actually a realisation of a random variable. Reporting the variation in Table 2 would give an idea of the variation in the performance of the method. And would lead to a more nuanced and realistic comparison of the methods. It is not difficult to imagine that for some of these criteria there is no clear winner when averaged over data sets.

Where did you get that painter in your pocket?

I have very mixed feelings about the timings column in the results table.  On one hand, an “order of magnitude” estimate of how long this will actually take to fit is probably a useful thing for a person considering using a method.  On the other hand, there is just no way for these results not to be misleading. And the paper acknowledges this.

Similarly, the competition does not specify things like priors for the Bayesian solutions. This makes it difficult to really compare things like interval estimates, which can strongly depend on the specified priors.  You could certainly improve your chances of winning on the CVG computation for the simulation study by choosing your priors carefully!

What is this six-stringed instrument but an adolescent loom?

I haven’t really talked about the real data performance yet. Part of this is because I don’t think real data is particularly useful for evaluating algorithms. More likely, you’re evaluating your chosen data set as much as, or even more than, you are evaluating your algorithm.

Why? Because real data doesn’t follow the model, so even if a particular method gives a terrible approximation to the inference you’d get from the “correct” model, it might do very very well on the particular data set.  I’m not sure how you can draw any sort of meaningful conclusion from this type of situation.

I mean, I should be happy I guess because the method I work on “won” three of the scores, and did fairly well in the other two. But there’s no way to say that wasn’t just luck.

What does luck look like in this context? It could be that the SPDE approximation is a better model for the data than the “correct” Gaussian random field model. It could just be Finn appealing to the old Norse gods. It’s really hard to  tell.

If any real data is to be used to make general claims about how well algorithms work, I think it’s necessary to use a lot  of different data sets rather than just one.

Similarly, a range of different simulation study scenarios would give a broader picture of when different approximations behave better.

Don’t dream it’s over

One more kiss before we part: This field is still alive and kicking. One of the really exciting new ideas in the field (that’s probably too new to be in the comparison) is that you can speed up the computation of the unnormalised log-posterior through hierarchical decompositions of the covariance matrix (there is also code). This is a really neat method for solving the problem and a really exciting new idea in the field.

There are a bunch of other things that are probably worth looking at in this article, but I’ve run out of energy for the moment. Probably the most interesting thing for me is that a lot of the methods that did well (SPDEs, Predictive Processes, Fixed Rank Kriging, Multi-resolution Approximation, Lattice Krig, Nearest-Neighbour Predictive Processes)   are cut from very similar cloth. It would be interesting to look deeper at the similarities and differences in an attempt to explain these results.


  • Finn Lindgren has pointed out that the code will be made available online. Currently there is a gap between intention and deed, but I’ll update the post with a link in the event that it makes it online.

“La critique est la vie de la science”: I kinda get annoyed when people set themselves up as the voice of reason but don’t ever get around to explaining what’s the unreasonable thing they dislike.

Someone pointed me to a blog post, Negative Psychology, from 2014 by Jim Coan about the replication crisis in psychology.

My reaction: I find it hard to make sense of what he is saying because he doesn’t offer any examples of the “negative psychology” phenomenon that he discussing. I kinda get annoyed when people set themselves up as the voice of reason but don’t ever get around to explaining what’s the unreasonable thing they dislike.

I read more by Coan and he seems to me making a common mistake which is to conflate scientific error with character flaw. He thinks that critics of bad research are personally criticizing scientists. And, conversely, since he knows that scientists are mostly good people, he resists criticism of their work. Well, hey, probably 500 years ago most astrologers were good people too, but this doesn’t mean their work was any good to anyone! It’s not just about character, it’s also about data and models and methods. One reason I prefer to use the neutral term “forking paths” rather than the value-laden term “p-hacking” is that I want to emphasize that scientists can do bad work, even if they’re trying their best to do good work. I have no reason to think that John Bargh, Roy Baumeister, Ellen Langer, etc. want to be pushing around noise and making unreplicable claims. I’m sure they’d love to do good empirical science and they think they are. But, y’know, GIGO.

Good character is not enough. All the personal integrity in the world won’t help you if your measurements are super-noisy and if you’re using statistical methods that don’t work well in the presence of noise.

And, of course, once NPR, Gladwell, and Ted talks get involved, all the incentives go in the wrong direction. Researchers such as Coan have every motivation to exaggerate and very little motivation to admit error or even uncertainty.

My correspondent responds:

This is also a problem in medicine as I am sure you already know. This effect should be named: so much noise it makes you deaf to constructive criticism :) Unfortunately, this affects many people’s lives and I think it should be brought to light. Besides, constructive criticism is one of the pillars of science.

As Karl Pearson wrote in 1900:

In an age like our own, which is essentially an age of scientific injury, the prevalence of doubt and criticism ought not to be regarded with despair or as a sign of decadence. It is one of the safeguards of progress; la critique est la vie de la science, I must again repeat. One of the most fatal (and not so impossible) futures for science would be the institution of a scientific hierarchy which would brand as heretical all doubt as to its conclusions, all criticism of its results.

P.S. This post happens to be appearing shortly after a discussion on replicability and scientific criticism. Just a coincidence. I wrote the post several months ago (see here for the full list).

Why I think the top batting average will be higher than .311: Over-pooling of point predictions in Bayesian inference

In a post from 22 May 2017 entitled, “Who is Going to Win the Batting Crown?”, Jim Albert writes:

At this point in the season, folks are interested in extreme stats and want to predict final season measures. On the morning of Saturday May 20, here are the leading batting averages:

Justin Turner .379
Ryan Zimmerman .375
Buster Posey .369

At the end of this season, who among these three will have the highest average? . . . these batting averages are based on a small number of at-bats (between 120 and 144) and one expects all of these extreme averages to move towards the mean as the season progresses. One might think that Turner will win the batting crown, but certainly not with a batting average of .379. . . .

I’m scheduling this post to appear in October, at which point we’ll know the answer!

Albert makes his season predictions not just using current batting average, but also using strikeout rates, home run rates, and batting average for balls in play. I think he’s only using data from the current year, which doesn’t seem quite right, but I guess it’s fine given that this is a demonstration of statistical methods and is not intended to represent a full-information prediction. In any case, here’s what he concludes in May:

I [Albert] predict Posey to finish with a .311 average followed by Zimmerman at .305 and Turner at .297.

These are reasonable predictions. But . . . I’m guessing that the league-leading batting average will be higher than .311!

Why do I say this? Check out recent history. The top batting averages in the past ten seasons (listed most recent year first) have been .348, .333, .319, .331, .336, .337, .336, .342, .364, .340. Actually, it looks like the top batting average in MLB has never been as low as .311. So I doubt that will happen this year. In 2016 there appear to have been 12 players who batted over .311 during the season.

What happened? Nothing wrong with Albert’s predictions. He’s just giving the posterior mean for each player, which cannot be directly examined to given an inference for the maximum over all players. Assuming he’s fitting his models in Stan—there’s no good reason to do otherwise—he’s also getting posterior simulation draws. He could then simulate, say, 1000 possibilities for the end-of-season records—and there he’d find that in just about any particular simulation the top batting average will exceed .311. Lots of players have a chance to make it, not just those three listed above.

This is not to diss Albert’s post; I’m just extending it by demonstrating out the perils of estimating extreme values from point predictions. It’s an issue that Phil and I discussed in our article, All maps of parameter estimates are misleading.

P.S. This post is appearing, as scheduled, on 19 Oct, during the playoffs. The season’s over so we can check what happened:

Buster Posey hit .320
Ryan Zimmerman hit .303
Justin Turner hit .322.

The league-leading batting averages were Charlie Blackmon at .331 and Jose Altuve at .346. So Albert’s predictions were not far off (these three batters did a bit better than the point predictions but I assume they’re well within the margin of error) and, indeed, it was two other hitters that won the batting titles.

From a math point of view, it’s an interesting example of how the mean of the maximum of a set of random variables is higher than the max of the individual means.

Beyond “power pose”: Using replication failures and a better understanding of data collection and analysis to do better science

So. A bunch of people pointed me to a New York Times article by Susan Dominus about Amy Cuddy, the psychology researcher and Ted-talk star famous for the following claim (made in a paper written with Dana Carney and Andy Yap and published in 2010):

That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.

Awkwardly enough, no support for that particular high-stakes claim was ever presented in the journal article where it appeared. And, even more awkwardly, key specific claims for which the paper did offer some empirical evidence for, failed to show up in a series of external replication studies, first by Ranehill et al. in 2015 and then more recently various other research teams (see, for example, here). Following up on the Ranehill et al. paper was an analysis by Joe Simmons and Uri Simonsohn explaining how Carney, Cuddy, and Yap could’ve gotten it wrong in the first place. Also awkward was a full retraction by first author Dana Carney, who detailed many ways in which the data were handled in order to pull out apparently statistically significant findings.

Anyway, that’s all background. I think Dominus’s article is fair, given the inevitable space limitations. I wouldn’t’ve chosen to have written an article about Amy Cuddy—I think Eva Ranehill or Uri Simonsohn would be much more interesting subjects. But, conditional on the article being written largely from Cuddy’s perspective, I think it portrays the rest of us in a reasonable way. As I said to Dominus when she interviewed me, I don’t have any personal animosity toward Cuddy. I just think it’s too bad that the Carney/Cuddy/Yap paper got all that publicity and that Cuddy got herself tangled up in defending it. It’s admirable that Carney just walked away from it all. And it’s probably a good call of Yap to pretty much have avoided any further involvement in the matter.

The only thing that really bugged me about the NYT article is when Cuddy is quoted as saying, “Why not help social psychologists instead of attacking them on your blog?” and there is no quoted response from me. I remember this came up when Dominus interviewed me for the story, and I responded right away that I have helped social psychologists! A lot. I’ve given many talks during the past few years to psychology departments and at professional meetings, and I’ve published several papers in psychology and related fields on how to do better applied research, for example here, here, here, here, here, here, here, and here. I even wrote an article, with Hilda Geurts, for The Clinical Neuropsychologist! So, yeah, I do spend some time helping social psychologists.

Dominus also writes, “Gelman considers himself someone who is doing others the favor of pointing out their errors, a service for which he would be grateful, he says.” This too is accurate, and let me also emphasize that this is a service for which I not only would be grateful. I actually am grateful when people point out my errors. It’s happened several times; see for example here. When we do science, we can make mistakes. That’s fine. What’s important is to learn from our mistakes.

In summary, I think Dominus’s article was fair, but I do wish she hadn’t let that particular false implication by Cuddy, the claim that I didn’t help social psychologists, go unchallenged. Then again, I also don’t like it that Cuddy baselessly attacked the work of Simmons and Simonsohn and to my knowledge never has apologized for that. (I’m thinking of Cuddy’s statement, quoted here, that Simmons and Simonsohn “are flat-out wrong. Their analyses are riddled with mistakes . . .” I never saw Cuddy present any evidence for these claims.)

Good people can do bad science. Indeed, if you have bad data you’ll do bad science (or, at best, report null findings), no matter how good a person you are.

Let me continue by saying something I’ve said before, which is that being a scientist, and being a good person, does not necessarily mean that you’re doing good science. I don’t know Cuddy personally, but given everything I’ve read, I imagine that she’s a kind, thoughtful, and charming person. I’ve heard that Daryl Bem is a nice guy too. And I expect Satoshi Kanazawa has many fine features too. In any case, it’s not my job to judge these people nor is it their job to judge me. A few hundred years ago, I expect there were some wonderful, thoughtful, intelligent, good people doing astrology. That doesn’t mean that they were doing good science!

If your measurements are too noisy (again, see here for details), it doesn’t matter how good a person you are, you won’t be able to use your data to make replicable predictions of the world or evaluate your theories: You won’t be able to do empirical science.

Conversely, if Eva Ranehill, or Uri Simonsohn, or me, or anyone else, performs a replication (and don’t forget the time-reversal heuristic) or analyzes your experimental protocol or looks carefully at your data and finds that your data are too noisy for you to learn anything useful, then they may be saying you’re doing bad science, but they’re not saying you’re a bad person.

As the subtitle of Dominus’s excellent article says, “suddenly, the rules changed.” It happened over several years, but it really did feel like something sudden. And, yes, Carney, Cuddy, and Yap ideally should’ve known back in 2010 that they were chasing for patterns in noise. But they, like many others, didn’t. They, and we, were fortunate to have Ranehill et al. reveal some problems in their study with the failed replication. And they, and we, were fortunate to have Simmons, Simonsohn, and others explain in more detail how they could’ve got things wrong. Through this and other examples of failed studies (most notably Bem’s ESP paper, but also hopelessly flawed Kanazawa and many others), and through lots of work by psychologists such as Nosek and others, we are developing a better understanding of how to do research on unstable, context-dependent human phenomena. There’s no reason to think of the authors of those fatally flawed papers as being bad people. We learn, individually and collectively, from our mistakes. We’re all part of the process, and Dominus is doing the readers of the New York Times a favor by revealing one part of that process from the inside. Instead of the usual journalistic trope of scientist as hero, it’s science as community, including confusion, miscommunication, error, and an understanding that a certain research method that used to be popular and associated with worldly success—the method of trying out some vaguely motivated idea, gathering a bunch of noisy data, and looking for patterns—doesn’t work so well at producing sensible or replicable results. That’s a good thing to know, and it could well be interesting for outsiders to see the missteps it took for us all to get there.

Selection bias in what gets reported

When people make statistical errors, I don’t say “gotcha,” I feel sad. Even when I joke about it, I’m not happy to see the mistakes; indeed, I often blame the statistics profession—including me, as a textbook writer!—for portraying statistical methods as tools for routine discovery: Do the randomization, gather the data, pass statistical significance and collect $200.

Regarding what gets mentioned in the newspapers and in the blogs, there’s some selection bias. A lot of selection bias, actually. Suppose, for example, that Daryl Bem had not made the serious, fatal mistakes he’d made in his ESP research.. Suppose he’d fit a hierarchical model or done a preregistered replication or used some other procedure to avoid jumping at patterns in noise. That would’ve been great. And then he most likely would’ve found nothing distinguishable from a null effect, no publication in JPSP (no, I don’t think they’d publish the results of a large multi-year study finding no effect for a phenomenon that most psychologists don’t believe in the first place), no article on Bem in the NYT . . . indeed, I never would’ve heard of Bem!

Think of the thousands of careful scientists who, for whatever combination of curiosity or personal interests or heterodoxy, decide to study offbeat topics such as ESP or the effect of posture on life success—but who conduct their studies carefully, gathering high-quality data, and using designs and analyses that minimize the chances of being fooled by noise. These researchers will, by and large, quietly find null results, which for very reasonable dog-bite-man reasons will typically be unpublishable, or only publishable in minor journals and will not be likely to inspire lots of news coverage. So we won’t hear about them.

Conversely, I’ll accept the statement that Cuddy in her Ted talks could be inspiring millions of people in a good way, even if power pose does nothing, or even does more harm than good. (I assume it depends on context, that power pose will do more good than harm in some settings, and more harm than good in others). The challenge for Cuddy—and in all seriousness I hope she follows up on this—is to be this inspirational figure, to communicate to those millions, in a way that respects the science. I hope Cuddy can stop insulting Simmons and Simonsohn, forget about the claims of the absolute effects of power pose, and move forward, sending the message that people can help themselves by taking charge of their environment, by embodying who they want to be. The funny thing is, I think that pretty much is the message of that famous Ted talk, and that the message would be stronger without silly, unsupported claims such as “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.”

A way forward

People criticize Cuddy for hyping her science and making it into a Ted talk. But, paradoxically, I’m now thinking we should be saying the opposite. The Ted talk has a lot going for it: it’s much stronger than the journal articles that justify it and purportedly back it up. I have the impression that Cuddy and others think the science of power pose needs to be defended in part because of its role in this larger edifice, but I recommend that Cuddy and her colleagues go the other way: follow the lead of Dana Carney, Eva Ranehill, et al., and abandon the scientific claims, which ultimately were based on an overinterpretation of noise (again, recall the time-reversal heuristic)—and then let the inspirational Ted talk advice fly free of that scientific dead end. There are lots of interesting ways to study how people can help themselves through tools such as posture and visualization, but I think these have to be studied for real, not through crude button-pushing ideas such as power pose but through careful studies on individuals, recognizing that different postures, breathing exercises, yoga moves, etc., will work for different people. Lots of interesting ideas here, and it does these ideas no favor to tie them to some silly paper published in 2010 that happened to get a bunch of publicity. The idea is to take the positive aspects of the work of Cuddy and others—the inspirational message that rings true for millions of people—and to investigate it using a more modern, data-rich, within-person model of scientific investigation. That’s the sort of thing that should one day be appearing in the pages of Psychological Science.

I think Cuddy has the opportunity to take her fame and her energy and her charm and her good will and her communication skills and her desire to change the world and take her field in a useful direction. Or not. It’s her call, and she has no obligation to do what I think would be a good idea. I just wanted to emphasize that there’s no reason her career, or even her famous Ted talk, has to rely on a particular intriguing idea (on there being a large and predictable effect of a certain pose) that happened not to work out. And I thank Dominus for getting us all to think about these issues.

P.S. There are a bunch of comments, including from some people who strongly disagree with me—which I appreciate! That is, I appreciate that people who disagree are taking the trouble to share their perspective and engage in discussion here.

There are a lot of details above so maybe it would be a good idea to emphasize some key points:
Continue reading ‘Beyond “power pose”: Using replication failures and a better understanding of data collection and analysis to do better science’ »

No tradeoff between regularization and discovery

We had a couple recent discussions regarding questionable claims based on p-values extracted from forking paths, and in both cases (a study “trying large numbers of combinations of otherwise-unused drugs against a large number of untreatable illnesses,” and a salami-slicing exercise looking for public opinion changes in subgroups of the population), I recommended fitting a multilevel model to estimate the effects in question. The idea is that such a model will estimate a distribution of treatment effects that is concentrated near zero, and the resulting inferences for the individual effects will be partially pooled toward zero, with the anticipated result in these cases that none of the claims will be so strong any more.

Here’s a simple example:

Suppose the prior distribution, as estimated by the hierarchical model, is that the population of effects has mean 0 and standard deviation of 0.1. And now suppose that the data-based estimate for one of the treatment effects is 0.5 with a standard error of 0.2 (thus, statistically significant at conventional levels). Also assume normal distributions all around. Then the posterior distribution for this particular treatment effect is normal with mean (0/0.1^2 + 0.5/0.2^2)/(1/0.1^2 + 1/0.2^2) = 0.10, with standard deviation 1/sqrt(1/0.1^2 + 1/0.2^2) = 0.09. Based on this inference, there’s an 87% posterior probability that the treatment effect is positive.

We could expand this hypothetical example by considering possible alternative prior distributions for the unknown treatment effect. Uniform(-inf,inf) is just too weak. Perhaps normal(0,0.1) is also weakly informative, and maybe the actual population distribution of the true effects is something like normal(0,0.05). In that case, using the normal(0,0.1) prior as above will under-pool, that is, the inference will be anti-conservative and be too susceptible to noise.

With a normal(0,0.05) prior and normal(0.5,0.2) data, you’ll get a posterior that’s normal with mean (0/0.05^2 + 0.5/0.2^2)/(1/0.05^2 + 1/0.2^2) = 0.03, with standard deviation 1/sqrt(1/0.05^2 + 1/0.2^2) = 0.05. Thus, the treatment effect is likely to be small, and there’s a 72% chance that it is positive.

Also, all this assumes zero bias in measurement and estimation, which is just about never correct but can be an ok approximation when standard errors are large. Once the standard error becomes small, then we should think about including an error term to allow for bias, to avoid ending up with too-strong claims.

Regularization vs. discovery?

The above procedure is an example of regularization or smoothing, and from the Bayesian perspective it’s the right thing to do, combining prior information and data to get probabilistic inference.

A concern is sometimes raised, however, that regularization gets in the way of discovery. By partially pooling estimates toward zero, are we reducing our ability to discover new and surprising effects?

My answer is no, there’s not a tradeoff between regularization and discovery.

How is that? Consider the example above, with the 0 ± 0.05 prior with 0.5 ± 0.2 data. Our prior pulls the estimate to 0.03 ± 0.05, thus moving the estimate from clearly statistically significant (2.5 standard errors away from 0) to not even close to statistical significance (less than 1 standard error from zero).

So we’ve lost the opportunity for discovery, right?


There’s nothing stopping you from gathering more data to pursue this possible effect you’ve discovered. Or, if you can’t gather such data, you just have to accept this uncertainty.

If you want to be more open to discovery, you can pursue more leads and gather more and higher quality data. That’s how discovery happens.

B-b-b-but, you might say, what about discovery by luck? By regularizing, are we losing the ability to get lucky? Even if our hypotheses are mere lottery tickets, why throw away tickets that might contain a winner?

Here, my answer is: If you want to label something that might likely be wrong as a “discovery,” that’s fine by me! No need for a discovery to represent certainty or even to represent near-certainty. In the above example, we have a 73% posterior probability of seeing a positive effect in an exact replication study. Call that a discovery if you’d like. Integrate this discovery into your theoretical and practical understanding of the world and use it to decide where to go next.

P.S. The above could be performed using longer-tailed distributions if that’s more appropriate for the problem under consideration. The numbers will change but the general principles are the same.

From perpetual motion machines to embodied cognition: The boundaries of pseudoscience are being pushed back into the trivial.

This exchange came from a comment thread last year.

Diana Senechal points to this bizarre thing:

Brian Little says in Me, Myself, and Us (regarding the “lemon introvert test”):

One of the more interesting ways of informally assessing extraversion at the biogenic level is to do the lemon-drop test. [Description of experiment omitted from present quote—DS.] For some people the swab will remain horizontal. For others it will dip on the lemon juice end. Can you guess which? For the extraverts, the swab stays relatively horizontal, but for introverts it dips. . . . I have done this exercise on myself a number of times, and each time my swab dips deeply. I am, at least by this measure, a biogenic introvert.

I mean, really . . .

This claim has (at least) two serious problems: first, the weirdly overconfident button-pushing model of science, which reminds me of phrenological ideas from the schoolyard that how you lift your hands or cross your legs reveals some deep truth about you; and, second, the complete lack of understanding of variation, the idea that this thing would work every time. (Recall this similar attitude of researchers who felt the need for their theory to explain every case.)

Here, though, I want to point out a more positive take on this story. From my response to Senechal on that thread:

Maybe we’re making progress. 50 years ago we’d be hearing rumors of perpetual motion machines, car engines that ran on water, spoon bending, and even bigfoot. Now the purveyors of pseudoscience have moved to embodied cognition, lemon juice extraversion, power pose, and himmicanes. The boundaries of pseudoscience are being pushed back into the trivial. From perpetual motion to the lemon juice test: in the grand scheme of things this is a retreat.

Just to elaborate on this point: Bigfoot of course is trivial, but the point about perpetual motion machines etc. is that, if they were real, they’d imply huge changes in our understanding of physics. Similarly with the old story about the car engine that ran on water that some guy built in his backyard but was then suppressed by the powers-that-be in Detroit: if true, this would have implied huge changes in our understanding of physics and of economics. By comparison, the new cargo cult science of embodied cognition, shark attacks, smiley faces, beauty and sex ratio, etc., is so moderate and trivial: Any of these claims could be true; they’re just not well supported by the evidence, and as a whole they don’t fit together. To put it another way, the himmicanes story is not as obviously silly as, say, those photographs of fairies that fooled Arthur Conan Doyle, or the bending spoons and Noah’s ark stories that fooled people in the 1970s. So I think this is a positive development, that even though pseudoscience still is prominent in NPR, Ted, PPNAS, etc., it’s a fuzzier sort of pseudoscience, not as demonstrably wrong as perpetual motion machines, Nessie, and all that old-school hocus-pocus.

Analyzing New Zealand fatal traffic crashes in Stan with added open-access science

Open-access science

I’ll get to the meat of this post in a second, but I just wanted to highlight how the study I’m about to talk about was done in the open and how that helped everyone. Tim Makarios read the study and responded in the blog comments,

Hold on. As I first skimmed this post, I happened to have, on the coffee table next to me, half a sheet of The Evening Post dated July 18, 1984. At the bottom of page 2, there’s a note saying “The Ministry of Transport said today that 391 road deaths had been reported so far this year. This compared with 315 at the same time last year.”

How is there such a big discrepancy with your chart of Sam Warburton’s data?

to which Peter Ellis, the author of the study responded in typical open-source fashion, encouraging the original poster to dig deeper and report back,

Good question. I’m not a traffic crash expert, the spirit of open source is – you let me know when you think you’ve worked it out. Obviously these are measuring two different things, I’m interested to know what! Thanks.

The question apparently prompted Peter to look himself; he followed up with

Looks like I misread the data for that particular bit of the analysis and my graphic was only showing *driver* deaths. I’ve updated it and the source code so it shows total casualties, which are consistent with that number in your old paper. Thanks for alerting me to that.

I love to see open science in action! Anyway, onto the real topic here.

New Zealand fatal traffic crashes

The above was an exchange about Peter Ellis’s analysis,

In his at-a-glance summary, Peter says,

I explore half a million rows of disaggregated crash data for New Zealand, and along the way illustrate geo-spatial projections, maps, forecasting with ensembles of methods, a state space model for change over time, and a generalized linear model for understanding interactions in a three-way cross tab.

I’d highly recommend it if you are interested in spatio-temporal modeling in particular or even just in plotting. It has great plots, very nice Stan code, and lots of great exploratory and Bayesian data analysis.

Cold wind from the north

We’ve had a rash of work lately on spatial models; must be the wind blowing down from the north (Finland and Toronto, specifically).

Contribute to Stan case studies?

Peter, if you’re listening and would be willing to release this open source, it’d make a great Stan case study. If you already submitted it for StanCon 2018, thanks! We’ll all be getting the New Year’s gift of a couple dozen new Stan case studies!

Beyond forking paths: using multilevel modeling to figure out what can be learned from this survey experiment

Under the heading, “Incompetent leaders as a protection against elite betrayal,” Tyler Cowen linked to this paper, “Populism and the Return of the ‘Paranoid Style’: Some Evidence and a Simple Model of Demand for Incompetence as Insurance against Elite Betrayal,” by Rafael Di Tella and Julio Rotemberg.

From a statistical perspective, the article by Tella and Rotemberg is a disaster of forking paths, as can be seen even from the abstract:

We present a simple model of populism as the rejection of “disloyal” leaders. We show that adding the assumption that people are worse off when they experience low income as a result of leader betrayal (than when it is the result of bad luck) to a simple voter choice model yields a preference for incompetent leaders. These deliver worse material outcomes in general, but they reduce the feelings of betrayal during bad times. Some evidence consistent with our model is gathered from the Trump-Clinton 2016 election: on average, subjects primed with the importance of competence in policymaking decrease their support for Trump, the candidate who scores lower on competence in our survey. But two groups respond to the treatment with a large (between 5 and 7 percentage points) increase in their support for Donald Trump: those living in rural areas and those that are low educated, white and living in urban and suburban areas.

There are just so many reasonable interactions that one could look at here, also no reason at all that we’d expect to be in a “needle in a haystack” situation in which there are one or two very large effects and a bunch of zeroes. So it doesn’t make sense to pull out various differences that happen to be large in these particular data and then spin out stories. The trouble is that this approach has poor statistical properties under repeated sampling: with another dataset sampled from the same population, you could find other patterns and tell other stories.

It’s not that Tella and Rotemberg are necessarily wrong in their conclusions (or Cowen wrong in taking these conclusions seriously), but I don’t think these data are helping here: they all might be better off just speculating based on other things they’ve heard.

What to do, then? Preregistered replication (as in 50 shades of gray), sure. But, before then, I’d suggest multilevel modeling and partial pooling to get a better handle on what an be learned from their existing data.

This could be an interesting project: to get the raw data from the above study and reanalyze using multilevel modeling.

Baseball, apple pie, and Stan

Ben sends along these two baseball job ads that mention experience with Stan as a preferred qualification:

St. Louis Cardinals Baseball Development Analyst

Tampa Bay Rays Baseball Research and Development Analyst

Freelance orphans: “33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??”

From someone who would prefer to remain anonymous:

As you may know, the relatively recent “orphan drug” laws allow (basically) companies that can prove an off-patent drug treats an otherwise untreatable illness, to obtain intellectual property protection for otherwise generic or dead drugs. This has led to a new business of trying large numbers of combinations of otherwise-unused drugs against a large number of untreatable illnesses, with a large number of success criteria.

Charcot-Marie-Tooth is a moderately rare genetic degenerative peripheral nerve disease with no known treatment. CMT causes the Schwann cells, which surround the peripheral nerves, to weaken and eventually die, leading to demyelination of the nerves, a loss of nerve conduction velocity, and an eventual loss of nerve efficacy.

PXT3003 is a drug currently in Phase 2 clinical testing to treat CMT. PXT3003 consists of a mixture of low doses of baclofen (an off-patent muscle relaxant), naltrexone (an off-patent medication used to treat alcoholism and opiate dependency), and sorbitol (a sugar substitute.)

Pre-phase 2 results from PXT3003 are shown here.

I call your attention to Figure 2, and note that in Phase 2, efficacy will be measured exclusively by the ONLS score.

My reply: 33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??

I sent this exchange to a colleague, who wrote:

In a past life I did mutual fund research. One of the fun things that fund managers do is “incubate” dozens of funds with their own money. Some do very well, others do miserably. They liquidate the poorly performing funds and “open” the high-performing funds to public investment (of course, reporting the fantastic historical earnings to the fund databases). Then sit back and watch the inflows (and management fees) pour in.

Stan case studies

Following up on recent posts here and here, I thought I’d post a list of all the Stan case studies we have so far.


Modeling Loss Curves in Insurance with RStan, by Mick Cooney
Splines in Stan, by Milad Kharratzadeh
Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data, by Mitzi Morris
The QR Decomposition for Regression Models, by Michael Betancourt
Robust RStan Workflow, by Michael Betancourt
Robust PyStan Workflow, by Michael Betancourt
Typical Sets and the Curse of Dimensionality, by Bob Carpenter
Diagnosing Biased Inference with Divergences, by Michael Betancourt
Identifying Bayesian Mixture Models, by Michael Betancourt
How the Shape of a Weakly Informative Prior Affects Inferences, by Michael Betancourt


Exact Sparse CAR Models in Stan, by Max Joseph
A Primer on Bayesian Multilevel Modeling using PyStan, by Chris Fonnesbeck
The Impact of Reparameterization on Point Estimates, by Bob Carpenter
Hierarchical Two-Parameter Logistic Item Response Model, by Daniel Furr
Rating Scale and Generalized Rating Scale Models with Latent Regression, by Daniel Furr
Partial Credit and Generalized Partial Credit Models with Latent Regression, by Daniel Furr
Rasch and Two-Parameter Logistic Item Response Models with Latent Regression, by Daniel Furr
Two-Parameter Logistic Item Response Model, by Daniel Furr, Seung Yeon Lee, Joon-Ho Lee, and Sophia Rabe-Hesketh
Cognitive Diagnosis Model: DINA model with independent attributes, by Seung Yeon Lee
Pooling with Hierarchical Models for Repeated Binary Trials, by Bob Carpenter


Multiple Species-Site Occupancy Model, by Bob Carpenter


Soil Carbon Modeling with RStan, by Bob Carpenter

“Bayesian evidence synthesis”

Donny Williams writes:

My colleagues and I have a paper recently accepted in the journal Psychological Science in which we “bang” on Bayes factors. We explicitly show how the Bayes factor varies according to tau (I thought you might find this interesting for yourself and your blog’s readers). There is also a very nice figure.

Here is a brief excerpt:

Whereas BES [a new method introduced by EJ Wagenmakers] assumes zero between-study variability, a multilevel model does not make this assumption and allows for examining the influence of heterogeneity on Bayes factors. Indeed, allowing for some variability substantially reduced the evidence in favor of an effect….In conclusion, we strongly caution against BES and suggest that researchers wanting to use Bayesian methods adopt a multilevel approach.

My reply: I just have one suggestion if it’s not too late for you to make the change. In your title you refer to “Bayesian Evidence Synthesis” which is a kind of brand name for a particular method. One could also speak of “Bayesian evidence synthesis” as referring to methods of synthesizing evidence using Bayesian models, for example this would include the multilevel approach that you prefer. The trouble is that many (most) readers of your paper will not have heard of the brand name “Bayesian Evidence Synthesis”—I had not heard of it myself!—and so they will erroneously take your paper as a slam on Bayesian evidence synthesis.

This is similar to how you can be a democrat without being a Democrat, or a republican without being a Republican.

P.S. Williams replied in comments. He and his collaborators revised the paper, including changing the title; the new version is here.

Mick Cooney: case study on modeling loss curves in insurance with RStan

This is great. Thanks, Mick!

All the Stan case studies are here.

“I agree entirely that the way to go is to build some model of attitudes and how they’re affected by recent weather and to fit such a model to “thick” data—rather than to zip in and try to grab statistically significant stylized facts about people’s cognitive illusions in this area.”

Angus Reynolds sent me a long email. I’ll share it in a moment but first here’s my reply:

I don’t have much to say here, except that:

1. It’s nearly a year later but Christmas is coming again so here’s my post.

2. Yes, the effects of local weather on climate change attitudes do seem worth studying in a more systematic way. I agree entirely that the way to go is to build some model of attitudes and how they’re affected by recent weather and to fit such a model to “thick” data—rather than to zip in and try to grab statistically significant stylized facts about people’s cognitive illusions in this area.

There’s some general principle here, I think, which is worth exploring further.

3. Yes, they really do say “Happy holidays” here. Or “Have a good winter break.” Also “Merry Christmas” etc. I’ve never heard anyone say “Season’s greetings”—that just sounds like something you’d see on a Hallmark card.

And here’s what Reynolds sent me:

I figure it’s likely you will not reply given it’s nearly Christmas and you may have already come across it. But I thought it might be worth covering on your blog. I think it fits the theme of probably-correct-social-science-that-could-be-done-better, which you do seem to cover a bit.

Yesterday a study was published in the Proceedings of the National Academy of Sciences linking the proportion of people in an area who believe that climate change is happening with the most recent local weather record.

They conclude that people who have experienced more recent record highs are more likely to accept climate change than people who have experienced record lows (actually it’s slightly more complicated—“the number of days per year for which the year of the record high temperature is more recent than the year of the record low temperature”).

While I find the claim plausible if not likely I don’t really understand the analysis they have done, and find it all a bit weird.
I think they may have fallen into the camp of – got access to lots of noisy data, built a metric, done some statistical tests with significant p-values and then come up with a narrative to fit the results. Something weird is going on with the significance levels- “Levels of significance (*5%; **1%; ***10%).”
Wouldn’t that normally be *** .1% ? Or just (*.05; **.01; ***.001), so you don’t end up making typos with percentages.

Interesting that usually you have to explain to people that weather is not climate (and that weather data is noisy), but in this case it’s how people’s experience of one (which is still going to be noisy) effects their belief about the other.

So I don’t think they have really collected data that truly tests their theory, and then built and tested a model to explain the data. I notice from looking at the maps of America that pretty much everywhere has had “red” levels of TMax and the regions that have the most blue levels are mostly in the mid-west. Wouldn’t it be better to build in a model that also factors in other predictors of belief like region, climate in that region, wealth, levels of education etc. Shouldn’t it also include record highs for each season, rather than just the year? A particularly hot winter or chilly summer might convince someone more/less that climate change is happening than yet another stinking hot Californian summer.
In Australia it feels like we have been living through a handful of “Once in a lifetime weather events” each year.
I guess there’s just so many different maximums that can be broken – hottest day, month, season or year.

The more I think about it the more things I’d want to check, but I’ve already got more than enough ideas for my own research so I should stop.

What I think would be cool if someone could build a model that tracks change in social attitudes about climate change over time (and if local weather events have any effect).

Cheers and Happy Holidays (which apparently is the preferred season greeting in New York)

“congratulations, your article is published!” Ummm . . .

The following came in the email under the heading, “congratulations, your article is published!”:

I don’t know that I should be congratulated on correcting an error, but sure, whatever.

P.S. The above cat is adorably looking out and will notice all of your errors.

When do we want evidence-based change? Not “after peer review”

Jonathan Falk sent me the above image in an email with subject line, “If this isn’t the picture for some future blog entry I’ll never forgive you.” This was a credible threat so here’s the post.

But I don’t agree with that placard at all!

Waiting for peer review is a bad idea for two reasons: First because you might be waiting for a really long time (especially if an econ journal is involved), second because all sorts of bad stuff gets through peer review—just search this blog for PPNAS.

Falk replied:

Completely agree… That’s what makes it funny, no?

Making a living outside academia is all about this for me. When asked on the stand whether some analysis of mine has had peer review, my answer is yes, because my colleagues have reviewed it. But if we had to defend on traditional academic peer review, the time lags would put us out of business—buy of course the REAL peer review is performed by the experts in the other side. Consulting experts are the replication ideal, no?

I don’t know enough about legal consulting to have an opinion on whether consulting experts are the replication ideal, but I do know that I don’t want to wait around for peer review.

Another way to see this is from a decision-analytic perspective. There are potential costs and benefits to change, potential costs and benefits to remaining with the status quo, and potential costs and benefits to waiting for more information. A decision among these options has to depend to some extent on these costs and benefits, which are considered only obliquely in any peer review process.

Postdoc in Finland and NY to work on probabilistic inference and Stan!

I (Aki) got 2 year funding to hire a postdoc to work on validation of probabilistic inference approaches and model selection in Stan. Work would be done with Stan team in Aalto, Helsinki and Columbia, New York. We probably have PhD positions, too.

The funding is part of the joint project with Antti Honkela and Arto Klami at University of Helsinki and we are together hiring 3 postdocs. Two other postdocs would work in Helsinki and part of the time in DTU, Denmark (Ole Winther) or Cambridge, UK (Zoubin Ghahramani).

The project is about theory and methods for assessing the quality of distributional approximations, improving the inference accuracy by targeting the approximation towards the eventual application goal and by better utilising the available data, e.g., when having data with privacy constraints.

More information about the position and how to apply is here.

You can manage very well in Finland with English and you don’t need to learn any Finnish for the job. Helsinki has been selected many times among world’s top 10 liveable cities

Does racquetball save lives?

Asher Meir points to this news report and writes:

8e5 people in study, about half reported exercising, about half not. About 10% died overall. So overall death rate difference of 28% is pretty remarkable. It means about 3500 deaths instead of 4500 for a similar sample size.

But when you compare the rate of heart disease risk specifically (about 2% died of heart disease, or around 1000 in each sample), for runners vs. racket sports specifically (less than 10% each) you are really shooting in the dark. Say around 5000 people engaged in each kind of activity and around 100 died of heart disease in each group, sounds like normal variation.

Also they eliminated people who had heart disease at the beginning of the study, not sure why they would do this.

I guess the biggest issue is not controlling for endogeneity of activity, as the people who are frail and sickly are probably not engaging in much sports activity.

Not sure how much is author hype vs. journalist hype.

My reply: This reminds me of the “what does not kill me makes me stronger” principle. The elimination of people with risk at beginning of study, that’s interesting. I can see why it makes sense to do this and I can also see how it can cause bias. I guess the right way to do this is to express results conditional on initial risk.


Afterwards I realized the biggest weakness: They do not control for income. If you examine the results you see that the people who engage in the most expensive forms of exercise (e.g. racquetball) have the lowest morbidity. Could be just a proxy for income.

Anyway, a flawed study is better than no study, the next time they can try to control for income.

Yup. Gotta start somewhere. Make all your data and code available, don’t hype your claims, and we can go from there.

P.S. Meir also pointed me to this book, “The Lion in the Living Room: How House Cats Tamed Us and Took Over the World,” by Abigail Tucker. I haven’t read it—really I should spend less time online and more time reading books!—but it has a pretty good title, I’ll grant it that.

Halifax, NS, Stan talk and course Thu 19 Oct

Halfiax, here we come.

I (Bob, not Andrew) am going to be giving a talk on Stan and then Mitzi and I will be teaching a course on Stan after that. The public is invited, though space is limited for the course. Here are details if you happen to be in the Maritime provinces.

TALK: Stan: A Probabilistic Programming Language for Bayesian Inference

Date: Thursday October 19, 2017

Time: 10am

Location: Slonim Conference room (#430), Goldberg Computer Science Building, Dalhousie University, 6050 University Avenue, Halifax


I’ll describe Stan’s probabilistic programming language, and how it’s used, including

  • blocks for data, parameter, and predictive quantities
  • transforms of constrained parameters to unconstrained spaces, with automatic Jacobian corrections
  • automatic computation of first- and higher-order derivatives
  • operator, function, and linear algebra library
  • vectorized density functions, cumulative distributions, and random number generators
  • user-defined functions
  • (stiff) ordinary differential equation solvers

I’ll also provide an overview of the underlying algorithms for full Bayesian inference and for maximum likelihood estimation:

  • adaptive Hamiltonian Monte Carlo for MCMC
  • L-BFGS optimization and transforms for MLE

I’ll also briefly describe the user-facing interfaces: RStan (R), PyStan (Python), CmdStan (command line), Stan.jl (Julia), MatlabStan (MATLAB)

I’ll finish with an overview of the what’s on the immediate horizon:

  • GPU matrix operations
  • MPI multi-core, multi-machine parallelism
  • data parallel expectation propagation for approximate Bayes
  • marginal Laplace approximations

TUTORIAL: Introductio to Bayesian Modeling and Inference with RStan


  • Bob Carpenter, Columbia University
  • Mitzi Morris, Columbia University

Date: Thursday October 19, 2017

Time: 11:30am-5:30pm (following the seminar on Stan at 10am)

Location: Slonim Conference room (#430)
Goldberg Computer Science Building
Dalhousie University
6050 University Avenue, Halifax

Registration: EventBrite Registration Page


This short course will provide

  • an introduction to Bayesian modeling
  • an introduction to Monte Carlo methods for Bayesian inference
  • an overview of the probabilistic programming language Stan

Stan provides a language for coding Bayesian models along
with state-of-the-art inference algorithms based on gradients. There will be an overview of how Stan works, but the main focus will be on the RStan interface and building applied models.

The afternoon will be devoted to a case study of hierarchical modeling, the workhorse of applied Bayesian statistics. We will show how hierarchical models pool estimates toward the population means based on population variance and how this automatically estimates regularization and adjusts for multiple comparisons. The focus will be on probabilistic inference, and in particular on testing posterior predictive calibration and the sharpness of predictions.

Installing RStan: Please show up with RStan installed. Instructions are linked from here:

Warning: follow the instructions step-by-step; even though installation involves a CRAN package, it’s more complex than just installing from RStudio because a C++ toolchain is required at runtime.

If you run into trouble, please ask for help on our forums—they’re very friendly:

Full Day Schedule

10:00-11:00am Open Seminar – Introduction to the “Stan” System

11:00-11:30am Break

11:30am-1:00pm Tutorial part 1

1:00pm -2:00pm Lunch Break

2:00pm -3:30pm Tutorial part 2

3:30pm -3:45pm Break

3:45pm -5:30pm Tutorial part 3

Workshop on Interpretable Machine Learning

Andrew Gordon Wilson sends along this conference announcement:

NIPS 2017 Symposium
Interpretable Machine Learning
Long Beach, California, USA
December 7, 2017

Call for Papers:

We invite researchers to submit their recent work on interpretable machine learning from a wide range of approaches, including (1) methods that are designed to be more interpretable from the start, such as rule-based methods, (2) methods that produce insight into existing ML models, and (3) perspectives either for or against interpretability in general. Topics of interest include:

– Deep learning
– Kernel, tensor, graph, or probabilistic methods
– Automatic scientific discovery
– Safe AI and AI Ethics
– Causality
– Social Science
– Human-computer interaction
– Quantifying or visualizing interpretability
– Symbolic regression

Authors are welcome to submit 2-4 page extended abstracts, in the NIPS style. References and supplementary material are not included in the page limit. Author names do not need to be anonymized. Accepted papers will have the option of inclusion in the proceedings. Certain papers will also be selected to present spotlight talks. Email submissions to

Key Dates:

Submission Deadline: 20 Oct 2017
Acceptance Notification: 23 Oct 2017
Symposium: 7 Dec 2017

Workshop Overview:

Complex machine learning models, such as deep neural networks, have recently achieved outstanding predictive performance in a wide range of applications, including visual object recognition, speech perception, language modeling, and information retrieval. There has since been an explosion of interest in interpreting the representations learned and decisions made by these models, with profound implications for research into explainable ML, causality, safe AI, social science, automatic scientific discovery, human computer interaction (HCI), crowdsourcing, machine teaching, and AI ethics. This symposium is designed to broadly engage the machine learning community on the intersection of these topics—tying together many threads which are deeply related but often considered in isolation.

For example, we may build a complex model to predict levels of crime. Predictions on their own produce insights, but by interpreting the learned structure of the model, we can gain more important new insights into the processes driving crime, enabling us to develop more effective public policy. Moreover, if we learn that the model is making good predictions by discovering how the geometry of clusters of crime events affect future activity, we can use this knowledge to design even more successful predictive models. Similarly, if we wish to make AI systems deployed on self-driving cars safe, straightforward black-box models will not suffice, as we will need methods of understanding their rare but costly mistakes.

The symposium will feature invited talks and two panel discussions. One of the panels will have a moderated debate format where arguments are presented on each side of key topics chosen prior to the symposium, with the opportunity to follow-up each argument with questions. This format will encourage an interactive, lively, and rigorous discussion, working towards the shared goal of making intellectual progress on foundational questions. During the symposium, we will also feature the launch of a new Explainability in Machine Learning Challenge, involving the creation of new benchmarks for motivating the development of interpretable learning algorithms.

I’m interested in this topic! It relates to some of the ideas we’ve been talking about regarding Stan and Bayesian workflow.