Skip to content

Workshop on science communication for graduate students

“Surely our first response to the disproof of a shocking-but-surprising claim should be to be un-shocked and un-surprised, not to try to explain away the refutation”

I came across the above quote the other day in an old post of mine, when searching for a Schrodinger’s cat image.

The quote came up in the context of a statistical claim made by a political activist which was widely promoted and discussed but which turned out to be false. As I wrote at the time, I was disappointed that the activist’s response to the disproof of his claim was not to recalibrate his understanding but rather to try to explain away the refutation and to attack the people who went to the trouble of figuring out where he’d gone wrong. Later on in the comments I continued along the same lines:

If you think being extremely numerate is protection against making a statistical mistake, you are naive about the process of scientific discovery. Extremely numerate people make mistakes all the time. Everybody makes mistakes all the time. Being open to learning from your mistakes, that’s how to move forward. Denying your mistakes and fighting, that’s not a way to move forward in your understanding.

Also this:

As they say in AA (or someplace like that), it’s only after you admit you’re a sinner that you can be redeemed. I know that I’m a sinner. I make statistical mistakes all the time. It’s unavoidable.

As you can see, it’s my general position that if something’s worth saying, it’s worth saying over and over and over.

The issue of accepting error in a shocking-but-surprising claim has connections to two statistical issues I’ve been thinking about recently, as I’ll discuss.

The paradoxical nature of anecdotal evidence (and of evidence more generally)

Thomas Basbøll and I recently published a couple of articles on the role of stories in social science (see here and here). Our key point is that stories should be anomalous and immutable: anomalous because the role of a story is to change our view of the world, to represent a solid piece of information that contradicts, in some way, our current understanding; and immutable because the value of this contradiction comes from the story having sharp edges that do not fit into conventional structures.

To the extent that a story becomes pliable, so that its details can be altered to fit a point of view, it loses its ability to inform us, as social scientists (or as humans, acting in the role of amateur scientists in our goal of learning about the social world).

That’s (one reason) why it’s important, when your surprising story is shot down, to accept that you might be wrong. Your story is surprising—that is, it contains information—but this surprise is conditional on the information being true. When it turns out the information is false, it’s a horrible mistake to hold on to the surprise and discard the truth. Then you’re in the position of this guy:


Your belief has no foundation, and you’re supporting yourself on nothing but a cloud of ignorance.

Time to turn around before you end up here:


“Psychological Science”-style papers

The other thing the above quote reminds me of, is all the controversy about noise-mining research articles that have appeared in journals such as Psychological Science. My fullest discussion of such issues appears in this recent paper, but, for here, let me reiterate Jeremy Freese’s point that research about the unknown is, well, it’s full of unknowns, and there should be no shame in accepting that a once-promising idea didn’t work out.

Surprising, newsworthy, statistically significant, and wrong: it happens all the time.

On deck this week

Mon: “Surely our first response to the disproof of a shocking-but-surprising claim should be to be un-shocked and un-surprised, not to try to explain away the refutation”

Tues: Another benefit of bloglag

Wed: High risk, low return

Thurs: Patience and research

Fri: This is why I’m a political scientist and not a psychologist

Sat: “What then should we teach about hypothesis testing?”

Sun: Tell me what you don’t know

Lee Sechrest


Yesterday we posted on Lewis Richardson, a scientist who did pioneering work in weather prediction and, separately, in fractals, in the early twentieth century. I was pointed to Richardson by Lee Sechrest, who I then googled.

Here’s Sechrest’s story:

His first major book [was] “Psychotherapy and the Psychology of Behavior Change” . . . Sechrest may be best known, however, for another book he co-authored in 1966: “Unobtrusive Measures: A Survey of Nonreactive Research in Social Science” . . .

“‘Unobtrusive Measures’ invoked the notion that we do not have the correct, right, accurate, valid measure of anything,” says Sechrest. “We have measures that are more or less useful under different circumstances. And the best response that we can make to our measurement problem is to use measures that get at the construct of interest in very different way.”

The book has served as an inspiration to many psychologists, encouraging them to go beyond surveys and questionnaires in their attempts to understand behavior . . .

Cool. Measurement is important, and the title and theme of “unobtrusive measurement” seems closely related to ideas we’ve been talking about for awhile regarding the ways in which quantum-mechanical concepts such as Heisenberg’s uncertainty principle might be usefully applied to measurement in the human sciences. I’m still a bit stuck on how this should all be done, but I think the framework makes sense, and it’s interesting to know that a book was written on the topic back in 1966. I guess this is just another demonstration of a longstanding principle of statistics (see also here).

Lewis Richardson, father of numerical weather prediction and of fractals

Lee Sechrest writes:

If you get a chance, Wiki this guy:


I [Sechrest] did and was gratifyingly reminded that I read some bits of his work in graduate school 60 years ago. Specifically, about his math models for predicting wars and his work on fractals to arrive at better estimates of the lengths of common boundaries between nations. Pretty remarkable.

Cool indeed.

Lots and lots of great stuff in this mini-bio, for example:

One of Richardson’s most celebrated achievements is his retroactive attempt to forecast the weather during a single day—20 May 1910—by direct computation. At the time, meteorologists performed forecasts principally by looking for similar weather patterns from past records, and then extrapolating forward. Richardson attempted to use a mathematical model of the principal features of the atmosphere, and use data taken at a specific time (7 AM) to calculate the weather six hours later ab initio. As Lynch makes clear, Richardson’s forecast failed dramatically, predicting a huge 145 hectopascals (4.3 inHg) rise in pressure over six hours when the pressure actually was more or less static. However, detailed analysis by Lynch has shown that the cause was a failure to apply smoothing techniques to the data, which rule out unphysical surges in pressure. When these are applied, Richardson’s forecast is revealed to be essentially accurate—a remarkable achievement considering the calculations were done by hand, and while Richardson was serving with the Quaker ambulance unit in northern France.

It also mentions his statistical modeling of international disputes. I wonder what today’s international relations scholars think of this work. I’m sure they’ve gone much farther along in sophistication, but I wonder whether they see Richardson’s work as an interesting precursor or as a dead end.

He also appears to have come up with the idea of fractal dimension in the length of coastlines, inspiring the famous writings of Mandelbrot on the topic:

At the time, Richardson’s research was ignored by the scientific community. Today, it is considered an element of the beginning of the modern study of fractals. Richardson’s research was quoted by mathematician Benoît Mandelbrot in his 1967 paper How Long Is the Coast of Britain? Richardson identified a value (between 1 and 2) that would describe the changes (with increasing measurement detail) in observed complexity for a particular coastline; this value served as a model for the concept of fractal dimension.

I’d never heard of this guy but apparently he’s pretty well known. For one thing, he has this long wikipedia page; for another, it says that the European Geosciences Union has an award named after him. But perhaps his closest connection to fame is that he’s the uncle of actor Ralph Richardson. Which is a little bit like me being famous because my distant relation to Marge Simpson (apparently, she’s married to a cousin of mine in L.A. whom I’ve never met).

P.S. I gave the post this title (which I adapted from the link to the above Wikipedia image) because it reminds me of the song, “Cezanne, father of cubism,” which I only heard once, on the radio many years ago, but which Google and Youtube assure me actually exists.

When a study fails to replicate: let’s be fair and open-minded

In a recent discussion of replication in science (particularly psychology experiments), the question came up of how to interpret things when a preregistered replication reaches a conclusion different from the original study. Typically the original, published result is large and statistically significant, and the estimate from the replication is small and not statistically significant.

One person in the discussion wrote, “As Simone Schnall suggests, this may not call into question the existence of the phenomenon; but it does raise concerns about boundary conditions, robustness, etc. It also opens up doors for examining exceptions, new factors (e.g., cultural factors outside US / North America), etc.” All this indeed is possible, but let’s also keep in mind the very real possibility that what we are seeing is simple sampling variation.

That is, suppose study 1 is performed under conditions A and is published with p less than .05, and then replication study 2 is performed under conditions B (which are intended to reproduce conditions A but in practice no replication is perfect), and replication study 2 is not statistically significant.

(i) One story (perhaps the preferred story of the researcher who published study 1) is that study 1 discovered a real effect and that study 2 is flawed, either because of poor data collection or analysis or because the replication wasn’t done right.

(ii) Another story (perhaps the back-up) is that study 2 did not reach statistical significance because it was a poorly done study with low power.

(iii) Yet another story (the back-up back-up) is that study 2 differed from study 1 because the effect is variable and occurs in setting A but not in setting B.

(iiii) But I’d like to advance another story (not mentioned at all as a possibility by Schnall in her post that got this recent discussion started) which is that any real effect is so small as to be essentially undetectable (as in the power=.06 example here, and, yes, power=.06 is no joke, it’s a real possibility), and so the statistically significant pattern in study 1 is actually just happening within that particular sample and doesn’t reflect any general story even under setting A.

Again, let me emphasize that I’m not speaking of Schnall’s research in particular, which I’ve barely looked at; rather, I’m speaking more generally about how to think about the results of replications.

I think we should be fair and open-minded—and part of being fair and open-minded is to consider option (iiii) above as a real possibility.

Cross-validation, LOO and WAIC for time series

This post is by Aki.

Jonah asked in Stan users mailing list

Suppose we have J groups and T time periods, so y[t,j] is the observed value of y at time t for group j. (We also have predictors x[t,j].) I’m wondering if WAIC is appropriate in this scenario assuming that our interest in predictive accuracy is for existing groups only (i.e. we might get data for new time periods but only for the same J groups). My hunch is that this scenario requires a more complicated form of cross-validation that WAIC does not approximate, but the more I think about it the more confused I seem to become. Am I right that WAIC is not appropriate here?

I’ll try to be more specific than in my previous comments on this topic.

As WAIC is an approximation of leave-one-out (LOO) cross-validation, I’ll first start considering when LOO is appropriate for time series.

LOO is appropriate if we are interested how well our model describes structure in the observed time series. For example, in the birthday example (BDA3 p. 505 and here), we can say that we have learned about the structure if we can predict any single date with missing data and thus LOO is appropriate. Here we are not concerned so much about the birthdays in the future. The fact that the covariate x is deterministic (fixed) doesn’t change how we estimate the expected predictive performance (for a single date with missing data), but since x is fixed there is no uncertainty of the future values of x.

If we are interested in making predictions for the next not yet observed date and we want to get better estimate than LOO for the expected predictive performance we can use sequential prediction. I don’t recommend using all the terms
because the beginning of this series is sensitive to prior. I would use terms
How many terms (k-1) to remove depends on the properties of the time series.

When the number time points is much larger than the number of hyperparameters \theta, to make the series even more stable and to better correspond the prediction task I would define
p(y_k|y_{1..k-1})=int p(y_k|y_{1..k-1},\theta)p(\theta|y_{1..T}) d\theta

If we are interested in making predictions for several not yet observed dates I recommend using, for example for d days ahead prediction

If we are interested in making predictions for future dates, we could still use LOO to select a model which can describe well the structure in the time series. It is likely that such model would also make good predictions for future data, but LOO will give an optimistic estimate of the expected predictive performance (for the next not yet observed date). This bias may be such that it does not affect which model is selected. This optimistic bias is harmful, for example, if we use the predictions for resource allocation and due to underestimating how difficult is to predict the future we might end not allocating enough resources (doctors for handling births, electricity generation to match the load, etc.).

If we are interested in making predictions for future dates, I think it is OK to use LOO in preliminary phase but sequential methods should be used for final reporting and decision making. Reason for using LOO could be that we can get LOO estimate with a small additional computational cost after the full posterior inference. LOO approximations, which are obtained as a by-product or with a small additional cost after the full posterior inference has been made, are discussed in the papers Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models  and WAIC and cross-validation in Stan.

Note that when using Kalman filter type inference for time series models, these sequential estimates can be obtained as a by product or only with a small additional cost.

So now I’ve covered when LOO or the sequential approach is appropriate for time series and I’ll return to the actual question which states

(i.e. we might get data for new time periods but only for the same J groups)

That is, the group ids are fixed and time periods are deterministic

As I told before, LOO (WAIC) is fine for estimating whether the model has found some structure in the data and it does not matter that x is combination of fixed and deterministic parts. If it is important to know the actual predictive performance for the future data, you need to use a version of the sequential approach.

WAIC is just and approximation of LOO. I’m now convinced that there is no need to use WAIC. The paper Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models shows that there are better methods than WAIC for Gaussian latent variable models. We are also working on a better method to approximate LOO in Stan (maybe we call it Very Good Information Criterion?). I just need to make some additional experiments and write the paper…

The bracket!

That’s right, we’re getting ready for the battle to choose the ultimate seminar speaker. Paul Davidson, who send in the image below, writes:

Knocked together in Excel. I’m European, so I may not have respected the North American system for brackets i.e. I split each category into seeded pools and randomly drew from them. The French Intellectuals get a bit of a rough draw in this regard with a lot of early matchups.

I take back all the bad things I ever said about Excel, as this image looks pretty good. Sure, the font is pretty unreadable, but other than that it looks cool.


I just feel bad for Plato, having to go up against Henny Youngman in the very first round, followed by a probable Mark Twain if he can get past Henny. The philosopher-king has a tough road to the Final Four.

The pairings will start on 3 Feb, so get your witticisms ready now!

I need your help in setting up the ultimate bracket: Picking the ideal seminar speaker


This came in the departmental email awhile ago:

The Brown Institute for Media Innovation, Alliance (Columbia University, École Polytechnique, Sciences Po, and Panthéon-Sorbonne University), The Center for Science and Society, and The Faculty of Arts and Sciences are proud to present
You are invited to apply for a seminar led by Professor Bruno Latour on Tuesday, September 23, 12-3pm. Twenty-five graduate students from throughout the university will be selected to participate in this single seminar given by Prof. Latour. Students will organize themselves into a reading group to meet once or twice in early September for discussion of Prof. Latour’s work. They will then meet to continue this discussion with a small group of faculty on September 15, 12-2pm. Students and a few faculty will meet with Prof. Latour on September 23. A reading list will be distributed in advance.
If you are interested in this 3-4 session seminar (attendance at all 3-4 sessions is mandatory), please send
Your School:
Your Department:
Year you began your terminal degree at Columbia:
Thesis or Dissertation title or topic:
Name of main advisor:
In one short, concise paragraph tell us what major themes/keywords from Latour’s work are most relevant to your own work, and why you would benefit from this seminar. Please submit this information via the site

The due date for applications is August 11 and successful applicants will be notified in mid-August.

This is the first time I’ve heard of a speaker who’s so important that you have to apply to attend his seminar! And, don’t forget, “attendance at all 3-4 sessions is mandatory.”

At this point you’re probably wondering what exactly is it that Bruno Latour does. Don’t worry—I googled him for you. Here’s the description of his most recent book, “An Inquiry Into Modes of Existence”:

The result of a twenty five years inquiry, it offers a positive version to the question raised, only negatively, with the publication, in 1991, of ”We have never been modern”: if ”we” have never been modern, then what have ”we” been? From what sort of values should ”we” inherit? In order to answer this question, a research protocol has been developed that is very different from the actor-network theory. The question is no longer only to define ”associations” and to follow networks in order to redefine the notion of ”society” and ”social” (as in ”Reassembling the Social”) but to follow the different types of connectors that provide those networks with their specific tonalities. Those modes of extension, or modes of existence, account for the many differences between law, science, politics, and so on. This systematic effort for building a new philosophical anthropology offers a completely different view of what the ”Moderns” have been and thus a very different basis for opening a comparative anthropology with the other collectives – at the time when they all have to cope with ecological crisis. Thanks to a European research council grant (2011-2014) the printed book will be associated with a very original purpose built digital platform allowing for the inquiry summed up in the book to be pursued and modified by interested readers who will act as co-inquirers and co-authors of the final results. With this major book, readers will finally understand what has led to so many apparently disconnected topics and see how the symmetric anthropology begun forty years ago can come to fruition.

Huh? I wonder if this is what they mean by “one short, concise paragraph” . . .

Update: We just got an announcement in the mail. The due date has been extended a second time, this time to Aug 18. This seems like a good sign, if fewer Columbia grad students than expected wanted to jump through the hoops to participate in this seminar.

The ultimate bracket

But I’m getting a bit off topic. What really got me interested in this was the idea of a speaker who is so important, so much in demand, that you have to fill out an application just to be in the same small room with him. Not to mention the labor involved by whoever is screening the applications (assuming, that is, that more than 25 people actually apply).

So here’s the question: who would be the ultimate seminar speaker—the one person who you could only get to speak in a limited-access venue? I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I thought the best way for us to work this out would be via a single-elimination bracket, March Madness style. Which is why I’ve exercised the ultimate in patience and scheduled this post for January, 2015—nearly half a year after I wrote it!

So who’s the ultimate seminar speaker? Of course there’s an endless list of possibilities, ranging from celebrity academics (Paul Krugman, etc.) to cult figures of the past (Philip K. Dick, Ayn Rand, etc.) to actual rock stars (from Elvis on down). But to narrow things down I’ve chosen a list of 64 for us to work through.

My list includes eight current or historical figures from each of the following eight categories:
– Philosophers
– Religious Leaders
– Authors
– Artists
– Founders of Religions
– Cult Figures
– Comedians
– Modern French Intellectuals.

All these categories seem to be possible choices to reach the sort of general-interest intellectual community that is implied by the Latour announcement.

I’ve purposely not included any statisticians or indeed any academics (with the exception of Bruno Latour himself) because I don’t want to turn this competition into a mudfest.

I’ll give the list in a moment, along with the seedings, but first let me explain where I need help. I’m sure one of you has access to a computer program that makes one of those pretty brackets—you know what I’m talking about, four little trees of 16 teams each, all meeting in the middle. I want my potential seminar speakers set up in such a bracket which I can then post on this website and which we can go through, one pairing at a time.

With 64 speakers, we’ll need 63 matches to come to a winner. We can do one a day starting on February 3, so that the final bout will come on April 6, the final day of the NCAA men’s basketball tournament.

So here’s what I need from one of you: a full bracket with all 64 seminar speakers, displayed in that pretty “bracket” form, and with the speakers from the different categories all mixed up. It would be pretty boring to have all the artists against all the artists, all the religious leaders against all the religious leaders, etc. Instead, each group of 8 in the bracket should include one from each of the 8 occupational categories, and it should also include one #1 seed, one #2 seed, one #3 seed, one #4 seed, and 4 unseeded people, with the seedings set up as is standard: each seeded speaker is matched against an unseeded person, then the pairings are set up so that, if the seeds advance, #1 faces #4, and #2 faces #3.

Send me the bracket, I’ll post it on the blog, and we’ll go from there, once a day starting on 3 Feb. It will be fun, and the results won’t be obvious. These sorts of matchup can be highly nontransitive because we are implicitly comparing people on many different dimensions.

The 64

- Philosophers:
Plato (seeded 1 in group)
Alan Turing (seeded 2)
Aristotle (3)
Friedrich Nietzsche (4)
Thomas Hobbes
Jean-Jacques Rousseau
Bertrand Russell
Karl Popper

- Religious Leaders:
Mohandas Gandhi (1)
Martin Luther King (2)
Henry David Thoreau (3)
Mother Teresa (4)
Al Sharpton
Phyllis Schlafly
Yoko Ono

- Authors:
William Shakespeare (1)
Miguel de Cervantes (2)
James Joyce (3)
Mark Twain (4)
Jane Austen
John Updike
Raymond Carver
Leo Tolstoy

- Artists:
Leonardo da Vinci (1)
Rembrandt van Rijn (2)
Vincent van Gogh (3)
Marcel Duchamp (4)
Thomas Kinkade
Grandma Moses
Barbara Kruger
The guy who did Piss Christ

- Founders of Religions:
Jesus (1)
Mohammad (2)
Buddha (3)
Abraham (4)
L. Ron Hubbard
Mary Baker Eddy
Sigmund Freud
Karl Marx

- Cult Figures:
John Waters (1)
Philip K. Dick (2)
Ed Wood (3)
Judy Garland (4)
Sun Myung Moon
Charles Manson
Joan Crawford
Stanley Kubrick

- Comedians:
Richard Pryor (1)
George Carlin (2)
Chris Rock (3)
Larry David (4)
Alan Bennett
Stewart Lee
Ed McMahon
Henny Youngman

- Modern French Intellectuals:
Albert Camus (1)
Simone de Beauvoir (2)
Bernard-Henry Levy (3)
Claude Levi-Strauss (4)
Raymond Aron
Jacques Derrida
Jean Baudrillard
Bruno Latour

I don’t know how far Bruno Latour will go in this competition, but at least he’s in the running. May the best man (or woman) win!

And here it is (courtesy of Paul Davidson):


Stan comes through . . . again!


Erikson Kaszubowski writes in:

I missed your call for Stan research stories, but the recent post about stranded dolphins mentioned it again.

When I read about the Crowdstorming project in your blog, I thought it would be a good project to apply my recent studies in Bayesian modeling.

The project coordinators shared a big dataset (with 124,621 cases) and each research team had to independently analyze the data and answer two research questions:
1) Are soccer referees more likely to give red cards to dark skin toned players?
2) Do referees from countries with high skin tone bias are more likely to be biased towards dark skin toned players?

Given the data structure (each case is a player-referee dyad, with variables about how many games occurred between them and more) and inpired by the recent reading of ARM, I thought a multilevel binomial-normal regression could be a good model to analyze the data.

I initially created a model using a different Bayesian software, but it only worked in small samples of the dataset. When I tried to analyze the whole thing, this other program couldn’t get off the ground. So, I decided to give Stan a try… And it worked like a charm!

The project article is still being written, but all analyses are already published in the Open Science Framework. Here’s the link for my analysis.

A short report, source codes and Stan chains are all there, in case anyone is interested.

I know the model isn’t such a great novelty and there is plenty of criticism to be done about what I did. But we can say, at least, that when people first crowdstormed a dataset, Stan was there!

Thank you and all the Stan team for such a great tool!

I haven’t looked at this in detail so don’t take this post as an endorsement of this particular model, coding, or data analysis—but it does demonstrate the success of our goal of allowing people to fit models directly, with a minimum of fuss, so that users can focus on the statistical modeling, not on the computation.