Skip to content

The difference between me and you is that I’m not on fire

“Eat what you are while you’re falling apart and it opened a can of worms. The gun’s in my hand and I know it looks bad, but believe me I’m innocent.” – Mclusky

While the next episode of Madam Secretary buffers on terrible hotel internet, I (the other other white meat) thought I’d pop in to say a long, convoluted hello. I’m in New York this week visiting Andrew and the Stan crew (because it’s cold in Toronto and I somehow managed to put all my teaching on Mondays. I’m Garfield without the spray tan.).

So I’m in a hotel on the Upper West Side (or, like, maybe the upper upper west side. I’m in the 100s. Am I in Harlem yet? All I know is that I’m a block from my favourite bar [which, as a side note, Aki does not particularly care for] where I am currently not sitting and writing this because last night I was there reading a book about the rise of the surprisingly multicultural anti-immigration movement in Australia and, after asking what my book was about, some bloke started asking me for my genealogy and “how Australian I am” and really I thought that it was both a bit much and a serious misunderstanding of what someone who is reading book with headphones on was looking for in a social interaction.) going through the folder of emails I haven’t managed to answer in the last couple of weeks looking for something fun to pass the time.

And I found one. Ravi Shroff from the Department of Applied Statistics, Social Science and Humanities at NYU (side note: applied statistics gets a short shrift in a lot of academic stats departments around the world, which is criminal. So I will always love a department that leads with it in the title. I’ll also say that my impression when I wandered in there for a couple of hours at some point last year was that, on top of everything else, this was an uncommonly friendly group of people. Really, it’s my second favourite statistics department in North America, obviously after Toronto who agreed to throw a man into a volcano every year as part of my startup package after I got really into both that Tori Amos album from 1996 and cultural appropriation. Obviously I’m still processing the trauma of being 11 in 1996 and singularly unable to sacrifice any young men to the volcano goddess.) sent me an email a couple of weeks ago about constructing interpretable decision rules.

(Meta-structural diversion: I starting writing this with the new year, new me idea that every blog post wasn’t going to devolve into, say, 500 words on how Medúlla is Björk’s Joanne, but that resolution clearly lasted for less time than my tenure as an Olympic torch relay runner. But if you’ve not learnt to skip the first section of my posts by now, clearly reinforcement learning isn’t for you.)

To hell with good intentions

Ravi sent me his paper Simple rules for complex decisions by Jongbin Jung, Connor Concannon, Ravi Shroff, Sharad Goel and Daniel Goldstein and it’s one of those deals where the title really does cover the content.

This is my absolute favourite type of statistics paper: it eschews the bright shiny lights of ultra-modern methodology in favour of the much harder road of taking a collection of standard tools and shaping them into something completely new.

Why do I prefer the latter? Well it’s related to the age old tension between “state-of-the-art” methods and “stuff-people-understand” methods. The latter are obviously preferred as they’re much easier to push into practice. This is in spite of the former being potentially hugely more effective. Practically, you have to balance “black box performance” with “interpretability”. Where you personally land on that Pareto frontier is between you and your volcano goddess.

This paper proposes a simple decision rule for binary classification problems and shows fairly convincingly that it can be almost as effective as much more complicated classifiers.

There ain’t no fool in Ferguson

The paper proposes a Select-Regress-and-Round method for constructing decision rules that works as follows:

  1. Select a small number k of features \mathbf{x} that will be used to build the classifier
  2. Regress: Use a logistic-lasso to estimate the classifier h(\mathbf{x}) = (\mathbf{x}^T\mathbf{\beta} \geq 0 \text{ ? } 1 \text{ : } 0).
  3. Round: Chose M possible levels of effect and build weights

w_j = \text{Round} \left( \frac{M \beta_j}{\max_i|\beta_i|}\right).

The new classifier (which chooses between options 1 and 0) selects 1 if

\sum_{j=1}^k w_j x_j > 0.

In the paper they use k=10 features and M = 3 levels.  To interpret this classifier, we can consider each level as a discrete measure of importance.  For example, when we have M=3 we have seven levels of importance from “very high negative effect”, through “no effect”, to “very high positive effect”. In particular

  • w_j=0: The jth feature has no effect
  • w_j= \pm 1: The jth feature has a low effect (positive or negative)
  • w_j = \pm 2: The jth feature has a medium effect (positive or negative)
  • w_j = \pm 3: The jth feature has a high effect (positive or negative).

A couple of key things here that makes this idea work.  Firstly, the initial selection phase allows people to “sense check” the initial group of features while also forcing the decision rule to only depend on a small number of features, which greatly improves the ability for people to interpret the rule.  The second two phases then works out which of those features are used (the number of active features can be less than k. Finally the last phase gives a qualitative weight to each feature.

This is a transparent way of building a decision rule, as the effect of each feature used to make the decision is clearly specified.  But does it work?

She will only bring you happiness

The most surprising thing in this paper is that this very simple strategy for building a decision rule works fairly well. Probably unsurprisingly, complicated, uninterpretable decision rules constructed through random forests typically do work better than this simple decision rule.  But the select-regress-round strategy doesn’t do too badly.  It might be possible to improve the performance by tweaking the first two steps to allow for some low-order interactions. For binary features, this would allow for classifiers where neither X nor Y are strong indicators of success, but the co-occurance of them (XY) is.

Even without this tweak, the select-regress-round classifier performs about as well as logistic regression and logistic lasso models that use all possible features (see the above figure from the paper), although it performs worse than the random forrest.  It also doesn’t appear that the rounding process has too much of an effect on the quality of the classifier.

This man will not hang

The substantive example in this paper has to do with whether or not a judge decides to grant bail, where the event you’re trying to predict is a failure to appear at trial. The results in this paper suggest that the select-regress-round rule leads to a consistently lower rate of failure compared to the “expert judgment” of the judges.  It also works, on this example, almost as well as a random forest classifier.

There’s some cool methodology stuff in here about how to actually build, train, and evaluate classification rules when, for any particular experimental unit (person getting or not getting bail in this case), you can only observed one of the potential outcomes.  This paper uses some ideas from the causal analysis literature to work around that problem.

I guess the real question I have about this type of decision rule for this sort of example is around how these sorts of decision rules would be applied in practice.  In particular, would judges be willing to use this type of system?  The obvious advantage of implementing it in practice is that it is data driven and, therefore, the decisions are potentially less likely to fall prey to implicit and unconscious biases. The obvious downside is that I am personally more than the sum of my demographic features (or other measurable quantities) and this type of system would treat me like the average person who has shares the k features with me.

We were measuring the speed of Stan incorrectly—it’s faster than we thought in some cases due to antithetical sampling

Aki points out that in cases of antithetical sampling, our effective sample size calculations were unduly truncated above at the number of iterations. It turns out the effective sample size can be greater than the number of iterations if the draws are anticorrelated. And all we really care about for speed is effective sample size per unit time.

NUTS can be antithetical

The desideratum for a sampler Andrew laid out to Matt was to maximze expected squared transition distance. Why? Because that’s going to maximize effective sample size. (I still hadn’t wrapped my head around this when Andrew was laying it out.) Matt figured out how to achieve this goal by building an algorithm that simulated the Hamiltonian forward and backward in time at random, doubling the time at each iteration, and then sampling from the path with a preference for the points visited in the final doubling. This tends to push iterations away from their previous values. In some cases, it can lead to anticorrelated chains.

Removing this preference for the second half of the chains drastically reduces NUTS’s effectiveness. Figuring out how to include it and satisfy detailed balance was one of the really nice contributions in the original NUTS paper (and implementation).

Have you ever seen 4000 as the estimated n_eff in a default Stan run? That’s probably because the true value is greater than 4000 and we truncated it.

The fix is in

What’s even cooler is that the fix is already in the pipeline and it just happens to be Aki’s first C++ contribution. Here it is on GitHub:

Aki’s also done simulations, so the new version is actually better calibrated as far as MCMC standard error goes (posterior standard deviation divided by the square root of the effective sample size).

A simple example

Consider three Markov processes for drawing a binary sequence y[1], y[2], y[3], …, where each y[n] is in { 0, 1 }. Our target is a uniform stationary distribution, for which each sequence element is marginally uniformly distributed,

Pr[y[n] = 0] = 0.5     Pr[y[n] = 1] = 0.5  

Process 1: Independent. This Markov process draws each y[n] independently. Whether the previous state is 0 or 1, the next state has a 50-50 chance of being either 0 or 1.

Here are the transition probabilities:

Pr[0 | 1] = 0.5   Pr[1 | 1] = 0.5
Pr[0 | 0] = 0.5   Pr[1 | 0] = 0.5

More formally, these should be written in the form

Pr[y[n + 1] = 0 | y[n] = 1] = 0.5

For this Markov chain, the stationary distribution is uniform. That is, some number of steps after initialization, there’s a probability of 0.5 of being in state 0 and a probability of 0.5 of being in state 1. More formally, there exists an m such that for all n > m,

Pr[y[n] = 1] = 0.5

The process will have an effective sample size exactly equal to the number of iterations because each state in a chain is independent.

Process 2: Correlated. This one makes correlated draws and is more likely to emit sequences of the same symbol.

Pr[0 | 1] = 0.01   Pr[1 | 1] = 0.99
Pr[0 | 0] = 0.99   Pr[1 | 0] = 0.01

Nevertheless, the stationary distribution remains uniform. Chains drawn according to this process will be slow to mix in the sense that they will have long sequences of zeroes and long sequences of ones.

The effective sample size will be much smaller than the number of iterations when drawing chains from this process.

Process 3: Anticorrelated. The final process makes anticorrelated draws. It’s more likely to switch back and forth after every output, so that there will be very few repeating sequences of digits.

Pr[0 | 1] = 0.99   Pr[1 | 1] = 0.01
Pr[0 | 0] = 0.01   Pr[1 | 0] = 0.99

The stationary distribution is still uniform. Chains drawn according to this process will mix very quickly.

With an anticorrelated process, the effective sample size will be greater than the number of iterations.


If I had more time, I’d simulate, draw some traceplots, and also show correlation plots at various lags and the rate at which the estimated mean converges. This example’s totally going in the Coursera course I’m doing on MCMC, so I’ll have to work out the visualizations soon.

(What’s So Funny ‘Bout) Evidence, Policy, and Understanding


Kevin Lewis asked me what I thought of this article by Oren Cass, “Policy-Based Evidence Making.” That title sounds wrong at first—shouldn’t it be “evidence-based policy making”?—but when you read the article you get the point, which is that Cass argues that so-called evidence-based policy isn’t so evidence-based at all, that what is considered “evidence” in social science and economic policy is often so flexible that it can be taken to support whatever position you want. Hence, policy-based evidence making.

I agree with Cass that the whole “evidence-based policy” thing has been oversold. For an extreme example, see this story of some “evidence-based design” in architecture that could well be little more than a billion-dollar pseudoscientific scam.

More generally, I agree that there are problems with a lot of these studies, both in their design and in their interpretation. Here’s an story from a few years ago that I’ve discussed a bit; for a slightly more formal treatment of that example, see section 2.1 of this article.

So I’m sympathetic with the points that Cass is making, and I’m glad this article came out; I think it will generally push the discussion in the right direction.

But there are two places where I disagree with Cass.

1. First, despite all the problems with controlled experiments, they can still tell us something, as long as they’re not overinterpreted. If we forget about statistical significance and all that crap, a controlled experiment is, well, it’s a controlled experiment, you’re learning about something under controlled conditions, which can be useful. This is a point that has been made many times by my colleague Don Green: yes, controlled experiments have problems with realism, but the same difficulties can arise when trying to generalize observational comparisons to new settings.

To put it another way, recall Bill James’s adage that alternative to good statistics is not “no statistics,” it’s “bad statistics.” Consider Cass’s article. He goes through lots of legitimate criticism of overinterpretations of results from that Oregon experiment, but then, what does he do? He gives lots of weight to an observational study from Yale that compares across states.

One point that Cass makes very well is that you can’t rely too much on any single study. Any single study is limited in scope, it occurs at a particular time and place and with a particular set of treatments, outcomes, and time horizon. To make decisions, we have to do our best with the studies we have, which sometimes means discarding them completely if they are too noisy. And I think Cass is right that we should take studies more seriously when they are large and occur under realistic conditions.

2. The political slant is just overwhelming. Cass throws in the kitchen sink. For example, “When Denmark began offering generous maternity leave, so many nurses made use of it that mortality rates in nursing homes skyrocketed.” Whaaaa? Maybe they could hire some more help in those nursing homes? Any policy will have issues when rolling out on a larger scale, but it seems silly to say that therefore it’s not a good idea to evaluate based on what evidence is available. Then he refers to “This evidence ratchet, in which findings can promote but not undermine a policy, is common.” This makes no sense to me, given that anything can be a policy. The $15 minimum wage is a policy. So is the $5 minimum wage, or for that matter the $0 minimum wage. High taxes on the rich is a policy, low taxes on the rich is a policy, etc.

Also this: “Grappling with such questions is frustrating and unsettling, as the policymaking process should be. It encourages humility and demands that the case for government action clear a high bar.” This may be Cass’s personal view, but it has nothing to do with evidence. He’s basically saying that if the evidence isn’t clear, we should make decisions based on his personal preference for less government spending, which I think means lower taxes on the rich. One could just as well say the opposite: “It encourages humility and demands that the riches of our country be shared more equally.” Or, to give it a different spin: “It encourages humility and demands that we live by the Islamic principles that have stood the test of time.” Or whatever. When evidence is weak, you have to respect uncertainty; it should not be treated as a rationale for sneaking in your own policy preferences as a default.

But I hate to end it there. Overall I liked Cass’s article, and we should be able to get value from it, subtracting the political slant which muddles his legitimate points. The key point, which Cass makes well, is that there is no magic to evidence-based decision making: You can do a controlled experiment and still learn nothing useful. The challenge is where to go next. I do think evidence is important, and I think that, looking forward, our empirical studies of policies should be as realistic as possible, close to the ground, as it were. Easier said than done, perhaps, but we need to do our best, and I think that critiques such as Cass’s are helpful.

“The following needs to be an immutable law of journalism: when someone with no track record comes into a field claiming to be able to do a job many times better for a fraction of the cost, the burden of proof needs to shift quickly and decisively onto the one making the claim. The reporter simply has to assume the claim is false until substantial evidence is presented to the contrary.”

Mark Palko writes:

The following needs to be an immutable law of journalism: when someone with no track record comes into a field claiming to be able to do a job many times better for a fraction of the cost, the burden of proof needs to shift quickly and decisively onto the one making the claim. The reporter simply has to assume the claim is false until substantial evidence is presented to the contrary.

Yup. This is related to advice I give to young researchers giving presentations or writing research papers:

1. Describe the problem you have that existing methods can’t solve.

2. Show how your new method solves the problem.

3. Explain how your method works.

4. Explain why, if your idea is so great, how come all the people who came before you were not already doing it.

There are lots of possibilities for step 4. Maybe your new idea is only now possible because of new technology, or network effects, or your idea flowed from earlier ideas that people only recently realized were effective, or maybe it was an idea taken from another area and introduced into your field. Whatever. The point is, if you don’t answer question 4, your responses to question 1, 2, and 3 are suspect.

Palko’s example concerned a newspaper story about “of entrepreneurs and Ivy League grads from the USA rescuing the poor children of Africa from poverty and ignorance.” Here’s Palko:

If you take down a few pages, there is some good, substantial reporting by Peg Tyre on this story. Unfortunately, as is all too often the case with New York Times Magazine articles, the first act is almost entirely credulous and laudatory. For example:

By 2015, Bridge was educating 100,000 students, and the founders claimed that they were providing a “world-class education” at “less than 30 percent” of what “the average developing country spends per child on primary education.” This would represent a remarkable achievement. None of the founders had traditional teaching experience. May had been an unpaid teacher at a school in China; Kimmelman worked with teachers and administrators developing an ed-tech company. How had they pulled it off? In interviews and speeches, they credited cutting-edge education technology and business strategies — the company monitors and stores a wide range of data on subjects including teacher absenteeism, student payment history and academic achievement — along with their concern for the well-being of the world’s poorest children. That potent mixture, they said, had allowed them to begin solving a complex and intractable problem: how to provide cheap, scalable, high-quality schooling for the most vulnerable, disadvantaged children on earth.

So, yes, question #4 is addressed in the news article—but the problem is the intro is entirely presented from the perspective of the boosters, the people who have a stake in selling their idea. It’s ok when giving a talk to sell your idea—there will be others out there to present their own, different views on the topic. But a newspaper should do better.

Palko continues:

Later on in the piece, Tyre actually does start digging into the business model, where the key drivers seem to be using cheap, substandard buildings, hiring undertrained and unqualified instructors, employing strong-arm collection tactics, and doing lots of marketing.

So Palko’a complaint is not with all of the article, just how it starts. Still, I think he’s making a good point. When we hear bold claims, our starting point should be disbelief. Assume the claim is false until substantial evidence is presented to the contrary.

P.S. Palko points here to another example of ridiculously credulous journalism.

StanCon 2018 Helsinki, 29-31 August 2018

Photo of Helsinki by (c) Visit Helsinki / Jussi Hellsten.
Photo (c) Visit Helsinki / Jussi Hellsten

StanCon 2018 Asilomar was so much fun that we are organizing StanCon 2018 Helsinki August 29-31, 2018 at Aalto University, Helsinki, Finland (location chosen using antithetic sampling).

Full information is available at StanCon 2018 Helsinki website

Summary of the information

What: One day of tutorials and two days of talks, open discussions, and statistical modeling in beautiful Helsinki.

When: August 29-31, 2018

Where: Aalto University, Helsinki, Finland

Invited speakers

  • Richard McElreath, Max Planck Institute for Evolutionary Anthropology
  • Maggie Lieu, European Space Astronomy Centre
  • Sarah Heaps, Newcastle University
  • Daniel Simpson, University of Toronto

Call for contributed talks

StanCon’s version of conference proceedings is a collection of contributed talks based on interactive, self-contained notebooks (e.g., knitr, R Markdown, Jupyter, etc.). For example, you might demonstrate a novel modeling technique, or (possibly simplified version of) a novel application, etc. There is no minimum or maximum length and anyone using Stan is welcome to submit a contributed talk.

More details are available on the StanCon submissions web page and examples of accepted submissions from StanCon 2017 are available in our stancon_talks repository on GitHub.

Contributed posters

We will accept poster submissions on a rolling basis until July 31st. One page exclusive of references is the desired format but anything that gives us enough information to make a decision is fine. See the conference web page for submission instructions.


If you’re interested in sponsoring StanCon 2018 Helsinki, please reach out to Your generous contributions will ensure that our registration costs are kept as low as possible and allow for us to subsidize attendance for students who would otherwise be unable to come.

Static sensitivity analysis: Computing robustness of Bayesian inferences to the choice of hyperparameters

Ryan Giordano wrote:

Last year at StanCon we talked about how you can differentiate under the integral to automatically calculate quantitative hyperparameter robustness for Bayesian posteriors. Since then, I’ve packaged the idea up into an R library that plays nice with Stan. You can install it from this github repo. I’m sure you’ll be pretty busy at StanCon, but I’ll be there presenting a poster about exactly this work, and if you have a moment to chat I’d be very interested to hear what you think!

I’ve started applying this package to some of the Stan examples, and it’s already uncovered some (in my opinion) serious problems, like this one from chapter 13.5 of the ARM book. It’s easy to accidentally make a non-robust model, and I think a tool like this could be very useful to Stan users! As of right now I’m the only one who’s experimented with it (AFAIK), but I think it’s ready for some live guinea pigs.

It’s something I came upon while working on robustness for variational Bayes, but it doesn’t have anything to do with VB. It’s based on a very simple and old idea (nothing more than differentiating under the integral of an expectation). But do I think it’s now timely to re-visit the idea now that we have an MCMC toolkit built with easy-to-use automatic differentiation!

I sent this to some Stan people.

Dan Simpson wrote:

One comment about the lineariseation: it’s basically parameterisations dependent, so it’s not “userproof.” A few friends of mine made a local geometry argument for these types of things in here.

Ryan replied:

This is certainly true! That it’s not userproof is a point that can’t be said enough.

Linearization does, however, has the benefit of being relatively easy for ordinary users to understand. In particular, it’s easy to understand its limitations. The limitations of worst-case priors in a unit ball of a functional metric space are harder to wrap your head around, especially if you’re a Stan user trying to fit a model to your sociology data set.

I might add that, if this package does anything distinctively useful, it’s that it manages to pummel the Stan API into giving it the derivatives it needs. A richer set of features for interacting with the C++ autodiff library through the Stan modeling language would make this package kind of trivial.

And Aki asked:

How do you interpret the robustuness measure? How small value is negligible and how large is too big?

Ryan replied:

A full answer is in appendix C of our our paper. In short, I’d expect users to have in mind a reasonable range of their hyperparameters.

But here’s an answer in short. Also see this picture:

A central assumption—currently only verifiable by actually re-fitting*—is that the expectations depend linearly on the hyperparameters over this range. Given the sensitivity, you can estimate how many posterior standard deviations the expectation would very by across the plausible range of hyperparameters. If the expectation varies too much (e.g., by more than 1 posterior sd), then there’s a problem.

So there are a number of things that the user still has to decide:
1) How much hyperparameter variability is reasonable?
2) Do I have reason to believe that linearity is reasonable over this range?
3) How much variability in the expectation is acceptable?

I think the answers to this question will vary considerably from problem to problem, and we’ve made no attempt to answer them automatically (except to normalize automatically by the posterior standard deviation).

* You can calculate higher-order derivatives wrt the hyperparameter, but in general you’ll need higher order derivatives of the log probability which are not now exposed in rstan. I also expect MCMC error will start to become a real problem. I haven’t actually experimented with this.

All this is a followup from our conversation regarding static sensitivity analysis, an idea that I had way back when but have never really integrated into my workflow. I’m hoping that Ryan’s methods will make it easy to do this in Stan, and this will give us one more tool to understand how our models are working.

Statistical behavior at the end of the world: the effect of the publication crisis on U.S. research productivity

Under the heading, “I’m suspicious,” Kevin Lewis points us to this article with abstract:

We exploit the timing of the Cuban Missile Crisis and the geographical variation in mortality risks individuals faced across states to analyse reproduction decisions during the crisis. The results of a difference-in-differences approach show evidence that fertility decreased in states that are farther from Cuba and increased in states with more military installations. Our findings suggest that individuals are more likely to engage in reproductive activities when facing high mortality risks, but reduce fertility when facing a high probability of enduring the aftermath of a catastrophe.

It’s the usual story: forking paths (nothing in the main effect, followed by a selection among the many many possible two-way and three-way interactions that could be studied), followed by convoluted storytelling (“individuals indulge in reproductive activities when facing high mortality risks, but reduce fertility when facing the possibility to endure the aftermath of the impending catastrophe.” This sort of paper reflects the problems with null hypothesis significance testing amidst abundant researcher degrees of freedom (find a statistically significant p-value, then tell a story) and also the problems with the “identification strategy” approach that is now standard in the social sciences: if you can do a regression discontinuity, or a difference in differences, or an instrumental variable, than all skepticism is turned off.

That said, I have no objection to this paper being published: it’s a pattern in data, no reason not to share it with the world. No reason to take it seriously, either.
Continue reading ‘Statistical behavior at the end of the world: the effect of the publication crisis on U.S. research productivity’ »

Hey, here’s a new reason for a journal to reject a paper: it’s “annoying” that it’s already on a preprint server

Alex Gamma writes:

I’m interested in publishing in journal X. So I inquire about X’s preprint policy.

X’s editor informs me that

[Journal X] does not prohibit placing submitted manuscripts on preprint servers. Some reviewers may notice the server version of the article, however, and they may find the lack of anonymity so annoying that it affects their recommendations about the paper.

This is interesting in part because it highlights the different roles of scientific journals. Traditionally, a journal is a way to “publish” a paper, that is, to print the article so that other people can read it. In this case, it’s already on the preprint server, so the main purpose of the journal is to give a stamp of approval. This all seems like a mess.

I think it’s inappropriate for a reviewer to downgrade a submission because it’s been posted on a preprint server. I don’t know about the rest of you, but sometimes I do something that seems important enough that I don’t feel like waiting 2 years for it to appear in print!

At the very least, I feel that the editor, in his or her instructions to reviewers, could explicitly instruct them not be influenced by the existing publication history. This stricture can’t be enforced but at least it could be established as a norm. As it is, it almost sounds like the editor thinks this attitude on the part of reviewers is OK.

Dear Thomas Frank

It’s a funny thing: academics are all easily reachable by email, but non-academics can be harder to track down.

Someone pointed me today to a newspaper article by political analyst Thomas Frank that briefly mentioned my work. I had a question for Frank, but the only correspondence I had with him was from ten years ago, and my email bounced.

So I’ll send it here:

Dear Thomas:

Someone pointed out a newspaper article in which you linked to something I’d written.

Here’s what you wrote:

Krugman said that the shift of working-class people to the Republican party was a myth and that it was not happening outside the south. . . . Here are some examples: a blog post from 2007; a column in the Times in 2008 (“Nor have working-class voters trended Republican over time,” he wrote. “On the contrary, Democrats do better with these voters now than they did in the 1960s”); his book, Conscience of a Liberal, published in 2007 and reprinted in 2009 and 2015; and a Times column in 2015, in which Krugman was still insisting that: “The working-class turn against Democrats wasn’t a national phenomenon — it was entirely restricted to the south.” . . . I know: Krugman wasn’t the only one saying things like this. Here’s a political scientist making the same point in tones of utmost contempt, implying that no serious professional in academia or prestige journalism could possibly disagree with him.

In that last place you link to this post of mine from 2012.

Your remark about me expressing “utmost contempt” is fair enough; I guess that is one thing you and I have in common, that we are not always patient with people who we feel have made a mistake.

Beyond issues of tone, though, is there anything I wrote in that post that you think is actually incorrect? From your newspaper article you seem to be expressing a negative take on that post of mine, but as far as I am aware, there are no errors in that post; everything I wrote there is accurate. So I’d like to know what specifically I got wrong there. Thanks in advance for explaining.


I should perhaps clarify that I’m serious about asking what I got wrong. I can well believe that I made a mistake; I’d just like to know what it is. I’m guessing it’s just a matter of emphasis, that Frank doesn’t actually think I got anything wrong in that post; he just would’ve written it differently. But maybe not; maybe there’s something I’m completely missing here. It wouldn’t be the first time.

Anyway, if any of you is in contact with Thomas Frank, please forward this along to him. He can respond in the comments or by email.

P.S. I have more to say on the topic of who is voting for Democrats and who is voting for Republicans—as has come up before, terms such as “blue collar” and “working class” are somewhat loaded in that they often seem to summon up images of white men—but here I just want to figure out exactly what it is of my writing that Frank is disagreeing with.

P.P.S. Someone sent me Thomas Frank’s email address so I sent him my question (that is, the part of the above post beginning with “Dear Thomas” and ending with “Yours Andrew”). And he responded!

Here is his response, in its entirety:

Dear sir:

Thanks for writing.

I think my article speaks for itself.


Thomas Frank

Wow! “Dear sir”—I don’t hear that one very often. I wonder how he responds to female correspondents. “Dear Madam”? “Dear Miss or Madam”? I can’t imagine.

Anyway, I can understand the response: after all, Frank, unlike me, makes his living from writing so he can’t very well just give out political commentary for free. Remember that Samuel Johnson quote.

The puzzle: Why do scientists typically respond to legitimate scientific criticism in an angry, defensive, closed, non-scientific way? The answer: We’re trained to do this during the process of responding to peer review.

[image of Cantor’s corner]

Here’s the “puzzle,” as we say in social science. Scientific research is all about discovery of the unexpected: to do research, you need to be open to new possibilities, to design experiments to force anomalies, and to learn from them. The sweet spot for any researcher is at Cantor’s corner. (See here for further explanation of the Cantor connection.)

Buuuut . . . researchers are also notorious for being stubborn. In particular, here’s a pattern we see a lot:
– Research team publishes surprising result A based on some “p less than .05” empirical results.
– This publication gets positive attention and the researchers and others in their subfield follow up with open-ended “conceptual replications”: related studies that also attain the “p less than .05” threshold.
– Given the surprising nature of result A, it’s unsurprising that other researchers are skeptical of A. The more theoretically-minded skeptics, or agnostics, demonstrate statistical reasons why these seemingly statistically-significant results can’t be trusted. The more empirically-minded skeptics, or agnostics, run preregistered replications studies, which fail to replicate the original claim.
– At this point, the original researchers do not apply the time-reversal heuristic and conclude that their original study was flawed (forking paths and all that). Instead they double down, insist their original findings are correct, and they come up with lots of little explanations for why the replications aren’t relevant to evaluating their original claims. And they typically just ignore or brush aside the statistical reasons why their original study was too noisy to ever show what they thought they were finding.

So, the puzzle is: researchers are taught to be open to new ideas, research is all about finding new things and being aware of flaws in existing paradigms—but researchers can be sooooo reluctant to abandon their own pet ideas.

OK, some of this we can explain by general “human nature” arguments. But I have another explanation for you, that’s specific to the scientific communication process.

My story goes like this. As scientists, we put a lot of effort into writing articles, typically with collaborators: we work hard on each article, try to get everything right, then we submit to a journal.

What happens next? Sometimes the article is rejected outright, but, if not, we’ll get back some review reports which can have some sharp criticisms: What about X? Have you considered Y? Could Z be biasing your results? Did you consider papers U, V, and W?

The next step is to respond to the review reports, and typically this takes the form of, We considered X, and the result remained significant. Or, We added Y to the model, and the result was in the same direction, marginally significant, so the claim still holds. Or, We adjusted for Z and everything changed . . . hmmmm . . . we then also though about factors P, Q, and R. After including these, as well as Z, our finding still holds. And so on.

The point is: each of the remarks from the reviewers is potentially a sign that our paper is completely wrong, that everything we thought we found is just an artifact of the analysis, that maybe the effect even goes in the opposite direction! But that’s typically not how we take these remarks. Instead, almost invariably, we think of the reviewers’ comments as a set of hoops to jump through: We need to address all the criticisms in order to get the paper published. We think of the reviewers as our opponents, not our allies (except in the case of those reports that only make mild suggestions that don’t threaten our hypotheses).

When I think of the hundreds of papers I’ve published and the, I dunno, thousand or so review reports I’ve had to address in writing revisions, how often have I read a report and said, Hey, I was all wrong? Not very often. Never, maybe?

So, here’s the deal. As scientists, we see serious criticism on a regular basis, and we’re trained to deal with it in a certain way: to respond while making minimal, ideally zero, changes to our scientific claims.

That’s what we do for a living; that’s what we’re trained to do. We think of every critical review report as a pain in the ass that we have to deal with, not as a potential sign that we screwed up.

So, given that training, it’s perhaps little surprise that when our work is scrutinized in post-publication review, we have the same attitude: the expectation that the critic is nitpicking, that we don’t have to change our fundamental claims at all, that if necessary we can do a few supplemental analyses and demonstrate the robustness of our findings to those carping critics.

And that’s the answer to the puzzle: Why do scientists typically respond to legitimate scientific criticism in an angry, defensive, closed, non-scientific way? Because in their careers, starting from the very first paper they submit to a journal in grad school, scientists get regular doses of legitimate scientific criticism, and they’re trained to respond to it in the shallowest way possible, almost never even considering the possibility that their work is fundamentally in error.

P.S. I’m pretty sure I posted on this before but I can’t remember when, so I thought it was simplest to just rewrite from scratch.

P.P.S. Just to clarify—I’m not trying to slam peer review. I think peer review is great; even at its worst it can be a way to convey that a paper has not been clear. My problem is not with peer review but rather with our default way of responding to peer review, which is to figure out how to handle the review comments in whatever way is necessary to get the paper published. I fear that this trains us to respond to post-publication criticism in that same way.

The retraction paradox: Once you retract, you implicitly have to defend all the many things you haven’t yet retracted

Mark Palko points to this news article by Beth Skwarecki on Goop, “the Gwyneth Paltrow pseudoscience empire.” Here’s Skwarecki:

When Goop publishes something weird or, worse, harmful, I often find myself wondering what are they thinking? Recently, on Jimmy Kimmel, Gwyneth laughed at some of the newsletter’s weirder recommendations and said “I don’t know what the fuck we talk about.” . . .

I [Skwarecki] . . . end up speaking with editorial director Nandita Khanna. “You publish a lot of things that are outside of the mainstream. What are your criteria for determining that something is safe and ethical to recommend?”

Khanna starts by pointing out that they include a disclaimer at the bottom of health articles. This is true. It reads:

The views expressed in this article intend to highlight alternative studies and induce conversation. They are the views of the author and do not necessarily represent the views of goop, and are for informational purposes only, even if and to the extent that this article features the advice of physicians and medical practitioners. This article is not, nor is it intended to be, a substitute for professional medical advice, diagnosis, or treatment, and should never be relied upon for specific medical advice.

. . . I ask: “What responsibility do you believe you have to your readers?” Here at Lifehacker, I recently killed a post I was excited about—a trick for stopping kids from unbuckling and escaping from their car seat—after a car seat expert nixed it. I feel like if I’m providing information people might act on, I have a responsibility to make sure that information is reasonably accurate and that people won’t hurt themselves (or their children) if they take me at my word.

Goop’s editors don’t see it that way. “Our responsibility is to ask questions, to start the conversation,” Khanna says.

OK, so far, not so bad. Goop’s basically an entertainment site. It’s goal is to be thought-provoking. They’re not heavy on the quality control, but they make this clear, and readers can take Goop’s articles with that in mind.

From this perspective, to criticize Goop for peddling pseudoscience would like criticizing David Sedaris for embellishing his Santaland story. It’s beside the point. We can’t get mad at Goop like we can get mad at, say, Dr. Oz, who’s using his medical degree and Columbia University affiliation to push questionable products.

But then Skwarecki continues:

I turn the conversation to Goop’s infamous jade eggs. They are for sale that day in the pharmacy shop, and I got to hold one in my hand. It was smaller than I expected, not the size of a chicken egg but more like a grape tomato. Both the jade and rose quartz eggs have a hole drilled through the smaller end, and at first I imagined a Goop acolyte taking the egg out of her vagina, rinsing it off, and hanging it around her neck. I learned later that the hole is the answer to the question in the jar: you can attach dental floss to give it a removal string, like a tampon.

The idea of the jade egg, or its prettier rose quartz companion, is to “cultivate sexual energy, increase orgasm, balance the cycle, stimulate key reflexology around vaginal walls.” The grain of truth here is that using a small weight for vaginal exercises can help strengthen the muscles in that area. You can do this without a weight, too.

But Jen Gunter, a practicing gynecologist who is one of Gwyneth’s most vocal critics, has explained that jade eggs are a terrible idea. Stones can be porous enough to grow bacteria, and she says the instructions for using the egg are incorrect and could harm people. For example, a Goop article suggests walking around with the egg inside of you. Gunter counters that overworking your vaginal muscles this way can result in pelvic pain.

The Goop editors remember the jade egg backlash, and they are unfazed. “Did you read the letter from Layla?” Khanna asks. Layla Martin, who sells jade eggs and a seven-week course on how to use them, wrote a 2,000-word “letter to the editors” defending the eggs. Goop published it in their newsletter, and underneath it, their disclaimer, and underneath that, a link to their shop.

Whoa! That doesn’t sound like healthy living.

The punch line:

Khanna says they “never considered backing down.” She points out, as if it were a defense, that the eggs were very popular and sold out right away. I ask her: Has there ever been a health article in Goop that you thought afterward, maybe we shouldn’t have run that?

No, she says, never.

Interesting story. It reminds me of Freakonomics. I always wondered why they never retracted some of their more embarrassing mistakes, such as their endorsement of Satoshi Kanazawa’s silly claims about beauty and sex ratio, or their breathless coverage of Daryl Bem’s ESP paper, or their book chapter on climate change. Why not retract the errors that experts point out to you? My best guess was that they didn’t want to start retracting even their biggest goof-ups, because once you start retracting, you’re implicitly endorsing all the things you didn’t retract. Paradoxically, if you don’t really believe the things you’re writing, you might be better off not retracting anything.

the Freakonomics team and the Goop team are in the same place, which is that they believe they are fundamentally doing good by spreading the principles (healthy living for Goop, economics for Freakonomics) and that the details don’t really matter. They know they’re the good guys and they don’t want to hear otherwise.

P.S. Skwarecki’s article is on Lifehacker, a site formerly connected with Gawker. Now that they’re being mean to a business venture, I wonder if Peter Thiel will try to sue them to death. I hope not.

P.P.S. The comment thread on Palko’s post continues in some interesting directions, including a discussion of some subset of eminent journalists and scientists who seem to care more about the well-being of their professional colleagues than anything else.

Why are these explanations so popular?

David Weakliem writes:

According to exit polls, Donald Trump got 67% of the vote among whites without a college degree in 2016, which may be the best-ever performance by a Republican (Reagan got 66% of that group in 1984).

Weakliem first rejects one possibility that’s been going around:

One popular idea is that he cared about them, or at least gave them the impression that he cared. The popularity of this account has puzzled me, because it’s not even superficially plausible. Every other presidential candidate I can remember tried to show empathy by talking about people they had met on the campaign trail, or tough times they had encountered in their past, or how their parents taught them to treat everyone equally. Trump didn’t do any of that—he boasted about how smart and how rich he was.

And, indeed, Weakliem shows data to shoot down the “he cares” explanation.

Here’s another possibility:

A variant is that Democrats drove “working class” voters away by showing contempt for them. This is more plausible, but raises the question of whether Democrats showed that much more contempt in 2016 than in 2012, 2008, 2004, etc. That seems like a hard case to make—at any rate, I haven’t heard anyone try to make it.

Weakliem then turns to the meta-question: not why did Trump do so well among less-educated white voters, but why are so many pundits pushing the “treating everyone with dignity” story? Here’s Weakliem:

So why are these explanations so popular? My hypothesis is that it’s because American society has become a lot more socially egalitarian over the last 60 years or so. Educated people don’t want to be thought of as snobs or elitists, and less educated people are less likely to think they should “improve themselves” by emulating the middle class. At one time, you could say that Democrats thought of themselves as the party of the common people, and Republicans thought of themselves as the party of successful people. Now both parties think of themselves as the party of the common people, plus the fraction of the elites who care about or understand the common people. The result is that people are attracted to an explanation that is more flattering to the “working class.”

This reminds me of something we wrote in Red State Blue State:

The Republican Party’s long-standing pro-business philosophy has a natural enduring appeal to higher-income voters. In contrast, it is a surprise when rich people vote for Democrats, suggesting that the party may have departed from its traditional populism. Conservative pundits hit the Democratic Party for losing relevance and authenticity, while liberals slam the Democrats for selling out. For example, Thomas Edsall quoted labor leader Andy Stern, president of the Service Employees International Union, saying that the perception of Democrats as “Volvo-driving, latte-drinking, Chardonnay-sipping, Northeast, Harvard- and Yale-educated liberals is the reality. That is who people see as leading the Democratic Party. There’s no authenticity; they don’t look like them. People are not voting against their interests; they’re looking for someone to represent their interests.” If Republicans are led by Benz-driving, golf-playing, Texas, Harvard- and Yale-educated conservatives, this is not such a problem because, in some sense, the Republicans never really claim to be in favor of complete equality.

Just as politicians would like the endorsement of Oprah Winfrey or Bruce Springsteen, it also seems desirable in our democracy to have the support of the so-called waitress moms and NASCAR dads—not just for the direct benefits of their votes but also because they signal a party’s broad appeal. In recent years, prestige votes for Democrats have included teachers and nurses; Republicans have won the prestige votes of farmers and many in the armed services.

Rich people are an anti-prestige vote: just as politicians generally don’t seek out the endorsement of, for example, Barbra Streisand or Ted Nugent (except in venues where these names are respected), they also don’t want to be associated with obnoxious yuppies or smug suburbanites in gated communities. The parties want the money of these people—in fact, in their leadership, both parties to some extent are these people—and they’ll take their votes, but they don’t necessarily want to make a big deal about it.

P.S. Weakliem also writes of “the recollections of people like Charles Murray (Coming Apart) and Robert Putnam (Our Kids) about how there used to be less social distance between classes. I think that may be because they both grew up in small towns in the Midwest. If you read something like E. Digby Baltzell’s The Protestant Establishment, you get a very different picture of status differences in America.”

I agree. Murray and Putnam have some useful things to say, and they’ve said some more debatable things too, but in any case they have a particular perspective which does not tell the whole story of America, or even of white male America.

A Python program for multivariate missing-data imputation that works on large datasets!?

Alex Stenlake and Ranjit Lall write about a program they wrote for imputing missing data:

Strategies for analyzing missing data have become increasingly sophisticated in recent years, most notably with the growing popularity of the best-practice technique of multiple imputation. However, existing algorithms for implementing multiple imputation suffer from limited computational efficiency, scalability, and capacity to exploit complex interactions among large numbers of variables. These shortcomings render them poorly suited to the emerging era of “Big Data” in the social and natural sciences.

Drawing on new advances in machine learning, we have developed an easy-to-use Python program – MIDAS (Multiple Imputation with Denoising Autoencoders) – that leverages principles of Bayesian nonparametrics to deliver a fast, scalable, and high-performance implementation of multiple imputation. MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are capable of producing complex, fine-grained reconstructions of partially corrupted inputs. To enhance their accuracy and speed while preserving their complexity, these networks leverage the recently developed technique of Monte Carlo dropout, changing their output from a frequentist point estimate into the approximate posterior of a Gaussian process. Preliminary tests indicate that, in addition to successfully handling large datasets that cause existing multiple imputation algorithms to fail, MIDAS generates substantially more accurate and precise imputed values than such algorithms in ordinary statistical settings.

They continue:

Please keep in mind we’re writing for a political science/IPE audience, where listwise deletion is still common practice. The “best-practice” part should be fairly evident among your readership…in fact, it’s probably just considered “how to build a model”, rather than a separate step.

And here are some details:

Our method is “black box”-y, using neural networks and denoising autoencoders to approximate a Gaussian process. We call it MIDAS -Multiple Imputation with Denoising AutoencoderS – because you have to have a snappy name. Using a neural network has drawbacks – it’s a non-interpretable system which can’t give additional insight into the data generation process. On the upside, it means you can point it at truly enormous datasets and yield accurate imputations in relatively short (by MCMC standards) time. For example, we ran the entire CCES (66k x ~ 2k categories) as one of our test cases in about an hour. We’ve also built in an overimputation method for checking model complexity and the ability of the model to reconstruct known values. It’s not a perfect “return of the full distribution of missing values”, but point estimates of error for deliberately removed values, giving a good estimate and allowing the avoidance of any potential overtraining through early stopping. A rough sanity check, if you will. This is an attempt to map onto the same terrain covered in your blog post.

Due to aggressive regularisation and application of deep learning techniques, it’s also resistant to overfitting in overspecified models. A second test loosely follows the method outlined in Honaker and King 2012, taking the World Development Indicators, subsetting out a 31 year period for 6 African nations, and then lagging all complete indicators a year either side. We then remove a single value of GDP, run a complete imputation model and compare S=200 draws of the approximate posterior to the true value. In other words, the data feed is a 186 x ~3.6k matrix, suffering hugely from both collinearity and irrelevant input, and it still yields quite accurate results. It should be overfitting hell, but it’s not. We’d make a joke about the Bayesian LASSO, but I’m actually not sure. Right now we think its a combination of data augmentation and sparsity driven by input noise. Gal’s PhD thesis is the basis for this algorithm, and this conception of a sparsity-inducing prior could just be a logical extension of his idea. Bayesian deep learning is all pretty experimental right now.

We’ve got a github alpha release of MIDAS up, but there’s still a long way to go before it gets close to Stan’s level of flexibility. Right now, it’s a fire-and-forget algorithm designed to be simple and fast. Let’s be frank – it’s not a replacement for conventional model-based imputation strategies. We’d still trust in a fully specified generative model which can incorporate explicit information about the missingness-generating mechanism over my own for bespoke models. Our aim is to build a better solution than listwise deletion for the average scholar/data scientist, that can reliably handle the sorts of nonlinear patterns found in real datasets. Looking at the internet, most users – particularly data scientists – aren’t statisticians who have the time or energy to go the full-Bayes route.

Missing data imputation is tough. You want to include lots of predictors and a flexible model, so regularization is necessary, but it’s only recently that we—statisticians and users of statistics—have become comfortable with regularization. Our mi package, for example, uses only very weak regularization, and I think we’ll need to start from scratch when thinking about imputation. As background, here are 3 papers we wrote on generic missing data imputation: and and .

I encourage interested readers to check out Stenlake and Lall’s program and see how it works. Or, if people have any thoughts on the method, feel free to share these thoughts in comments. I’ve not looked at the program or the documentation myself; I’m just sharing it here because the description above sounds reasonable.

Incentive to cheat

Joseph Delaney quotes Matthew Yglesias writing this:

But it is entirely emblematic of America’s post-Reagan treatment of business regulation. What a wealthy and powerful person faced with a legal impediment to moneymaking is supposed to do is work with a lawyer to devise clever means of subverting the purpose of the law. If you end up getting caught, the attempted subversion will be construed as a mitigating (it’s a gray area!) rather than aggravating factor. Your punishment will probably be light and will certainly not involve anything more than money. You already have plenty of money, and your plan is to get even more. So why not?

Yglesias’s quote is about Donald Trump but the issue is more general than that; for example here’s Delaney quoting a news report by Matt Egan regarding a banking scandal:

These payouts are on top of the $3.2 million Wells Fargo has paid to customers over 130,000 accounts over potentially unauthorized accounts. That works out to a refund of roughly $25 per account.

From an economics perspective, this is all standard stuff, falling under the category “moral hazard”: When the expected benefits from cheating greatly exceed the expected costs, there’s an incentive to cheat. If the difference between the expected financial benefits and costs is small, then the incentive is small or even negative, as there are reputational costs to being caught cheating, and most of us feel bad about cheating, also it can be difficult to persuade others to collude in an illegal scheme. But when the gap between expected benefits and costs widens, eventually people will grab the opportunity, and then others, seeing the rewards, will join in. Etc.

As the freakonomists say, “incentives matter.”

One question, then, is why is this sort of thing not discussed more in pop-econ? We heard a lot about why inner-city drug dealers live with their mothers—something to do with economics, I recall—but no corresponding chapter about how suburban pharmaceutical executives (that’s another kind of drug dealer, right? And I say this as a person who works in pharmaceutical research myself) are motivated to misreport drug trials. We heard about the incentives that encourage real-estate agents, auto mechanics, and doctors to rip you off, but not the ways in which the legal system gives an incentive for wealthy and powerful people to break the law.

This is not about Freakonomics, which is the product of two creative, hardworking people who can feel free to write about whatever interests them the most. I’m just using them as a convenient example.

Really my question is about how this particular incentive-to-cheat identified by Delaney and Yglesias is not discussed more. Why is it not the standard example among economists when talking about the effects of incentives? An incentive to systematically break the law—that’s a pretty good one, right?

Why don’t economists don’t talk more about the incentives for white-collar crime? I suspect it’s because many economists think of business regulation as fundamentally illegitimate. Not that they think all regulation is bad, but just about any regulation can make the typical economist a bit uncomfortable. Hence weak enforcement is perhaps viewed as a feature as much as a bug, and perhaps the mainstream view in economics is that it’s just as well that people in business will push against the rules.

Even if you oppose a law, it’s still a relevant point of economics and political science that weak enforcement gives an incentive to break the law—but maybe there’s something uncomfortable about making this point in a textbook or general presentation of economics. I don’t see it as a left-wing or right-wing thing, exactly, more that there’s something so upsetting about thinking that the system is set up with incentives to cheat, that we avoid talking about it unless we really have to. Cheating among sumo wrestlers, real estate agents, even doctors—sure, that’s unfortunate, but at least we can see our way to economics-based solutions. But a legal system that’s set up to reward cheating—that’s just scary, so better not to think about it. Or, at least, not to consider it as part of economics or political science, at least not most of the time.

StanCon 2018 Live Stream — bad news…. not enough bandwidth

Breaking news: no live stream. We’re recording, so we’ll put the videos online after the fact.

We don’t have enough bandwidth to live stream today.




StanCon 2018 starts today! We’re going to try our best to live stream the event on YouTube.

We have the same video setup as last year, but may be limited by internet bandwidth here at Asilomar.

If we’re up, we will these YouTube events on the Stan YouTube Channel (all times Pacific):


Three new domain-specific (embedded) languages with a Stan backend

One is an accident. Two is a coincidence. Three is a pattern.

Perhaps it’s no coincidence that there are three new interfaces that use Stan’s C++ implementation of adaptive Hamiltonian Monte Carlo (currently an updated version of the no-U-turn sampler).

  • ScalaStan embeds a Stan-like language in Scala. It’s a Scala package largely (if not entirely written by Joe Wingbermuehle.
    [GitHub link]

  • tmbstan lets you fit TMB models with Stan. It’s an R package listing Kasper Kristensen as author.
    [CRAN link]

  • SlicStan is a “blockless” and self-optimizing version of Stan. It’s a standalone language coded in F# written by Maria Gorinova.
    [pdf language spec]

These are in contrast with systems that entirely reimplement a version of the no-U-turn sampler, such as PyMC3, ADMB, and NONMEM.

Alzheimer’s Mouse research on the Orient Express

Paul Alper sends along an article from Joy Victory at Health News Review, shooting down a bunch of newspaper headlines (“Extra virgin olive oil staves off Alzheimer’s, preserves memory, new study shows” from USA Today, the only marginally better “Can extra-virgin olive oil preserve memory and prevent Alzheimer’s?” from the Atlanta Journal-Constitution, and the better but still misleading “Temple finds olive oil is good for the brain — in mice” from the Philadelphia Inquirer) which were based on a university’s misleading press release. That’s a story we’ve heard before.

The clickbait also made its way into traditionally respected outlets Newsweek and Voice of America. And NPR, kinda.

Here’s Joy Victory:

It’s pretty great clickbait—a common, devastating disease cured by something many of us already have in our pantries! . . . To deconstruct how this went off the rails, let’s start with the university news release sent to journalists: “Temple study: Extra-virgin olive oil preserves memory & protects brain against Alzheimer’s.”

That’s a headline that surely got journalists’ attention. It’s not until after two very long opening paragraphs extolling the virtues of the nearly magical powers of extra virgin olive oil that we find out who, exactly this was tested on.

Mice. . . .

It’s never mentioned in the news release, but this hypothesis was tested on only 22 mice, just 10 of which got the olive oil rich diet, and 12 of which got a standard diet.

As in: The sample size here is so small that we can’t be very sure what the results are telling us.

The release does at least make it pretty clear that these were also genetically modified mice, who had their DNA fracked with so that they developed “three key characteristics of the disease: memory impairment, amyloid plaques, and neurofibrillary tangles.”

Victory continues:

Stating the obvious, here, but genetically modified mice are a far, far cry from people. . . .

In fact, even drugs that apparently do a great job of getting rid of amyloid in thousands of actual humans don’t seem to have much effect on the symptoms of Alzheimer’s disease. . . .

There are other limitations of the study that should have been in the news release, but we’ll stop here.

I looked briefly at the published research article and am concerned about forking paths, type M errors, and type S errors. Put briefly, I doubt such strong results would show up in a replication of this study.

“No actual interviewing took place.”

Victory continues her article with a discussion of the news coverage:

With a news release like this, journalists were primed to do a poor job writing about the study.

To its credit, the Inquirer was upfront about this being a mouse study. . . . Yet, the story included no independent experts to offer some perspective—something we saw across the board in the stories we read.

USA Today and the Atlanta Journal Constitution didn’t even bother to disclose that all the quotes in their stories come directly from the news release. As in: No actual interviewing took place.


Victory concludes:

This is all great news for Temple’s PR team—their messaging made it out to the public with very little editing. But this isn’t great news for the public.

Countless people across the country who have loved ones with dementia likely read these stories with a whole lot of hope.

What they get instead is a whole lot of hype.

It’s ok to publish raw data and speculation—just present them as such.

I have no objection to this research being published—indeed, a key reason for having a scientific publication process in the first place is so that these sort of preliminary, exploratory results can be shared, and others can replicate.

The key is to present the results as data rather than as strong conclusions, and to avoid this sort of strong conclusion:

Taken together, our findings support a beneficial effect of EVOO consumption on all major features of the AD phenotype (behavioral deficits, synaptic pathology, Aβ and tau neuropathology), and demonstrate that autophagy activation is the mechanism underlying these biological actions.

That’s from the published paper. Can they really conclude all that from 22 genetically modified mice and a pile of bar graphs and p-values? I don’t think so.

The ethics question

Is it unethical for a team of researchers to overstate their results? Is it unethical for a university public relations office to hype a statistical study about 22 genetically modified mice with the following inaccurate headline: “Extra-virgin olive oil preserves memory & protects brain against Alzheimer’s”?

I feel like everyone’s passing the buck here:
– Newspapers don’t have the resources to spend on science reporting;
– Science writers are busy and feel under pressure to present good news, maybe even attach themselves to the kind of “breakthrough” they can write about in more detail;
– Lots of pressure to get “clicks”;
– Public relations offices are judged on media exposure, not accuracy;
– Academic researchers know that it’s hard to publish in top journals without making big claims.

Somewhere along the line, though, something unethical seems to be happening. And it seems worth discussing in a context such as this: legitimate (I assume) research on an important topic that is just being majorly hyped. This is not the Brian Wansink show, it’s not power pose or himmicanes or ovulation and clothing or beauty and sex ratio or all the other ridiculous noise-mining exercises that we’ve used to sharpen our intuition about understanding and communicating variation. It’s potentially real research, being unethically hyped. But the hyping is done in stages. It’s a Murder on the Orient Express situation in which . . . . SPOILER ALERT! . . . all the players are guilty.

And all this is happening in a media environment that has less journalism and more public relations than in the past. Fewer eyes on the street, as Jane Jacobs would say.

Benefits and limitations of randomized controlled trials: I agree with Deaton and Cartwright

My discussion of “Understanding and misunderstanding randomized controlled trials,” by Angus Deaton and Nancy Cartwright, for Social Science & Medicine:

I agree with Deaton and Cartwright that randomized trials are often overrated. There is a strange form of reasoning we often see in science, which is the idea that a chain of reasoning is as strong as its strongest link. The social science and medical research literature is full of papers in which a randomized experiment is performed, a statistically significant comparison is found, and then story time begins, and continues, and continues—as if the rigor from the randomized experiment somehow suffuses through the entire analysis.

Here are some reasons why the results of a randomized trial cannot be taken as representing a general discovery:

1. Measurement. A causal effect on a surrogate endpoint does not necessarily map to an effect on the outcome of interest. . . .

2. Missing data. . . .

3. Extrapolation. The participants in a controlled trial are typically not representative of the larger population of interest. This causes no problem if the treatment effect is constant but can leads to bias to the extent that treatment effects are nonlinear and have interactions. . . .

4. Researcher degrees of freedom. . . .

5. Type M (magnitude) errors. . . .

Each of these threats to validity is well known, but they often seem to be forgotten, or to be treated as minor irritants to be handled with some reassuring words or a robustness study, rather than as fundamental limitations on what can be learned from a particular dataset.

One way to get a sense of the limitations of controlled trials is to consider the conditions under which they can yield meaningful, repeatable inferences. . . .

Where does this all leave us? Randomized controlled trials have problems, but the problem is not with the randomization and the control—which do give us causal identification, albeit subject to sampling variation and relative to a particular local treatment effect. So really we’re saying at all empirical trials have problems, a point which has arisen many times in discussions of experiments and causal reasoning in political science; see Teele (2014). I agree with Deaton and Cartwright that the best way forward is to integrate subject-matter information into design, data collection, and data analysis . . .

Once we recognize the importance of diverse sources of data, statistics can be helpful in making decisions and quantifying uncertainty. . . .

Again, here’s my discussion, and here’s the original article. I assume other discussions are coming soon.

Nudge nudge, say no more

Alan Finlayson puts it well when he writes of “the tiresome business of informing and persuading people replaced by psychological techniques designed to ‘nudge’ us in the right direction.”

I think that’s about right. Nudging makes sense as part of a package that already includes information and persuasion. For example, tell us that smoking causes cancer, persuade us to reduce or quit, then also provide nudges that make smoking more difficult and make non-smoking more comfortable. But to try to nudge people without informing and persuading them, that seems like a mistake.

“However noble the goal, research findings should be reported accurately. Distortion of results often occurs not in the data presented but . . . in the abstract, discussion, secondary literature and press releases. Such distortion can lead to unsupported beliefs about what works for obesity treatment and prevention. Such unsupported beliefs may in turn adversely affect future research efforts and the decisions of lawmakers, clinicians and public health leaders.”

David Allison points us to this article by Bryan McComb, Alexis Frazier-Wood, John Dawson, and himself, “Drawing conclusions from within-group comparisons and selected subsets of data leads to unsubstantiated conclusions.” It’s a letter to the editor for the Australian and New Zealand Journal of Public Health, and it begins:

[In the paper, “School-based systems change for obesity prevention in adolescents: Outcomes of the Australian Capital Territory ‘It’s Your Move!’”] Malakellis et al. conducted an ambitious quasi-experimental evaluation of “multiple initiatives at [the] individual, community, and school policy level to support healthier nutrition and physical activity” among children.1 In the Abstract they concluded, “There was some evidence of effectiveness of the systems approach to preventing obesity among adolescents” and cited implications for public health as follows: “These findings demonstrate that the use of systems methods can be effective on a small scale.” Given the importance of reducing childhood obesity, news of an effective program is welcome. Unfortunately, the data and analyses do not support the conclusions.

And it continues with the following sections:

Why within-group testing is misleading

Malakellis et al. reported a “significant decrease in the prevalence of overweight/obesity within the pooled intervention group (p<0.05) but not the pooled comparison group (NS) (Figure 2)”. This kind of analysis, known as differences in nominal significance (DINS) analysis, is “invalid, producing conclusions which are, potentially, highly misleading”. . . .

Why drawing conclusions from subsets of data selected on the basis of observed results is misleading

Ideally, all analyses would be clearly described as having been specified a priori or not, so that readers can best interpret the data. Despite reporting no significance for the overall association, Malakellis et al. highlighted the results of the subgroup analyses as a general effect overall. Further complicating matters, the total number of subgroup analyses were unclear. It is also uncertain whether the analyses were planned a priori or after the data were collected and viewed. . . . Other problems arise when subgroup analyses are unrestricted, which is a multiple comparisons issue. . . .

Spin can distort the scientific record and mislead the public

Although Malakellis et al. may have presented their data accurately, by including statements of effectiveness based on a within-group test instead of relying on the proper between-group test, the article did not represent the findings accurately. The goal of reducing childhood obesity is a noble one. . . . However noble the goal, research findings should be reported accurately. Distortion of results often occurs not in the data presented but, as in the current article, in the abstract, discussion, secondary literature and press releases. Such distortion can lead to unsupported beliefs about what works for obesity treatment and prevention. Such unsupported beliefs may in turn adversely affect future research efforts and the decisions of lawmakers, clinicians and public health leaders.

They conclude:

Considering the importance of providing both the scientific community and the public with accurate information to support policy decisions and future research, erroneous conclusions reported in the literature should be corrected. The stated conclusions of the article in question were not substantiated by the data and should be corrected.

Well put. The problems identified by McComb et al. should be familiar to regular readers of this blog, as they include the difference between significant and non-significant is not itself statistically significant, the garden of forking paths, and story time.

I particularly like this bit: “However noble the goal, research findings should be reported accurately.” That was one of the things that got tangled in discussions we’ve had of various low-quality psychology research. The research has noble goals. But I don’t think those goals are served by over-claiming and then minimizing the problems with those claims. You really have to go back to first principles. If the published research is wrong, it’s good to know that. And if the published research is weak, it’s good to know that too: it’s the nature of claims supported by weak evidence that they often don’t replicate.

Allison also pointed me to the authors’ response to their letter. The authors of the original paper are Mary Malakellis, Erin Hoare, Andrew Sanigorski, Nicholas Crooks, Steven Allender, Melanie Nichols, Boyd Swinburn, Cal Chikwendu, Paul Kelly, Solveig Petersen, and Lynne Millar, and they write:

The paper describes one of the first attempts to evaluate an obesity prevention intervention that was informed by systems thinking and deliberately addressed the complexity within each school setting. A quasi-experimental design was adopted, and the intervention design included the facility for each school to choose and adopt interventions that were specific to their school context and priorities. This, in turn, meant the expectation of differential behavioural effects was part of the initial design and therefore a comparison of outcomes by intervention school was warranted. . . . Because of the unique and adaptive nature of intervention within each school, and the different intervention priority in each school, there was an a priori expectation of differential results and we therefore investigated reports within schools’ changes.

This is fine. Interactions are important. You just have to recognize that estimates of interactions will be more variable than estimates of main effects, thus you can pretty much forget about establishing “statistical significance” or near-certainty about particular interactions.

Malakellis et al. continue:

Our conclusion used qualifying statements that there was “some evidence” of within-school changes but no interaction effect, and that the findings were “limited”.

Fair enough—if that’s what they really did.

Let’s check, going back to the original article. Here’s the abstract, in its entirety:

OBJECTIVE: The Australian Capital Territory ‘It’s Your Move!’ (ACT-IYM) was a three-year (2012-2014) systems intervention to prevent obesity among adolescents.

METHODS: The ACT-IYM project involved three intervention schools and three comparison schools and targeted secondary students aged 12-16 years. The intervention consisted of multiple initiatives at individual, community, and school policy level to support healthier nutrition and physical activity. Intervention school-specific objectives related to increasing active transport, increasing time spent physically active at school, and supporting mental wellbeing. Data were collected in 2012 and 2014 from 656 students. Anthropometric data were objectively measured and behavioural data self-reported.

RESULTS: Proportions of overweight or obesity were similar over time within the intervention (24.5% baseline and 22.8% follow-up) and comparison groups (31.8% baseline and 30.6% follow-up). Within schools, two of three the intervention schools showed a significant decrease in the prevalence of overweight and obesity (p<0.05).

CONCLUSIONS: There was some evidence of effectiveness of the systems approach to preventing obesity among adolescents. Implications for public health: The incorporation of systems thinking has been touted as the next stage in obesity prevention and public health more broadly. These findings demonstrate that the use of systems methods can be effective on a small scale.

After reading this, I’ll have to say, No, they did not sufficiently qualify their claims. Yes, their Results section clearly indicates that the treatment and comparison groups were not comparable and that there was no apparent main effects. But it’s inappropriate to pick out some subset of comparisons and label them as “p<0.05.” Multiple comparisons is real. My concern here is not “Type 1 errors” or “Type 2 errors” or “false rejections” or “retaining the null hypothesis.” My concern here is that from noisy data you’ll be able to see patterns, and there’s no reason to believe that these noisy patterns tell us anything beyond the people and measurements in this particular dataset.

And then in the conclusions, yes, they say “some evidence.” But then consider the final sentence of the abstract, which for convenience I’ll repeat here:

These findings demonstrate that the use of systems methods can be effective on a small scale.


I mean, sure, they got excited when they were writing their article and this sentence slipped in. Too bad, but such things happen. But then they were lucky enough to receive thoughtful comments from McComb et al., and this was their chance to re-evaluate, to take stock of the situation and correct their errors, if for no other reason than to help future researchers not be led astray. And did they do so? No, they didn’t. Instead they muddled the waters and concluded their response with, “While we grapple with intervention and evaluation of systems approaches to prevention, we are forced to use the methods available to us which are mainly based on very linear models.” Which completely misses the point that they overstated their results and made a claim not supported by their data. As McComb et al. put it, “The stated conclusions of the article in question were not substantiated by the data and should be corrected.” And the authors of the original paper, given the opportunity to make this correction, did not do so. This behavior does not surprise me, but it still makes me unhappy.

Who cares?

What’s the point here? A suboptimal statistical analysis and misleading summary appeared in an obscure journal published halfway around the world? (OK, not so obscure; I published there once.) That seems to fall into the “Someone is wrong on the internet” category.

No, my point is not to pick on some hapless authors of a paper in the Australian and New Zealand Journal of Public Health. I needed to check the original paper to make sure McComb et al. got it right, that’s all.

My point in sharing this story is to foreground this quote from McComb et al.:

However noble the goal, research findings should be reported accurately. Distortion of results often occurs not in the data presented but, as in the current article, in the abstract, discussion, secondary literature and press releases. Such distortion can lead to unsupported beliefs about what works for obesity treatment and prevention. Such unsupported beliefs may in turn adversely affect future research efforts and the decisions of lawmakers, clinicians and public health leaders.

This is a message of general importance. It seems to be pretty hopeless to get researcher to correct the errors they’ve made in published papers, but maybe this message will get out there to students and new researchers who can do better in the future.


Really, what’s up with people? Everyone was a student, once. And as a student you make mistakes: mistakes in class, mistakes on your homework, etc. What makes people think that, suddenly, once they graduate and have a job, that they can’t make serious mistakes in their work? What makes people think that, just because a paper has their name on it and happens to be published somewhere, that it can’t have a serious mistake? The whole thing frankly baffles me. I make mistakes, I put my work out there, and people point out errors that I’ve made. Why do so many researchers have problems doing the same? It’s baffling. I mean, sure, I guess I understand from a psychological perspective: people have their self-image, they can feel they have a lot to lose by admitting error, etc. But from a logical perspective, it makes no sense at all.