Skip to content

Low power and the replication crisis: What have we learned since 2004 (or 1984, or 1964)?

I happened to run across this article from 2004, “The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies,” by Scott Maxwell and published in the journal Psychological Methods.

In this article, Maxwell covers a lot of the material later discussed in the paper Power Failure by Button et al. (2013), and the 2014 paper on Type M and Type S errors by John Carlin and myself. Maxwell also points out that these alarms were raised repeatedly by earlier writers such as Cohen, Meehl, and Rozeboom, from the 1960s onwards.

In this post, I’ll first pull out some quotes from that 2004 paper that presage many of the issues of non-replications that we still are wrestle with today. Then I’ll discuss what’s been happening since 2004: what’s new in our thinking in the past fifteen years.

I’ll argue that, yes, everyone should’ve been listening to Cohen, Meehl, Roseboom, Maxwell, etc., all along; and also that we have been making some progress, that we have some new ideas that might help us move forward.

Part 1: They said it all before

Here’s a key quote from Maxwell (2004):

When power is low for any specific hypothesis but high for the collection of tests, researchers will usually be able to obtain statistically significant results, but which specific effects are statistically significant will tend to vary greatly from one sample to another, producing a pattern of apparent contradictions in the published literature.

I like this quote, as it goes beyond the usual framing in terms of “false positives” etc., to address the larger goals of a scientific research program.

Maxwell continues:

A researcher adopting such a strategy [focusing on statistically significant patterns in observed data] may have a reasonable probability of discovering apparent justification for recentering his or her article around a new finding. Unfortunately, however, this recentering may simply reflect sampling error . . . this strategy will inevitably produce positively biased estimates of effect sizes, accompanied by apparent 95% confidence intervals whose lower limit may fail to contain the value of the true population parameter 10% to 20% of the time.

He also slams deterministic reasoning:

The presence or absence of asterisks [indicating p-value thresholds] tends to convey an air of finality that an effect exists or does not exist . . .

And he mentions the “decline effect”:

Even a literal replication in a situation such as this would be expected to reveal smaller effect sizes than those originally reported. . . . the magnitude of effect sizes found in attempts to replicate can be much smaller than those originally reported, especially when the original research is based on small samples. . . . these smaller effect sizes might not even appear in the literature because attempts to replicate may result in nonsignificant results.

Classical multiple comparisons corrections won’t save you:

Some traditionalists might suggest that part of the problem . . . reflects capitalization on chance that could be reduced or even eliminated by requiring a statistically significant multivariate test. Figure 3 shows the result of adding this requirement. Although fewer studies will meet this additional criterion, the smaller subset of studies that would now presumably appear in the literature are even more biased . . .

This was a point raised a few years later by Vul et al. in their classic voodoo correlations paper.

Maxwell points out that meta-analysis of published summaries won’t solve the problem:

Including underpowered studies in meta-analyses leads to biased estimates of effect size whenever accessibility of studies depends at least in part on the presence of statistically significant results.

And this:

Unless psychologists begin to incorporate methods for increasing the power of their studies, the published literature is likely to contain a mixture of apparent results buzzing with confusion.

And the incentives:

Not only do underpowered studies lead to a confusing literature but they also create a literature that contains biased estimates of effect sizes. Furthermore . . . researchers may have felt little pressure to increase the power of their studies, because by testing multiple hypotheses, they often assured themselves of a reasonable probability of achieving a goal of obtaining at least one statistically significant result.

And he makes a point that I echoed many years later, regarding the importance of measurement and the naivety of researchers who think that the answer to all problems is to crank up the sample size:

Fortunately, an assumption that the only way to increase power is to increase sample size is almost always wrong. Psychologists are encouraged to familiarize themselves with additional methods for increasing power.

Part 2: (Some of) what’s new

So, Maxwell covered most of the ground in 2004. Here are a few things that I would add, from my standpoint nearly fifteen years later:

1. I think the concept of “statistical power” itself is a problem in that it implicitly treats the attainment of statistical significance as a goal. As Button et al. and others have discussed, low-power studies have a winner’s curse aspect, in that if you do a “power = 0.06” study and get lucky and find a statistical significant result, your estimate will be horribly exaggerated and likely in the wrong direction.

To put it another way, I fear that a typical well-intentioned researcher will want to avoid low-power studies—and, indeed, it’s trivial to talk yourself into thinking your study has high power, by just performing the power analysis using an overestimated effect size from the published literature—but will also think that a low power study is essentially a role of the dice. The implicit attitude is that in a study with, say, 10% power, you have a 10% chance of winning. But in such cases, a win is really a loss.

2. Variation in effects and context dependence. It’s not about identifying whether an effect is “true” or a “false positive.” Rather, let’s accept that in the human sciences there are no true zeroes, and relevant questions include the magnitude of effects, and how and where they vary. What I’m saying is: less “discovery,” more exploration and measurement.

3. Forking paths. If I were to rewrite Maxwell’s article today, I’d emphasize that the concern is not just multiple comparisons that have been performed, but also multiple potential comparisons. A researcher can walk through his or her data and only perform one or two analyses, but these analyses will be contingent on data, so that had the data been different, they would’ve been summarized differently. This allows the probability of finding statistical significance to approach 1, given just about any data (see, most notoriously, this story). In addition, I would emphasize that “researcher degrees of freedom” (in the words of Simmons, Nelson, and Simonsohn, 2011) arise not just in the choice of which of multiple coefficients to test in a regression, but also in which variables and interactions to include in the model, how to code data, and which data to exclude (see my above-linked paper with Loken for sevaral examples).

4. Related to point 2 above is that some effects are really really small. We all know about ESP, but there are also other tiny effects being studied. An extreme example is the literature on sex ratios. At one point in his article, referring to a proposal that psychologists gather data on a sample of a million people, Maxwell writes, “Thankfully, samples this large are unnecessary even to detect minuscule effect sizes.” Actually, if you’re studying variation in the human sex ratio, that’s about the size of sample you’d actually need! For the calculation, see pages 645-646 of this paper.

5. Flexible theories: The “goal of obtaining at least one statistically significant result” is only relevant because theories are so flexible that just about any comparison can be taken to be consistent with theory. Remember sociologist Jeremy Freese’s characterization of some hypotheses as “more vampirical than empirical—unable to be killed by mere evidence.”

6. Maxwell writes, “it would seem advisable to require that a priori power calculations be performed and reported routinely in empirical research.” Fine, but we can also do design analysis (our preferred replacement term for “power calculations”) after the data have come in and the analysis has been published. The purpose of a design calculation is not just to decide whether to do a study or to choose a sample size. It’s also to aid in interpretation of published results.

7. Measurement.

Bob likes the big audience

In response to a colleague who was a bit scared of posting some work up on the internet for all to see, Bob Carpenter writes:

I like the big audience for two reasons related to computer science principles.

The first benefit is the same reason it’s scary. The big audience is likely to find flaws. And point them out. In public! The trick is to get over feeling bad about it and realize that it’s a super powerful debugging tool for ideas. Owning up to being wrong in public is also very liberating. Turns out people don’t hold it against you at all (well, maybe they would if you were persistently and unapologetically wrong). It also provides a great teaching opportunity—if a postdoc is confused about something in their speciality, chances are that a lot of others are confused, too.

In programming, the principle is that you want routines to fail early. You want to inspect user input and if there’s a fatal problem with it that can be detected, fail right away and let the user know what the error is. Don’t fire up the algorithm and report some deeply nested error in a Fortran matrix algorithm. Something not being shot down on the blog is like passing that validation. It gives you confidence going on.

The second benefit is the same as in any writing, only the stakes are higher with the big audience. When you write for someone else, you’re much more self critical. The very act of writing can uncover problems or holes in your reasoning. I’ve started several blog posts and papers and realized at some point as I fleshed out an argument that I was missing a fundamental point.

There’s a principle in computer science called the rubber ducky.

One of the best ways to debug is to have a colleague sit down and let you explain your bug to them. Very often halfway through the explanation you find your own bug and the colleague never even understands the problem. The blog audience is your rubber ducky.

The term itself if a misnomer in that it only really works if the rubbery ducky can understand what you’re saying. They don’t need to understand it, just be capable of understanding it. Like the no free lunch principle, there are no free pair programmers.

The third huge benefit is that other people have complementary skills and knowledge. They point out connections and provide hints that can prove invaluable. We found out about automatic differentiation through a comment on the blog to a post where I was speculating about how we could calculate gradients of log densities in C++.

I guess there’s a computer science principle there, too—modularity. You can bring in whole modules of knowledge, like we did with autodiff.

I agree. It’s all about costs and benefits. The cost of an error is low if discovered early. You want to stress test, not to hide your errors and wait for them to be discovered later.

Of rabbits and cannons

When does it make sense to shoot a rabbit with a cannon?

I was reminded of this question recently when I happened to come across this exchange in the comments section from a couple years ago, in the context of the finding patterns in the frequencies of births on different days:

Rahul: Yes, inverting a million element matrix for this sort of problem does have the feel of killing mice with cannon.

Andrew: In many areas of research, you start with the cannon. Once the mouse is dead and you can look at it carefully from all angles, you can design an effective mousetrap. Red State Blue State went the same way: we found the big pattern only after fitting a multilevel model, but once we knew what we were looking for, it was possible to see it in the raw data.

The curse of dimensionality and finite vs. asymptotic convergence results

Related to our (Aki, Andrew, Jonah) Pareto smoothed importance sampling paper I (Aki) received a few times a comment that why bother with Pareto smoothing as you can always choose the proposal distribution so that importance ratios are bounded and then central limit theorem holds. The curse of dimensionality here is that the papers they refer used small dimensional experiments and the results do not work so well in high dimensions. Readers of this blog should not be surprised that things look not the same in high dimensions. In high dimensions the probability mass is far from the mode. It’s spread thinly in surface of high-dimensional sphere. See, e.g. Mike’s paper Bob’s case study, and blog post

In importance sampling one working solution in low dimensions is to use mixture of two proposals. One component tries to match the mode, and the other takes care that tails go down slower than the tails of the target ensuring bounded ratios. In the following I only look at the behavior with one component which has thicker tail and thus importance ratios are bounded (but I have made the similar experiment with mixture, too).

The target distribution is multidimensional normal distribution with zero mean and unit covariance matrix. In the first case the proposal distribution is also normal, but with scale 1.1 in each dimension. The scale is just slightly larger than for the proposal and we are often lucky if we can guess the scale of proposal with 10% accuracy. I draw 100000 draws from the proposal distribution.

The following figure shows when the number of dimensions go from 1 to 1024.

The upper subplot shows the estimated effective sample size. By D=512 importance weighted 100000 draws have only a few practically non-zero weights. The middle subplot shows the convergence rate compared to independent sampling, ie, how fast the variance goes down. By D=1024 the convergence rate has dramatically dropped and getting any improvement in the accuracy requires more and more draws. The bottom subplot shows Pareto khat diagnostic (see the paper for details). Dashed line is k=0.5, which the limit for variance being finite and dotted line is our suggestion for practically useful performance when using PSIS. But how can khat be larger than 0.5 when we have bounded weights! Central limit theorem has not failed here, but we have just not reach yet the asymptotic regime to get CLT to kick in!

The next plot shows more information what happens with D=1024.

Since humans are lousy in looking at 1024 dimensional plots the top subplot shows the 1 dimensional marginal density of the target (blue) and the proposal (red) densities of the distance from the origo r=sqrt(sum_{d=1}^D x_d). The proposal density has only 1.1 larger scale than the target, but most of the draws from the proposal are away from the typical set of the target! The vertical dashed line shows 1e-6 quantile of the proposal, ie when we draw 100000 draws, 90% of time we don’t get any draws from there. The middle subplot shows the importance ratio function, and we can see that the highest value is at 0, but that value is larger than 2*10^42! That’s a big number. The bottom subplot scales the y axis show that we see importance ratios near that 1e-6 quantile. Check the y-axis: it’s still from 0 to 1e6. So if we are lucky we may get a draw below the dashed line, but then it’s likely to get all the weight. The importance function is practically as steep everywhere where we can get draws in a time of the age of the universe. 1e-80 quantile is at 21.5 (1e80 is the estimated number of atoms in the visible universe). and it’s still far away from the region where the boundedness of the importance ratio function starts to affect.

I have more similar plots with thick tailed Student’s t, mixture of proposals etc, but I’ll save you from more plots. As long as there is some difference in target and proposal taking the number of dimensions to high enough, IS and PSIS break (PSIS giving slight improvement in the performance, and more importantly can diagnose the problem and improves the Monte Carlo estimate).

In addition that we need to take into account that many methods which work in small dimensions can break in high dimensions, we need to focus more on finite case performance. As seen here it doesn’t help us that CLT holds if we never can reach that asymptotic regime (same as why Metropolis algorithm in high dimensions may require close to infinite time to produce useful results). Pareto diagnostics has been empirically shown to provide very good finite case convergence rate estimates which also match some theoretical bounds.

“Write No Matter What” . . . about what?

Scott Jaschik interviews Joli Jensen (link from Tyler Cowen), a professor of communication who wrote a new book called “Write No Matter What: Advice for Academics.”

Her advice might well be reasonable—it’s hard for me to judge; as someone who blogs a few hundred times a year, I’m not really part of Jensen’s target audience. She offers “a variety of techniques to help . . . reduce writing anxiety; secure writing time, space and energy; recognize and overcome writing myths; and maintain writing momentum.” She recommends “spending at least 15 minutes a day in contact with your writing project . . . writing groups, focusing on accountability (not content critiques), are great ways to maintain weekly writing time commitments.”

Writing is non-algorithmic, and I’ve pushed hard against advice-givers who don’t seem to get that. So, based on this quick interview, my impression is that Jensen’s on the right track.

I’d just like to add one thing: If you want to write, it helps to have something to write about. Even when I have something I really want to say, writing can be hard. I can only imagine how hard it would be if I was just trying to write, to produce, without something I felt it was important to share with the world.

So, when writing, imagine your audience, and ask yourself why they should care. Tell ’em what they don’t know.

Also, when you’re writing, be aware of your audience’s expectations. You can satisfy their expectations or confound their expectations, but it’s good to have a sense of what you’re doing.

And here’s some specific advice about academic writing, from a few years ago.

P.S. In that same post, Cowen also links to a bizarre book review by Edward Luttwak who, among other things, refers to “George Pataki of New York, whose own executive experience as the State governor ranged from the supervision of the New York City subways to the discretionary command of considerable army, air force and naval national guard forces.” The New York Air National Guard, huh? I hate to see the Times Literary Supplement fall for this sort of pontificating. I guess that there will always be a market for authoritative-sounding pundits. But Tyler Cowen should know better. Maybe it was the New York thing that faked him out. If Luttwak had been singing the strategic praises of the New Jersey Air National Guard, that might’ve set off Cowen’s B.S. meter.

How to think about the risks from low doses of radon

Nick Stockton, a reporter for Wired magazine, sent me some questions about radiation risk and radon, and Phil and I replied. I thought our responses might be of general interest so I’m posting them here.

First I wrote:

Low dose risk is inherently difficult to estimate using epidemiological studies. I’ve seen no evidence that risk is not linear at low dose, and there is evidence that areas with high radon levels have elevated levels of lung cancer. When it comes to resource allocation, we recommend that measurement and remediation be done in areas of high average radon levels but not at a national level; see here and here and, for a more technical treatment, here.

Regarding the question of “If the concerns about the linear no-threshold model for radiation risk are based on valid science, why don’t public health agencies like the EPA take them seriously?” I have no idea what goes on within the EPA, but when it comes to radon remediation, the effects of low dose exposure aren’t so relevant to the decision: if your radon level is low (as it is in most homes in the U.S.) you don’t need to do anything anyway; if your radon level is high, you’ll want to remediate; if you don’t know your radon level but it has a good chance of being high, you should get an accurate measurement and then make your decision.

For homes with high radon levels, radon is a “dangerous, proven harm,” and we recommend remediation. For homes with low radon levels, it might or might not be worth your money to remediate; that’s an individual decision based on your view of the risk and how much you can afford the $2000 or whatever to remediate.

Then Phil followed up:

The idea of hormesis [the theory that low doses of radiation can be beneficial to your health] is not quackery. Nor is LNT [the linear no-threshold model of radiation risk].

I will elaborate.

The theory behind LNT isn’t just ‘we have to assume something’, nor ‘everything is linear to first order’. The idea is that twice as much radiation means twice as many cells with damaged DNA, and if each cell with damaged DNA has a certain chance of initiating a cancer, then ceterus paribus you have LNT. That’s not crazy.

The theory behind hormesis is that your body has mechanisms for dealing with cancerous cells, and that perhaps these mechanisms recognize become more active or more effective when there is more damage. That’s not crazy either.

Perhaps exposure to a little bit of radiation isn’t bad for you at all. Perhaps it’s even good for you. Perhaps it’s just barely bad for you, but then when you’re exposed to more, you overwhelm the repair/rejection mechanisms and at some point just a little bit more adds a great deal of risk. This goes for smoking, too: maybe smoking 1/4 cigarette per day woud be good for you. For radiation there are various physiological models and there are enough adjustable parameters to get just about any behavior out of the models I have seen.

Of course what is needed is actual data. Data can be in vitro or in vivo; population-wide or case-control; etc.

There’s fairly persuasive evidence that the dose-response relationship is significantly nonlinear at low doses for “low linear-energy-transfer radiation”, aka low-LET radiation, such as x-rays. I don’t know whether the EPA still uses a LNT model for low-LET radiation.

But for high-LET radiation, including the alpha radiation emitted by radon and most of its decay products of concern, I don’t know much about the dose-response relationship at low does and I’m very skeptical of anyone who says they do know. There are some pretty basic reasons to expect low-LET and high-LET radiation to have very different effects. Perhaps I need to explain just a bit. For a given amount of energy that is deposited in tissue, low-LET radiation causes a small disruption to a lot of cells, whereas high-LET radiation delivers a huge wallop to relatively few cells.

An obvious thing to do is to look at people who have been exposed to high levels of radon and its decay products. As you probably know, it is really radon’s decay products that are dangerous, not radon itself. When we talk about radon risk, we really mean the risk from radon and its decay products.

At high concentrations, such as those found in uranium mines, it is very clear that radiation is dangerous, and that the more you are exposed to the higher your risk of cancer. I don’t think anyone would argue against the assertion that an indoor radon concentration of, say, 20 pCi/L leads to a substantially increased risk of lung cancer. And there are houses with living area concentrations that high, although not many.

A complication is that the radon risk for smokers seems to be much higher than for non-smokers. That is, a smoker exposed to 20 pCi/L for ten hours per day for several years is at much higher risk than a non-smoker with the same level of exposure.

But what about 10, 4, 2, or 1 pCi/L? No one really knows.

One thing people have done (notably Bernard Cohen, who you’ve probably come across) is to look at the average lung cancer rate by county, as a function of the average indoor radon concentration by county. If you do that, you find that low-radon counties actually have lower lung cancer rates than high-radon counties. But: a disproportionate fraction of low-radon counties are in the South, and that’s also where smoking rates are highest. It’s hard to completely control for the effect of smoking in that kind of study, but you can do things like look within individual states or regions (for instance, look at the relationship between average county radon concentrations and average lung cancer rates in just the northeast) and you still find a slight effect of higher radon being associated with lower lung cancer rates. If taken at face value, this would suggest that a living-aread concentration of 1 pCi/L or maybe even 2 pCi/L would be better than 0. But few counties have annual-average living-area radon concentration over about 2 pCi/L, and of course any individual county has a range of radon levels. Plus people move around, both within and between countties, so you don’t know the lifetime exposure of anyone. Putting it all together, even if there aren’t important confounding variables or other issues, these studies would suggest a protective effect at low radon levels but they don’t tell you anything about the risk at 10 pCi/L or 4 pCi/L.

There’s another class of studies, case-control studies, in which people with lung cancer are compared statistically to those without. In this country the objectively best of these looked at women in Iowa. (You may have come across this work, led by Bill Field). Iowa has a lot of farm women who don’t smoke and who have lived in just a few houses for their whole lives. Some of these women contracted lung cancer. The study made radon measurements in these houses, and in the houses of women of similar demographics who didn’t get lung cancer. They find increased risk at 4 pCi/L (even for nonsmokers, as I recall) and they are certainly inconsistent with a protective effect at 4 pCi/L. As I recall — you should check — they also found a positive estimated risk at 2 pCi/L that is consistent with LNT but also statistically consistent with 0 effect.

So, putting it all together, what do we have? I, at least, am convinced that increased exposure leads to increased risk for concentrations above 4 pCi/L. There’s some shaky empirical evidence for a weak protective effect at 2pCi/L compared to 0 pCi/L. In between it’s hard to say. All of the evidence below about 8 or 10 pCi/L is pretty shaky due to low expected risk, methodological problems with the studies, etc.

My informed belief is this: just as I wouldn’t suggest smoking a little bit of tobacco every day in the hope of a hormetic effect, I woudn’t recommend a bit of exposure to high-LET radiation every day. It’s not that it couldn’t possibly be protective, but I wouldn’t bet on it. And I’m pretty sure the EPA’s recommended ‘action level’ of 4 pCi/L is indeed risky compared to lower concentrations, especially for smokers. As a nonsmoker I wouldn’t necessarily remediate if my home were at 4 pCi/L, but I would at least consider it.

For low-LET radiation, I think the scientific evidence weighs against LNT. If public health agencies don’t take LNT seriously for this type of radiation it’s possible that they acknowledge this.

For high-LET radiation, such as alpha particles from radon decay products, there’s more a priori reason to believe LNT would be a good model, and less empirical evidence suggesting that it is a bad model. It might be hard for the agencies to explicitly disavow LNT in these circumstances. At the same time, there’s not compelling evidence in favor of LNT even for this type of radiation, and life is a lot simpler if you don’t take LNT ‘seriously’.

“Service” is one of my duties as a professor—the three parts of this job are teaching, research, and service—and, I guess, in general, those of us who’ve had the benefit of higher education have some sort of duty to share our knowledge when possible. So I have no problem answering reporters’ questions. But reporters have space constraints: you can send a reporter a long email or talk on the phone for an hour, and you’ll be lucky if one sentence of your hard-earned wisdom makes its way into the news article. So much effort all gone! It’s good to be able to post here and reach some more people.

Research project in London and Chicago to develop and fit hierarchical models for development economics in Stan!

Rachael Meager at the London School of Economics and Dean Karlan at Northwestern University write:

We are seeking a Research Assistant skilled in R programming and the production of R packages.

The successful applicant will have experience creating R packages accessible on github or CRAN, and ideally will have experience working with Rstan. The main goal of this position is to create R packages to allow users to run Rstan models for Bayesian hierarchical evidence aggregation, including models on CDFs, sets of quantiles, and tailored parametric aggregation models as in Meager (2016) ( Ideally the resulting package will keep RStan “under the hood” with minimal user input, as in rstanarm. Further work on this project is likely to involve programming functions to allow users to run models to predict individual heterogeneity in treatment effects conditional on covariates in a hierarchical setting, potentially using ensemble methods such as Bayesian additive regression trees. Part of this work may involve developing and testing new statistical methodology. We aim to create comprehensively-tested packages that alert users when the underlying routines may have performed poorly. The application of our project is situated in development economics with a focus on understanding the potentially heterogeneous effects of the BRAC Graduation program to alleviate poverty.

The ideal candidate will have the right to work in the UK, and able to make weekly trips to London to meet with the research team. However a remote arrangement may be possible for the right candidate, and for those without the right to work in the UK, hiring can be done through Northwestern University. The start date is flexible but we aim to hire by the middle of March 2018.

The ideal candidate would commit 20-30 hours a week on a contracting basis, although the exact commitment is flexible. Pay rate is negotiable and commensurate with academic research positions and the candidate’s experience. Formally, the position is on a casual basis, but our working arrangement is also flexible with the option to work a number of hours corresponding to full-time, part time or casual. A 6-12 month commitment is ideal, with the option to extend pending satisfactory performance and funding availability, but a shorter commitment can be negotiated. Applications will be evaluated beginning immediately until the position is filled.

Please send your applications via email, attaching your CV and the links to any relevant packages or repositories, with the subject line “R programmer job” to and cc and

Use multilevel modeling to correct for the “winner’s curse” arising from selection of successful experimental results

John Snow writes:

I came across this blog by Milan Shen recently and thought you might find it interesting.

A couple of things jumped out at me. It seemed like the so-called ‘Winner’s Curse’ is just another way of describing the statistical significance filter. It also doesn’t look like their correction method is very effective. I’d be very curious to hear your take on this work, especially this idea of a ‘Winner’s Curse’. I suspect the airbnb team could benefit from reading some of your work when it comes to dealing with these problems!

My reply: Yes, indeed I’ve used the term “winner’s curse” in this context. Others have used the term too.

Also here.

Here’s a paper discussing the bias.

I think the right thing to do is fit a multilevel model.

The Lab Where It Happens

“Study 1 was planned in 2007, but it was conducted in the Spring of 2008 shortly after the first author was asked to take a 15-month leave-of-absence to be the Executive Director for USDA’s Center for Nutrition Policy and Promotion in Washington DC. . . . The manuscript describing this pair of studies did not end up being drafted until about three years after the data for Study 1 had been collected. At this point, the lab manager, post-doctoral student, and research assistants involved in the data collection for this study had moved away. The portion of the data file that we used for the study had the name of each student and the location where their data was collected but not their age. Four factors led us to wrongly assume that the students in Study 1 must have been elementary school students . . .

The conclusions of both studies and the conclusions of the paper remain strong after correcting for these errors.”

— Brian Wansink, David R. Just, Collin R. Payne, Matthew Z. Klinger, Preventive Medicine (2018).

This reminds me of a song . . .

Ah, Mister Editor
Mister Prof, sir
Did’ya hear the news about good old Professor Stapel
You know Lacour Street
They renamed it after him, the Stapel legacy is secure
And all he had to do was lie
That’s a lot less work
We oughta give it a try
Now how’re you gonna get your experiment through
I guess I’m gonna fin’ly have to listen to you
Measure less, claim more
Do whatever it takes to get my manuscript on the floor
Now, Reviewers 1 and 2 are merciless
Well, hate the data, love the finding
Food and Brand
I’m sorry Prof, I’ve gotta go
Decisions are happening over dinner
Two professors and an immigrant walk into a restaurant
Diametric’ly opposed, foes
They emerge with a compromise, having opened doors that were
Previously closed
The immigrant emerges with unprecedented citation power
A system he can shape however he wants
The professors emerge in the university
And here’s the pièce de résistance
No one else was in
The lab where it happened
The lab where it happened
The lab where it happened
No one else was in
The lab where it happened
The lab where it happened
The lab where it happened
No one really knows how the game is played
The art of the trade
How the sausage gets made
We just assume that it happens
But no one else is in
The lab where it happens
Prof claims
He was in Washington offices one day
In distress ‘n disarray
The Uni claims
His students said
I’ve nowhere else to turn
And basic’ly begged me to join the fray
Student claims
I approached the P.I. and said
I know you have the data, but I’ll tell you what to say
Professor claims
Well, I arranged the meeting
I arranged the menu, the venue, the seating
No one else was in
The lab where it happened
The lab where it happened
The lab where it happened
No one else was in
The lab where it happened
The lab where it happened
The lab where it happened
No one really knows how the
Journals get to yes
The pieces that are sacrificed in
Ev’ry game of chess
We just assume that it happens
But no one else is in
The room where it happens
Scientists are grappling with the fact that not ev’ry issue can be settled by committee
Journal is fighting over where to put the retraction
It isn’t pretty
Then pizza-man approaches with a dinner and invite
And postdoc responds with well-trained insight
Maybe we can solve one problem with another and win a victory for the researchers, in other words
Oh ho
A quid pro quo
I suppose
Wouldn’t you like to work a little closer to home
Actually, I would
Well, I propose the lunchroom
And you’ll provide him his grants
Well, we’ll see how it goes
Let’s go
One else was in
The lab where it happened
The lab where it happened
The lab where it happened
No one else was in
The lab where it happened
The lab where it happened
The lab where it happened
My data
In data we trust
But we’ll never really know what got discussed
Click-boom then it happened
And no one else was in the room where it happened
Professor of nutrition
What did they say to you to get you to sell your theory down the river
Professor of nutrition
Did the editor know about the dinner
Was there citation index pressure to deliver
All the coauthors
Or did you know, even then, it doesn’t matter
Who ate the carrots
‘Cause we’ll have the journals
We’re in the same spot
You got more than you gave
And I wanted what I got
When you got skin in the game, you stay in the game
But you don’t get a win unless you play in the game
Oh, you get love for it, you get hate for it
You get nothing if you
Wait for it, wait for it, wait
God help and forgive me
I wanna build
Something that’s gonna
Outlive me
What do you want, Prof
What do you want, Prof
If you stand for nothing
Prof, then what do you fall for
Wanna be in
The lab where it happens
The lab where it happens
Wanna be in
The lab where it happens
The lab where it happens
Wanna be
In the lab where it happens
I wanna be in the lab
I wanna be in
The lab where it happens
The lab where it happens
The lab where it happens
I wanna be in the lab
Where it happens
The lab where it happens
The lab where it happen
The art of the compromise
Hold your nose and close your eyes
We want our leaders to save the day
But we don’t get a say in what they trade away
We dream of a brand new start
But we dream in the dark for the most part
Dark as a scab where it happens
I’ve got to be in
The lab (where it happens)
I’ve got to be (the lab where it happens)
I’ve got to be (the lab where it happens)
Oh, I’ve got to be in
The lab where it happens
I’ve got to be, I’ve gotta be, I’ve gotta be
In the lab
Click bab

(Apologies to Lin-Manuel Miranda. Any resemblance to persons living or dead is entirely coincidental.)

P.S. Yes, these stories are funny—the missing carrots and all the rest—but they’re also just so sad, to think that this is what our scientific establishment has come to. I take no joy from these events. We laugh because, after awhile, we get tired of screaming.

I just wish Veronica Geng were still around to write about these hilarious/horrible stories. I just can’t give them justice.

One data pattern, many interpretations

David Pittelli points us to this paper: “When Is Higher Neuroticism Protective Against Death? Findings From UK Biobank,” and writes:

They come to a rather absurd conclusion, in my opinion, which is that neuroticism is protective if, and only if, you say you are in bad health, overlooking the probability that neuroticism instead makes you pessimistic when describing your health.

Here’s the abstract of the article, by Catharine Gale, Iva Cukic, G. David Batty, Andrew McIntosh, Alexander Weiss, and Ian Deary:

We examined the association between neuroticism and mortality in a sample of 321,456 people from UK Biobank and explored the influence of self-rated health on this relationship. After adjustment for age and sex, a 1-SD increment in neuroticism was associated with a 6% increase in all-cause mortality (hazard ratio = 1.06, 95% confidence interval = [1.03, 1.09]). After adjustment for other covariates, and, in particular, self-rated health, higher neuroticism was associated with an 8% reduction in all-cause mortality (hazard ratio = 0.92, 95% confidence interval = [0.89, 0.95]), as well as with reductions in mortality from cancer, cardiovascular disease, and respiratory disease, but not external causes. Further analyses revealed that higher neuroticism was associated with lower mortality only in those people with fair or poor self-rated health, and that higher scores on a facet of neuroticism related to worry and vulnerability were associated with lower mortality. Research into associations between personality facets and mortality may elucidate mechanisms underlying neuroticism’s covert protection against death.

The abstract is admirably modest in its claims; still, Pittelli’s criticism seems reasonable to me. I’m generally suspicious of reading too much into this sort of interaction in observational data. The trouble is that there are so many possible theories floating around, so many ways of explaining a pattern in data. I think it’s a good thing that the Gale et al. paper was published: they found a pattern in data and others can work to understand it.

Testing Seth Roberts’ appetite theory

Jonathan Tupper writes:

My organization is running a group test of Seth Roberts’ old theory about appetite.

We are running something like a “web trial” as discussed in your Chance article with Seth. And in fact our design was very inspired by your conversation… For one, we are using a control group which takes light olive oil *with* meals as you mentioned. We are also testing the mechanism of hunger rather than the outcome of weight loss. This is partly for pragmatic reasons about the variability of the measures, but it’s also an attempt to address the concern you raised that the mechanism is the 2 hour flavorless window itself. Not eating for two hours probably predicts weight loss but it wouldn’t seem to predict less hunger!

Here’s how to sign up for their experiment. I told Tupper that I found the documentation at that webpage to be confusing, so they also prepared this short document summarizing their plan.

I know nothing about these people but I like the idea of testing Seth’s diet, so I’m sharing this with you. (And I’m posting it now rather than setting it at the end of the queue so they can get their experimental data sooner rather than later.) Feel free to post your questions/criticisms/objections/thoughts in comments.

3 quick tricks to get into the data science/analytics field

John McCool writes:

Do you have advice getting into the data science/analytics field? I just graduated with a B.S. in environmental science and a statistics minor and am currently interning at a university. I enjoy working with datasets from sports to transportation and doing historical analysis and predictive modeling.

My quick advice is to avoid interning at a university as I think you’ll learn more by working in the so-called real world. If you work at a company and are doing analytics, try to do your best work, don’t just be satisfied with getting the job done, and if you’re lucky you’ll interact with enough people that you’ll find the ones who you like, who you can work with. You can also go to local tech meetups to stay exposed to new ideas.

But maybe my advice is terrible, I have no idea, so others should feel free to share your advice and experience in the comments.

Hysteresis corner: “These mistakes and omissions do not change the general conclusion of the paper . . .”

All right, then. The paper’s called Attractive Data Sustain Increased B.S. Intake in Journals Attractive Names Sustain Increased Vegetable Intake in Schools.

Seriously, though, this is just an extreme example of a general phenomenon, which we might call scientific hysteresis or the research incumbency advantage:

When you’re submitting a paper to a journal, it can be really hard to get it accepted, and any possible flaw in your reasoning detected by a reviewer is enough to stop publication. But when a result has already been published, it can be really hard to get it overturned. All of a sudden, the burden is on the reviewer, not just to point out a gaping hole in the study but to demonstrate precisely where that hole led to an erroneous conclusion. Even when it turns out that a paper has several different mistakes (including, in the above example, mislabeling preschoolers as elementary school students, a change that entirely changes the intervention being studied), the author is allowed to claim, “These mistakes and omissions do not change the general conclusion of the paper.” It’s the research incumbency effect.

As I wrote in the context of a different paper, where t-statistics of 1.8 and 3.3 were reported as 5.03 and 11.14 and the authors wrote that this “does not change the conclusion of the paper”:

This is both ridiculous and all too true. It’s ridiculous because one of the key claims is entirely based on a statistically significant p-value that is no longer there. But the claim is true because the real “conclusion of the paper” doesn’t depend on any of its details—all that matters is that there’s something, somewhere, that has p less than .05, because that’s enough to make publishable, promotable claims about “the pervasiveness and persistence of the elderly stereotype” or whatever else they want to publish that day.

When the authors protest that none of the errors really matter, it makes you realize that, in these projects, the data hardly matter at all.

In some sense, maybe that’s fine. If this is the rules that the medical and psychology literatures want to play by, that’s their choice. It could be that the theories that these researchers come up with are so valuable, that it doesn’t really matter if they get the details wrong: the data are in some sense just an illustration of their larger points. Perhaps an idea such as “Attractive names sustain increased vegetable intake in schools” is so valuable—such a game-changer—that it should not be held up just because the data in some particular study don’t quite support the claims that were made. Or perhaps the claims in that paper are so robust that they hold up even despite many different errors.

OK, fine, let’s accept that. Let’s accept that, ultimately what matters is that a paper has a grabby idea that could change people’s lives, a cool theory that could very well be true. Along with a grab bag of data and some p-values. I don’t really see why the data are even necessary, but whatever. Maybe some readers have so little imagination that they can’t process an idea such as “Attractive names sustain increased vegetable intake in schools” without a bit of data, of some sort, to make the point.

Again, OK, fine, let’s go with that. But in that case, I think these journals should accept just about every paper sent to them. That is, they should become Arxiv.

Cos multiple fatal errors in a paper aren’t enough to sink it in post-publication review, why should they be enough to sink it in pre-publication review?

Consider the following hypothetical Scenario 1:

Author A sends paper to journal B, whose editor C sends it to referee D.

D: Hey, this paper has dozens of errors. The numbers don’t add up, and the descriptions don’t match the data. There’s no way this experiment could’ve been done as desribed.

C: OK, we’ll reject the paper. Sorry for sending this pile o’ poop to you in the first place!

And now the alternative, Scenario 2:

Author A sends paper to journal B, whose editor C accepts it. Later, the paper is read by outsider D.

D: Hey, this paper has dozens of errors. The numbers don’t add up, and the descriptions don’t match the data. There’s no way this experiment could’ve been done as desribed.

C: We sent your comments to the author who said that the main conclusions of the paper are unaffected.

D: #^&*$#@

[many months later, if ever]

C: The author published a correction, saying that the main conclusions of the paper are unaffected.

Does that really make sense? If the journal editors are going to behave that way in Scenario 2, why bother with Scenario 1 at all?

“Heating Up in NBA Free Throw Shooting”

Paul Pudaite writes:

I demonstrate that repetition heats players up, while interruption cools players down in NBA free throw shooting. My analysis also suggests that fatigue and stress come into play. If, as seems likely, all four of these effects have comparable impact on field goal shooting, they would justify strategic choices throughout a basketball game that take into account the hot hand. More generally my analysis motivates approaching causal investigation of the variation in the quality of all types of human performance by seeking to operationalize and measure these effects. Viewing the hot hand as a dynamic, causal process motivates an alternative application of the concept of the hot hand: instead of trying to detect which player happens to be hot at the moment, promote that which heats up you and your allies.

Pudaite says his paper is related to this post (and also, of course, this).

Return of the Klam

Matthew Klam is back. This time for reals. I’m halfway through reading his new book, Who is Rich?, and it’s just wonderful. The main character is a cartoonist and illustrator, and just about every scene is filled with stunning and hilarious physical descriptions. If I were writing a blurb, I’d call Who is Rich? the most sheerly enjoyable book I’ve read in a long time. Cos it is.

Here’s the story behind Klam’s strangely interrupted career, as reported by Taffy Brodesser-Akner: “Matthew Klam’s New Book Is Only 17 Years Overdue“:

In 1993, Matthew Klam was sitting in his room at the Fine Arts Work Center in Provincetown, where he was a fellow, when he received a call from Daniel Menaker at The New Yorker saying that they were interested in buying his short story “Sam the Cat.” . . . an outrageous success — it sparkled with human observation that is so true it makes you cringe — so he wrote a short-story collection, also called Sam the Cat, for Random House. [That volume, which came out in 2000, really is great. The title story is excellent but it’s not even my favorite one in the book.] Klam won an O. Henry Award, a Whiting Award, an NEA grant, a PEN Robert W. Bingham Prize, and a Guggenheim fellowship. . . .

But Klam was not so happy with what he was producing:

He felt like a fraud. All those awards were for things he was going to do in the future, and he didn’t know if he had anything left to say. . . . In 2007, he realized he didn’t have another short story in him . . . [In the following years,] Klam sat in his unfinished basement, temperature 55 degrees, even with a space heater, and asked himself how to make something again. He threw away what his wife and friends and editors suspect was an entire book’s worth of material. . . .

What happened to Matthew Klam, Matthew Klam explains, wasn’t as simple as creative paralysis. He’d gotten a tenure-track teaching position at Johns Hopkins in 2010, in the creative-writing department. It was a welcome respite from the spotlight . . . Teaching allowed him to write and to feel like no one was really watching. . . . each day he returned home from Baltimore and sat in his basement and waited for his novel to become apparent to him.

And then . . .

Finally, his hand was forced. As part of his tenure-track review, he had to show what he was working on to his department-assigned mentor and the chair of the department. Klam showed her the first 100 pages of Who Is Rich?. He was worried his voice was no longer special. He was worried it was the same old thing.

Get this:

The department supervisor found the pages “sloppily written” and “glib and cynical” and said that if he didn’t abandon the effort, she thought he would lose his job.

Hey! Not to make this blog all about me or anything, but that’s just about exactly what happened to me back in 1994 or so when the statistics department chair stopped me in the hallway and told me that he’d heard I’d been writing this book on Bayesian statistics and that if I was serious about tenure, I should work on something else. I responded that I’d rather write the book and not have tenure at Berkeley than have tenure at Berkeley and not write the book. And I got my wish!

So did Klam:

Suddenly, forcefully, he was sure that this wasn’t a piece of shit. This was his book. He told his boss he was going to keep working on the book . . . He sold Who Is Rich? before it was complete, based on 120 pages, in 2013. In 2016, he was denied tenure.

Ha! It looks like the Johns Hopkins University English Department dodged a bullet on that one.

Still, it seems to have taken Klam another four years to write the next 170 or so pages of the book. That’s slow work. That’s fine—given the finished product, it was all worth it—it just surprises me: Klam’s writing seems so fluid, I would’ve thought he could spin out 10 pages a day, no problem.

P.S. One thing I couldn’t quite figure out from this article is what Klam does for a living, how he paid the bills all those years. I hope he writes more books and that they come out more frequently than once every 17 years. But it’s tough to make a career writing books; you pretty much have to do it just cos you want to. I don’t think they sell many copies anymore.

P.P.S. There was one other thing that bothered me about that article, which was when we’re told that the academic department supervisor didn’t like Klam’s new book: “‘They like Updike,’ Klam explains of his department’s reaction and its conventional tastes. ‘They like Alice Munro. They love Virginia Woolf.'” OK, Klam’s not so similar to Munro and Woolf, but he’s a pretty good match for Updike: middle-aged, middle-class white guy writing about adultery among middle-aged white guys in the Northeast. Can’t get much more Updikey than that. I’m not saying Johns Hopkins was right to let Klam go, not at all, I just don’t think it seems quite on target to say they did it because of their “conventional tastes.”

P.P.P.S. Also this. Brodesser writes:

He wanted a regular writing life and everything that went with it . . . Just a regular literary career — you know, tenure, a full professorship, a novel every few years.

That just made me want to cry. I mean, a tenured professorship is great, I like my job. But that’s “a regular literary career” now? That’s really sad, compared to the days when a regular literary career meant that you could support yourself on your writing alone.

I’m skeptical of the claims made in this paper

Two different people pointed me to a recent research article, suggesting that the claims therein were implausible and the result of some combination of forking paths and spurious correlations—that is, there was doubt that the results would show up in a preregistered replication, and that, if they did show up, that they would mean what was claimed in the article.

One of my correspondents amusingly asked if we should see if the purported effect in this paper interacts with a certain notorious effect that has been claimed, and disputed, in this same subfield. I responded by listing three additional possible explanations from the discredited literature, thus suggesting a possible five-way interaction.

I followed up with, “I assume I can quote you if I blog this? (I’m not sure I will . . . do I really need to make N more enemies???)”

My correspondent replied: “*I* definitely don’t need N more enemies. I’m an untenured postdoc on a fixed contract. If you blog about it, could you just say the paper was sent your way?”

So that’s what I’m doing!

P.S. Although I find the paper in question a bit silly, I have no objection whatsoever to it being put out there. Speculative theory is fine, speculative data analysis is fine too. Indeed, one of the problems with the current system of scientific publication is that speculation generally isn’t enough: you have to gussy up your results with p-values and strong causal claims, or else you can have difficulty getting them published anywhere.

“No System is Perfect: Understanding How Registration-Based Editorial Processes Affect Reproducibility and Investment in Research Quality”

Robert Bloomfield, Kristina Rennekamp, Blake Steenhoven sent along this paper that compares “a registration-based Editorial Process (REP). Authors submitted proposals to gather and analyze data; successful proposals were guaranteed publication as long as the authors lived up to their commitments, regardless of whether results supported their predictions” to “the Traditional Editorial Process (TEP).”

Here’s what they found:

[N]o system is perfect. Registration is a useful complement to the traditional editorial process, but is unlikely to be an adequate replacement. By encouraging authors to shift from follow-up investment to up-front investment, REP encourages careful planning and ambitious data gathering, and reduces the questionable practices and selection biases that undermine the reproducibility of results. But the reduction in follow-up investment leaves papers in a less-refined state than the traditional process, leaving useful work undone. Because accounting is a small field, with journals that typically publish a small number of long articles, subsequent authors may have no clear opportunity to make a publishable contribution by filling in the gaps.

With experience, we expect that authors and editors will learn which editorial process is better suited to which types of studies, and learn how to draw inferences differently from papers produced by these very different systems. We also see many ways that both editorial processes could be improved by moving closer toward each other. REP would be improved by encouraging more outside input before proposals are accepted, and more extensive revisions after planned analyses are conducted, especially those relying on forms of discretion that our community sees as most helpful and least harmful. TEP would be improved by demanding more complete and accurate descriptions of procedures (as Journal of Accounting Research has been implementing for several years (updated JAR [2018]), not only those that help subsequent authors follow those procedures, but also those that help readers interpret p-values in light of the alternatives that authors considered and rejected in calculating them. REP and TEP would complement one another more successfully if journals would be more open to publishing short articles under TEP that fill in the gaps left by articles published under REP.

They also share some anecdotes:

“I was serving as a reviewer for a paper at a top journal, and the original manuscript submitted by the authors had found conflicting results relating to the theory they had proposed–in other words, some of the results were consistent with expectations derived from the theory while others were contrary. The other reviewer suggested that the authors consider a different theory that was, frankly, a better fit for the situation and that explained the pattern of results very well–far better than the theory proposed by the authors. The question immediately arose as to whether it would be ethical and proper for the authors to rewrite the manuscript with the new theory in place of the old. This was a difficult situation because it was clear the authors had chosen a theory that didn’t fit the situation very well, and had they been aware (or had thought of) the alternate theory suggested by the other reviewer, they would have been well advised on an a priori basis to select it instead of the one they went with, but I had concerns about a wholesale replacement of a theory after data had been collected to test a different theory. On the other hand, the instrument used in collecting the data actually constituted a reasonably adequate way to test the alternate theory, except, of course that it wasn’t specifically designed to differentiate between the two. I don’t recall exactly how the situation was resolved as it was a number of years ago, but my recollection is that the paper was published after some additional data was collected that pointed to the alternate theory.”

-Respondent 84, Full Professor, Laboratory Experiments 

“As an author, I have received feedback from an editor at a Top 3 journal that the economic significance of the results in the paper seemed a little too large to be fully explained by the hypotheses. My co-authors and I were informed by the editor of an additional theoretical reason why the effects sizes could be that large and we were encouraged by the editor to incorporate that additional discussion into the underlying theory in the paper.  My co-authors and I agreed that the theory and arguments provided by the editor seemed reasonable. As a result of incorporating this suggestion, we believe the paper is more informative to readers.”

-Respondent 280, Assistant Professor, Archival

“As a doctoral student, I ran a 2x2x2 design on one of my studies. The 2×2 of primary interest worked well in one level of the last variable but not at all in the other level. I was advised by my dissertation committee not to report the results for participants in the one level of that factor that “didn’t work” because that level of the factor was not theoretically very important and the results would be easier to explain and essentially more informative. As a result, I ended up reporting only the 2×2 of primary interest with participants from the level of the third variable where the 2×2 held up. To this day, I still feel a little uncomfortable about that decision, although I understood the rationale and thought it made sense.”

-Respondent 85, Full Professor, Laboratory Experiments

This all seems relevant to discussions of preregistration, post-publication review, etc.

What’s Wrong with “Evidence-Based Medicine” and How Can We Do Better? (My talk at the University of Michigan Friday 2pm)

Tomorrow (Fri 9 Feb) 2pm at the NCRC Research Auditorium (Building 10) at the University of Michigan:

What’s Wrong with “Evidence-Based Medicine” and How Can We Do Better?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

“Evidence-based medicine” sounds like a good idea, but it can run into problems when the evidence is shaky. We discuss several different reasons that even clean randomized clinical trials can fail to replicate, and we discuss directions for improvement in design and data collection, statistical analysis, and the practices of the scientific community. See this paper: and this one:

354 possible control groups; what to do?

Jonas Cederlöf writes:

I’m a PhD student in economics at Stockholm University and a frequent reader of your blog. I have for a long time followed your quest in trying to bring attention to p-hacking and multiple comparison problems in research. I’m now myself faced with the aforementioned problem and want to at the very least try to avoid picking (or being subject to the critique of having picked) control group which merely gives me fancy results. The setting is the following,

I run a difference-in-difference (DD) model between occupations where people working in occupation X is treated at year T. There are 354 other types of occupations where for at least 20-30 of them I could make up a “credible” story about why they would be a natural control group. One could of course run the DD-estimation on the treated group vs. the entire labor market, but claiming causality between the reform and the outcome hinges on not only the parallel trend assumption but also on that group specific shocks are absent. Hence one might wan’t to find a control group that would be subjected to the same type of shocks as the treated occupation X so one might be better of picking specific occupation from the rest of the 354 categories. Some of these might have parallel trends some others, but wouldn’t it be p-hacking choosing groups like this, based on parallel trends? The reader has no guarantee that I as a researcher haven’t picked control groups that gives me the results that will get me published?

So in summary: When one has 1 treated group and 354 potential control groups, how does one go about choosing among these?

My response: rather than picking one analysis (either ahead of time or after seeing the data), I suggest you do all 354 analyses and put them together using a hierarchical model as discussed in this paper. Really, this is not doing 354 analyses, it’s doing one analysis that includes all these comparisons.

Eid ma clack shaw zupoven del ba.

When I say “I love you”, you look accordingly skeptical Frida Hyvönen

A few years back, Bill Callahan wrote a song about the night he dreamt the perfect song. In a fever, he woke and wrote it down before going back to sleep. The next morning, as he struggled to read his handwriting, he saw that he’d written the nonsense that forms the title of this post.

Variational inference is a lot like that song; dreams of the perfect are ruined in the harsh glow of the morning after.

(For more unnaturally tortured metaphors see my twitter. I think we can all agree setting one up was a fairly bad choice for me. Edit: Yup. I almost lasted three weeks, but it is not the medium for me. )

But how can we tell if variational inference has written the perfect song or, indeed, if it has laid an egg? Unfortunately, there doesn’t seem to be a lot of literature to guide us. We (Yuling, Aki, Me, and Andrew) have a new paper to give you a bit more of an idea.


The guiding principle of variational inference is that if it’s impossible to work with the true posterior p(\theta \mid y), then near enough is surely good enough. (It seldom is.)

In particular, we try to find the  member q^*(\theta) of some tractable set of distributions \mathcal{Q} (commonly the family of multivariate Gaussian distributions with diagonal covariance matrices) that minimizes the Kullback-Leibler divergence

q^*(\theta) = \arg \min_{q \in \mathcal{Q}} KL\left[\,q(\cdot) \mid p(\cdot \mid y)\,\right].

The Kullback-Leibler divergence in this direction (it’s asymmetric, so the order of arguments is important) can be interpreted as the the amount of information lost if we replace the approximate posterior q(\theta) with the true posterior p(\theta \mid y). Now, if this seems like the wrong way around to you [that we should instead worry about what happens if we replace the target posterior p(\theta \mid y) with the approximation q(\theta)], you would be very very correct.  That Kullback-Leibler divergence is backwards.

What does this mean? Well it means that we won’t penalize approximate distributions that are much less complex than the true one as heavily as we should. How does this translate into real life? It means that usually we will end up with approximations q^*(\theta) that are narrower than the true posterior. Usually this manifests as distributions with lighter tails.

(Quiet side note: Why are those last few sentences so wishy-washy? Well it turns out that minimizing a Kullback-Leibler divergence in the wrong direction can do all kinds of things to the resulting minimizer and it’s hard to really pin down what will happen.  But it’s almost everyone’s experience that the variational posterior q(\theta) is almost always narrower than than the true posterior. So the previous paragraph is usually true.)

So variational inference is mostly set up to fail. Really, we should be surprised it works at all.

Cold discovery

There are really two things we need to check when we’re doing variational inference. The first is that the optimization procedure that we have used to compute q^*(\theta) has actually converged to a (local) minimum.  Naively, this seems fairly straightforward. After all, we don’t think of maximum likelihood estimation as being hard computationally, so we should be able to solve this optimization problem easily. But it turns out that if we want our variational inference to be scalable in the sense that we can apply it to big problems, we need to be more clever. For example Automatic Differentiation Variational Inference (ADVI) uses a fairly sophisticated stochastic optimization method to find q^*(\theta).

So first we have to make sure the method actually converges.  I don’t really want to talk about this, but it’s probably worth saying that it’s not trivial and stochastic methods like ADVI will occasionally terminate too soon. This leads to terrible approximations to the true posterior. It’s also well worth saying the if the true posterior is multimodal, there’s no guarantee that the minimum that is found will be a (nearly) global one.  (And if the approximating family \mathcal{Q} only contains unimodal distributions, we will have some problems!) There are perhaps some ways out of this (Yuling has many good ideas), but the key thing is that if you want to actually know if there is a potential problem, it’s important to run multiple optimizations beginning at a diverse set of initial values.

Anyway, let’s pretend that this isn’t a problem so that we can get onto the main point.

The second thing that we need to check is that the approximate posterior q^*(\theta) is an ok approximation to the true posterior p(\theta \mid y ). This is a much less standard task and we haven’t found a good method for addressing it in the literature. So we came up with two ideas.

Left only with love

Our first idea was based Aki, Andrew, and Jonah’s Pareto-Smoothed Importance Sampling (PSIS). The crux of our idea is that if q(\theta) is a good approximation to the true posterior, it can be used as an importance sampling proposal to compute expectations with respect to p(\theta \mid y). So before we can talk about that method, we need to remember what PSIS does.

The idea is that we can approximate any posterior expectation \int h(\theta)p(\theta \mid y)\,d \theta using a self-normalized importance sampling estimator. We do this by drawing S samples \{\theta_s\}_{s=1}^S from the proposal distribution q(\theta) and computing the estimate

\int h(\theta)p(\theta \mid y)\,d \theta \approx \frac{\sum_{s=1}^S h(\theta_s) r_s}{\sum_{s=1}^S r_s}.

Here we define the importance weights as

r_s = \frac{p(\theta_s,y)}{q(\theta_s)}.

We can get away with using the joint distribution instead of the posterior in the numerator because p(\theta \mid y) \propto p(\theta,y) and we re-normalise the the estimator. This self-normalized importance sampling estimator is consistent with bias that goes asymptotically like \mathcal{O}(S^{-1}). (The bias  comes from the self-normalization step.  Ordinary importance sampling is unbiased.)

The only problem is that if the distribution of r_s has too heavy a tail, the self-normalized importance sampling estimator will have infinite variance. This is not a good thing. Basically, it means that the error in the posterior expectation could be any size.

The problem is that if the distribution of r_s has a heavy tail, the importance sampling estimator will be almost entirely driven by a small number of samples \theta_s with very large r_s values. But there is a trick to get around this: somehow tamp down the extreme values of r_s.

With PSIS, Aki, Andrew, and Jonah propose a nifty solution. They argue that you can model the tails of the distribution of the importance ratio with a generalized Pareto distribution 

p(r|\mu, \sigma, k)=\begin{cases} \frac{1}{\sigma} \left( 1+k\left( \frac{r-\mu}{\sigma} \right) \right) ^{-\frac{1}{k}-1}, & k \neq 0. \\ \frac{1}{\sigma} \exp\left( \frac{r-\mu}{\sigma} \right), & k = 0.\\\end{cases} .

This is a very sensible thing to do: the generalized Pareto is the go-to distribution that you use when you want to model the distribution of  all samples from an iid population that are above a certain (high) value. The PSIS approximation argues that you should take the M largest r_s (where M is chosen carefully) and fit a generalized Pareto distribution to them. You then replace those M largest observed importance weights with the corresponding expected order statistics from the fitted generalized Pareto.

There are some more (critical) details in the PSIS paper but the intuition is that we are replacing the “noisy” sample importance weights with their model-based estimates.  This reduces the variance of the resulting self-normalized importance sampling estimator and reduces the bias compared to other options.

It turns out that the key parameter in the generalized Pareto distribution is the shape parameter k. The interpretation of this parameter is that if the generalized Pareto distribution has shape parameter k, then the distribution of the sampling weights have \lfloor k^{-1} \rfloor moments.

This is particularly relevant in this context as the condition for the importance sampling estimator to have finite variance (and be asymptotically normal) is that the sampling weights have (slightly more than) two moments. This translates to k<1/2.

VERY technical side note: What I want to say is that the self-normalized importance sampling estimator is asymptotically normal. This was nominally proved in Theorem 2 of Geweke’s 1983 paper. The proof there looks wrong. Basically, he applies a standard central limit theorem to get the result, which seems to assume the terms in the sum are iid. The only problem is that the summands


are not independent. So it looks a lot like Geweke should’ve used a central limit theorem for weakly-mixing triangular arrays instead. He did not. What he actually did was quite clever. He noticed that the bivariate random variables (h(\theta_s)r_s, r_s)^T are independent and satisfy a central limit theorem with mean (A,w)^T. From there you’re a second-order Taylor expansion of the function f(A,w) = A/w to show that the sequence

f \left( S^{-1} \sum_{s=1}^S h(\theta_s) r_s, S^{-1} \sum_{s=1}^S r_s\right)

is also asymptotically normal as long as zero or infinity are never in a neighbourhood of S^{-1}\sum_{s=1}^S r_s .

End VERY technical side note!

The news actually gets even better! The smaller k is, the faster the importance sampling estimate will converge. Even better than that, the PSIS estimator seems to be useful even if k is slightly bigger than 0.5. The recommendations in the PSIS paper is that if \hat k <0.7, the PSIS estimator is reliable.

But what is \hat k? It’s the sample estimate of the shape parameter k. Once again, some really nice things happen when you use this estimator. For example, even if we know from the structure of the problem that k<0.5, if \hat k >0.7 (which can happen), then importance sampling will perform poorly. The value of \hat k is strongly linked to the finite sample behaviour of the PSIS (and other importance sampling) estimators.

The intuition for why the estimated shape parameter is more useful than the population shape parameter is that it tells you when the sample of r_s that you’ve drawn could have come from a heavy tailed distribution. If this is the case, there isn’t enough information in your sample yet to push you into the asymptotic regime and pre-asymptotic behaviour will dominate (usually leading to worse than expected behaviour).


Ok, so what does all this have to do with variational inference? Well it turns out that if we draw samples from out variational posterior and use them to compute the importance weights, then we have another interpretation for the shape parameter k:

k = \arg \inf_{k'>0} D_{1/k}(p(\theta \mid y) \, ||\, q(\theta),

where D_\alpha (p, q) is the Rényi divergence of order \alpha.  In particular, if k >1, then the Kullback-Leibler divergence in the more natural direction KL(p(\theta \mid y) \mid q(\theta)) = \infty even if q(\theta) minimizes the KL-divergence in the other direction! Once again, we have found that the estimate \hat k gives an excellent indication of the performance of the variational posterior.

So why is checking if \hat k <0.7 a good heuristic to evaluate the quality of the variational posterior? There are a few reasons. Firstly, because the variational posterior minimizes the KL-divergence in the direction that penalizes approximations with heavier tails than the posterior much harder than approximations with lighter tails, it is very difficult to get a good \hat k value by simply “fattening out” the approximation.  Secondly, empirical evidence suggests that the smaller the value of \hat k, the closer the variational posterior is to the true posterior. Finally, if \hat k <0.7 we can automatically improve any expectation computed against the variational posterior using PSIS.  This makes this tool both a diagnostic and a correction for the variational posterior that does not rely too heavily on asymptotic arguments. The value of \hat k has also proven useful for selecting the best parameterization of the model for the variational approximation (or equivalently, between different approximation families).

There are some downsides to this heuristic. Firstly, it really does check that the whole variational posterior is like the true posterior. This is a quite stringent requirement that variational inference methods often do not pass. In particular, as the number of dimensions increases, we’ve found that unless the approximating family is particularly well-chosen for the problem, the variational approximation will eventually become bad enough that \hat k will  exceeded the threshold. Secondly, this diagnostic only considers the full posterior and cannot be modified to work on lower-dimensional subsets of the parameter space. This means that if the model has some “less important” parameters, we  still require their posterior be very well captured by the variational approximation.

Let me see the colts

The thing about variational inference is that it’s actually often quite bad at estimating a posterior. On the other hand, the centre of the variational posterior is much more frequently a good approximation to the centre of the true posterior. This means that we can get good point estimates from variational inference even if the full posterior isn’t very good. So we need a diagnostic to reflect this.

Into the fray steps an old paper of Andrew’s (with Samantha Cook and Don Rubin) on verifying statistical software. We (mainly Stan developer Sean) have been playing with various ways of extending and refining this method for the last little while and we’re almost done on a big paper about it. (Let me tell you: god may be present in the sweeping gesture, but the devil is definitely in the details.) Thankfully for this work, we don’t need any of the new detailed work we’ve been doing. We can just use the original results as they are (with just a little bit of a twist).

The resulting heuristic, which we call Variational Simulation-Based Calibration (VSBC), complements the PSIS diagnostic by assessing the average performance of the implied variational approximation to univariate posterior marginals. One of the things that this method can do particularly well is indicate if the centre of the variational posterior will be, on average, biased.  If it’s not biased, we can apply clever second-order corrections (like the one proposed by Ryan Giordano, Tamara Broderick, and Michael Jordan).

I keep saying “on average”, so what do I mean by that? Basically, VSBC looks at how well the variational posterior is calibrated by computing the distribution of p_j = P( \theta < \tilde{\theta}_j \mid y_j) where y_j is simulated from the model with parameter \theta_j that is itself drawn from the prior distribution. If the variational inference method is calibrated, then Cook et al. showed that the histogram of p_j should be uniform.

This observation can be generalized using insight from the forecast validation community: if the histogram of p_j is asymmetric, then the variational posterior will be (on average over data drawn from the model) biased. In the paper, we have a specific result, which shows that this insight is exactly correct if the true posterior is symmetric, and approximately true if it’s fairly symmetric.

There’s also the small problem that if the model is badly mis-specified, then it may fit the observed data much worse or better than the average of data drawn from the model. Again, this contrasts with the PSIS diagnostic that only assesses the fit for the particular data set you’ve observed.

In light of this, we recommend interpreting both of our heuristics the same way: conservatively. If either heuristic fails, then we can say the variational posterior is poorly behaved in one of two specific ways. If either or both heuristics pass, then we can have some confidence that the variational posterior will be a good approximation to the true distribution (especially after a PSIS or second-order correction), but this is still not guaranteed.


To close this post out symmetrically (because symmetry indicates a lack of bias), let’s go back to a different Bill Callahan song to remind us that even if it’s not the perfect song, you can construct something beautiful by leveraging formal structure:

If you
If you could
If you could only
If you could only stop
If you could only stop your
If you could only stop your heart
If you could only stop your heart beat
If you could only stop your heart beat for
If you could only stop your heart beat for one heart
If you could only stop your heart beat for one heart beat