## Return of the Klam

Matthew Klam is back. This time for reals. I’m halfway through reading his new book, Who is Rich?, and it’s just wonderful. The main character is a cartoonist and illustrator, and just about every scene is filled with stunning and hilarious physical descriptions. If I were writing a blurb, I’d call Who is Rich? the most sheerly enjoyable book I’ve read in a long time. Cos it is.

Here’s the story behind Klam’s strangely interrupted career, as reported by Taffy Brodesser-Akner: “Matthew Klam’s New Book Is Only 17 Years Overdue“:

In 1993, Matthew Klam was sitting in his room at the Fine Arts Work Center in Provincetown, where he was a fellow, when he received a call from Daniel Menaker at The New Yorker saying that they were interested in buying his short story “Sam the Cat.” . . . an outrageous success — it sparkled with human observation that is so true it makes you cringe — so he wrote a short-story collection, also called Sam the Cat, for Random House. [That volume, which came out in 2000, really is great. The title story is excellent but it’s not even my favorite one in the book.] Klam won an O. Henry Award, a Whiting Award, an NEA grant, a PEN Robert W. Bingham Prize, and a Guggenheim fellowship. . . .

But Klam was not so happy with what he was producing:

He felt like a fraud. All those awards were for things he was going to do in the future, and he didn’t know if he had anything left to say. . . . In 2007, he realized he didn’t have another short story in him . . . [In the following years,] Klam sat in his unfinished basement, temperature 55 degrees, even with a space heater, and asked himself how to make something again. He threw away what his wife and friends and editors suspect was an entire book’s worth of material. . . .

What happened to Matthew Klam, Matthew Klam explains, wasn’t as simple as creative paralysis. He’d gotten a tenure-track teaching position at Johns Hopkins in 2010, in the creative-writing department. It was a welcome respite from the spotlight . . . Teaching allowed him to write and to feel like no one was really watching. . . . each day he returned home from Baltimore and sat in his basement and waited for his novel to become apparent to him.

And then . . .

Finally, his hand was forced. As part of his tenure-track review, he had to show what he was working on to his department-assigned mentor and the chair of the department. Klam showed her the first 100 pages of Who Is Rich?. He was worried his voice was no longer special. He was worried it was the same old thing.

Get this:

The department supervisor found the pages “sloppily written” and “glib and cynical” and said that if he didn’t abandon the effort, she thought he would lose his job.

Hey! Not to make this blog all about me or anything, but that’s just about exactly what happened to me back in 1994 or so when the statistics department chair stopped me in the hallway and told me that he’d heard I’d been writing this book on Bayesian statistics and that if I was serious about tenure, I should work on something else. I responded that I’d rather write the book and not have tenure at Berkeley than have tenure at Berkeley and not write the book. And I got my wish!

So did Klam:

Suddenly, forcefully, he was sure that this wasn’t a piece of shit. This was his book. He told his boss he was going to keep working on the book . . . He sold Who Is Rich? before it was complete, based on 120 pages, in 2013. In 2016, he was denied tenure.

Ha! It looks like the Johns Hopkins University English Department dodged a bullet on that one.

Still, it seems to have taken Klam another four years to write the next 170 or so pages of the book. That’s slow work. That’s fine—given the finished product, it was all worth it—it just surprises me: Klam’s writing seems so fluid, I would’ve thought he could spin out 10 pages a day, no problem.

P.S. One thing I couldn’t quite figure out from this article is what Klam does for a living, how he paid the bills all those years. I hope he writes more books and that they come out more frequently than once every 17 years. But it’s tough to make a career writing books; you pretty much have to do it just cos you want to. I don’t think they sell many copies anymore.

P.P.S. There was one other thing that bothered me about that article, which was when we’re told that the academic department supervisor didn’t like Klam’s new book: “‘They like Updike,’ Klam explains of his department’s reaction and its conventional tastes. ‘They like Alice Munro. They love Virginia Woolf.'” OK, Klam’s not so similar to Munro and Woolf, but he’s a pretty good match for Updike: middle-aged, middle-class white guy writing about adultery among middle-aged white guys in the Northeast. Can’t get much more Updikey than that. I’m not saying Johns Hopkins was right to let Klam go, not at all, I just don’t think it seems quite on target to say they did it because of their “conventional tastes.”

P.P.P.S. Also this. Brodesser writes:

He wanted a regular writing life and everything that went with it . . . Just a regular literary career — you know, tenure, a full professorship, a novel every few years.

That just made me want to cry. I mean, a tenured professorship is great, I like my job. But that’s “a regular literary career” now? That’s really sad, compared to the days when a regular literary career meant that you could support yourself on your writing alone.

## I’m skeptical of the claims made in this paper

Two different people pointed me to a recent research article, suggesting that the claims therein were implausible and the result of some combination of forking paths and spurious correlations—that is, there was doubt that the results would show up in a preregistered replication, and that, if they did show up, that they would mean what was claimed in the article.

One of my correspondents amusingly asked if we should see if the purported effect in this paper interacts with a certain notorious effect that has been claimed, and disputed, in this same subfield. I responded by listing three additional possible explanations from the discredited literature, thus suggesting a possible five-way interaction.

I followed up with, “I assume I can quote you if I blog this? (I’m not sure I will . . . do I really need to make N more enemies???)”

My correspondent replied: “*I* definitely don’t need N more enemies. I’m an untenured postdoc on a fixed contract. If you blog about it, could you just say the paper was sent your way?”

So that’s what I’m doing!

P.S. Although I find the paper in question a bit silly, I have no objection whatsoever to it being put out there. Speculative theory is fine, speculative data analysis is fine too. Indeed, one of the problems with the current system of scientific publication is that speculation generally isn’t enough: you have to gussy up your results with p-values and strong causal claims, or else you can have difficulty getting them published anywhere.

## “No System is Perfect: Understanding How Registration-Based Editorial Processes Affect Reproducibility and Investment in Research Quality”

Robert Bloomfield, Kristina Rennekamp, Blake Steenhoven sent along this paper that compares “a registration-based Editorial Process (REP). Authors submitted proposals to gather and analyze data; successful proposals were guaranteed publication as long as the authors lived up to their commitments, regardless of whether results supported their predictions” to “the Traditional Editorial Process (TEP).”

Here’s what they found:

[N]o system is perfect. Registration is a useful complement to the traditional editorial process, but is unlikely to be an adequate replacement. By encouraging authors to shift from follow-up investment to up-front investment, REP encourages careful planning and ambitious data gathering, and reduces the questionable practices and selection biases that undermine the reproducibility of results. But the reduction in follow-up investment leaves papers in a less-refined state than the traditional process, leaving useful work undone. Because accounting is a small field, with journals that typically publish a small number of long articles, subsequent authors may have no clear opportunity to make a publishable contribution by filling in the gaps.

With experience, we expect that authors and editors will learn which editorial process is better suited to which types of studies, and learn how to draw inferences differently from papers produced by these very different systems. We also see many ways that both editorial processes could be improved by moving closer toward each other. REP would be improved by encouraging more outside input before proposals are accepted, and more extensive revisions after planned analyses are conducted, especially those relying on forms of discretion that our community sees as most helpful and least harmful. TEP would be improved by demanding more complete and accurate descriptions of procedures (as Journal of Accounting Research has been implementing for several years (updated JAR [2018]), not only those that help subsequent authors follow those procedures, but also those that help readers interpret p-values in light of the alternatives that authors considered and rejected in calculating them. REP and TEP would complement one another more successfully if journals would be more open to publishing short articles under TEP that fill in the gaps left by articles published under REP.

They also share some anecdotes:

-Respondent 84, Full Professor, Laboratory Experiments

“As an author, I have received feedback from an editor at a Top 3 journal that the economic significance of the results in the paper seemed a little too large to be fully explained by the hypotheses. My co-authors and I were informed by the editor of an additional theoretical reason why the effects sizes could be that large and we were encouraged by the editor to incorporate that additional discussion into the underlying theory in the paper.  My co-authors and I agreed that the theory and arguments provided by the editor seemed reasonable. As a result of incorporating this suggestion, we believe the paper is more informative to readers.”

-Respondent 280, Assistant Professor, Archival

“As a doctoral student, I ran a 2x2x2 design on one of my studies. The 2×2 of primary interest worked well in one level of the last variable but not at all in the other level. I was advised by my dissertation committee not to report the results for participants in the one level of that factor that “didn’t work” because that level of the factor was not theoretically very important and the results would be easier to explain and essentially more informative. As a result, I ended up reporting only the 2×2 of primary interest with participants from the level of the third variable where the 2×2 held up. To this day, I still feel a little uncomfortable about that decision, although I understood the rationale and thought it made sense.”

-Respondent 85, Full Professor, Laboratory Experiments

This all seems relevant to discussions of preregistration, post-publication review, etc.

## What’s Wrong with “Evidence-Based Medicine” and How Can We Do Better? (My talk at the University of Michigan Friday 2pm)

Tomorrow (Fri 9 Feb) 2pm at the NCRC Research Auditorium (Building 10) at the University of Michigan:

What’s Wrong with “Evidence-Based Medicine” and How Can We Do Better?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

“Evidence-based medicine” sounds like a good idea, but it can run into problems when the evidence is shaky. We discuss several different reasons that even clean randomized clinical trials can fail to replicate, and we discuss directions for improvement in design and data collection, statistical analysis, and the practices of the scientific community. See this paper: http://www.stat.columbia.edu/~gelman/research/published/incrementalism_3.pdf and this one: http://www.stat.columbia.edu/~gelman/research/unpublished/Stents_submitted.pdf

## 354 possible control groups; what to do?

Jonas Cederlöf writes:

I’m a PhD student in economics at Stockholm University and a frequent reader of your blog. I have for a long time followed your quest in trying to bring attention to p-hacking and multiple comparison problems in research. I’m now myself faced with the aforementioned problem and want to at the very least try to avoid picking (or being subject to the critique of having picked) control group which merely gives me fancy results. The setting is the following,

I run a difference-in-difference (DD) model between occupations where people working in occupation X is treated at year T. There are 354 other types of occupations where for at least 20-30 of them I could make up a “credible” story about why they would be a natural control group. One could of course run the DD-estimation on the treated group vs. the entire labor market, but claiming causality between the reform and the outcome hinges on not only the parallel trend assumption but also on that group specific shocks are absent. Hence one might wan’t to find a control group that would be subjected to the same type of shocks as the treated occupation X so one might be better of picking specific occupation from the rest of the 354 categories. Some of these might have parallel trends some others, but wouldn’t it be p-hacking choosing groups like this, based on parallel trends? The reader has no guarantee that I as a researcher haven’t picked control groups that gives me the results that will get me published?

So in summary: When one has 1 treated group and 354 potential control groups, how does one go about choosing among these?

My response: rather than picking one analysis (either ahead of time or after seeing the data), I suggest you do all 354 analyses and put them together using a hierarchical model as discussed in this paper. Really, this is not doing 354 analyses, it’s doing one analysis that includes all these comparisons.

## Eid ma clack shaw zupoven del ba.

When I say “I love you”, you look accordingly skeptical Frida Hyvönen

A few years back, Bill Callahan wrote a song about the night he dreamt the perfect song. In a fever, he woke and wrote it down before going back to sleep. The next morning, as he struggled to read his handwriting, he saw that he’d written the nonsense that forms the title of this post.

Variational inference is a lot like that song; dreams of the perfect are ruined in the harsh glow of the morning after.

(For more unnaturally tortured metaphors see my twitter. I think we can all agree setting one up was a fairly bad choice for me.)

But how can we tell if variational inference has written the perfect song or, indeed, if it has laid an egg? Unfortunately, there doesn’t seem to be a lot of literature to guide us. We (Yuling, Aki, Me, and Andrew) have a new paper to give you a bit more of an idea.

#### Palimpsest

The guiding principle of variational inference is that if it’s impossible to work with the true posterior $p(\theta \mid y)$, then near enough is surely good enough. (It seldom is.)

In particular, we try to find the  member $q^*(\theta)$ of some tractable set of distributions $\mathcal{Q}$ (commonly the family of multivariate Gaussian distributions with diagonal covariance matrices) that minimizes the Kullback-Leibler divergence

$q^*(\theta) = \arg \min_{q \in \mathcal{Q}} KL\left[\,q(\cdot) \mid p(\cdot \mid y)\,\right].$

The Kullback-Leibler divergence in this direction (it’s asymmetric, so the order of arguments is important) can be interpreted as the the amount of information lost if we replace the approximate posterior $q(\theta)$ with the true posterior $p(\theta \mid y)$. Now, if this seems like the wrong way around to you [that we should instead worry about what happens if we replace the target posterior $p(\theta \mid y)$ with the approximation $q(\theta)$], you would be very very correct.  That Kullback-Leibler divergence is backwards.

What does this mean? Well it means that we won’t penalize approximate distributions that are much less complex than the true one as heavily as we should. How does this translate into real life? It means that usually we will end up with approximations $q^*(\theta)$ that are narrower than the true posterior. Usually this manifests as distributions with lighter tails.

(Quiet side note: Why are those last few sentences so wishy-washy? Well it turns out that minimizing a Kullback-Leibler divergence in the wrong direction can do all kinds of things to the resulting minimizer and it’s hard to really pin down what will happen.  But it’s almost everyone’s experience that the variational posterior $q(\theta)$ is almost always narrower than than the true posterior. So the previous paragraph is usually true.)

So variational inference is mostly set up to fail. Really, we should be surprised it works at all.

#### Cold discovery

There are really two things we need to check when we’re doing variational inference. The first is that the optimization procedure that we have used to compute $q^*(\theta)$ has actually converged to a (local) minimum.  Naively, this seems fairly straightforward. After all, we don’t think of maximum likelihood estimation as being hard computationally, so we should be able to solve this optimization problem easily. But it turns out that if we want our variational inference to be scalable in the sense that we can apply it to big problems, we need to be more clever. For example Automatic Differentiation Variational Inference (ADVI) uses a fairly sophisticated stochastic optimization method to find $q^*(\theta)$.

So first we have to make sure the method actually converges.  I don’t really want to talk about this, but it’s probably worth saying that it’s not trivial and stochastic methods like ADVI will occasionally terminate too soon. This leads to terrible approximations to the true posterior. It’s also well worth saying the if the true posterior is multimodal, there’s no guarantee that the minimum that is found will be a (nearly) global one.  (And if the approximating family $\mathcal{Q}$ only contains unimodal distributions, we will have some problems!) There are perhaps some ways out of this (Yuling has many good ideas), but the key thing is that if you want to actually know if there is a potential problem, it’s important to run multiple optimizations beginning at a diverse set of initial values.

Anyway, let’s pretend that this isn’t a problem so that we can get onto the main point.

The second thing that we need to check is that the approximate posterior $q^*(\theta)$ is an ok approximation to the true posterior $p(\theta \mid y )$. This is a much less standard task and we haven’t found a good method for addressing it in the literature. So we came up with two ideas.

#### Left only with love

Our first idea was based Aki, Andrew, and Jonah’s Pareto-Smoothed Importance Sampling (PSIS). The crux of our idea is that if $q(\theta)$ is a good approximation to the true posterior, it can be used as an importance sampling proposal to compute expectations with respect to $p(\theta \mid y)$. So before we can talk about that method, we need to remember what PSIS does.

The idea is that we can approximate any posterior expectation $\int h(\theta)p(\theta \mid y)\,d \theta$ using a self-normalized importance sampling estimator. We do this by drawing $S$ samples $\{\theta_s\}_{s=1}^S$ from the proposal distribution $q(\theta)$ and computing the estimate

$\int h(\theta)p(\theta \mid y)\,d \theta \approx \frac{\sum_{s=1}^S h(\theta_s) r_s}{\sum_{s=1}^S r_s}.$

Here we define the importance weights as

$r_s = \frac{p(\theta_s,y)}{q(\theta_s)}.$

We can get away with using the joint distribution instead of the posterior in the numerator because $p(\theta \mid y) \propto p(\theta,y)$ and we re-normalise the the estimator. This self-normalized importance sampling estimator is consistent with bias that goes asymptotically like $\mathcal{O}(S^{-1})$. (The bias  comes from the self-normalization step.  Ordinary importance sampling is unbiased.)

The only problem is that if the distribution of $r_s$ has too heavy a tail, the self-normalized importance sampling estimator will have infinite variance. This is not a good thing. Basically, it means that the error in the posterior expectation could be any size.

The problem is that if the distribution of $r_s$ has a heavy tail, the importance sampling estimator will be almost entirely driven by a small number of samples $\theta_s$ with very large $r_s$ values. But there is a trick to get around this: somehow tamp down the extreme values of $r_s$.

With PSIS, Aki, Andrew, and Jonah propose a nifty solution. They argue that you can model the tails of the distribution of the importance ratio with a generalized Pareto distribution

$p(r|\mu, \sigma, k)=\begin{cases} \frac{1}{\sigma} \left( 1+k\left( \frac{r-\mu}{\sigma} \right) \right) ^{-\frac{1}{k}-1}, & k \neq 0. \\ \frac{1}{\sigma} \exp\left( \frac{r-\mu}{\sigma} \right), & k = 0.\\\end{cases} .$

This is a very sensible thing to do: the generalized Pareto is the go-to distribution that you use when you want to model the distribution of  all samples from an iid population that are above a certain (high) value. The PSIS approximation argues that you should take the $M$ largest $r_s$ (where $M$ is chosen carefully) and fit a generalized Pareto distribution to them. You then replace those $M$ largest observed importance weights with the corresponding expected order statistics from the fitted generalized Pareto.

There are some more (critical) details in the PSIS paper but the intuition is that we are replacing the “noisy” sample importance weights with their model-based estimates.  This reduces the variance of the resulting self-normalized importance sampling estimator and reduces the bias compared to other options.

It turns out that the key parameter in the generalized Pareto distribution is the shape parameter $k$. The interpretation of this parameter is that if the generalized Pareto distribution has shape parameter $k$, then the distribution of the sampling weights have $\lfloor k^{-1} \rfloor$ moments.

This is particularly relevant in this context as the condition for the importance sampling estimator to have finite variance (and be asymptotically normal) is that the sampling weights have (slightly more than) two moments. This translates to $k<1/2$.

VERY technical side note: What I want to say is that the self-normalized importance sampling estimator is asymptotically normal. This was nominally proved in Theorem 2 of Geweke’s 1983 paper. The proof there looks wrong. Basically, he applies a standard central limit theorem to get the result, which seems to assume the terms in the sum are iid. The only problem is that the summands

$\frac{h(\theta_s)r_s}{\sum_{s=1}^Sr_s}$

are not independent. So it looks a lot like Geweke should’ve used a central limit theorem for weakly-mixing triangular arrays instead. He did not. What he actually did was quite clever. He noticed that the bivariate random variables $(h(\theta_s)r_s, r_s)^T$ are independent and satisfy a central limit theorem with mean $(A,w)^T$. From there you’re a second-order Taylor expansion of the function $f(A,w) = A/w$ to show that the sequence

$f \left( S^{-1} \sum_{s=1}^S h(\theta_s) r_s, S^{-1} \sum_{s=1}^S r_s\right)$

is also asymptotically normal as long as zero or infinity are never in a neighbourhood of $S^{-1}\sum_{s=1}^S r_s$.

End VERY technical side note!

The news actually gets even better! The smaller $k$ is, the faster the importance sampling estimate will converge. Even better than that, the PSIS estimator seems to be useful even if $k$ is slightly bigger than 0.5. The recommendations in the PSIS paper is that if $\hat k <0.7$, the PSIS estimator is reliable.

But what is $\hat k$? It’s the sample estimate of the shape parameter $k$. Once again, some really nice things happen when you use this estimator. For example, even if we know from the structure of the problem that $k<0.5$, if $\hat k >0.7$ (which can happen), then importance sampling will perform poorly. The value of $\hat k$ is strongly linked to the finite sample behaviour of the PSIS (and other importance sampling) estimators.

The intuition for why the estimated shape parameter is more useful than the population shape parameter is that it tells you when the sample of $r_s$ that you’ve drawn could have come from a heavy tailed distribution. If this is the case, there isn’t enough information in your sample yet to push you into the asymptotic regime and pre-asymptotic behaviour will dominate (usually leading to worse than expected behaviour).

#### Footprints

Ok, so what does all this have to do with variational inference? Well it turns out that if we draw samples from out variational posterior and use them to compute the importance weights, then we have another interpretation for the shape parameter $k$:

$k = \arg \inf_{k'>0} D_{1/k}(p(\theta \mid y) \, ||\, q(\theta),$

where $D_\alpha (p, q)$ is the Rényi divergence of order $\alpha$.  In particular, if $k >1$, then the Kullback-Leibler divergence in the more natural direction $KL(p(\theta \mid y) \mid q(\theta)) = \infty$ even if $q(\theta)$ minimizes the KL-divergence in the other direction! Once again, we have found that the estimate $\hat k$ gives an excellent indication of the performance of the variational posterior.

So why is checking if $\hat k <0.7$ a good heuristic to evaluate the quality of the variational posterior? There are a few reasons. Firstly, because the variational posterior minimizes the KL-divergence in the direction that penalizes approximations with heavier tails than the posterior much harder than approximations with lighter tails, it is very difficult to get a good $\hat k$ value by simply “fattening out” the approximation.  Secondly, empirical evidence suggests that the smaller the value of $\hat k$, the closer the variational posterior is to the true posterior. Finally, if $\hat k <0.7$ we can automatically improve any expectation computed against the variational posterior using PSIS.  This makes this tool both a diagnostic and a correction for the variational posterior that does not rely too heavily on asymptotic arguments. The value of $\hat k$ has also proven useful for selecting the best parameterization of the model for the variational approximation (or equivalently, between different approximation families).

There are some downsides to this heuristic. Firstly, it really does check that the whole variational posterior is like the true posterior. This is a quite stringent requirement that variational inference methods often do not pass. In particular, as the number of dimensions increases, we’ve found that unless the approximating family is particularly well-chosen for the problem, the variational approximation will eventually become bad enough that $\hat k$ will  exceeded the threshold. Secondly, this diagnostic only considers the full posterior and cannot be modified to work on lower-dimensional subsets of the parameter space. This means that if the model has some “less important” parameters, we  still require their posterior be very well captured by the variational approximation.

#### Let me see the colts

The thing about variational inference is that it’s actually often quite bad at estimating a posterior. On the other hand, the centre of the variational posterior is much more frequently a good approximation to the centre of the true posterior. This means that we can get good point estimates from variational inference even if the full posterior isn’t very good. So we need a diagnostic to reflect this.

Into the fray steps an old paper of Andrew’s (with Samantha Cook and Don Rubin) on verifying statistical software. We (mainly Stan developer Sean) have been playing with various ways of extending and refining this method for the last little while and we’re almost done on a big paper about it. (Let me tell you: god may be present in the sweeping gesture, but the devil is definitely in the details.) Thankfully for this work, we don’t need any of the new detailed work we’ve been doing. We can just use the original results as they are (with just a little bit of a twist).

The resulting heuristic, which we call Variational Simulation-Based Calibration (VSBC), complements the PSIS diagnostic by assessing the average performance of the implied variational approximation to univariate posterior marginals. One of the things that this method can do particularly well is indicate if the centre of the variational posterior will be, on average, biased.  If it’s not biased, we can apply clever second-order corrections (like the one proposed by Ryan Giordano, Tamara Broderick, and Michael Jordan).

I keep saying “on average”, so what do I mean by that? Basically, VSBC looks at how well the variational posterior is calibrated by computing the distribution of $p_j = P( \theta < \tilde{\theta}_j \mid y_j)$ where $y_j$ is simulated from the model with parameter $\theta_j$ that is itself drawn from the prior distribution. If the variational inference method is calibrated, then Cook et al. showed that the histogram of $p_j$ should be uniform.

This observation can be generalized using insight from the forecast validation community: if the histogram of $p_j$ is asymmetric, then the variational posterior will be (on average over data drawn from the model) biased. In the paper, we have a specific result, which shows that this insight is exactly correct if the true posterior is symmetric, and approximately true if it’s fairly symmetric.

There’s also the small problem that if the model is badly mis-specified, then it may fit the observed data much worse or better than the average of data drawn from the model. Again, this contrasts with the PSIS diagnostic that only assesses the fit for the particular data set you’ve observed.

In light of this, we recommend interpreting both of our heuristics the same way: conservatively. If either heuristic fails, then we can say the variational posterior is poorly behaved in one of two specific ways. If either or both heuristics pass, then we can have some confidence that the variational posterior will be a good approximation to the true distribution (especially after a PSIS or second-order correction), but this is still not guaranteed.

#### Faith/Void

To close this post out symmetrically (because symmetry indicates a lack of bias), let’s go back to a different Bill Callahan song to remind us that even if it’s not the perfect song, you can construct something beautiful by leveraging formal structure:

If
If you
If you could
If you could only
If you could only stop
If you could only stop your
If you could only stop your heart
If you could only stop your heart beat
If you could only stop your heart beat for
If you could only stop your heart beat for one heart
If you could only stop your heart beat for one heart beat

## Methodological terrorism. For reals. (How to deal with “what we don’t know” in missing-data imputation.)

Kevin Lewis points us to this paper, by Aaron Safer-Lichtenstein, Gary LaFree, Thomas Loughran, on the methodology of terrorism studies. This is about as close to actual “methodological terrorism” as we’re ever gonna see here.

Although the empirical and analytical study of terrorism has grown dramatically in the past decade and a half to incorporate more sophisticated statistical and econometric methods, data validity is still an open, first-order question. Specifically, methods for treating missing data often rely on strong, untestable, and often implicit assumptions about the nature of the missing values.

Later, they write:

If researchers choose to impute data, then they must be clear about the benefits and drawbacks of using an imputation technique.

Yes, definitely. One funny thing about missing-data imputation is that the methods are so mysterious and are so obviously subject to uncheckable assumptions that there’s a tendency for researchers to just throw up their hands and give up, and either go for crude data-simplification strategies such as throwing away all cases where anything is missing, or just imputing without any attempt to check the resulting inferences.

My preference is to impute and then check assumptions, as here. That said, in practice this can be a bit of work so in a lot of my own applied work I kinda close my eyes to the problem too. I should do better.

## N=1 experiments and multilevel models

N=1 experiments are the hot new thing. Here are some things to read:

Design and Implementation of N-of-1 Trials: A User’s Guide, edited by Richard Kravitz and Naihua Duan for the Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services (2014).

Single-patient (n-of-1) trials: a pragmatic clinical decision methodology for patient-centered comparative effectiveness research, by Naihua Duan, Richard Kravitz, and Chris Schmid, for the Journal of Clinical Epidemiology (2013).

And a particular example:

The PREEMPT study – evaluating smartphone-assisted n-of-1 trials in patients with chronic pain: study protocol for a randomized controlled trial by Colin Barr et al., for Trials (2015), which begins:

Chronic pain is prevalent, costly, and clinically vexatious. Clinicians typically use a trial-and-error approach to treatment selection. Repeated crossover trials in a single patient (n-of-1 trials) may provide greater therapeutic precision. N-of-1 trials are the most direct way to estimate individual treatment effects and are useful in comparing the effectiveness and toxicity of different analgesic regimens.

This can also be framed as the problem of hierarchical modeling when the number of groups is 1 or 2, and this issue comes up, that once you go beyond N=1, you’re suddenly allowing more variation. One way to handle this is to include this between-person variance component even for an N=1 study. It’s just necessary to specify the between-person variance a priori—but that’s better than just setting it to 0. Similarly, once we have N=2 we can fit a hierarchical model but we’ll need strong prior info on the between-person variance parameter.

This relates to some recent work of ours in pharmacology—in this case, the problem is not N=1 patient, but N=1 study, and it also connects to a couple discussions we’ve had on this blog regarding the use of multilevel models to extrapolate to new scenarios; see here and here and here from 2012. We used to think of multilevel models as requiring 3 or more groups, but that’s not so at all; it’s just that when you have fewer groups, there’s more to be gained by including prior information on group-level variance parameters.

## p=0.24: “Modest improvements” if you want to believe it, a null finding if you don’t.

David Allison sends along this juxtaposition:

Press Release: “A large-scale effort to reduce childhood obesity in two low-income Massachusetts communities resulted in some modest improvements among schoolchildren over a relatively short period of time…”

Study: “Overall, we did not observe a significant decrease in the percent of students with obesity from baseline to post intervention in either community in comparison with controls…”

Allison continues:

In the paper, the body of the text states:

Overall, we did not observe a significant decrease in the percent of students with obesity from baseline to post intervention in either community in comparison with controls (Community 1: −0.77% change per year, 95% confidence interval [CI] = −2.06 to 0.52, P = 0.240; Community 2: −0.17% change per year, 95% CI = −1.45 to 1.11, P = 0.795).

Yet, the abstract concludes “This multisector intervention was associated with a modest reduction in obesity prevalence among seventh-graders in one community compared to controls . . .”

The publicity also seems to exaggerate the findings, stating, “A large-scale effort to reduce childhood obesity in two low-income Massachusetts communities resulted in some modest improvements among schoolchildren over a relatively short period of time, suggesting that such a comprehensive approach holds promise for the future, according to a new study from Harvard T.H. Chan School of Public Health.”

On one hand, we shouldn’t be using “p = 0.05” as a cutoff. Just cos a 95% conf interval excludes 0, it doesn’t mean a pattern in data reproduces in the general population; and just cos an interval includes 0, it doesn’t mean that nothing’s going on. So, with that in mind, sure, there’s nothing wrong with saying that the intervention “was associated with a modest reduction,” as long as you make clear that there’s uncertainty here, and the data are also consistent with a zero effect or even a modest increase in obesity.

On the other hand, there is a problem here with degrees of freedom available for researchers and publicists. The 95\% interval was [-2.1, 0.5], and this was reported as “a modest reduction” that was “holding promise for the future.” Suppose the 95% confidence interval had gone the other way and had been [-0.5, 2.1]. Would they have reported it as “a modest increase in obesity . . . holding danger for the future”? I doubt it. Rather, I expect they would’ve reported this outcome as a null (the p-value is 0.24, after all!) and gone searching for the positive results.

So there is a problem here, not so much with the reporting of this claim in isolation but with the larger way in which a study produces a big-ass pile of numbers which can then be mined to tell whatever story you want.

P.S. Just as a side note: above, I used the awkward but careful phrase “a pattern in data reproduces in the general population” rather than the convenient but vague formulation “the effect is real.”

P.P.S. I sent this to John Carlin, who replied:

That’s interesting and an area that I’ve had a bit to do with – basically there are zillions of attempts at interventions like this and none of them seem to work, so my prior distribution would be deflating this effect even more. The other point that occurred to me is that the discussion seems to have focussed entirely on the “time-confounded” before-after effect in the intervention group rather than the randomised(??) comparison with the control group – which looks even weaker.

John wanted to emphasize, though, that he’s not looked at the paper. So his comment represents a general impression, not a specific comment on what was done in this particular research project.

## Andrew vs. the Multi-Armed Bandit

Andrew and I were talking about coding up some sequential designs for A/B testing in Stan the other week. I volunteered to do the legwork and implement some examples. The literature is very accessible these days—it can be found under the subject heading “multi-armed bandits.” There’s even a Wikipedia page on multi-armed bandits that lays out the basic problem formulation.

It didn’t take long to implement some policies and a simulation framework in R (case study coming after I do some more plots and simulate some contextual bandit probability matching policies). When I responded with positive results on the Bernoulli bandit case with probability matching (aka “Thompson sampling”), Andrew replied with a concern about the nomenclature, which he said I could summarize publicly.

First, Each slot machine (or “bandit”) only has one arm. Hence it’s many one-armed bandits, not one multi-armed bandit.

Indeed. The Wikipedia says it’s sometimes called the K-bandit problem instead, but I haven’t seen any evidence of that.

Second, the basic strategy in these problems is to play on lots of machines until you find out which is the best, and then concentrate your plays on that best machine. This all presupposes that either (a) you’re required to play, or (b) at least one of the machines has positive expected value. But with slot machines, they all have negative expected value for the player (that’s why they’re called “bandits”), and the best strategy is not to play at all. So the whole analogy seems backward to me.

Yes. Nevertheless, people play slot machines. I suppose an economist would say the financial payout has low expected value, but high variance, so play may be rational for a risk-happy player who enjoys to play.

Third, I find the “bandit” terminology obscure and overly cute. It’s an analogy removed at two levels from reality: the optimization problem is not really like playing slot machines, and slot machines are not actually bandits. It’s basically a math joke, and I’m not a big fan of math jokes.

Seriously? The first comments I understand, but the third is perplexing when considering the source is the same statistician and blogger who brought us “Red State, Blue State, Rich State, Poor State”, “Mister P”, “Cantor’s Corner”, and a seemingly endless series of cat pictures, click-bait titles and posts named after novels.

I might not be so sensitive about this had I not pleaded in vain not to name our software “Stan” because it was too cute, too obscure, and the worst choice for SEO since “BUGS”.

P.S. Here’s the conjugate model for Bernoulli bandits in Stan followed by the probability matching policy implemented in RStan.

data {
int K;                // num arms
int N;                // num trials
int z[N];  // arm on trial n
int y[N];  // reward on trial n
}
transformed data {
int successes[K] = rep_array(0, K);
int trials[K] = rep_array(0, K);
for (n in 1:N) {
trials[z[n]] += 1;
successes[z[n]] += y[n];
}
}
generated quantities {
simplex[K] is_best;
vector[K] theta;
for (k in 1:K)
theta[k] = beta_rng(1 + successes[k], 1 + trials[k] - successes[k]);
{
real best_prob = max(theta);
for (k in 1:K)
is_best[k] = (theta[k] >= best_prob);
is_best /= sum(is_best);  // uniform for ties
}
}


The generated quantities compute the indicator function needed to compute the event probability that a given bandit is the best. That’s then used as the distribution to sample which bandit’s arm to pull next in the policy.

model_conjugate <- stan_model("bernoulli-bandits-conjugate.stan")
fit_bernoulli_bandit <- function(y, z, K) {
data <- list(K = K, N = length(y), y = y, z = z)
sampling(model_conjugate, algorithm = "Fixed_param", data = data,
warmup = 0, chains = 1, iter = 1000, refresh = 0)
}

expectation <- function(fit, param) {
posterior_summary <- summary(fit, pars = param, probs = c())
posterior_summary\$summary[ , "mean"]
}

thompson_sampling_policy <- function(y, z, K) {
posterior <- fit_bernoulli_bandit(y, z, K)
p_best <- expectation(posterior, "is_best")
sample(K, 1, replace = TRUE, p_best)
}


1000 itertions is overkill here, but given everything's conjugate it's very fast. I can get away with a single chain because it's pure Monte Carlo (not MCMC)---that's the beauty of conjugate models. It's often possible to exploit conjugacy or collect sufficient statistics to produce more efficient implementations of a given posterior.

## Snappy Titles: Deterministic claims increase the probability of getting a paper published in a psychology journal

A junior psychology researcher who would like to remain anonymous writes:

I wanted to pass along something I found to be of interest today as a proponent of pre-registration.

Here is a recent article from Social Psychological and Personality Science. I was interested by the pre-registered study. Here is the pre-registration for Study 1.

The pre-registration lists eight (directional) hypotheses. Yet, in Study 1 of the paper, only 2-3 of these hypotheses are reported. Footnote 1 states that the study included some “exploratory” measures that were inconsistent and so not reported in the main text.

In one sense, this is pre-registration working well in that I can now identify these inconsistencies. In another sense, I think it’s a misuse of pre-registration when literally the first hypothesis listed (and three of the first four hypotheses) is now labeled as exploratory in the published version. It’s the first time I’ve seen this and I wonder how common it will become in the future, knowing that many people don’t look closely at pre-registrations.

My reply: I have no idea, but I was struck by how the paper in question follows the social-psychology-article template, with features such as a double-barreled title with catchy phrase followed by a deterministic claim (“Poisoned Praise: Discounted Praise Backfires and Undermines Subordinate Impressions in the Minds of the Powerful”) and a lead-off celebrity quote (“Flattery and knavery are blood relations. — Abraham Lincoln”). This does not mean the results are wrong—I guess if you’re interested in that question, you’ll have to wait for the preregistered replication—it’s just interesting to see the research pattern being essentially unchanged. Just speaking generally, without any knowledge of this particular topic, I’m skeptical about the possibility of learning about complex interactions (for example, this tongue-twister from the abstract: “high-power people’s tendency to discount feedback only produced negative partner perceptions when positive feedback, but not neutral feedback, was discounted”) from this sort of noisy between-person experiment.

I agree with my correspondent that preregistration is a good first step, but ultimately I think it will be necessary to use more efficient experimental designs and a stronger integration of data collection with theory, if the goal is to make progress in understanding of cognition and behavior.

P.S. My correspondent asked for anonymity, and that is a sign of something wrong in academia, that it’s can be difficult for people to criticize published papers. I don’t think this is a problem unique to psychology. In a better world, the person who sent me the above criticism would not be afraid to publish it openly and directly.

## Here’s a post with a Super Bowl theme.

Kevin Lewis pointed me to an article in the Journal of the American Medical Association, using the email subject line, “Not statistically significant, but close.” The article in question, by Atheendar Venkataramani, Maheer Gandhavadi, and Anupam Jena, is called, “Association Between Playing American Football in the National Football League and Long-term Mortality,” and it reports:

In this retrospective cohort study of 3812 NFL players who debuted between 1982 and 1992, there was no statistically significant difference in the risk of long-term all-cause mortality among career NFL players compared with NFL replacement players who participated in the NFL during a 3-game league-wide player strike in 1987 (adjusted hazard ratio, 1.38; 95% CI, 0.95-1.99). Career participation in NFL football, compared with participation as an NFL replacement player, was not associated with a statistically significant difference in the risk of all-cause mortality.

They report:

The study sample consisted of 2933 career NFL players and 879 NFL replacement players . . . A total of 144 deaths occurred among career NFL players . . . and 37 deaths occurred among NFL replacement players. . . . The mortality risk by study end line was 4.9% for career NFL players and 4.2% for NFL replacement players, and the unadjusted absolute risk difference was 0.7% . . . Adjusting for birth year, BMI, height, and position group, the absolute risk difference was 1.0%.

They also report that “Four career NFL players were excluded from the sample because they died while actively employed on an NFL roster.” Including them would bump up the mortality rate from 4.9% to 5.0%.

The real story, though, is that the sample is not large enough to reliably detect the difference between a 4% mortality rate and a 5% mortality rate.

From the standpoint of statistical communication, the challenge is how to report such a finding.

I think the authors did a pretty good job of not overselling their results. To start with, their title is noncommittal. They do report, “not associated with a statistically significant difference,” which I don’t think is so helpful. But, flip it around: what if the numbers had been very slightly different and the 95% interval had excluded a null effect? Then we wouldn’t want them blaring that they’d found a clear effect.

And the paper includes a strong section on Limitations:

This study has several limitations. First, the residual differences between career NFL players and NFL replacement players that may bias the results could not be addressed. Additional research that collects data on medical comorbidities, lifestyle factors, and earnings across the 2 groups could further address this bias. A complementary approach might focus on groups of NFL career and replacement players who are more similar in their duration of NFL exposure. For example, comparing career NFL players with shorter tenures vs NFL replacement players may avoid healthy worker bias from players who are able to persist in the league because of better unobserved health. . . .

Second, the estimates were based on a small number of events: 181 deaths, of which only 37 occurred among NFL replacement players. Consequently, the present analysis could be underpowered to detect meaningful associations. Additional analyses with longer-term follow-up may be informative.

Third, the data were drawn from online databases. While these have been used in other analyses, it is possible that some information was misreported . . .

That’s how you do it: report what you found and be clear about what you did.

The corresponding author works at the University of Pennsylvania. Eagles fan?

## What to teach in a statistics course for journalists?

Pascal Biber writes:

I am a science journalist for Swiss public television and have previously regularly covered the “crisis in science” on Swiss public radio, including things like p-hacking, relative risks, confidence intervals, reproducibility etc.

I have been giving courses in basic statistics and how to read scientific studies for Swiss journalists without science backgrounds. As such, I am increasingly wondering what to teach (knowing they will not dive into Bayesian statistics). How should a non-science journalist handle p values? Should he at all? What about 95 percent confidence intervals? Should relative risks be reported when absolute numbers aren’t available? What about small studies? Should they even report about single studies? And how about meta-analysis?

I wondered if you have some basic advise for journalists that you could share with me? Or are you aware of good existing checklists?

To start with, here’s my article from 2013, “Science Journalism and the Art of Expressing Uncertainty.”

Also relevant is this post on Taking Data Journalism Seriously. And this on what to avoid.

There are some teaching materials out there, which you can find by googling *statistics for journalists* or similar terms, but there is a problem that once you try to get in deeper, there’s little agreement in how to proceed.

I suppose that the starting point is understanding the statistical terms that often arise in science: sample and population, treatment and control, randomized experiment, observational study, regression, probability.

Should you teach p-values and 95% intervals? I’m not sure. OK, yeah, you have to say something about these, as they appear in so many published papers. And then you have to explain how these terms are defined. And then you have to explain the problems with these ideas. I think there’s no way around it.

Bayesian methods? Sure, you have to say something about these, because, again, they’re used in published papers. You don’t have to “dive” into Bayesian methods but you can and should explain the idea of predictive probabilities such as arise in election forecasts.

For the big picture, I recommend this bit from the first page of Regression and Other Stories:

Gary Smith sends along this news article from Jason Samenow, weather editor of the Washington Post, who writes:

Three years ago, a scientific study claimed that storms named Debby are predisposed to kill more people than storms named Don. The study alleged that people don’t take female-named storms as seriously. Numerous analyses have since found that this conclusion has little merit.

Good to see the record being corrected. But there’s more. Samenow continues:

Samenow concludes:

The publication of this study and the hype it generated (and continues to generate) show the potential pitfalls of coming to unqualified conclusions from a single study. It also reveals the importance of the peer review process, which encourages critical responses to new findings so that work that cannot withstand scrutiny does not endure.

Let me just emphasize that what really worked here was post-publication peer-review, an informal process using social media as well as traditional journals.

Good on Samenow for making this correction. It’s excellent to see a news editor taking responsibility for past stories. In contrast, Susan Fiske, the PPNAS editor who accepted that “himmicanes” paper and several others of comparable quality, has never acknowledged that there might be any problems with this work. Fiske lives in a traditional academic world in which all peer review occurs before publication, and where published results are protected by a sort of “thin blue line” of credentialed scientists,” a world in which it’s considered ok for a published paper, no matter how ridiculous, to get extravagant publicity, but where any negative remarks, no matter how well sourced, are supposed to be whispered.

We all make mistakes, and I don’t hold it against Fiske that she published that flawed paper. Any of us can be fooled by crap research that is wrapped up in a fancy package. I know I’ve been fooled on occasion. What’s important is to learn from our mistakes.

## When to add a feature to Stan? The recurring issue of the compound declare-distribute statement

At today’s Stan meeting (this is Bob, so I really do mean today), we revisited the topic of whether to add a feature to Stan that would let you put distributions on parameters with their declarations.

Compound declare-define statements

Mitzi added declare-define statements a while back, so you can now write:

transformed parameter {
real sigma = tau^-0.5;
}


It’s generally considered good style to put the declaration of a variable as close to where it is used as possible. It saves a lot of scanning back and forth to line things up. Stan’s forcing users to declare all the variables at the top violates this principle and compound declare-define clears a lot of that up (though not all—there’s a related proposal to let statements be mingled into the declarations at the top of a block like the transformed data and transformed parameters blocks).

Compound parameter declare-distribute statements?

Allowing compound declare-distribute statements in the parameters block would support writing linear regression as follows.

data {
int<lower = 0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha ~ normal(0, 10);
real beta ~ normal(0, 5);
real<lower = 0> sigma ~ normal(0, 2);
}
model {
y ~ normal(alpha + beta * x, sigma);
}


The current way of writing this would have the same data block, but the parameters block wold only have declarations—uses get moved to the model block.

parameters {
real alpha;
real beta;
real<lower = 0> sigma;
}
model {
alpha ~ normal(0, 10);
beta ~ normal(0, 5);
sigma ~ normal(0, 2);
y ~ normal(alpha + beta * x, sigma);
}


Compared to the current approach, the proposed compound declare-distribute statements pulls the first three uses of the variables, which is to assign them priors, up into the parameters block where they are declared.

Who’s on which side on this?

I’m trying to remain neutral as to who has which opinion as to whether compound declare-distribute statements would be a good thing. (The implementation’s no big deal by the way, so that’s not even an issue in the discussion.)

Even more ambitious proposal

An even more ambitious proposal (still easy to implement) would be to open up the parameters block to general statements so that anything at all in the model block could be moved up to the parameters block, though the intent would have it just be priors. The reason this matters is that not every prior in Stan can be easily coded in a single line with a distribution.

Allowing statements in the parameters block makes the transformed parameters and model blocks redundant. So we’ll just have one block that I’ll call “foo” to avoid bike-shedding on the name.

data {
int<lower = 0> N;
vector[N] x;
vector[N] y;
}
foo {
real alpha ~ normal(0, 10);
real beta ~ normal(0, 5);
real<lower = 0> sigma ~ normal(0, 2);
y ~ normal(alpha + beta * x, sigma);
}


It’d even be possible to just drop the block name and braces; but we’d still need to declare the generated quantities block.

Compound data declare-distribute statements

So far, we’ve only been talking about declare-distribute for parameters. But what about data? If what’s good for the goose is good for the gander, we can allow compound declare-define on data variables. But there’s a hitch—we can’t bring the sampling statement for the likelihood up into the data block, because we haven’t declared parameters yet.

The easy workaround here is to put the data declarations right on the variables instead of using blocks. Now we can move the data declarations down and put a compound declare-distribute statement on the modeled data y.

real alpha ~ normal(0, 10);
real beta ~ normal(0, 5);
real<lower = 0> sigma ~ normal(0, 2);
data int<lower = 0> N;
data vector[N] x;
data vector[N] y ~ normal(alpha + beta * x, sigma);


This is pretty close to Maria Gorinova’s proposal for SlicStan and something that’s very much like Cibo Technologies’ StataStan. It looks more like PyMC3 or Edward or even BUGS than it looks like existing Stan, but will still support loops, (recursive) functions, conditionals, etc.

Now what?

Our options seem to be the following:

2. allow compound declare-distribute for parameters
3. further allow other statements in the parameters block

Stan’s basic design

To frame the discussion, I think it’ll help to let you know my thinking when I originally designed the Stan language. I was very much working backwards from the C++ object I knew I wanted. Matt Hoffman and I already had HMC working with autodiff on models defined in C++ at that point (just before or after that point is when Daniel Lee joined the coding effort).

I thought of the C++ class definition as the definition of the model—typically in terms of a joint log density (I really regret calling the log probability density function log_prob—it should’ve been lpdf).

The constructor of the model class took the data as an argument. So when you constructed an instance of the class, you instantiated the model with data. That conditions the model on the data, so an instantiated model defines the log density of the posterior.

The data block declaration was essentially a way to code the arguments to the constructor of that class. It’s like the function argument declarations for a function in a language like C or Java. (It works much more indirectly in practice in Stan through generic I/O reader interface callbacks; the declarations specify how to generate the code to read the relevant variables from a reader and define them as member variables in the model class.)

The parameter block declaration was similarly just a way to code the arguments to the log density function. (This also works indirectly in practice through readers by taking unconstrained parameters, peeling of the next one in the declaration by size, adding any Jacobian adjustments for change of variables to the log density, and defining a local variable on the constrained scale to be used by the model—that’s also how we can generalize Maria Gorinova’s proposal for functions in SlicStan that required compile-time unfolding).

The model block was just the implementation of the body of the log density function.

That’s it for the basics and it’s still how I like to explain the Stan language to people—it’s just a way of writing down a log density function.

Later, I added the transformed data block as a way to save intermediate values at construction time. That is, member variables in the class that would be computed based on the data. The transformed parameter block is the same deal, only for parameters. So they define transforms as local variables for the model block to use (and get printed).

Next up…

Breaking the target log density into the sum of a log likelihood and log prior (or penalty)…

## The Anti-Bayesian Moment and Its Passing

This bit of reconstructed intellectual history is from a few years ago but I thought it’s worth repeating. It comes from the rejoinder that X and I wrote to our article, “‘Not only defended but also applied’: The perceived absurdity of Bayesian inference.” The rejoinder is called “The anti-Bayesian moment and its passing,” and it begins:

Over the years we have often felt frustration, both at smug Bayesians—in particular, those who object to checking of the fit of model to data, because all Bayesian models are held to be subjective and thus unquestioned (an odd combination indeed, but that is the subject of another article)—and angry anti-Bayesians who, as we wrote in our article, strain on the gnat of the prior distribution while swallowing the camel that is the likelihood.

The present article arose from our memory of a particularly intemperate anti-Bayesian statement that appeared in Feller’s beautiful and classic book on probability theory. We felt that it was worth exploring the very extremeness of Feller’s words, along with similar anti-Bayesian remarks by others, to better understand the background underlying controversies that still exist regarding the foundations of statistics. . . .

Here’s the key bit:

The second reason we suspect for Feller’s rabidly anti-Bayesian stance is the postwar success of classical Neyman-Pearson ideas. Many leading mathematicians and statisticians had worked on military problems during the World War II, using available statistical tools to solve real problems in real time. Serious applied work motivates the development of new methods and also builds a sense of confidence in the existing methods that have led to such success. After some formalization and mathematical development of the immediate postwar period, it was natural to feel that, with a bit more research, the hypothesis testing framework could be adapted to solve any statistical problem. In contrast, Thornton Fry could express his skepticism about Bayesian methods but could not so easily dismiss the entire enterprise, given that there was no comprehensive existing alternative.

If 1950 was the anti-Bayesian moment, it was due to the successes of the Neyman–Pearson–Wald approach, which was still young and growing, with its limitations not yet understood. In the context of this comparison, there was no need for researchers in the mainstream of statistics to become aware of the scattered successes of Bayesian inference. . . .

As Deborah Mayo notes, the anti-Bayesian moment, if it ever existed, has long passed. Influential non-Bayesian statisticians such as Cox and Efron are hardly anti-Bayesian, instead demonstrating both by their words and their practice a willingness to use full probability models as well as frequency evaluations in their methods, and purely Bayesian approaches have achieved footholds in fields as diverse as linguistics, marketing, political science, and toxicology.

If there ever was a “pure Bayesian moment,” that too has passed with the advent of “big data” that for computational reasons can only be analyzed using approximate methods. We have gone from an era in which prior distributions cannot be trusted to an era in which full probability models serve in many cases as motivations for the development of data-analysis algorithms.

## Geoff Norman: Is science a special kind of storytelling?

Javier Benítez points to this article by epidemiologist Geoff Norman, who writes:

The nature of science was summarized beautifully by a Stanford professor of science education, Mary Budd Rowe, who said that:

Science is a special kind of story-telling with no right or wrong answers. Just better and better stories.

Benítez writes that he doesn’t buy this.

Neither do I, but I read the rest of Norman’s article and I really liked it.

Here’s a great bit:

I [Norman] turn to a wonderful paper by Cook et al. (2008) that described three fundamentally different kinds of research question:

1. Description

“I have a new (curriculum, questionnaire, simulation, OSCE method, course) and here’s how I developed it”

That’s not even poor research. It’s not research at all.

2. Justification

“I have a new (curriculum, module, course, software) and it works. Students really like it” OR “students self-reported knowledge was higher” OR “students did better on the final exam than a control group” OR even “students had lower mortality rates after the course”

OK, it’s research. But is it science? After all, what do we know about how the instruction actually works? Do we have to take the whole thing on board lock, stock and barrel, to get the effects? What’s the active ingredient? In short, WHY is it better? And that brings us to

3. Clarification

“I have a new (curriculum, module, course, software). It contains a number of potentially active ingredients including careful sequencing of concepts, imbedding of concepts in a problem, interleaved practice, and distributed practice. I have conducted a program of research where these factors have been systematically investigated and the effectiveness of each was demonstrated”.

That’s more like it. We’re not asking if it works, we’re asking why it works. And the results truly add to our knowledge about effective strategies in education. So one essential characteristic is that the findings are not limited to the particular gizmo under scrutiny in the study. The study adds to our general understanding of the nature of teaching and learning.

I don’t want to get caught up in a debate on what’s “science” or what’s “research” or whatever; the key point is that science and statistics are not, and should not, be, just about “what works” but rather “How does it work?” This relates to our earlier post about the problem with the usual formulation of clinical trials as purely evaluative, which leaves you in the lurch if someone doesn’t happen to provide you with a new and super-effective treatment to try out.

To put it another way, the “take a pill” or “black box” approach to statistical evaluation would work ok if we were regularly testing wonder-pills. But in the real world, effects are typically highly variable, and we won’t get far without looking into the damn box.

Norman also writes:

The most critical aspect of a theory is that, instead of addressing a simple “Does it work?,” to which the answer is “Yup” or “Nope”, it permits a critical examination of the effects and interactions of any number of variables that may potentially influence a phenomenon. So, one question leads to another question, and before you know it, we have a research program. Programmatic strategies inevitably lead to much more sophisticated understanding. And along the way, if it’s really going well, each study leads to further insights that in turn lead to the next study question. And each answer leads to further insight and explanation.

One interesting thing here is that this Lakatosian description might seem at first glance to be a good description, not just of healthy scientific research, but also of degenerate research programmes associated with elderly-words-and-slow-walking, or ovulation and clothing, or beauty and sex ratio, or power pose, or various other problematic research agendas that we’ve criticized in this space during the past decade. All these subfields, which have turned to be noise-spinning dead ends, feature a series of studies and publications, with each new result leading to new questions. But these studies are uncontrolled. Part of the problem there is a lack of substantive theories—no, vague pointing to just-so evolutionary stories is not enough. But another problem is statistical, the p-values and all that.

Now let’s put all these ideas together:
– If you want to engage in a scientifically successful research programme, your theories should be “thick” and full of interactions, not mere estimates of average treatment effects. That’s fine: all the areas we’ve been discussing, including the theories we don’t respect, are complex. Daryl Bem’s ESP hypotheses, for example: they were full of interactions.
– But now the next step is to model those interactions, to consider them together. If, for example, you decide that outdoor temperature is an important variable (as in that ovulation-and-clothing paper), you go back and include it in analysis of earlier as well as later studies. And if you’re part of a literature that includes other factors such as age, marital status, political orientation, etc., then, again, you include all these too.
– Including all these factors and then analyzing in a reasonable way (I’d prefer a multilevel model but, if you’re careful, even some classical multiple comparisons approach could do the trick) would reveal not much other than noise in those problematic research areas. In contrast, in a field where real progress is being made, a full analysis should reveal persistent patterns. For example, political scientists keep studying polarization in different ways. I think it’s real.

The point of my above discussion is to elaborate on Norman’s article by emphasizing the interlocking roles of substantive theory, data collection, and statistical analysis. I think that in Norman’s discussion, he’s kinda taking for granted that the statistical analysis will respect the substantive theory, but we see real problems when this doesn’t happen, in papers that consider isolated hypotheses without considering the interactions pointed to even within their own literatures.

P.S. Regarding the title of this post, here’s what Basbøll and I wrote a few years ago about science and stories. Although here we were talking not about storytelling but about scientists’ use of and understanding of stories.

## Education, maternity leave, and breastfeeding

In today’s column, How We Are Ruining America, David Brooks writes that “Upper-middle-class moms have the means and the maternity leaves to breast-feed their babies at much higher rates than high school-educated moms, and for much longer periods.”

He’s correct about college grads being more likely to have access to maternity leave, but since this (and the cost of child care) translates to college grads being more likely to work and less likely to drop out of the labor force, I think he’s probably incorrect about this translating into ability to breastfeed. I used Census (ACS 2011-2015) data to look at typical hours worked per week in the previous year as well as current employment status among the mothers of infants. They both show the same thing: Education among the mothers of infants is positively correlated with number of hours typically worked per week in the previous year, as well as with currently being employed.

There is a leave gap, but it’s small in absolute terms. If you look at the sum of ‘not working’ + ’employed but on leave’ in the employment status graph (the ’employed but on leave’ group would be part of ’employed but not at work’) and assume that’s the group that’s not being prevented from breastfeeding due to their jobs, it’s the biggest among the less-educated group and the smallest among the most-educated group.

Below are my graphs and here is my workbook, and they’re also here along with the data. It’s possible that ‘hours worked’ is reflecting hours worked pre-baby – it depends on how old the baby is/how the respondent chose to answer. But it’s telling the same story as the employment status graph, and it’s tough to reconcile with Brooks’ story.

I’ve not looked at the details here but wanted to share with you.

## Postdoc opening on subgroup analysis and risk-benefit analysis at Merck pharmaceuticals research lab

Richard Baumgartner writes:

We are looking for a strong postdoctoral fellow for a very interesting cutting edge project.

The project requires expertise in statistical modeling and machine learning.

Here is the official job ad. We are looking for candidates that are strong both analytically and computationally (excellent coding skills).

In the project, we are interested in subgroup analysis for benefit-risk, also optimal treatment allocation methods would be of interest.

I don’t post every job ad that gets sent to me, but this one sounds particularly interesting, based on the description above.

P.S. Full disclosure: I consult for pharmaceutical companies and have been paid by Merck.

## My suggested project for the MIT Better Science Ideathon: assessing the reasonableness of missing-data imputations.

Leo Celi writes:

We are 3 months away from the MIT Better Science Ideathon on April 23.

We would like to request your help with mentoring a team or 2 during the ideathon. During the ideathon, teams discuss a specific issue (lack of focus on reproducibility across majority of journals) or problem that arose from a well-intentioned initiative (e.g., proliferation of open-access journals), brainstorm on how to address the issue/problem and design a strategy. In this regard, we are soliciting ideas that the teams can choose from for the ideathon.

I’ll be participating in this event, and here’s my suggested project: assessing the reasonableness of missing-data imputations. Here are 3 things to read:
http://www.stat.columbia.edu/~gelman/research/published/paper77notblind.pdf
http://www.stat.columbia.edu/~gelman/research/published/mipaper.pdf

A Python program for multivariate missing-data imputation that works on large datasets!?

This last link is particularly relevant as it points to some Python code that the students can run to get started.

Of course, anyone can work on this project; no need to go to the Ideathon to do it. I guess the plan is for the Ideathon to motivate groups of people to focus.