Skip to content

Expediting organised experience: What statistics should be?

The above diagram is by John F. Sowa and it depicts a high level view of C.S. Peirce’s classification of the sciences of discovery (you have been warned). The dotted lines indicate what on the right should be informed by what is on the left.

I think there is a missing spot (or better an opportunity) for statistics in this diagram. Were and what should statistics be?

First, to quickly give a sense about the diagram, starting on the far left, Peirce construed mathematics as experimenting (manipulating) representations as representations, to discern what they purport (could be taken) to represent. That is discover their implications if taken as exactly true. A concrete example would be manipulating diagrams to learn about their inherent properties (e.g. moving around the lines depicting a triangle to prove the angles always add up to 180 degrees). More abstractly, it could be the contemplating of any aspect of any sort representation to learn only about the representation itself.

Moving to the right, under the philosophical branch, phenomenology is a discernment of what is present in consciousness. Peirce argued these are possibilities, necessities and construals (reflective inferences of varying degrees). Consciousness was argued as always being some mix of these three. Next to the right are the Normative Sciences of Esthetics, Ethics and Logic. They were discussed here. But to keep the focus here,  Esthetics was discerning what to value (grasping empirical “reality” being suggested as the best), Ethics how to deliberately conduct oneself to achieve what is valued and Logic how to deliberately think by representing empirical “reality” least wrongly and re-representing it without making it more wrong (e.g. truth preserving manipulations). Metaphysics was the discernment of “reality” as it most  generally could be so as to not misguide, block or hamper organised experience of the world as it is.

Finally the right most branch Empirical – which is the bottom line for most of us – being Organised Experience split into Natural versus Social Science.

What should statistics be and where should statistics be placed in this diagram?

Continue reading ‘Expediting organised experience: What statistics should be?’ »

What’s the point of a robustness check?

Diomides Mavroyiannis writes:

I am currently a doctoral student in economics in France, I’ve been reading your blog fo awhile and I have this question that’s bugging me.

I often go to seminars where speakers present their statistical evidence for various theses. I was wondering if you could shed light on robustness checks, what is their link with replicability? I ask this because robustness checks are always just mentioned as a side note to presentations (yes we did a robustness check and it still works!). Is there any theory on what percent of results should pass the robustness check? Is it not suspicious that I’ve never heard anybody say that their results do NOT pass a check? Is this selection bias? is there something shady going on? or is there no reason to think that a proportion of the checks will fail?

Good question. Robustness checks can serve different goals:

1. The official reason, as it were, for a robustness check, is to see how your conclusions change when your assumptions change. From a Bayesian perspective there’s not a huge need for this—to the extent that you have important uncertainty in your assumptions you should incorporate this into your model—but, sure, at the end of the day there are always some data-analysis choices so it can make sense to consider other branches of the multiverse.

2. But the usual reason for a robustness check, I think, is to demonstrate that your main analysis is OK. This sort of robustness check—and I’ve done it too—has some real problems. It’s typically performed under the assumption that whatever you’re doing is just fine, and the audience for the robustness check includes the journal editor, referees, and anyone else out there who might be skeptical of your claims.

Sometimes this makes sense. For example, maybe you have discrete data with many categories, you fit using a continuous regression model which makes your analysis easier to perform, more flexible, and also easier to understand and explain—and then it makes sense to do a robustness check, re-fitting using ordered logit, just to check that nothing changes much.

Other times, though, I suspect that robustness checks lull people into a false sense of you-know-what. It’s a bit of the Armstrong principle, actually: You do the robustness check to shut up the damn reviewers, you have every motivation for the robustness check to show that your result persists . . . and so, guess what? You do the robustness check and you find that your result persists. Not much is really learned from such an exercise.

As Uri Simonson wrote:

Robustness checks involve reporting alternative specifications that test the same hypothesis. Because the problem is with the hypothesis, the problem is not addressed with robustness checks.

True story: A colleague and I used to joke that our findings were “robust to coding errors” because often we’d find bugs in the little programs we’d written—hey, it happens!—but when we fixed things it just about never changed our main conclusions.

“Five ways to fix statistics”

Nature magazine just published a short feature on statistics and the replication crisis, featuring the following five op-ed-sized bits:

Jeff Leek: Adjust for human cognition

Blake McShane, Andrew Gelman, David Gal, Christian Robert, and Jennifer Tackett: Abandon statistical significance

David Colquhoun: State false-positive risk, too

Michele Nuijten: Share analysis plans and results

Steven Goodman: Change norms from within

Our segment was listed as by Blake and me, but that’s because Nature would not allow us to include more than two authors. Our full article is here; see also this response to comments, which also includes links to relevant papers by Amrhein​, Korner-Nievergelt, and Roth, and Amrhein and Greenland.

Regarding the five short articles above: You can read them yourself, but my quick take is that these discussions all seem reasonable, except for the one by Colquhoun, which to my taste is unhelpful in that it sticks with the false-positive, false-negative framework which Blake and I find problematic for reasons discussed in our paper and elsewhere.

Also, I agree with just about everything in Leek’s article except for this statement: “It’s also impractical to say that statistical metrics such as P values should not be used to make decisions. Sometimes a decision (editorial or funding, say) must be made, and clear guidelines are useful.” Yes, decisions need to be made, but to suggest that p-values be used to make editorial or funding decisions—that’s just horrible. That’s what’s landed us in the current mess. As my colleagues and I have discussed, we strongly feel that editorial and funding decisions should be based on theory, statistical evidence, and cost-benefit analyses—not on a noisy measure such as a p-value. Remember that if you’re in a setting where the true effect is two standard errors away from zero, that the p-value could easily be anywhere from 0.00006 and 1. That is, in such a setting, the 95% predictive interval for the z-score is (0, 4), which corresponds to a 95% predictive interval for the p-value of (1.0, 0.00006). That’s how noisy the p-value is. So, no, don’t use it to make editorial and funding decisions. Please.

And I disagree with this, from the sub-headline: “The problem is not our maths, but ourselves.” Some of the problem is in “our maths” in the sense that people are using procedures with poor statistical properties.

Overall, though, I like the reasonable message being send by the various authors here. As Goodman writes, “No single approach will address problems in all fields.”

Computational and statistical issues with uniform interval priors

There are two anti-patterns* for prior specification in Stan programs that can be sourced directly to idioms developed for BUGS. One is the diffuse gamma priors that Andrew’s already written about at length. The second is interval-based priors. Which brings us to today’s post.

Interval priors

An interval prior is something like this in Stan (and in standard mathematical notation):

sigma ~ uniform(0.1, 2);

In Stan, such a prior presupposes that the parameter sigma is declared with the same bounds.

real<lower=0.1, upper=2> sigma;

We see a lot of examples where users either don’t know or don’t remember to constrain sigma. It’s impossible to infer bounds in general in Stan because of its underlying Turing-complete imperative programming language component—the bounds might be computed in a function call. BUGS’s restriction to directed graphical models lets you infer the bounds (at runtime).

Computational problems

Suppose the true value for sigma lies somewhere near one of the boundaries. The boundaries are mapped to the unconstrained scale using a log-odds (aka logit) transform [thanks to Vadim Kantorov for the correction to the first version]:

sigma = 0.1 + 1.9 * inv_logit(sigma_free)

sigma_free = logit((sigma - 0.1) / 1.9)

where logit(u) = log(u / (1 - u)). Stan actually works on the unconstrained space, sampling sigma_free and producing sigma by inverse transform based on the declared constraints.

When sigma approaches one of the boundaries, sigma_free moves toward positive or negative infinity.

This leads to computational difficulty in setting step sizes if the possible values include both values near the boundary and values even a little bit away from the boundary. We need very large step sizes to move near the boundary and relatively small step sizes to move elsewhere. Euclidean Hamiltonian Monte Carlo, as used in Stan, fixes a single step size (it’s very challenging to try to change this assumption—jittering step size a bit rarely helps).

Statistical problems

The more worrying problem is statistical. Often these interval constraints are imposed under the assumption that the model author knows the value lies in the interval. Let’s suppose the value lies near the upper end of the interval (0.1, 2). Then what happens is that any posterior mass that would be outside of (0.1, 2) if the prior were uniform on (0.1, 5) is pushed below 2. This then reduces the posterior uncertainty and biases the mean estimate lower compared to the wider prior.


A simple Stan program exemplifies the problem:

data {
  real L;
  real<lower=L> U;
parameters {
  real<lower=L, upper=U> y;
model {
  y ~ normal(0, 1);

The parameter y is given a lower bound and upper bound constraint where it is declared in the parameters block, then given a standard normal distribution in the model block. Without the uniform prior, y should clearly have a standard normal distribution.

Now let’s fit it with a very wide interval for bounds, say (-100, 100).

> fit1 <- stan("interval.stan", data = list(L = -100, U = 100), iter=10000)

> probs <- c(pnorm(-2:0))

> probs
[1] 0.02275013 0.15865525 0.50000000

     mean se_mean   sd 2.275013% 15.86553%  50% n_eff Rhat
y    0.00    0.01 0.99     -2.00     -1.00 0.00  7071    1

The pnorm() function is the inverse cumulative distribution function for the standard normal (location zero, unit scale). So the value of probs are the quantiles corresponding to values -2, -1, and 0 (roughly 0.022, 0.16, and 0.50).

The posterior appears to be standard normal, with Stan recovering the quantiles corresponding to -2, -1, and 0 values in the true distribution to within two decimal places (about the accuracy we expect here given the standard error report of 0.01).

What happens if we instead provide an interval prior with tighter bounds that is asymmetric around the mean of zero, say say uniform(-1, 2)? Let’s see.

> fit2 <- stan("interval.stan", data = list(L = -1, U = 2), iter=10000)

> print(fit2, probs=probs)

      mean se_mean   sd 2.275013% 15.86553%   50% n_eff Rhat
y     0.23    0.01 0.73     -0.93     -0.56  0.17  7224    1

Now the posterior mean is estimated as 0.23 and the posterior median as 0.17. That’s not standard normal. What happened? The posterior is not only no longer symmetric (mean and median differ), it’s no longer centered around 0.

Even though we know the “true” posterior mean would be zero without the constraint, adding an interval constraint (-1, 2) modifies the posterior so that it is not symmetric, has a higher mean, and a lower standard deviation.

If we had chosen (-1, 1), the posterior would be symmetric, the posterior mean would still be zero, but the posterior standard deviations would be lower than with the (-100, 100) uniform prior.

The take home message

The whole posterior matters for calculating both the posterior mean, posterior variance, and posterior intervals. Imposing narrow, uniform priors on an interval can bias estimates with respect to wider interval priors.

The lesson is that uniform priors are dangerous if any posterior mass would extend past the boundaries if a wider uniform interval were used.

If you want a wide uniform prior, you can just use an improper uniform prior in Stan (as long as the posterior is proper).

If you think diffuse inverse gamma priors are the answer, that was the second anti-pattern I alluded to earlier. It’s described in Andrew’s paper Prior distributions for variance parameters in hierarchical models (published as a comment on another paper!) and in BDA3.

But wait, there’s more

If you want more advice from the Stan dev team on priors, check out our wiki page:

Or you can wait a few years for Andrew and Aki to consolidate it all into BDA4.

* The Wikipedia page on anti-patterns requires “two key elements” of an anti-pattern:

  1. A commonly used process, structure, or pattern of action that despite initially appearing to be an appropriate and effective response to a problem, has more bad consequences than good ones.
  2. Another solution exists that is documented, repeatable, and proven to be effective.

Check and check.


Driving a stake through that ages-ending-in-9 paper

David Richter writes:

Here’s a letter to the editor [in PPNAS] in response to the ‘people with ages ending in 9’ paper?

We point out some problems with their analyses and their data and tried to replicate their theory in a large German panel study using a within-subjects design and variables close to those used in their paper.

We found no evidence for their theory and this is perfectly in line with your blog post from today [18 Sep 2016]: if hypotheses were ‘true’, they should be ‘true’ independent of the data sources used.

My reply: What’s particularly ridiculous about that ages-ending-in-9 paper was that even their own data did not particularly support their hypothesis (as I discussed in that post from a few years ago).

It’s a sad reflection on the state of the American science establishment that this sort of obviously bad work (the original ages-ending-in-9 paper, that is) was published by the National Academy of Sciences.

P.S. The title of this post is a nod to Jeremy Freese’s description of certain claims as “more vampirical than empirical: unable to be killed by mere evidence . . . the hypothesis seems so logically compelling that it becomes easy to presume that it must be true, and to presume that the natural science literature on the hypothesis is an unproblematic avalanche of supporting findings.” Also the idea that any effect could go in either direction and support the story. Suicide rates go up? That’s a sign of the despair of impending mortality. Suicide rates go down? No problem, it’s a sign that people are valuing what they have. And so on.

Asymptotically we are all dead (Thoughts about the Bernstein-von Mises theorem before and after a Diamanda Galás concert)

They say I did something bad, then why’s it feel so good–Taylor Swift

It’s a Sunday afternoon and I’m trying to work myself up to the sort of emotional fortitude where I can survive the Diamanda Galás concert that I was super excited about a few months ago, but now, as I stare down the barrel of a Greek woman vocalizing at me for 2 hours somewhere in East Toronto, I am starting to feel the fear.

Rather than anticipating the horror of being an emotional wreck on public transportation at about 10pm tonight, I’m thinking about Bayesian asymptotics. (Things that will make me cry on public transport: Baby Dee, Le Gateau Chocolate, Sinead O’Connor, The Mountain Goats. Things that will not make me cry on any type of transportation: Bayesian Asymptotics.)

So why am I thinking about Bayesian asymptotics? Well because somebody pointed out a thread on the Twitter (which is now apparently a place where people have long technical discussions about statistics, rather than a place where we can learn Bette Midler’s views on Jean Genet or reminisce about that time Cher was really into horses) that says a very bad thing:

The Bernstein-von Mises theorem kills any criticism against non-informative priors (for the models commonly used). Priors only matter if one wishes to combine one’s confirmation bias with small studies. Time to move to more interesting stuff(predictive inference)

I’ve written in other places about how Bayesian models do well at prediction (and Andrew and Aki have written even more on it), so I’m leaving the last sentence alone. Similarly the criticisms in the second last sentence are mainly rendered irrelevant if we focus on weakly informative priors. So let’s talk about the first sentence.

Look what you made me do

The Bernstein-von Mises theorem, like Right Said Fred, is a both a wonder and a horror that refuses to stay confined to a bygone era. So what is it?

The Bernstein-von Mises theorem (or BvM when I’m feeling lazy) says the following:

Under some conditions, a posterior distribution converges as you get more and more data to a multivariate normal distribution centred at the maximum likelihood estimator with covariance matrix given by n^{-1} I(\theta_0)^{-1}, where \theta_0 is the true population parameter (Edit: here I(\theta_0) is the Fisher information matrix at the true population parameter value).

A shorter version of this is that (under some conditions) a posterior distribution looks asymptotically like the sampling distribution of a maximum likelihood estimator.

Or we can do the wikipedia version (which lacks in both clarity and precision. A rare feat.):

[T]he posterior distribution for unknown quantities in any problem is effectively independent of the prior distribution (assuming it obeys Cromwell’s rule) once the amount of information supplied by a sample of data is large enough.

Like a lot of theorems that are imprecisely stated, BvM is both almost always true and absolutely never true.  So in order to do anything useful, we need to actually think about the assumptions. They are written, in great and loving detail, in Section 2.25 of  these lecture notes from Richard Nickl. I have neither the time nor energy to write these all out but here are some important assumptions:

  1. The maximum likelihood estimator is consistent.
  2. The model has a fixed, finite number of parameters.
  3. The true parameter value lies on the interior of the parameter space (ie if you’re estimating a standard deviation, the true value can’t be zero).
  4. The prior density must be non-zero in a neighbourhood of \theta_0.
  5. The log-likelihood needs to be smooth (two derivates at the true value and some other stuff)

The first condition rules out any problem where you’d want to use a penalized maximum likelihood estimator. (Edit: Well this was awkwardly stated. You need the MLE to be unbiased [Edit: Consistent! Not unbiased. Thanks A Reader] and there to be a uniformly consistent estimator, so I’m skeptical these things hold in the situation where you would use penalized MLE.) The third one makes estimating variance components difficult. The fifth condition may not be satisfied after you integrate out nuance parameters as this can lead to spikes in the likelihood.

I guess the key thing to remember is that this “thing that’s always true” is, like everything else in statistics, a highly technical statement that can be wrong when the the conditions under which it is true are not satisfied.

Call it what you want

You do it ’cause you can. Because when I was younger, I couldn’t sustain those phrases as long as I could. Now I can just go on and on. If you keep your stamina and you learn how to sing right, you should get better rather than worse. – Diamanda Galás in Rolling Stone 

Andrew has pointed out many times that the problem with scientists misapplying statistics is not that they haven’t been listening, it’s that they have listened too well. It is not hard to find statisticians (Bayesian or not) who will espouse a similar sentiment to the first sentence of that ill-begotten tweet. And that’s a problem.

When someone says to me “Bernstein-von Mises implies that the prior only has a higher-order effect on the posterior”, I know what they mean (or, I guess, what they should mean). I know that they’re talking about a regular model, a lot of information, and a true parameter that isn’t on the boundary of the parameter space. I know that declaring something a higher-order effect effect is a loaded statement because the “pre-asymptotic” regime can be large for complex models.

Or, to put it differently, when someone says that I know they are not saying that priors aren’t important in Bayesian inference. But it can be hard to know if they know this. And to be fair, if you make a super-simple model that can be used in the type of situation where you could read the entrails of a recently gutted chicken and still get an efficient, asymptotically normal estimator, then the prior is not a big deal unless you get it startlingly wrong.

No matter how much applied scientists may want to just keep on gutting those chickens, there aren’t that many chicken gutting problems around. (Let the record state that a “chicken gutting” problem is one where you only have a couple of parameters to control your system, and you have accurate, random, iid samples from your population of interest. NHSTs are pretty good at gutting chickens.) And the moment that data gets “big” (or gets called that to get the attention of a granting agency), all the chickens have left the building in some sort of “chicken rapture” leaving behind only tiny pairs of chicken shoes.

Big reputation, big reputation. Ooh, you and me would be a big conversation.

I guess what I’m describing is a communication problem. We spend a lot of time communicating the chicken gutting case to our students, applied researchers, applied scientists, and the public rather than properly preparing them for the poultry armageddon that is data analysis in 2017. They have nothing but a tiny knife as they attempt to wrestle truth from the menagerie of rhinoceroses, pumas, and semi-mythical megafauna that are all that remain in this chicken-free wasteland we call a home.

(I acknowledge that this metaphor has gotten away from me.)

The mathematical end of statistics is a highly technical discipline. That’s not so unusual–lots of disciplines are highly technical. What is unusual about statistics as a discipline is that the highly technical parts of the field mix with the deeply applied parts. Without either of these ingredients, statistics wouldn’t be an effective discipline.  The problem is, as it always is, that people at different ends of the discipline often aren’t very good at talking to each other.

Many people who work as excellent statisticians do not have a mathematics background and do not follow the nuances of the technical language. And many people who work as excellent statisticians do not do any applied work and do not understand that the nuances of their work are lost on the broader audience.

My background is a bit weird. I wasn’t trained as a statistician, so a lot of the probabilistic arguments in the theoretical stats literature feel unnatural to me. So I know that when people like Judith Rousseau or Natalia Bochkina or Ismael Castillo or Aad van der Vaart or any of the slew of people who understand Bayesian asymptotics more deeply that I can ever hope to speak or write, I need to pay a lot of attention to the specifics.  I will never understand their work on the first pass, and may never understand it deeply no matter how much effort I put in.

The only reason that I now know more than nothing about Bayesian asymptotics is that I hit a point where I no longer had the luxury to not know. So now I know enough to at least know what I don’t know.

Replication is not just Taylor Swift’s new album

The main thing that I want to make clear about the Bernstein-von Mises theorem is that it is hard to apply it in practice. This is for the exact same reason that the asymptotic arguments behind NHSTs rarely apply in practice.

Just because you have a lot of data, doesn’t mean you have independent replicates of the same experiment.

In particular, issues around selection, experimental design, forking paths, etc all are relevant to applying statistical asymptotics. Asymptotic statements are about what would happen if you gather more data and analyze it, and therefore they are statements about the entire procedure of doing inference on a new data set. So you can’t just declare that BvM holds. The existence of a Bernstein-von Mises theorem for your analysis is a statement about how you have conducted your entire analysis.

Let’s start with the obvious candidate for breaking BvM: big data. Large, modern data sets are typically observational (that is, they are not designed and the mechanism for including the data may be correlated with the inferential aim of the study). For observational data, it is unlikely that the posterior mean (for example) be a consistent estimator of the population parameter, which precludes a BvM theorem from applying.

Lesson: Consistency is a necessary condition for a BvM to hold, and it is unlikely to hold for undesigned data.

Now onto the next victim: concept drift.  Let’s imagine that we can somehow guarantee that the sampling mechanism we are using to collect our big data set will give us a sample that is representative of the population as a whole.  Now we have to deal with the fact that it takes a lot of time to collect a lot of data. Over this time, the process you’re measuring can change.  Unless your model is able to specifically model the mechanism for this change, you are unlikely to be in a situation where BvM holds.

Lesson: Big data sets are not always instantaneous snapshots of a static process, and this can kill off BvM.

For all the final girls: Big data sets are often built by merging many smaller data sets. One particular example springs to mind here: global health data. This is gathered from individual countries, each of which has its own data gathering protocols. To some extent, you can get around this by carefully including design information in your inference, but if you’ve come to the data after it has been collated, you may not know enough to do this. Once again, this can lead to biased inferences for which the Bernstein-von Mises theorem will not hold.

Lesson: The assumptions of the Bernstein-von Mises theorem are fragile and it’s very easy for a dataset or analysis to violate them. It is very hard to tell, without outside information, that this has not happened.

`Cause I know that it’s delicate

(Written after seeing Diamanda Galas, who was incredible, or possibly unbelievable, and definitely unreal. Sitting with a thousand or so people sitting in total silence listening to the world end is a hell of a way to wrap up a weekend.)

Much like in life, in statistics things typically only ever get worse when they get more complicated.  In the land of the Bernstein-von Mises theorem, this manifests in the guise of a dependence on the complexity of  the model.  Typically, if there are p parameters in a model and we observe n independent data points (and all the assumptions of the BvM are satisfied), then the distance from the posterior to a Normal distribution is \mathcal{O}\left(\sqrt{pn^{-1}}\right).  That is, it takes longer to converge to a normal distribution when you have more parameters to estimate.

Do you have the fear yet? As readers of this blog, you might be seeing the problem already. With multilevel models, you will frequently have at least as many parameters as you have observations.  Of course, the number of effective parameters is usually much much smaller due to partial pooling. Exactly how much smaller depends on how much pooling takes place, which depends on the data set that is observed.  So you can see the problem.

Once again, it’s those pesky assumptions (like the Scooby gang, they’re not going away no mater how much latex you wear).  In particular, the fundamental assumption is that you have replicates which, in a multilevel model with as many parameters as data, essentially means that you can pool more and more as you observe more and more categories. Or that you keep the number of categories fixed and you see more and more data in each category (and eventually see an infinite amount of data in each category).

All this means that the speed at which you hit the asymptotic regime (ie how much data you need before you can just pretend you posterior is Gaussian) will be a complicated function of your data. If you are using a multilevel model and the data does not support very much pooling, then you will reach infinity very very slowly.

This is why we can’t have nice things

Rolling Stone: Taylor Swift?

Diamanda Galás: [Gagging noises]

The ultimate death knell for simple statements about the Bernstein-von Mises theorem is the case where the model has an infinite dimensional parameter (aka a non-parametric effect).  For example, if one of your parameters is an unknown function.

A common example relevant in a health context would be if you’re fitting survival data using a Cox Proportional Hazards model, where the baseline hazard function is typically modelled by a non-parametric effect.  In this case, you don’t actually care about the baseline hazard (it’s a nuisance parameter), but you still have to model it because you’re being Bayesian. In the literature, this type of model is called a “semi-parametric model” as you care about the parametric part, but you still have to account for a non-parametric term.

To summarize a very long story that’s not been completely mapped out yet, BvM does not hold in general for models with an infinite dimensional parameter. But it does hold in some specific cases. And maybe these cases are common, although honestly it’s hard for me to tell. This is because working out exactly when BvM holds for these sorts of models involves pouring through some really tough theory papers, which typically only give explicit results for toy problems where some difficult calculations are possible.

And then there are Bayesian models of sparsity, trans-dimensional models (eg models where the number of parameters isn’t fixed) etc etc etc.

But I’ll be cleaning up bottles with you on New Year’s Day

So to summarise, a thing that a person said on twitter wasn’t very well thought out. Bernstein-von Mises fails in a whole variety of ways for a whole variety of interesting, difficult models that are extremely relevant for applied data analysis.

But I feel like I’ve been a bit down on asymptotic theory in this post. And just like Diamanda Galás ended her show with a spirited rendition of Johnny Paycheck’s (Pardon Me) I’ve Got Someone to Kill, I want to end this on a lighter note.

Almost all of my papers (or at least the ones that have any theory in them at all) have quite a lot of asymptotic results, asymptotic reasoning, and applications of other people’s asymptotic theory. So I strongly believe that asymptotics are a vital part of our toolbox as statisticians. Why? Because non-asymptotic theory is just too bloody hard.

In the end, we only have two tools at hand to understand and criticize modern statistical models: computation (a lot of which also relies on asymptotic theory) and asymptotic theory. We are trying to build a rocket ship with an earthmover and a rusty spoon, so we need to use them very well. We need to make sure we communicate our two tools better to the end-users of statistical methods. Because most people do not have the time, the training, or the interest to go back to the source material and understand everything themselves.

I know less about this topic than I do about Freud.

Someone who I don’t know writes:

Hi Andrew,

I hope this email finds you well.

Hey, that’s interesting: I’m on a first-name basis with this person who cares about my health, but I have no idea who he is. Or if he’s a bot. I guess a bot could care about my health too, inasmuch as a bot can care about anything.

My mysterious correspondent continues:

I just wanted to let you know that I will be writing a new post soon for the ** blog that will be centered on the Quantitative Easing debate. This topic seems to be a popular one of late and I thought I’d see if I could get your views on the topic to include in the piece.

Quantitative easing is no panacea. The Federal Reserve hopes it will never again have to resort to the unprecedented monetary stimulus efforts it took following the great financial crisis under a QE program that ended in late 2014. It seems that the ECB is also looking for an out.

Again, it would be great to get your views on the topic. More specifically:

1. How do you think the Fed will unwind its multi-trillion dollar balance sheet resulting from its stimulus program without severely upsetting the bond and equity markets?

2. And how will the ECB, which is still stuck in a quantitative easing cycle, be able to bring it to an end without plunging Eurozone countries into yet another financial crisis?

Any comments on the subject, even those not answering the questions above, would be highly appreciated.

Thank you very much in advance and I hope to hear back from you soon.

Best regards,

Hey, I think they want . . . free content! On a topic I know nothing about. I mean, absolutely nothing. Sure, I could google *quantitative easing* to find out what the hell this guy is talking about—but, then again, he could google it directly himself.

Why didn’t they ask me about Freud? I’m a Freud expert!

“Dear Professor Gelman, I thought you would be interested in these awful graphs I found in the paper today.”

Mike Sances writes:

I thought you would be interested in these awful graphs I found in the paper today.

Sample attached [see above], but the article is full of them.

My reply: This is indeed horrible in so many ways. I hope nobody was looking at that graph on their phone while driving!

At the very least, they could go for the click-through solution.

Poisoning the well with a within-person design? What’s the risk?

I was thinking more about our recommendation that psychology researchers routinely use within-person rather than between-person designs.

The quick story is that a within-person design is more statistically efficient because, when you compare measurements within a person, you should get less variation than when you compare different groups. But researchers often use between-person designs out of a concern with “poisoning the well”: the worry that, if you apply treatments A and B to someone, the effects of A might persist until the second measurement period, or the two treatments can interact.

I think there’s a common view among researchers that, even if the within-person design might be more efficient, the between-person design is safer in that it gives an unbiased estimate. And it’s considered a better scientific decision to choose the safer option.

I have a few things to say about this attitude, in which people want to use the safe, conservative statistical analysis.

1. As John Carlin and I explain, if you restrict yourself to summarizing with statistically significant comparisons (as is standard practice), your estimates are not at all unbiased. Type M error can be huge.

2. When uncontrolled variation is high, type S errors can also be huge: in short, if you have a noisy study, you’re likely to make substantively wrong conclusions.

3. Finally, just on its own terms—even if you accept the (false) belief that the noisy, between-person design is “safer”—even then, so what? Scientific research is not supposed to be safe. Power pose, ovulation and voting, embodied cognition, etc.: These are not “safe” ideas. They are controversial, risky ideas—they’re surprising, and that’s one reason they hit the headlines. We’re talking about researchers who in general don’t consider the safe path as a virtue: they want to make new, surprising discoveries.

Putting this all together, I thought it could be useful to frame questions of experimental design and analysis in terms of risks and benefits.

In a typical psychology experiment, the risk and benefits are indirect. No patients’ lives are in jeopardy, nor will any be saved. There could be benefits in the form of improved educational methods, or better psychotherapies, or simply a better understanding of science. On the other side, the risk is that people’s time could be wasted with spurious theories or ineffective treatments. Useless interventions could be costly in themselves and could do further harm by crowding out more effective treatments that might otherwise have been tried.

The point is that “bias” per se is not the risk. The risks and benefits come later on when someone tries to do something with the published results, such as to change national policy on child nutrition based on claims that are quite possibly spurious.

Now let’s apply these ideas to the between/within question. I’ll take one example, the notorious ovulation-and-voting study, which had a between-person design: a bunch of women were asked about their vote preference, the dates of their cycle, and some other questions, and then women in a certain phase of their cycle were compared to women in other phases. Instead, I think this should’ve been studied (if at all) using a within-person design: survey these women multiple times at different times of the month, each time asking a bunch of questions including vote intention. Under the within-person design, there’d be some concern that some respondents would be motivated to keep their answers consistent, but in what sense does that constitute a risk? What would happen is that changes would be underestimated, but when this propagates down to inferences about day-of-cycle effects, I’m pretty sure this is a small problem compared to all the variation that tangles up the between-person design. One could do a more formal version of this analysis; the point is that such comparisons can be done.

Using output from a fitted machine learning algorithm as a predictor in a statistical model

Fred Gruber writes:

I attended your talk at Harvard where, regarding the question on how to deal with complex models (trees, neural networks, etc) you mentioned the idea of taking the output of these models and fitting a multilevel regression model. Is there a paper you could refer me to where I can read about this idea in more detail? At work I deal with ensembles of Bayesian networks in a high dimensional setting and I’m always looking for ways to improve the understanding of the final models.

I replied that I know of no papers on this; it would be a good thing for someone to write up. In the two examples I was thinking of (from two different fields), machine learning models were used to predict a binary outcome; they gave predictions on 0-1 scale. We took the logits of these predictions to get continuous scores; call these “z”, then we ran logistic regressions on the data, using, as predictors, z and some other things. For example,
Pr(y_i = 1) = invlogit(a_j[i] + b*z_i) [that’s a varying-intercept model]
Pr(y_i = 1) = invlogit(a_j[i] + b_j[i]*z_i) [varying intercepts and slopes]
Pr(y_i = 1) = invlogit(a_j[i] + b_j[i]*z_i + X*gamma) [adding some new predictors]
You’d expect the coefficients b to be close to 1 in this model, but adding the varying intercepts/slopes and other structures can help pick up patterns that were missed in the machine learning model, and can be helpful in expanding the predictions, generalizing to new settings.

Gruber followed up:

It is an interesting approach. My initial thought was different. I have seem some approaches to bring some interpretability to complex models by learning the prediction of the complex model as in

Buciluǎ, Cristian, Rich Caruana, and Alexandru Niculescu-Mizil. “Model Compression.” In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 535–41. ACM, 2006.

Ba, Lei Jimmy, and Rich Caurana. “Do Deep Nets Really Need to Be Deep?” CoRR abs/1312.6184 (2013).

And more recently
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier,” 1135–44. ACM Press, 2016. doi:10.1145/2939672.2939778.

That’s all fine, it’s good to understand a model. I was thinking of a different question, which was taking predictions from a model and trying to do more with them by taking advantage of other information that had not been used in the original fit.

Stan is a probabilistic programming language

See here: Stan: A Probabilistic Programming Language. Journal of Statistical Software. (Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, Allen Riddell)

And here: Stan is Turing Complete. So what? (Bob Carpenter)

And, the pre-stan version: Fully Bayesian computing. (Jouni Kerman and Andrew Gelman)

Apparently there was some question about whether Stan is a “probabilistic programming language,” so I want to make it clear that it is. In the comment thread, we should be able to resolve any questions on this.

Teeth are the only bones that show

“I lived in the country where the dead wood aches, in a house made of stone and a thousand mistakes”The Drones

Sometimes it’s cold and grey and Canadian outside and the procrastination hits hard. Sometimes, in those dark moments, one is tempted to fire up the social media and see what’s happening in other places where it’s probably also cold and grey, but possibly not Canadian. As if change were possible.

And sometimes you see something on the social media that annoys you.

In this case, it was a Pink News article titled “Men with muscles and money are more attractive to gay men, new study finds”.

Now, I curate my social media fairly heavily: I did not get into this liberal bubble by accident and I have no interest in an uncurated life. So I didn’t see this from some sort of Pink News feed, but rather from some stranger saying (and I paraphrase) “this is crap”.

A quick sidebar

Really? A study shows men with muscles and money are more attractive to gay men? Really?

For a community of people who are nominally dedicated to performing masculinity in all its varied, rainbow forms, the gay media is pretty much exclusively devoted to good looking, often straight, rich, often white men who are so dehydrated they have fabulous abs.

To paraphrase Ms Justin Elizabeth Sayer, “I know you find it hard to believe, with all the things you could be as a gay person, boring would still be an option. But it is.”.

We don’t need new research for this, we could just look at the Pink News website. (Please don’t. It’s not worth the pain.)

A journey in three parts: Pink News

The best thing that I can say about the Pink News article is that it didn’t regurgitate the press release.  It instead used it as a base to improvise a gay spin around. Pink News: the jazz of news.

It’s immediately clear from this article that the findings are based around the content of only one (creepy) website called TubeCrush, where you can upload photos taken of unsuspecting men on public transport. So a really classy establishment. (I mean, there are not enough words to get into just how ugly this is as a concept, so let’s just take that as read.)

The idea was that because the photos on this site focus on ripped white men who often have visual signifiers of wealth, gay men are attracted to ripped, rich, white men.

Of course, this excited me immediately. The researchers are making broad-ranging conclusions from a single, niche website that fetishizes a particular lack of consent.

Why is this exciting? Well I’m interested in the way that the quantitative training that scientists (both of the social and antisocial bent) have received doesn’t appear to have made the point that an inference can only ever be as sharp as the data used to make it.  Problems come when you, to paraphrase the Mother Superior in Sister Act 2: Back in the Habit, try to make a silk purse out of a sow’s ear.

That terrible gayface study from a few months ago was a great example of how quantitative social science can fall apart due to biases in the data collection, and this looked like a great example of how qualitative social science can do the same thing.

A journey in three parts: The press release

Coventry University, for reasons best known only to themselves, decided that the revelation that pervy website TubeCrush had a tendency to feature ripped, rich white men was worth a press release.  Here it is.

The key sentence, which Pink News ran with, was

The researchers say their study of the entries on the site quickly revealed that both straight women’s and gay men’s desired particular types of men.

It goes on to talk about how, despite London being multicultural, most of the photos were of white men. Most of the men were good looking and muscled. Most of the commentary on TubeCrush mentioned the muscles, as well as the expensive suits, watches, and phones the men were wearing/using.

Why is all this important? Well the press release tell us:

The academics said that public transport – in this case the Tube – has now become the space where gender politics is decided.

How fascinating. Not in schools, or universities. Not in families or friendship circles. Not at work or in social spaces. On. The. Tube.

The press release ends with a few quotes from the lead researcher Adrienne Evans, which is a useful thing for journalists who need to pretend they’ve talked with the authors. I’m just going to cherry pick the last one, but you should get the point (or follow the link above):

“It’s a problem as because although it appears as though we have moved forward, our desires are still mostly about money and strength.”

That is a heavy extrapolation of the available data.

A journey in three parts: The paper

Finally, I decided to read the paper. If you’re going to do a deep dive methodological criticism instead of doing the pile of things that you are actually supposed to be spending the afternoon on, then you should probably read the actual paper.

So here is the paper, entitled “He’s a total TubeCrush”: post-feminist sensibility as intimate publics by Adrienne Evans and Sarah Riley, published in Feminist Media Studies.

Now I have no expert knowledge of feminist studies (media or otherwise).  I’ve done the basic readings, but I am essentially ignorant. But I’m going to attempt to parse the title anyway.

An intimate public appears to be the idea that you can leverage a private, shared community identity into a market. The example in the paper is “chick lit”, which takes shared, performed aspects of femininity and basically markets products towards them. Arguably another example would be Pink News and the gay media industry. The key feature is that “Intimate publics do not just orient desire toward traditional gender roles, but evoke a nostalgic stance toward them, so that one element of the intimate public is feelings of nostalgia.”. So an intimate public in this context will always throw back to some supposed halcyon days of peak masculinity and celebrate this by marketing it, as TubeCrush does. So far so good.

The post-feminist sensibility is basically the idea that post-feminist media simultaneously sell the idea of equality while also expelling that you can achieve equality through consumption.  (I mean, they use a lot more words than that, so I’m probably missing nuance, but that seems to be the thrust).

So the main argument (but less than half the paper) is devoted to the idea that TubeCrush is an example of post-feminist intimate publics. (a post-feminist intimate publics? The grammar of academic feminism is confusing)

All well and good. (And probably quite interesting if you’re into that sort of thing).

Also, as far as I’m concerned, completely kosher.  They are studying an object (TubeCrush), they are categorizing it, and they are unpicking the consequences of that categorization. That is essentially what academics do. We write paragraphs like:

TubeCrush makes masculinity a bodily property in much the same way as femininity is within post-feminist sensibility (Gill 2007). The aesthetic idealization of strength in the posts can be tied both to the heightened visibility of masculinity more generally, and to its location within an “attraction to-” culture. The intersection of heterosexual women’s and gay men’s desire arguably heightens the emphasis on strength as a key component of post-feminist masculinity, where gay male culture holds up heterosexual male strength as part of its own visual landscape (Duane Duncan 2007, 2010). In a culture that only very recently was not visible at all, effeminacy is disparaged and what is celebrated are “visible public identities that [have] more in common with traditional images of masculinity” (Duncan 2007, 334). In this way, representations of masculinity on TubeCrush demonstrate the maintenance of hegemonic masculinity, tied into notions of strength and phallic power.

Ah but there it is. They extrapolated from limited available data (a deep read of TubeCrush) to population dynamics.

On the maintenance of hegemonic data practices in social sciences


And this is the crux. Moving from this to the press release to that awful Pink News article is a straight ride into hell on a sleigh made of good intentions.

Statements like “TubeCrush is a reestablishment of traditional gender roles within the context of post-feminism” are perfectly reasonable outcomes from this study. They take the data and summarize it in a way and sheds light on the underlying process.  But to move the conclusions beyond the seedy TubeCrush universe is problematic.

And the researchers are aware of this (a smarter blogger could probably make links here with the post-feminist sensibility in that they also generalize with their data while admitting that you need to buy the generalization through more data collection. But the analogy doesn’t completely hold up.)

While there was no rigid research design prior to funding, we believe this extended engagement with a website (whose materials only go back as far as 2011) has provided an in-depth understanding of the patterns and content of TubeCrush.

The first challenge when generalizing these results off the TubeCrush platform is that the study is not designed. I would term this a “pilot study”. I guess you can write press releases about pilot studies, but you probably shouldn’t write press releases that don’t mention the limitations of the data.

My major criticism of this article is that it treats TubeCrush, the photographs and the comments as an entity that just exists outside of any context. This is a website that is put together by a person (or a team of people) who select photographs and caption photographs. So these photographs and comments will always be a weak instrument for understanding society.  They are not irrelevant: the website is apparently quite popular. But the data has been filtered through a very limited sensibility. It reflects both the ideals of attractiveness and the sense of humour (or “humour”, I’ve not actually visited the site because I don’t need that in my life) of the curator(s).

TubeCrush makes it a weak tool for understanding society and all inferences built using only this data will necessarily be weak.

And I guess this is my broad point. Quantitative thinking  around the data gathering, the design of the experiment, and the limitations of the measurements can inform the reliability of a qualitative study.

Is this paper bad? No. Was the article? Yes. Was the press release? Definitely.

But what sort of cold-hearted person doesn’t love a paper with the sentence

The wordplay on the financial language of the double-dip recession to again signify performing oral sex on financially secure (if not wealthy) masculinities demonstrates the juxtapolitics at the heart of TubeCrush.

Wine + Stan + Climate change = ?

Pablo Almaraz writes:

Recently, I published a paper in the journal Climate Research in which I used RStan to conduct the statistical analyses: Almaraz P (2015) Bordeaux wine quality and climate fluctuations during the last century: changing temperatures and changing industry. Clim Res 64:187-199.

We start by talking reproducible research, then we drift to a discussion of voter turnout

Emil Kirkegaard writes:

Regarding data sharing, you recently commented that “In future perhaps journals will require all data to be posted as a condition of publication and then this sort of thing won’t happen anymore.”

We went a step further. We require public data sharing at submission. This means that from the moment one submits, the data must be public. Two reasons for this setup. First, reviewers may need the data (+code) to review the paper. Some reviewers replicate all analyses in papers they review (i.e. me, hoping to start a trend) which frequently results in mistakes being found in the review. Second, if the data are first shared upon publication, this means that while the submission is in review, they are locked away. This results in a substantial slow-down of science because review times can be so long. A great example of this problem is GWASs which can take >1 year in review, while the manuscripts (usually without data) can be acquired thru networks if one knows the right person.

In your post, you note that you had to dig up the data from the hard drive to the student who requested it (good idea with that study, he should use lower-level administrative divisions too; alas these have lower voter turnout, the opposite of rational voter theory!). Given the fact that hard drives crash, computers get replaced, and humans are bad at backing up, this is a very error-prone method for storing for scientific materials for perpetuity. Would it not be better for you to go thru your old publications and make a project for each of them on OSF, and put all the materials there?

My reply: Yes, I agree with you on the replication thing. But I think you’re wrong regarding rational choice theory and turnout as a function of jurisdiction size; see section 3.3 of this article.

Kirkegaard responds:

I can’t say I’m an expert on RCT or turnout or that I’m interested enough in RCT for turnout to spend a lot of time understanding the math in that paper. A sort of meta-comeback.

However, I did read Section 3 and onwards. EU elections, by the way, have lower turnout than the national ones in EU, and the sub-national ones have lower turnout than the national ones as well (At least, that’s my impression, I did look up the Danish numbers, but did not do a systematic review of turnout by EU country by level). Not sure how the RCTheoist will change up the equations to back-predict this non-linear result, but I’m sure it can be done with appropriate tricks.

Above is a figure of Danish turnout results 1970-2013 [oh no! Excel graphics! — ed.]. Source: The reason the kommunal (communal, second-level divisions, n≈100) and regional (first-level divisions; n=5/14, it changed in 2007, notice no change in the turnout) are so closely tied is that they put them on the same day, so people almost always vote for both. EU, by the way, has grown tremendously in power since the 1970s but voter turnout is steady. As I recall, the reason for the spike in communal/regional turnout in early 2000s was because they put the national election on the same day, so people voted for both while they were there anyway.

Regarding voter intentions. It’s easy to find out why they vote. I have been talking to them about this for years, and they never ever ever cite these fancy decisions models. Of course, normal people don’t really understand this stuff. Instead, they say stuff like “if everybody thought like you, democracy wouldn’t work” (a failure to apply game theory) or “it’s a democratic duty” (not in the legal sense, and dubiously in the moral either). In my unscientific estimate of commoners, non-science regular people, I’d say about 90% of reasons given for why one should vote is one or both of these two.

This discussion reminds me of this one about RCT for voter ignorance. My agreement lies with Friedman.

Just in response to those last two paragraphs: I think these fancy decision models can give us insight into behavior, even if this is not the way people understand their voting decisions. Different explanations for voting are complementary, not competing. See section 5 of our paper for more on this point.

Custom Distribution Solutions

Custom Distribution Solutions

I (Aki) recently made a case study that demonstrates how to implement user defined probability functions in Stan language (case study, git repo). As an example I use the generalized Pareto distribution (GPD) to model extreme values of geomagnetic storm data from the World Data Center for Geomagnetism. Stan has had support for user defined functions for a long time, but there wasn’t a full practical example of how to implement all the functions that built-in distributions have (_lpdf (or _lpmf),_cdf, _lcdf, _lccdf, and_rng). Having the full set of functions makes it easy to implement models, censoring, posterior predictive checking and loo. The most interesting things I learned while making the case study were:

  • How to replicate the behavior of Stan’s internal distribution functions as close as possible (due to lack of overloading of user defined functions, we have to make some compromises).
  • How to make tests for the user defined distribution functions.

By using this case study as a template, it should be easier and faster to implement and test new custom distributions for your Stan models.

“A Bias in the Evaluation of Bias Comparing Randomized Trials with Nonexperimental Studies”

Jessica Franklin writes:

Given your interest in post-publication peer review, I thought you might be interested in our recent experience criticizing a paper published in BMJ last year by Hemkens et al.. I realized that the method used for the primary analysis was biased, so we published a criticism with mathematical proof of the bias (we tried to publish in BMJ, but it was a no go). Now there has been some back and forth between the Hemkens group and us on the BMJ rapid response page, and BMJ is considering a retraction, but no action yet. I don’t really want to comment too much on the specifics, as I don’t want to escalate the tension here, but this has all been pretty interesting, at least to me.

Interesting, in part because both sides in the dispute include well-known figures in epidemiology: John Ioannidis is a coauthor on the Hemkens et al. paper, and Kenneth Rothman is a coauthor on the Franklin et al. criticism.


The story starts with the paper by Hemkens et al., who performed a meta-analysis on “16 eligible RCD studies [observational studies using ‘routinely collected data’], and 36 subsequent published randomized controlled trials investigating the same clinical questions (with 17 275 patients and 835 deaths),” and they found that the observational studies overestimated efficacy of treatments compared to the later randomized experiments.

Their message: be careful when interpreting observational studies.

One thing I wonder about, though, is how much of this is due to the time ordering of the studies. Forget for a moment about which studies are observational and which are experimental. In any case, I’d expect the first published study on a topic to show statistically significant results—otherwise it’s less likely to be published in the first place—whereas anything could happen in a follow-up. Thus, I’d expect to see earlier studies overestimate effect sizes relative to later studies, irrespective of which studies are observational and which are experimental. This is related to the time-reversal heuristic.

To put it another way: The Hemkens et al. project is itself an observational study, and in their study there is complete confounding between two predictors: (a) whether a result came from an observational study or an experiment, and (b) whether the result was published first or second. So I think it’s impossible to disentangle the predictive value of (a) and (b).

The criticism and the controversy

Here are the data from Hemkens et al.:

Franklin et al. expressed the following concern:

In a recent meta-analysis by Hemkens et al. (Hemkens et al. 2016), the authors compared published RCD [routinely collected data] studies and subsequent RCTs [randomized controlled trials] using the ROR, but inverted the clinical question and corresponding treatment effect estimates for all study questions where the RCD estimate was > 1, thereby ensuring that all RCD estimates indicated protective effects.

Here’s the relevant bit from Hemkens et al.:

For consistency, we inverted the RCD effect estimates where necessary so that each RCD study indicated an odds ratio less than 1 (that is, swapping the study groups so that the first study group has lower mortality risk than the second).

So, yeah, that’s what they did.

On one hand, I can see where Hemkens et al. were coming from. To the extent that the original studies purported to be definitive, it makes sense to code them in the same direction, so that you’re asking how the replications compared to what was expected.

On the other hand, Franklin et al. have a point, that in the absence of any differences, the procedure of flipping all initial estimates to have odds ratios less than 1 will bias the estimate of the difference.

Beyond this, the above graph shots a high level of noise in the comparisons, as some of the follow-up randomized trials have standard errors that are essentially infinite. (What do you say about an estimated odds ratio that can be anywhere from 0.2 to 5?) Hemkens et al. appear to be using some sort of weighting procedure, but the relevant point here is that only a few of these studies have enough data to tell us anything at all.

My take on these papers

The above figure tells the story: The 16 observational studies appear to show a strong correlation between standard error and estimated effect size. This makes sense. Go, for example, to the bottom of the graph: I don’t know anything about Hahn 2010, Fonoarow 2008, Moss 2003, Kim 2008, and Cabell 2005, but all these studies are estimated to cut mortality by 50% or more, which seems like a lot, especially considering the big standard errors. It’s no surprise that these big estimates fail to reappear under independent replication. Indeed, as noted above, I’d expect that big estimates from randomized experiments would also generally fail to reappear under independent replication.

Franklin et al. raise a valid criticism: Even if there is no effect at all, the method used by Hemkens et al. will create the appearance of an effect: in short, the Hemkens et al. estimate is indeed biased.

Put it all together, and I think that the sort of meta-analysis performed by Hemkens et al. is potentially valuable, but maybe it would’ve been enough for them to stop with the graph on the left in the above image. It’s not clear that anything is gained from their averaging; also there’s complete confounding in their data between timing (which of the two studies came first) and mode (observational or experimental).

The discussion

Here are some juicy bits from the online discussion at the BMJ site:

02 August 2017, José G Merino, US Research Editor, The BMJ:

Last August, a group led by Jessica Franklin submitted to us a criticism of the methods used by the authors of this paper, calling into question some of the assumptions and conclusion reached by Lars Hemkens and his team. We invited Franklin and colleagues to submit their comments as a rapid response rather than as a separate paper but they declined and instead published the paper in Epidemiological Methods (Epidem Meth 2-17;20160018, DOI 10.1515/em-2016-0018.) We would like to alert the BMJ’s readers about the paper, which can be found here:

We asked Hemkens and his colleagues to submit a response to the criticism. That report is undergoing statistical review at The BMJ. We will post the response shortly.

14 September 2017, Lars G Hemkens, senior researcher, Despina G Contopoulos-Ioannidis, John P A Ioannidis:

The arguments and analyses of Franklin et al. [1] are flawed and misleading. . . . It is trivial that the direction of comparisons is essential in meta-epidemiological research comparing analytic approaches. It is also essential that there must be a rule for consistent coining of the direction of comparisons. The fact that there are theoretically multiple ways to define such rules and apply the ratio-of-odds ratio method doesn’t invalidate the approach in any way. . . . We took in our study the perspective of clinicians facing new evidence, having no randomized trials, and having to decide whether they use a new promising treatment. In this situation, a treatment would be seen as promising when there are indications for beneficial effects in the RCD-study, which we defined as having better survival than the comparator (that is a OR < 1 for mortality in the RCD-study) . . . it is the only reasonable and useful selection rule in real life . . . The theoretical simulation of Franklin et al. to make all relative risk estimates <1 in RCTs makes no sense in real life and is without any relevance for patient care or health-care decision making. . . . Franklin et al. included in their analysis a clinical question where both subsequent trials were published simultaneously making it impossible to clearly determine which one is the first (Gnerlich 2007). Franklin et al. selected the data which better fit to their claim. . . .

21 September 2017, Susan Gruber, Biostatistician:

The rapid response of Hemkens, Contopoulos-Ioannidis, and Ioannidis overlooks the fact that a metric of comparison can be systematic, transparent, replicable, and also wrong. Franklin et. al. clearly explains and demonstrates that inverting the OR based on RCD study result (or on the RCT result) yields a misleading statistic. . . .

02 October 2017, Jessica M. Franklin, Assistant Professor of Medicine, Sara Dejene, Krista F. Huybrechts, Shirley V. Wang, Martin Kulldorff, and Kenneth J. Rothman:

In a recent paper [1], we provided mathematical proof that the inversion rule used in the analysis of Hemkens et al. [2] results in positive bias of the pooled relative odds ratio . . . In their response, Hemkens et al [3] do not address this core statistical problem with their analysis. . . .

We applaud the transparency with which Hemkens et al reported their analyses, which allowed us to replicate their findings independently as well as to illustrate the inherent bias in their statistical method. Our paper was originally submitted to BMJ, as recently revealed by a journal editor [4], and it was reviewed there by two prominent biostatisticians and an epidemiologist. All three reviewers recognized that we had described a fundamental flaw in the statistical approach invented and used by Hemkens et al. We believe that everyone makes mistakes, and acknowledging an honest mistake is a badge of honor. Thus, based on our paper and those three reviews, we expected Hemkens et al. and the journal editors simply to acknowledge the problem and to retract the paper. Their reaction to date is disappointing.

13 November 2017, José G Merino, US Research Editor, Elizabeth Loder, Head of Research, The BMJ:

We acknowledge receipt of this letter that includes a request for retraction of the paper. We take this request very seriously. Before we make a decision on this request, we -The BMJ’s editors and statisticians – are reviewing all the available information. We hope to reach a decision that will maintain the integrity of the scientific literature, acknowledge legitimate differences of opinion about the methods used in the analysis of data, and is fair to all the participants in the debate. We will post a rapid response once we make a decision on this issue.

The discussion also includes contributions from others on unrelated aspects of the problem; here I’m focusing about the Franklin et al. critique and the Hemkens et al. paper.

Good on ya, BMJ

I love how the BMJ is handling this. The discussion is completely open, and the journal editor is completely non-judgmental. All so much better than my recent experience with the Association for Psychological Science, where the journal editor brushed me off in a polite but content-free way, and then the chair of the journal’s publication board followed up with some gratuitous rudeness. The BMJ is doing it right, and the psychology society has a few things to learn from them.

Also, just to make my position on this clear: I don’t see why anyone would think the Hemkens et al. paper should be retracted; a link to the criticisms would seem to be enough.

P.S. Franklin adds:

Just last week I got am email from someone who thought that our conclusion in our Epi Methods paper that use of the pooled ROR without inversion is “just as flawed” was too strong. I think they are right, so we will now be preparing a correction to our paper to modify this statement. So the circle of post-publication peer review continues…

Yes, exactly!

A pivotal episode in the unfolding of the replication crisis

Axel Cleeremans writes:

I appreciated your piece titled “What has happened down here is the winds have changed”. Your mini-history of what happened was truly enlightening — but you didn’t explicitly mention our failure to replicate Bargh’s slow walking effect. This was absolutely instrumental in triggering the replication crisis. As you know, the article was covered by the science journalist Ed Yong and came shortly after the Stapel affair. It was the first failure to replicate a classic priming effect that attracted so much attention. Yong’s blog post about it attracted a response from John Bargh and further replies from Yong, as you indirectly point to. But our article and the entire exchange between Yong and Bargh is also what triggered an extended email discussion involving many of the actors involved in this entire debate (including E. J. Wagenmakers, Hal Pashler, Fritz Strack and about 30 other people). That discussion was initiated by Daniel Kahneman after he and I discussed what to make of our failure to replicate Bargh’s findings. This email discussion continued for about two years and eventually resulted in further attempts to replicate, as they are unfolding now.

I was aware of the Bargh issue but I’d only read Wagenmakers (and Bargh’s own unfortunate writings) on the issue; I’d never followed up to read the original, so this is good to know. One thing I like about having these exchanges on a blog, rather than circulating emails, is that all the discussion is in one place and is open to all to read and participate.

No to inferential thresholds

Harry Crane points us to this new paper, “Why ‘Redefining Statistical Significance’ Will Not Improve Reproducibility and Could Make the Replication Crisis Worse,” and writes:

Quick summary: Benjamin et al. claim that FPR would improve by factors greater than 2 and replication rates would double under their plan. That analysis ignores the existence and impact of “P-hacking” on reproducibility. My analysis accounts for P-hacking and shows that FPR and reproducibility would improve by much smaller margins and quite possibly could decline depending on some other factors.

I am not putting forward a specific counterproposal here. I am instead examining the argument in favor of redefining statistical significance in the original Benjamin et al. paper.

From the concluding section of Crane’s paper:

The proposal to redefine statistical significance is severely flawed, presented under false pretenses, supported by a misleading analysis, and should not be adopted.

Defenders of the proposal will inevitably criticize this conclusion as “perpetuating the status quo,” as one of them already has [12]. Such a rebuttal is in keeping with the spiritof the original RSS [redefining statistical significance] proposal, which has attained legitimacy not by coherent reasoning or compelling evidence but rather by appealing to the authority and number of its 72 authors. The RSS proposal is just the latest in a long line of recommendations aimed at resolving the crisis while perpetuating the cult of statistical significance [22] and propping up the flailing and failing scientific establishment under which the crisis has thrived.

I like Crane’s style. I can’t say that I tried to follow the details, because his paper is all about false positive rates, and I think that whole false positive thing is a inappropriate in most science and engineering contexts that I’ve seen, as I’ve written many times (see, for example, here and here).

I think the original sin of all these methods is the attempt to get certainty or near-certainty from noisy data. These thresholds are bad news—and, as Hal Stern and I wrote awhile ago, it’s not just because of the 0.049 or 0.051 thing. Remember this: a z-score of 3 gives you a (two-sided) p-value of 0.003, and a z-score of 1 gives you a p-value of 0.32. One of these is super significant—“p less than 0.005”! Wow!—and the other is the ultimate statistical nothingburger. But if you have two different studies, and one gives p=0.003 and the other gives p=0.32, the difference between them is not at all remarkable. You could easily get both these results from the exact same underlying result, based on nothing but sampling variation, or measurement error, or whatever.

So, scientists and statisticians: All that thresholding you’re doing? It’s not doing what you think it’s doing. It’s just a magnification of noise.

So I’m not really inclined to follow the details of Crane’s argument regarding false positive rates etc., but I’m supportive of his general attitude that thresholds are a joke.

Post-publication review, not “ever expanding regulation”

Crane’s article also includes this bit:

While I am sympathetic to the sentiment prompting the various responses to RSS [1, 11, 15, 20], I am not optimistic that the problem can be addressed by ever expanding scientific regulation in the form of proposals and counterproposals advocating for pre-registered studies, banned methods, better study design, or generic ‘calls to action’. Those calling for bigger and better scientific regulations ought not forget that another regulation—the 5% significance level—lies at the heart of the crisis.

As a coauthor of one of the cited papers ([15], to be precise), let me clarify that we are not “calling for bigger and better scientific regulations, nor are we advocating for pre-registered studies (although we do believe such studies have their place), nor are we proposing to “ban” anything!, nor are we offering any “generic calls to action.” Of all the things on that list, the only thing we’re suggesting is “better study design”—and our suggestions for better study design are in no way a call for “ever expanding scientific regulation.”

Spatial models for demographic trends?

Jon Minton writes:

You may be interested in a commentary piece I wrote early this year, which was published recently in the International Journal of Epidemiology, where I discuss your work on identifying an aggregation bias in one of the key figures in Case & Deaton’s (in)famous 2015 paper on rising morbidity and mortality in middle-aged White non-Hispanics in the US.

Colour versions of the figures are available in the ‘supplementary data’ link in the above. (The long delay between writing, submitting, and the publication of the piece in IJE in some ways supports the arguments I make in the commentary, that timeliness is key, and blogs – and arxiv – allow for a much faster pace of research and analysis.)

An example of the more general approach I try to promote to looking at outcomes which vary by age and year is provided below, where I used data from the Human Mortality Database to produce a 3D printed ‘data cube’ of log mortality by age and year, whose features I then discuss. [See here and here.]

Seeing the data arranged in this way also makes it possible to see when the data quality improves, for example, as you can see the texture of the surface change from smooth (imputed within 5/10 year intervals) to rough.

I agree with your willingness to explore data visually to establish ground truths which your statistical models then express and explore more formally. (For example, in your identification of cohort effects in US voting preferences.) To this end I continue to find heat maps and contour plots of outcomes arranged by year and age a simple but powerful approach to pattern-finding, which I am now using as a starting point for statistical model specification.

The arrangement of data by year and age conceptually involves thinking about a continuous ‘data surface’ much like a spatial surface.

Given this, what are your thoughts on using spatial models which account for spatial autocorrelation, such as in R’s CARBayes package, to model demographic data as well?

My reply:

I agree that visualization is important.

Regarding your question about a continuous surface: yes, this makes sense. But my instinct is that we’d want something tailored to the problem; I doubt that a CAR model makes sense in your example. Those models are rotationally symmetric, which doesn’t seem like a property you’d want here.

If you do want to fit Bayesian CAR models, I suggest you do it in Stan.

Minton responded:

I agree that additional structure and different assumptions to those made by CAR would be needed. I’m thinking more about the general principle of modeling continuous age-year-rate surfaces. In the case of fertility modeling, for example, I was able to follow enough of this paper (my background is as an engineer rather than statistician) to get a sense that it formalises the way I intuit the data.

In the case of fertility, I also agree with using cohort and age as the surface’s axes rather than year and age. I produced the figure in this poster, where I munged Human Fertility Database and (less quality assured but more comprehensive) Human Fertility Collection data together and re-arranged year-age fertility rates by cohort to produce slightly crude estimates of cumulative cohort fertility levels. The thick solid line shows at which age different cohort ‘achieve’ replacement fertility levels (2.05), which for most countries veers off into infinity if not achieved by around the age of 43. The USA is unusual in regaining replacement fertility levels after losing them, which I assume is a secondary effect of high migration, and migrant cohorts bringing with them a different fertility schedule with them than non-migrants. The tiles are arranged from most to least fertile in the last recorded year, but the trends show these ranks will change over time, and the USA may move to top place.

Graphics software is not a tool that makes your graphs for you. Graphics software is a tool that allows you to make your graphs.

I had an email exchange with someone the other day. He had a paper with some graphs that I found hard to read, and he replied by telling me about the software he used to make the graphs. It was fine software, but the graphs were, nonetheless, unreadable.

Which made me realize that people are thinking about graphics software the wrong way. People are thinking that the software makes the graph for you. But that’s not quite right. The software allows you to make a graph for yourself.

Think of graphics software like a hammer. A hammer won’t drive in a nail for you. But if you have a nail and you know where to put it, you can use the hammer to drive in the nail yourself.

This is what I told my correspondent:

Writing takes thought. You can’t just plug your results into a computer program and hope to have readable, useful paragraphs.
Similarly, graphics takes thought. You can’t just plug your results into a graphics program and hope to have readable, useful graphs.