Skip to content

“The Null Hypothesis Screening Fallacy”?

[non-cat picture]

Rick Gerkin writes:

A few months ago you posted your list of blog posts in draft stage and I noticed that “Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Not.” was still on that list. It was about some concerns I had about a paper in Science ( After talking it through with them, the authors of that paper eventually added a correction to the article. I think the issues with that paper are a bit deeper (as I published elsewhere: but still it takes courage to acknowledge the merit of the concerns and write a correction.

Meanwhile, two of the principal investigators from that paper produced a new, exciting data set which was used for a Kaggle-like competition. I won that competition and became a co-first author on a *new* paper in Science (

And this is great! I totally respect them as scientists and think their research is really cool. They made an important mistake in their paper and since the research question was something I care a lot about I had to call attention to it. But I always looked forward to moving on from that and working on the other paper with them, and it all worked out.

That is such a great attitude.

Gerkin continues:

Yet another lesson that most scientific disputes are pretty minor, and working together with the people you disagreed with can produce huge returns. The second paper would have been less interesting and important if we hadn’t been working on it together.

What a wonderful story!

Here’s the background. I received the following email from Gerkin a bit over a year ago:

About 3 months ago there was a paper in Science entitled “Humans Can Discriminate More than 1 Trillion Olfactory Stimuli” ( You may have heard about it through normal science channels, or NPR, or the news. The press release was everywhere. It was a big deal because the conclusion that humans can discriminate a trillion odors was unexpected, previous estimates having been in the ~10000 range. Our central concern is the analysis of the data.

The short version:
They use a hypothesis testing framework — not to reject a null hypothesis with type 1 error rate alpha — but to essentially convert raw data (fraction of subjects discriminating correctly) into a more favorable form (fraction of subjects discriminating significantly above chance), which is subsequently used to estimate an intermediate hypothetical variable, which, when plugged into another equation produces the final point estimate of “number of odors humans can discriminate”. However, small changes in the choice of alpha during this data conversion step (or equivalently small changes in the number of subjects, the number of trials, etc), by virtue of their highly non-linear impact on that point estimate, undermine any confidence in that estimate. I’m pretty sure this is a misuse of hypothesis testing. Does this have a name? Gelman’s fallacy?

I replied:

People do use hyp testing as a screen. When this is done, it should be evaluated as such. The p-values themselves are not so important, you just have to consider the screening as a data-based rule and evaluate its statistical properties. Personally, I do not like hyp-test-based screening rules: I think it makes more sense to consider screening as a goal and go from there. As you note, the p-value is a highly nonlinear transformation of the data, with the sharp nonlinearity occurring at a somewhat arbitrary place in the scale. So, in general, I think it can lead to inferences that throw away information. I did not go to the trouble of following your link and reading the original paper, but my usual view is that it would be better to just analyze the raw data (taking the proportions for each person as continuous data and going from there, or maybe fitting a logistic regression or some similar model to the individual responses).

Gerkin continued:

The long version:
1) Olfactory stimuli (basically vials of molecular mixtures) differed from each other according to the number of molecules they each had in common (e.g. 7 in common out of 10 total, i.e. 3 differences). All pairs of mixtures for which the stimuli in the pair had D differences were assigned to stimulus group D.
2) For each stimulus pair in a group D, the authors computed the fraction of subjects who could successfully discriminate that pair using smell.
3) For each group D, they then computed the fraction of pairs in D for which that fraction of subjects was “significantly above chance”. By design, chance success had p=1/3, so a pair was “significantly above chance” if the fraction of subjects discriminating it correctly exceeded that given by the binomial inverse CDF with x=(1-alpha/2), p=1/3, N=# of subjects. The choice of alpha (an analysis choice) and N (an experimental design choice) clearly drive the results so far. Let’s denote by F that fraction of pairs exceeding the threshold determined by the inverse CDF.
4) They did a linear regression of F vs D. They defined something called a “limen” (basically a fancy term for a discrimination threshold) and set it equal to the solution to 0.5 = beta_0 + beta_1*X, where the betas are the regression coefficients.
5) They then plugged X into yet another equation with more parameters, and the result was their estimate of the number of discriminable olfactory stimuli.

My reply: I’ve seen a lot of this sort of thing, over the years. My impression is that people are often doing these convoluted steps, not so much out of a desire to cheat but rather because they have not ever stepped back and tried to consider their larger goals. Or perhaps they don’t have the training to set up a model from scratch.

Here’s Gerkin again:

I think it was one of those cases where an experimentalist talked to a mathematician, and the mathematician had some experience with a vaguely similar problem and suggested a corresponding framework that unfortunately didn’t really apply to the current problem. The kinds of stress tests one would apply to resulting model to make sure it makes sense of the data never got applied.

And then he continued with his main thread:

If you followed this, you’ve already concluded that their method is unsound even before we get to step 4 and 5 (which I believe are unsound for unrelated reasons). I also generated figures showing that reasonable alternative choices of all of these variables yield estimates of the number of olfactory stimuli ranging from 10^3 to 10^80. I have Python code implementing this reanalysis and figures available at But what I am wondering most is, is there a name for what is wrong with that screening procedure? Is there some adage that can be rolled out, or work cited, to illustrate this to the author?

To which I replied:

I don’t have any name for this one, but perhaps one way to frame your point is that the term “discriminate” in this context is not precisely determined. Ultimately the question of whether two odors can be “discriminated” should have some testable definition: that is, not just a data-based procedure that produces an estimate, but some definition of what “discrimination” really means. My guess is that your response is strong enough, but it does seem that if someone estimates “X” as 10^9 or whatever, it would be good to have a definition of what X is.

Gerkin concludes with a plea:

The one thing I would really, really like is for the fallacy I described to have a name—even better if it could be listed on your lexicon page. Maybe “The Null Hypothesis Screening Fallacy” or something. Then I could just refer to that link instead of to some 10,000 words explanation of it, everytime this comes up in biology (which is all the time).

P.S. Here’s my earlier post on smell statistics.

“Furthermore, there are forms of research that have reached such a degree of complexity in their experimental methodology that replicative repetition can be difficult.”

[cat picture]

Shravan Vasishth writes:

The German NSF (DFG) has recently published a position paper on replicability, which contains the following explosive statement (emphasis mine in the quote below).

The first part of their defence against replicability is reasonable: some experiments can never be repeated under the same conditions (e.g., volcanic eruptions etc). But if that is so, why do researchers use frequentist logic for their analyses? This is the one situation situation where one cannot even imagine repeating the experiment hypothetically (cause the volcano to erupt 10,000 times and calculate the mean emission or whatever and its standard error).

The second part of their defence (in boldface) gives a free pass to the social psychologists. Now one can always claim that the experiment is “difficult” to redo. That is exactly the Fiske defence.

DFG quote:

Scientific results can be replicable, but they need not be. Replicability is not a universal criterion for scientific knowledge. The expectation that all scientific findings must be replicable cannot be satisfied, if only because numerous research areas investigate unique events such as climate change, supernovas, volcanic eruptions or past events. Other research areas focus on the observation and analysis of contingent phenomena (e.g. in the earth system sciences or in astrophysics) or investigate phenomena that cannot be observed repeatedly for other reasons (e.g., ethical, financial or technical reasons). Furthermore, there are forms of research that have reached such a degree of complexity in their experimental methodology that replicative repetition can be difficult.

Wow. I guess they’ll have to specify exactly which are these forms of research are too complex to replicate. And why, if it is too complex to replicate, we should care about such claims. As is often the case in such discussions, I feel that their meaning would be much clearer if they’d give some examples.

No, I’m not blocking you or deleting your comments!

Someone wrote in:

I am worried you may have blocked me from commenting on your blog (because a couple of comments I made aren’t there). . . . Or maybe I failed to post correctly or maybe you just didn’t think my comments were interesting enough. . . .

This comes up from time to time and I always explain that, no, I don’t delete comments.

I don’t block commenters. I flag spam comments as spam—this includes comments with actual content but that contain spam links, and it also includes comments with no links but with such meaningless content that they seem to be some sort of spam—and I delete duplicate comments, which happens I think when people don’t realize their comment was entered the first time. In nearly 15 years of blogging I think I’ve deleted fewer then 5 comments based on content when people are extremely rude.

Legitimate comments also can get caught in the spam. When people email me as above, I search the blog’s spam comments file, and the comment in question is typically there, having been trapped by the spam filter. Other times the comment isn’t there, and I’m guessing it got eaten by the person’s browser before it ever got posted.

I appreciate all the effort that people put into their comments and definitely don’t want to be deleting them! Just as I blog for free so as to improve scientific discourse, so do you and others supply comments for free for that same reason, and I’m glad we have such free and interesting exchanges.

Stan Weekly Roundup, 30 June 2017

TM version of logoHere’s some things that have been going on with Stan since the last week’s roundup

  • Stan® and the logo were granted a U.S. Trademark Registration No. 5,222,891 and a U.S. Serial Number: 87,237,369, respectively. Hard to feel special when there were millions of products ahead of you. Trademarked names are case insensitive and they required a black-and-white image, shown here.

  • Peter Ellis, a data analyst working for the New Zealand government, posted a nice case study, State-space modelling of the Australian 2007 federal election. His post is intended to “replicate Simon Jackman’s state space modelling [from his book and pscl package in R] with house effects of the 2007 Australian federal election.”

  • Masaaki Horikoshi provides Stan programs on GitHub for the models in Jacques J.F. Commandeur and Siem Jan Koopman’s book Introduction to State Space Time Series Analysis.

  • Sebastian Weber put out a first draft of the MPI specification for a map function for Stan. Mapping was introduced in Lisp with maplist(); Python uses map() and R uses sapply(). The map operation is also the first half of the parallel map-reduce pattern, which is how we’re implmenting it. The reduction involves fiddling the operands, result, and gradients into the shared autodiff graph.

  • Sophia Rabe-Hesketh, Daniel Furr, and Seung Yeon Lee, of UC Berkeley, put together a page of Resources for Stan in educational modeling; we only have another partial year left on our IES grant with Sophia.
  • Bill Gillespie put together some introductory Stan lectures. Bill’s recently back from teaching Stan at the PAGE conference in Budapest.
  • Mitzi Morris got her pull request merged to add compound arithmetic and assignment to the language (she did the compound declare/define before that). That means we’ll be able to write foo[i, j] += 1 instead of foo[i, j] = foo[i, j] + 1 going forward. It works for all types where the binary operation and assignment are well typed.
  • Sean Talts has the first prototype of Andrew Gelman’s algorithm for max marginal modes—either posterior or likelihood. This’ll give us the same kind of maximum likelihood estimates as Doug Bates’s packages for generalized linear mixed effects models, lme4 in R and MixedModels.jl in Julia. It not only allows penalities or priors like Vince Dorie’s and Andrew’s R package blme, but it can be used for arbitrary parameters subsets in arbitrary Stan models. It shares some computational tricks for stochastic derivatives with Alp Kucukelbir’s autodiff variational inference (ADVI) algorithm.
  • I got the pull request merged for the forward-mode test framework. It’s cutting down drastically on code size and improving test coverage. Thanks to Rob Trangucci for writing the finite diff functionals and to Sean Talts and Daniel Lee for feedback on the first round of testing. This should mean that we’ll have higher-order autodiff exposed soon, which means RHMC and faster autodiffed Hessians.

Plan 9 from PPNAS

[cat picture]

Asher Meir points to this breathless news article and sends me a message, subject line “Fruit juice leads to 0.003 unit (!) increase in BMI”:

“the study results showed that one daily 6- to 8-ounce serving increment of 100% fruit juice was associated with a small .003 unit increase in body mass index over one year in children of all ages.”

No confidence intervals but obviously this finding is very worrisome. Children shouldn’t be gaining weight.

Meir continues:

Of course it’s not a coincidence that it’s weird. I send you a very unrepresentative sample of the stuff I read. I mostly don’t send you ordinary schlock but rather things that are really weird – like a “0.003 unit increase in BMI” which is not only statistically insignificant but even if it was able to be substantiated would be of 0 health consequences.

I really enjoy seeing things like this, they are so ridiculous they are like those cult movies that are so bad they’re good.

P.S. Yeah, yeah, I know that this particular piece of junk science didn’t appear in PPNAS. But until PPNAS apologizes for wasting the world’s time with air rage, himmicanes, ages ending in 9, etc., I think we have the moral right to continue to use them as shorthand for this sort of thing.

Again: Let’s stop talking about published research findings being true or false

Coincidentally, on the same day this post appeared, a couple people pointed me to a news article by Paul Basken entitled, “A New Theory on How Researchers Can Solve the Reproducibility Crisis: Do the Math.”

This is not good.
Continue reading ‘Again: Let’s stop talking about published research findings being true or false’ »

Let’s stop talking about published research findings being true or false

I bear some of the blame for this.

When I heard about John Ioannidis’s paper, “Why Most Published Research Findings Are False,” I thought it was cool. Ioannidis was on the same side as me, and Uri Simonsohn, and Greg Francis, and Paul Meehl, in the replication debate: he felt that there was a lot of bad work out there, supported by meaningless p-values, and his paper was a demonstration of how this could come to pass, how it was that the seemingly-strong evidence of “p less than .05” wasn’t so strong at all.

I didn’t (and don’t) quite buy Ioannidis’s mathematical framing of the problem, in which published findings map to hypotheses that are “true” or “false.” I don’t buy it for two reasons: First, statistical claims are only loosely linked to scientific hypotheses. What, for example, is the hypothesis of Satoshi Kanazawa? Is it that sex ratios of babies are not identical among all groups? Or that we should believe in “evolutionary psychology”? Or that strong powerful men are more likely to have boys, in all circumstances? Some circumstances? Etc. Similarly with that ovulation-and-clothing paper: is the hypothesis that women are more likely to wear red clothing during their most fertile days? Or during days 6-14 (which are not the most fertile days of the cycle)? Or only on warm days? Etc. The second problem is that the null hypotheses being tested and rejected are typically point nulls—the model of zero difference, which is just about always false. So the alternative hypothesis is just about always true. But the alternative to the null is not what is being specified in the paper. And, as Bargh etc. have demonstrated, the hypothesis can keep shifting. So we go round and round.

Here’s my point. Whether you think the experiments and observational studies of Kanazawa, Bargh, etc., are worth doing, or whether you think they’re a waste of time: either way, I don’t think they’re making claims that can be said to be either “true” or “false.” And I feel the same way about medical studies of the “hormone therapy causes cancer” variety. It could be possible to coerce these claims into specific predictions about measurable quantities, but that’s not what these papers are doing.

I agree that there are true and false statements. For example, “the Stroop effect is real and it’s spectacular” is true. But when you move away from these super-clear examples, it’s tougher. Does power pose have real effects? Sure, everything you do will have some effect. But that’s not quite what Ioannidis was talking about, I guess.

Anyway, I’m still glad that Ioannidis wrote that paper, and I agree with his main point, even if I feel it was awkwardly expressed by being crammed into the true-positive, false-positive framework.

But it’s been 12 years now, and it’s time to move on. Back in 2013, I was not so pleased with Jager and Leek’s paper, “Empirical estimates suggest most published medical research is true.” Studying the statistical properties published scientific claims, that’s great. Doing it in the true-or-false framework, not so much.

I can understand Jager and Leek’s frustration: Ioannidis used this framework to write a much celebrated paper; Jager and Leek do something similar—but with real data!—and get all this skepticism. But I do think we have to move on.

And I feel the same way about this new paper, “Too True to be Bad: When Sets of Studies With Significant and Nonsignificant Findings Are Probably True,” by Daniel Lakens and Alexander Etz, sent to me by Kevin Lewis. I suppose such analyses are helpful for people to build their understanding, but I think the whole true/false thing with social science hypotheses is just pointless. These people are working within an old-fashioned paradigm, and I wish they’d take the lead from my 2014 paper with Carlin on Type M and S errors. I suspect that I would agree with the recommendations of this paper (as, indeed, I agree with Ioannidis), but at this point I’ve just lost the patience for decoding this sort of argument and reframing it in terms of continuous and varying effects. That said, I expect this paper by Lakens and Etz, like the earlier papers by Ioannidis and Jager/Leek, could be useful, as I recognize that many people are still comfortable working within the outmoded framework of true and false hypotheses.

P.S. More here and here.

Bayesian, but not Bayesian enough

Will Moir writes:

This short New York Times article on a study published in BMJ might be of interest to you and your blog community, both in terms of how the media reports science and also the use of bayesian vs frequentist statistics in the study itself.

Here is the short summary from the news ticker thing on the NYTimes homepage:

Wow, that sounds really bad! Here is the full article:

It is extremely short, and basically just summarizes the abstract, adds that the absolute increase in risk is actually very small, and recommends talking to your doctor before taking NSAIDs. I guess my problem is that they have the scary headline (53%!), but then say the risk is actually small and you might or might not want to avoid NSAIDs. So is this important or not? The average reader probably has not thought much about relative versus absolute risk, so I wish they would have expanded on that.

In terms of bayesian vs frequentist, this study is bayesian (bayesian meta-analysis of individual patient data). Here is the link:

Despite being bayesian, the way the results are presented give me very frequentist/NHST vibes. For example, the NYTimes article gives the percent increase in risk of heart attack for the various NSAIDs, which are taken directly from the odds ratios in the abstract:

With use for one to seven days the probability of increased myocardial infarction risk (posterior probability of odds ratio >1.0) was 92% for celecoxib, 97% for ibuprofen, and 99% for diclofenac, naproxen, and rofecoxib. The corresponding odds ratios (95% credible intervals) were 1.24 (0.91 to 1.82) for celecoxib, 1.48 (1.00 to 2.26) for ibuprofen, 1.50 (1.06 to 2.04) for diclofenac, 1.53 (1.07 to 2.33) for naproxen, and 1.58 (1.07 to 2.17) for rofecoxib.

This reads to me like the bayesian equivalent of “statistically significant, p<0.05, lower 95% CI is greater than 1”! To be fair that is just the abstract, and the article itself provides much, much more information.

The following passage also caught my eye:

The bayesian approach is useful for decision making. Take, for example, the summary odds ratio of acute myocardial infarction of 2.65 (1.46 to 4.67) with rofecoxib >25 mg/day for 8-30 days versus non-use. With a frequentist confidence interval, which represents uncertainty through repetition of the experience, all odds ratios from 1.46 to 4.67 might seem equally likely. In contrast, the bayesian approach, although resulting in a numerically similar 95% credible interval, also allows us to calculate that there is an 83% probability that this odds ratio of acute myocardial infarction is greater than 2.00.

It seems like they’re using bayesian methods to generate alternative versions of the typical frequentist statistics that can actually be interpreted the way most people incorrectly interpret frequentist/NHST stats (p=0.01 meaning 99% probability that there is an effect, etc). If so that is great because it makes sense to use statistics that match how people will interpret them anyway, but I also imagine it also would be subject to the same limitations and abuse that is common to NHST (I am not saying that about this particular study, just in general).

I agree.  If you’re doing decision analysis, you can’t do much with statements such as, “there is an 83% probability that this odds ratio of acute myocardial infarction is greater than 2.00.”  It’s better to just work with the risk parameter directly. A parameter being greater than 2.00 isn’t what kills you.

Estimating Public Market Exposure of Private Capital Funds Using Bayesian Inference

I don’t know anything about this work by Luis O’Shea and Vishv Jeet—that is, I know nothing of public market exposure or private capital firms, and I don’t know anything about the model they fit, the data they used, or what information they had available for constructing and checking their model.

But what I do know is that they fit their model in Stan.

Fitting models in Stan is just great, for the usual reasons of flexible modeling and fast computing, and also because Stan code can be shared, so we—the Stan user community and the larger research community—can learn from each other and move all our data analyses forward.

Capitalist science: The solution to the replication crisis?

Bruce Knuteson pointed me to this article, which begins:

The solution to science’s replication crisis is a new ecosystem in which scientists sell what they learn from their research. In each pairwise transaction, the information seller makes (loses) money if he turns out to be correct (incorrect). Responsibility for the determination of correctness is delegated, with appropriate incentives, to the information purchaser. Each transaction is brokered by a central exchange, which holds money from the anonymous information buyer and anonymous information seller in escrow, and which enforces a set of incentives facilitating the transfer of useful, bluntly honest information from the seller to the buyer. This new ecosystem, capitalist science, directly addresses socialist science’s replication crisis by explicitly rewarding accuracy and penalizing inaccuracy.

The idea seems interesting to me, even though I don’t think it would quite work for my own research as my work tends to be interpretive and descriptive without many true/false claims. But it could perhaps work for others. Some effort is being made right now to set up prediction markets for scientific papers.

Knuteson replied:

Prediction markets have a few features that led me to make different design decisions. Two of note:
– Prices on prediction markets are public. The people I have spoken with in industry seem more willing to pay for information if the information they receive is not automatically made public.
– Prediction markets generally deal with true/false claims. People like being able to ask a broader set of questions.

A bit later, Knuteson wrote:

I read your post “Authority figures in psychology spread more happy talk, still don’t get the point . . .”

You may find this Physics World article interesting: Figuring out a handshake.

I fully agree with you that not all broken eggs can be made into omelets.

Also relevant is this paper where Eric Loken and I consider the idea of peer review as an attempted quality control system, and we discuss proposals such as prediction markets for improving scientific communication.

Bad Numbers: Media-savvy Ivy League prof publishes textbook with a corrupted dataset

[cat picture]

I might not have noticed this one, except that it happened to involve Congressional elections, and this is an area I know something about.

The story goes like this. I’m working to finish up Regression and Other Stories, going through the examples. There’s one where we fit a model to predict the 1988 elections for the U.S. House of Representatives, district by district, given the results from the previous election and incumbency status. We fit a linear regression, then used the fitted model to predict 1990, then compared to the actual election results from 1990. A clean example with just a bit of realism—the model doesn’t fit perfectly, there’s some missing data, there are some choices in how to set up the model.

This example was in Data Analysis Using Regression and Multilevel/Hierarchical Models—that’s the book that Regression and Other Stories is the updated version of the first half of—and for this new book I just want to redo the predictions using stan_glm() and posterior_predict(), which is simpler and more direct than the hacky way we were doing predictions before.

So, no problem. In the new book chapter I adapt the code, cleaning it in various places, then I open an R window and an emacs window for my R script and check that everything works ok. Ummm, first I gotta find the directory with the old code and data, I do that, everything seems to work all right. . . .

I look over what I wrote one more time. It’s kinda complicated: I’d imputed winners of uncontested elections at 75% of the two-party vote—that’s a reasonable choice, it’s based on some analysis we did many years ago of the votes in districts the election before or after they became uncontested—but then there was a tricky thing where I excluded some of these when fitting the regression and put them back in the imputation. In rewriting the example, it seemed simpler to just impute all those uncontested elections once and for all and then do the modeling and fitting on all the districts. Not perfect—and I can explain that in the text—but less of a distraction from the main point in this section, which is the use of simulation for nonlinear predictors, in this case the number of seats predicted to be won by each party in the next election.

Here’s what I had in the text: “Many of the elections were uncontested in 1988, so that y_i = 0 or 1 exactly; for simplicity, we exclude these from our analysis. . . . We also exclude any elections that were won by third parties. This leaves us with n = 343 congressional elections for the analysis.” So I went back to the R script and put the (suitably imputed) uncontested elections back in. This left me with 411 elections in the dataset, out of 435. The rest were NA’s. And I rewrote the paragraph to simply say: “We exclude any elections that were won by third parties in 1986 or 1988. This leaves us with $n=411$ congressional elections for the analysis.”

But . . . wait a minute! Were there really 34 24 districts won by third parties in those years? That doesn’t sound right. I go to the one of the relevant data file, “1986.asc,” and scan down until I find some of the districts in question:

The first column’s the state (we were using “ICPSR codes,” and states 44, 45, and 46 are Georgia, Louisiana, and Mississippi, respectively), the second is the congressional district, third is incumbency (+1 for Democrat running for reelection, -1 for Republican, 0 for an open seat), and the last two columns are the votes received by the Democratic and Republican candidates. If one of those last two columns is 0, that’s an uncontested election. If both are 0, I was calling it a third-party victory.

But can this be right?

Here’s the relevant section from the codebook:

Nothing about what to do if both columns are 0.

Also this:

For those districts with both columns -9, it says the election didn’t take place, or there was a third party victory, or there was an at-large election.

Whassup? Let’s check Louisiana (state 45 in the above display). Google *Louisiana 1986 House of Representatives Elections* and it’s right there on Wikipedia. I have no idea who went to the trouble of entering all this information (or who went to the trouble of writing a computer program to enter all this information), but here it is:

So it looks like the data table I had was just incomplete. I have no idea how this happened, but it’s kinda embarrassing that I never noticed. What with all those uncontested elections, I didn’t really look carefully at the data with zeroes -9’s in both columns.

Also, the incumbency information isn’t all correct. Our file had LA-6 with a Republican incumbent running for reelection, but according to Wikipedia, the actual election was an open seat (but with the Republican running unopposed).

I’m not sure what’s the best way forward. Putting together a new dataset for all those decades of elections, that would be a lot of work. But maybe such a file now exists somewhere? The easiest solution would be to clean up the existing dataset just for the three elections I need for the example: 1986, 1988, 1990. On the other hand, if I’m going to do that anyway, maybe better to use some more recent data, such as 2006, 2008, 2010.

No big deal—it’s just one example in the book—but, still, it’s a mistake I should never have made.

This is all a good example of the benefits of a reproducible workflow. It was through my efforts to put together clean, reproducible code that I discovered the problem.

Also, errors in this dataset could have propagated into errors in these published articles:

[2008] Estimating incumbency advantage and its variation, as an example of a before/after study (with discussion). {\em Journal of the American Statistical Association} {\bf 103}, 437–451. (Andrew Gelman and Zaiying Huang)

[1991] Systemic consequences of incumbency advantage in U.S. House elections. {\em American Journal of Political Science} {\bf 35}, 110–138. (Gary King and Andrew Gelman)

[1990] Estimating incumbency advantage without bias. {\em American Journal of Political Science} {\bf 34}, 1142–1164. (Andrew Gelman and Gary King)

I’m guessing that the main conclusions won’t change, as the total number of these excluded cases is small. Of course those papers were all written before the era of reproducible analyses, so it’s not like the data and code are all there for you to re-run.

Problems with the jargon “statistically significant” and “clinically significant”

Someone writes:

After listening to your EconTalk episode a few weeks ago, I have a question about interpreting treatment effect magnitudes, effect sizes, SDs, etc. I studied Econ/Math undergrad and worked at a social science research institution in health policy as a research assistant, so I have a good amount of background.

At the institution where I worked we started adopting the jargon “statistically significant” AND “clinically significant.” The latter describes the importance of the magnitude in the real world. However, my understanding of standard T testing and p-values is that since the null hypothesis is treatment == 0, then if we can reject the null at p>.05, then this is only evidence that the treatment is > 0. Because the test was against 0, we cannot make any additional claims about the magnitude. If we wanted to make claims about the magnitude, then we would need to test against the null hypothesis of treatment effect == [whatever threshold we assess as clinically significant]. So, what do you think? Were we always over-interpreting the magnitude results or am I missing something here?

My reply:

Section 2.4 of this recent paper with John Carlin explains the problem with talking about “practical” (or “clinical”) significance.

More generally, that’s right, the hypothesis test is, at best, nothing more than the rejection of a null hypothesis that nobody should care about. In real life, treatment effects are not exactly zero. A treatment will help some people and hurt others; it will have some average benefit which will in turn depend on the population being studied and the settings where the treatment is being applied.

But, no, I disagree with your statement that, if we wanted to make claims about the magnitude, then we would need to test other hypotheses. The whole “hypothesis” thing just misses the point. There are no “hypotheses” here in the traditional statistical sense. The hypothesis is that some intervention helps more than it hurts, for some people in some settings. The way to go, I think, is to just model these treatment effects directly. Estimate the treatment effect and its variation, and go from there. Forget the hypotheses and p-values entirely.

Analyze all your comparisons. That’s better than looking at the max difference and trying to do a multiple comparisons correction.

[cat picture]

The following email came in:

I’m in a PhD program (poli sci) with a heavy emphasis on methods. One thing that my statistics courses emphasize, but that doesn’t get much attention in my poli sci courses, is the problem of simultaneous inferences. This strikes me as a problem.

I am a bit unclear on exactly how this works, and it’s something that my stats professors have been sort of vague about. But I gather from your blog that this is a subject near and dear to your heart.

For purposes of clarification, I’ll work under the frequentist framework, since for better or for worse, that’s what almost all poli sci literature operates under.

But am I right that any time you want to claim that two things are significant *at the same time* you need to halve your alpha? Or use Scheffe or whatever multiplier you think is appropriate if you think Bonfronni is too conservative?

I’m thinking in particular of this paper [“When Does Negativity Demobilize? Tracing the Conditional Effect of Negative Campaigning on Voter Turnout,” by Yanna Krupnikov].

In particular the findings on page 803.

Setting aside the 25+ predictors, which smacks of p-hacking to me, to support her conclusions she needs it to simultaneously be true that (1) negative ads themselves don’t affect turnout, (2) negative ads for a disliked candidate don’t affect turnout; (3) negative ads against a preferred candidate don’t affect turnout; (4) late ads for a disliked candidate don’t affect turnout AND (5) negative ads for a liked candidate DO affect turnout. In other words, her conclusion is valid iff she finds a significant effect at #5.

This is what she finds, but it looks like it just *barely* crosses the .05 threshold (again, p-hacking concerns). But am I right that since she needs to make inferences about five tests here, her alpha should be .01 (or whatever if you use a different multiplier)? Also, that we don’t care about the number of predictors she uses (outside of p-hacking concerns) since we’re not really making inferences about them?

My reply:

First, just speaking generally: it’s fine to work in the frequentist framework, which to me implies that you’re trying to understand the properties of your statistical methods in the settings where they will be applied. I work in the frequentist framework too! The framework where I don’t want you working is the null hypothesis significance testing framework, in which you try to prove your point by rejecting straw-man nulls.

In particular, I have no use for statistical significance, or alpha-levels, or familywise error rates, or the .05 threshold, or anything like that. To me, these are all silly games, and we should just cut to the chase and estimate the descriptive and casual population quantities of interest. Again, I am interested in the frequentist properties of my estimates—I’d like to understand their bias and variance—but I don’t want to do it conditional on null hypotheses of zero effect, which are hypotheses of zero interest to me. That’s a game you just don’t need to play anymore.

When you do have multiple comparisons, I think the right way to go is to analyze all of them using a hierarchical model—not to pick one or two or three out of context and then try to adjust the p-values using a multiple comparisons correction. Jennifer Hill, Masanao Yajima, and I discuss this in our 2011 paper, Why we (usually) don’t have to worry about multiple comparisons.

To put it another way, the original sin is selection. The problem with p-hacked work is not that p-values are uncorrected for multiple comparison, it’s that some subset of comparisons is selected for further analysis, which is wasteful of information. It’s better to analyze all the comparisons of interest at once. This paper with Steegen et al. demonstrates how many different potential analyses can be present, even in a simple study.

OK, so that’s my general advice: look at all the data and fit a multilevel model allowing for varying baselines and varying effects.

What about the specifics?

I took a look at the linked paper. I like the title. “When Does Negativity Demobilize?” is much better than “Does Negatively Demobilize.” The title recognizes that (a) effects are never zero, and (b) effects vary. I can’t quite buy this last sentence of the abstract, though: “negativity can only demobilize when two conditions are met: (1) a person is exposed to negativity after selecting a preferred candidate and (2) the negativity is about this selected candidate.” No way! There must be other cases when negativity can demobilize. That said, at this point the paper could still be fine: even if a paper is working within a flawed inferential framework, it could still be solid empirical work. After all, it’s completely standard to estimate constant treatment effects—we did this in our first paper on incumbency advantage and I still think most of our reported findings were basically correct.

Reading on . . . Krupnikov writes, “The first section explores the psychological determinants that underlie the power of negativity leading to the focal hypothesis of this research. The second section offers empirical tests of this hypothesis.” For the psychological model, she writes that first a person decides which candidate to support, then he or she decides whether to vote. That seems a bit of a simplification, as sometimes I know I’ll vote even before I decide whom to vote for. Haven’t you ever heard of people making their decision inside the voting booth? I’ve done that! Even beyond that, it doesn’t seem quite right to identify the choice as being made at a single precise time. Again, though, that’s ok: Krupnikov is presenting a model, and models are inherently simplifications. Models can still help us learn from the data.

OK, now on to the empirical part of the paper. I see what you mean: there are a lot of potential explanatory variables running around: overall negativity, late negativity, state competitiveness, etc etc. Anything could be interacted with anything. This is a common concern in social science, as there is an essentially unlimited number of factors that could influence the outcome of interest (turnout, in this case). On one hand, it’s a poopstorm when you throw all these variables into your model at once; on the other hand, if you exclude anything that might be important, it can be hard to interpret any comparisons in observational data. So this is something we’ll have to deal with: it won’t be enough to just say there are too many variables and then give up. And it certainly won’t be a good idea to trawl through hundreds of comparisons, looking for something that’s significant at the .001 level or whatever. That would make no sense at all. Think of what happens: you grab the comparison with a z-score of 4, setting aside all those silly comparisons with z-scores of 3, or 2, or 1, but this doesn’t make much sense, given that these z-scores are so bouncy: differences of less than 3 in z-scores are not themselves statistically significant.

To put it another way, “multiple comparisons” can be a valuable criticism, but multiple comparisons corrections are not so useful as a method of data analysis.

Getting back to the empirics . . . here I agree that there are problems. I don’t like this:

Estimating Model 1 shows that overall negativity has a null effect on turnout in the 2004 presidential election (Table 2, Model 1). While the coefficient on the overall negativity variable is negative, it does not reach conven- tional levels of statistical significance. These results are in line with Finkel and Geer (1998), as well as Lau and Pomper (2004), and show that increases in the negativity in a respondent’s media market over the entire duration of the campaign did not have any effect on his likelihood of turning out to vote in 2004.

Not statistically significant != zero.

Here’s more:

Going back to the conclusion from the abstract, “negativity can only demobilize when two conditions are met: (1) a person is exposed to negativity after selecting a preferred candidate and (2) the negativity is about this selected candidate,” I think Krupnikov is just wrong here in her application of her empirical results. She’s taking non-statistically-significant comparisons as zero, and she’s taking the difference between significant and non-significant as being significant. Don’t do that.

Given that the goal here is causal inference, I think it would’ve been better off setting this up more formally as an observational study comparing treatment and control groups.

I did not read the rest of the paper, nor am I attempting to offer any evaluation of the work. I was just focusing on the part addressed by your question. The bigger picture, I think, is that it can be valuable for a researcher to (a) summarize the patterns she sees in data, and (b) consider the implications of these patterns for understanding recent and future campaigns, while (c) recognizing residual uncertainty.

Remember Tukey’s quote: “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

The attitude I’m offering is not nihilistic: even if we have not reached anything close to certainty, we can still learn from data and have a clearer sense of the world after our analysis than before.


Update: Usage guidelines

We basically just followed Apache’s lead.

It’s official

“Stan” is now a registered trademark. For those keeping score, it’s

The Stan logo (see image below) is also official

No idea why there are serial numbers for the image and registration numbers for the text. Ask the USPTO

How to refer to Stan

Please just keep writing “Stan”. We’ll be using the little ® symbol in prominent branding, but you don’t have to.

Thanks to NumFOCUS

Thanks to Leah Silen and NumFOCUS for shepherding the application through the registration process. NumFOCUS is the official trademark holder.

“Stan”, not “STAN”

We use “Stan” rather than “STAN”, because “Stan” isn’t an acronym. Stan is named after Stanislaw Ulam.

TM version of logoThe mark is rendered as “STAN” on the USPTO site. Do not be fooled! The patent office capitalizes everything because the registrations are case insensitive.

The image submitted for the trademark (shown above) is black and white. So far, we’ve always used color—on the web site, manual, t-shirts, stickers, etc.

Incentives Matter (Congress and Wall Street edition)

[cat picture]

Thomas Ferguson sends along this paper. From the summary:

Social scientists have traditionally struggled to identify clear links between political spending and congressional voting, and many journalists have embraced their skepticism. A giant stumbling block has been the challenge of measuring the labyrinthine ways money flows from investors, firms, and industries to particular candidates. Ferguson, Jorgensen, and Chen directly tackle that classic problem in this paper. Constructing new data sets that capture much larger swaths of political spending, they show direct links between political contributions to individual members of Congress and key floor votes . . .

They show that prior studies have missed important streams of political money, and, more importantly, they show in detail how past studies have underestimated the flow of political money into Congress. The authors employ a data set that attempts to bring together all forms of campaign contributions from any source— contributions to candidate campaign committees, party committees, 527s or “independent expenditures,” SuperPACs, etc., and aggregate them by final sources in a unified, systematic way. To test the influence of money on financial regulation votes, they analyze the U.S. House of Representatives voting on measures to weaken the Dodd-Frank financial reform bill. Taking care to control as many factors as possible that could influence floor votes, they focus most of their attention on representatives who originally voted in favor of the bill and subsequently to dismantle key provisions of it. Because these are the same representatives, belonging to the same political party, in substantially the same districts, many factors normally advanced to explain vote shifts are ruled out from the start. . . .

The authors test five votes from 2013 to 2015, finding the link between campaign contributions from the financial sector and switching to a pro-bank vote to be direct and substantial. The results indicate that for every $100,000 that Democratic representatives received from finance, the odds they would break with their party’s majority support for the Dodd-Frank legislation increased by 13.9 percent. Democratic representatives who voted in favor of finance often received $200,000–$300,000 from that sector, which raised the odds of switching by 25–40 percent. The authors also test whether representatives who left the House at the end of 2014 behaved differently. They find that these individuals were much more likely to break with their party and side with the banks. . . .

I had a quick question: how do you deal with the correlation/causation issue? The idea that Wall St is giving money to politicians who would already support them? That too is a big deal, of course, but it’s not quite the story Ferguson et al. are telling in the paper.

Ferguson responded:

We actually considered that at some length. That’s why we organized the main discussion on Wall Street and Dodd-Frank around looking at Democratic switchers — people who originally voted for passage (against Wall Street, that is), but then switched in one or more later votes to weaken. Nobody is in that particular regression who didn’t already vote against Wall Street once already, when it really counted.

I replied: Sure, but there’s still the correlation problem, in that one could argue that switchers are people whose latent preferences were closer to the middle, so they were just the ones who were more likely to shift following a change in the political weather.


Conservatism is controlled for in the analysis, using a measure derived from that Congress. This isn’t going to the middle; it’s a tropism for money. The other obvious comment is that if they are really latent Wall Street lovers, they should be moving mostly in lockstep on the subsequent votes. If you look at our summary nos., you can see they weren’t. We could probably mine that point some more.
Short of administering the MMPPI for banks in advance, are you prepared to accept any empirical evidence? Voting against banks in the big one is pretty good, I think.

Me: I’m not sure, I’ll have to think about it. One answer, I think, is that if it’s just $ given to pre-existing supporters of Wall St., it’s still an issue, as the congressmembers are then getting asymmetrically rewarded (votes for Wall St get the reward, votes against don’t get the reward), and, as economists are always telling us, Incentives Matter.


Remember those folks who turned on Swaps Push Out didn’t necessarily turn out for the banks on other votes. If it’s “weather” it’s a pretty strange weather.

Stan Weekly Roundup, 23 June 2017

Lots of activity this week, as usual.

* Lots of people got involved in pushing Stan 2.16 and interfaces out the door; Sean Talts got the math library, Stan library (that’s the language, inference algorithms, and interface infrastructure), and CmdStan out, while Allen Riddell got PyStan 2.16 out and Ben Goodrich and Jonah Gabry are tackling RStan 2.16

* Stan 2.16 is the last series of releases that will not require C++11; let the coding fun begin!

* Ari Hartikainen (of Aalto University) joined the Stan dev team—he’s working with Allen Riddell on PyStan, where judging from the pull request traffic, he put in a lot of work on the 2.16 release. Welcome!

* Imad Ali’s working on adding more cool features to RStanArm including time series and spatial models; yesterday he and Mitzi were scheming to get intrinsic conditional autoregressive models in and I heard all those time series name flying around (like ARIMA)

* Michael Betancourt rearranged the Stan web site with some input from me and Andrew; Michael added more descriptive text and Sean Talts managed to get the redirects in so all of our links aren’t broken; let us know what you think

* Markus Ojala of Smartly wrote a case study on their blog, Tutorial: How We Productized Bayesian Revenue Estimation with Stan

* Mitzi Morris got in the pull request for adding compound assignment and arithmetic; this adds statements such as n += 1.

* lots of chatter about characterization tests and a pull request from Daniel Lee to update some of update some of our our existing performance tests

* Roger Grosse from U.Toronto visited to tell us about his, Siddharth Ancha, and Daniel Roy’s 2016 NIPS paper on testing MCMC using bidirectional Monte Carlo sampling; we talked about how he modified Stan’s sampler to do annealed importance sampling

* GPU integration continues apace

* I got to listen in on Michael Betancourt and Maggie Lieu of the European Space Institute spend a couple days hashing out astrophysics models; Maggie would really like us to add integrals.

* Speaking of integration, Marco Inacio has updated his pull request; Michael’s worried there may be numerical instabilities, because trying to calculate arbitrary bounded integrals is not so easy in a lot of cases

* Andrew continues to lobby for being able to write priors directly into parameter declarations; for example, here’s what a hierarchical prior for beta might look like

parameters {
  real mu ~ normal(0, 2);
  real sigma ~ student_t(4, 0, 2);
  vector[N] beta ~ normal(mu, sigma);

* I got the go-ahead on adding foreach loops; Mitzi Morris will probably be coding them. We’re talking about

real ys[N];
for (y in ys)
  target += log_mix(lambda, normal_lpdf(y | mu[1], sigma[1]),
                            normal_lpdf(y | mu[2], sigma[2]));

* Kalman filter case study by Jouni Helske was discussed on Discourse

* Rob Trangucci rewrote the Gaussian processes chapter of the Stan manual; I’m to blame for the first version, writing it as I was learning GPs. For some reason, it’s not up on the web page doc yet.

* This is a very ad hoc list. I’m sure I missed lots of good stuff, so feel free to either send updates to me directly for next week’s letter or add things to comments. This project’s now way too big for me to track all the activity!

Best correction ever: “Unfortunately, the correct values are impossible to establish, since the raw data could not be retrieved.”

Commenter Erik Arnesen points to this:

Several errors and omissions occurred in the reporting of research and data in our paper: “How Descriptive Food Names Bias Sensory Perceptions in Restaurants,” Food Quality and Preference (2005) . . .

The dog ate my data. Damn gremlins. I hate when that happens.

As the saying goes, “Each year we publish 20+ new ideas in academic journals, and we appear in media around the world.” In all seriousness, the problem is not that they publish their ideas, the problem is that they are “changing or omitting data or results such that the research is not accurately represented in the research record.” And of course it’s not just a problem with Mr. Pizzagate or Mr. Gremlins or Mr. Evilicious or Mr. Politically Incorrect Sex Ratios: it’s all sorts of researchers who (a) don’t report what they actually did, and (b) refuse to reconsider their flimsy hypotheses in light of new theory or evidence.

Question about the secret weapon

Micah Wright writes:

I first encountered your explanation of secret weapon plots while I was browsing your blog in grad school, and later in your 2007 book with Jennifer Hill. I found them immediately compelling and intuitive, but I have been met with a lot of confusion and some skepticism when I’ve tried to use them. I’m uncertain as to whether it’s me that’s confused, or whether my audience doesn’t get it. I should note that my formal statistical training is somewhat limited—while I was able to take a couple of stats courses during my masters, I’ve had to learn quite a bit on the side, which makes me skeptical as to whether or not I actually understand what I’m doing.

My main question is this: when using the secret weapon, does it make sense to subset the data across any arbitrary variable of interest, as long as you want to see if the effects of other variables vary across its range? My specific case concerns tree growth (ring widths). I’m interested to see how the effect of competition (crowding and other indices) on growth varies at different temperatures, and if these patterns change in different locations (there are two locations). To do this, I subset the growth data in two steps: first by location, then by each degree of temperature, which I rounded to the nearest integer. I then ran the same linear model on each subset. The model had growth as the response, and competition variables as predictors, which were standardized. I’ve attached the resulting figure [see above], which plots the change in effect for each predictor over the range of temperature.

My reply: I like these graphs! In future you might try a 6 x K grid, where K is the number of different things you’re plotting. That is, right now you’re wasting one of your directions because your 2 x 3 grid doesn’t mean anything. These plots are fine, but if you have more information for each of these predictors, you can consider plotting the existing information as six little graphs stacked vertically and then you’ll have room for additional columns. In addition, you should make the tick marks much smaller, put the labels closer to the axes, and reduce the number of axis labels, especially on the vertical axes. For example, (0.0, 0.3, 0.6, 0.9) can be replaced by labels at 0, 0.5, 1.

Regarding the larger issue of, what is the secret weapon, as always I see it as an approximation to a full model that bridges the different analyses. It’s a sort of nonparametric analysis. You should be able to get better estimates by using some modeling, but a lot of that smoothing can be done visually anyway, so the secret weapon gets you most of the way there, and in my view it’s much much better than the usual alternative of fitting a single model to all the data without letting all the coefficients vary.

“Developers Who Use Spaces Make More Money Than Those Who Use Tabs”

Rudy Malka writes:

I think you’ll enjoy this nice piece of pop regression by David Robinson: developers who use spaces make more money than those who use tabs. I’d like to know your opinion about it.

At the above link, Robinson discusses a survey that allows him to compare salaries of software developers who use tabs to those who use spaces. The key graph is above. Robinson found similar results after breaking down the data by country, job title, or computer language used, and it also showed up in a linear regression controlling in a simple way for a bunch of factors.

As Robinson put it in terms reminiscent of our Why Ask Why? paper:

This is certainly a surprising result, one that I didn’t expect to find when I started exploring the data. . . . I tried controlling for many other confounding factors within the survey data beyond those mentioned here, but it was difficult to make the effect shrink and basically impossible to make it disappear.

Speaking with the benefit of hindsight—that is, seeing Robinson’s results and assuming they are a correct representation of real survey data—it all makes sense to me. Tabs seem so amateurish, I much prefer spaces—2 spaces, not 4, please!!!—so from that perspective it makes sense to me that the kind of programmers who use tabs tend to be programmers with poor taste and thus, on average, of lower quality.

I just want to say one thing. Robinson writes, “Correlation is not causation, and we can never be sure that we’ve controlled for all the confounding factors present in a dataset.” But this isn’t quite the point. Or, to put it another way, I think he has the right instinct here but isn’t quite presenting the issue precisely. To see why, suppose the survey had only 2 questions: How much money do you make? and Do you use spaces or tabs? And suppose we had no other information on the respondents. And, for that matter, suppose there was no nonresponse and that we had a simple random sample of all programmers from some specified set of countries. In that case, we’d know for sure that there are no other confounding factors in the dataset, as the dataset is nothing but those two columns of numbers. But we’d still be able to come up with a zillion potential explanations.

To put it another way, the descriptive comparison is interesting in its own right, and we just should be careful about misusing causal language. Instead of saying, “using spaces instead of tabs leads to an 8.6% higher salary,” we could say, “comparing two otherwise similar programmers, the one who uses spaces has, on average, an 8.6% higher salary than the one who uses tabs.” That’s a bit of a mouthful—but such a mouthful is necessary to accurately describe the comparison that’s being made.

Time-sharing Experiments for the Social Sciences

Jamie Druckman writes:

Time-sharing Experiments for the Social Sciences (TESS) is an NSF-funded initiative. Investigators propose survey experiments to be fielded using a nationally representative Internet platform via NORC’s AmeriSpeak® Panel (see http:/ for more information). In an effort to enable younger scholars to field larger-scale studies than what TESS normally conducts, we are pleased to announce a Special Competition for Young Investigators. While anyone can submit at any time through TESS’s regular proposal mechanism, this Special Competition is limited to graduate students and individuals who are who are no more than 3 years post-PhD. Winning projects will be allowed to be fielded at a size up to twice the usual budget as a regular TESS study. For more specifics on the special competition, see:  We will begin accepting proposals for the Special Competition on August 1, 2017, and the deadline is October 1, 2017.  Full details about the competition are available at   This page includes information about what is required of proposals and how to submit, and should be reviewed by anyone entering the competition.