Skip to content

If I had a long enough blog delay, I could just schedule this one for 1 Jan 2026


Gaurav Sood points us to this post, “Why did so many Japanese families avoid having children in 1966?”, by Randy Olson, which includes the excellent graph above and the following explanation:

The Japanese use [an] . . . astrological system . . . based on the Chinese zodiac. Along with assigning an astrological beast based on your birth year, each year also has one of the Five Elements associated with it—all that dramatically affect what your astrological sign entails. . . .

In 1966, however, many Japanese families were still quite superstitious—and . . . 1966 was the year of 丙午 (Hinoe-Uma), or the “Fire Horse.” As one source describes:

Girls born in [1966] became known as ‘Fire Horse Women’ and are reputed to be dangerous, headstrong and generally bad luck for any husband. In 1966, a baby’s sex couldn’t be reliably detected before birth; hence there was a large increase of induced abortions and a sharp decrease in the birth rate in 1966.

Time will tell if superstition will strike again 10 years from now in 2026, the next year of the “Fire Horse” in its 60-year cycle. Given that Japanese is already below the replacement fertility rate (i.e., roughly an average of 2 children per woman), the result could be disastrous.

If you look carefully at the above graph, you’ll see increases during the years before and after 1966. So, while it does seem that the net effect on births is negative, it’s not quite so negative as the spike might make it appear, given that some births that would’ve occurred in 1966 have been displaced to the adjacent years.

Also, I’d like to see the raw data—just total number of births in each year, or I guess even better would be total number of births divided by total number of women aged 15-45 in each country. I’m not saying there’s anything wrong with plotting total fertility rate. It’s just that it’s a derived quantity and it would be helpful to me, in trying to understand it, to see the pattern in the raw data as well.

George Orwell on “alternative facts”

Paul Alper points me to this quote from George Orwell’s 1943 essay, Looking Back on the Spanish War:

I know it is the fashion to say that most of recorded history is lies anyway. I am willing to believe that history is for the most part inaccurate and biased, but what is peculiar to our own age is the abandonment of the idea that history could be truthfully written. In the past, people deliberately lied, or they unconsciously colored what they wrote, or they struggled after the truth, well knowing that they must make many mistakes; but in each case they believed that “the facts” existed and were more or less discoverable. And in practice there was always a considerable body of fact which would have been agreed to by almost anyone. If you look up the history of the last war in, for instance, the Encyclopedia Britannica, you will find that a respectable amount of the material is drawn from German sources. A British and a German historian would disagree deeply on many things, even on fundamentals, but there would still be a body of, as it were, neutral fact on which neither would seriously challenge the other. It is just this common basis of agreement with its implication that human beings are all one species of animal, that totalitarianism destroys. Nazi theory indeed specifically denies that such a thing as “the truth” exists. There is, for instance, no such thing as “Science”. There is only “German Science,” “Jewish Science,” etc. The implied objective of this line of thought is a nightmare world in which the Leader, or some ruling clique, controls not only the future but the past. If the Leader says of such and such an event, “It never happened” — well, it never happened. If he says that two and two are five — well two and two are five. This prospect frightens me much more than bombs — and after our experiences of the last few years that is not such a frivolous statement.

It’s not about left and right. In the above passage Orwell points to the Nazis but in other places (notably 1984) he talks about the Soviets having the same attitude.

The Orwell quote is relevant, I think, to the recent story, following the inaugural festivities, of the White House press secretary unleashing a series of false statements—I think we can’t quite call these “lies” because it’s possible that the secretary went to some effort to avoid looking up the relevant facts—followed up by a presidential advisor characterizing these falsehoods as “alternative facts.”

The tricky thing about all this is that there are few absolutes. I won’t say that everybody does it, but I will say that Donald Trump is not the only leading political figure to lie about easily-checked facts. There was Hillary Clinton’s “landing under sniper fire,” Joe Biden’s plagiarized speech, and who could forget the time Paul Ryan broke 3 hours in the marathon? All these are pretty inconsequential, and I can only assume that the politicians in question were just in such a habit of saying things they wanted their audience to hear, that they didn’t care so much whether they were telling the truth. My take on it (just my take, I have no idea) is that for these politicians, speech is instrumental rather than expressive: it doesn’t really matter if what you’re saying is true or false; all that matters is that it has the desired effect.

So I guess we have to accept some ambient level of lies on issues big and small. I agree with Orwell, though, that there’s something particularly disturbing about lying being endorsed on a theoretical level, as it were.

This is related to various statistical issues we discuss on this blog. It can be hard to move forward when people won’t recognize their mistakes even when the evidence is right in front of them. At some point the practice of refusing to admit error edges toward the labeling of false statements as “alternative facts.”

Quantifying uncertainty in identification assumptions—this is important!

Luis Guirola writes:

I’m a poli sci student currently working on methods. I’ve seen you sometimes address questions in your blog, so here is one in case you wanted.

I recently read some of Chuck Manski book “Identification for decision and prediction”. I take his main message to be “The only way to get identification is using assumptions which are untestable”. This makes a lot of sense to me. In fact, most literature in causal applied design working in the Rubin identification tradition that is now popular in poli sci proceeds that way: first consider a research design (IV, quasi experiment, RDD, whatever) and a) justify that the conditions for identification are met and b) proceed to run the design conditional on the assumptions being true. My problem here is that the decision about a) is totally binary, and the uncertainty that I feel is associated with it is taken out of the final result.

Chuck Manski’s idea here is something like “let’s see how far we can get without making any assumptions” (or as few as possible), which takes him to set identification. But as someone educated in the bayesian tradition, I tend to feel that there must be a way of quantifying, if only subjectively or a priori, how sure we are about how sensible the identification assumptions by putting a probability distribution on them. Intuitively, that’s how I assess the state of knowledge in a certain area: if it relies on strong/implausible identification assumptions, I give less credit to its results; if I feel the assumptions are generalizable and hard to dispute, I give them more credit. But obviously, this is a very sloppy way of assesing it… I feel I must be missing something here, for otherwise I should have found more stuff on this.

My response:

Yes, I think it is would be a good idea to quantify uncertainty in identification assumptions. The basic idea would be to express your model with an additional parameter, call it phi, which equals 0 of the identification assumption holds, and is positive or negative if the assumption fails, with the magnitude of phi indexing how far off the assumption is from reality. For example, if you have a model of ignorable treatment assignment, phi could be the coefficient on an unobserved latent characteristic U in a logistic regression predicting treatment assignment; for example, Pr(T=1) = invlogit(X*beta + U*phi), where X represents observed pre-treatment predictors. The coefficient phi could never actually be estimated from data, as you don’t know U, but one could put priors on X and U based on some model of how selection could occur. One could then look at sensitivity of inferences to assumed values of phi.

I’m sure a lot of work has been done on such models—I assume they’re related to the selection models of James Heckman from the 1970s—and I think they’re worthy of more attention. My impression is that people don’t work with such models because they make life more complicated and require additional assumptions.

It’s funny: Putting a model on U and a prior on phi is a lot less restrictive—a lot less of an “assumption”—than simply setting phi to 0, which is what we always do. But the model on U and phi is explicit, whereas the phi=0 assumption is hidden so somehow it doesn’t seem so bad.

Regression models with latent variables and measurement error can be difficult to fit using usual statistical software but they’re easy to fit in Stan: you just add each new equation and distribution to the model, no problem at all. So I’m hoping that, now that Stan is widely available, people will start fitting these sorts of models. And maybe at some point this will be routine for causal inference.

At the time of this writing, I haven’t worked through any such example myself, but I think it’s potentially a very useful idea in many application areas.

Is the dorsal anterior cingulate cortex “selective for pain”?


Peter Clayson writes:

I have spent much of the last 6 months or so of my life trying to learn Bayesian statistics on my own. It’s been a difficult, yet rewarding experience.

I have a question about a research debate that is going on my field.

Briefly, the debate between some very prominent scholars in my area surrounds the question of whether the dorsal anterior cingulate cortex (dACC) is selective for pain. I.e., pain is the best explanation compared to the numerous of other terms commonly studied with regard to dACC activation. The paper reached these conclusions by using reverse inference on studies included in the NeuroSynth database.

What is getting many researchers riled up is the statistical approach the paper used (as well as the potential anatomical errors). The original paper in PNAS, by Matt Lieberman and Naomi Eisenberger here at UCLA, used z-scores to summarize dACC activation instead of posterior probabilities. Explanation below:

Our next goal was to quantify the strength of evidence for different processes being the psychological interpretation for dACC activity and how the evidence for different psychological processes compared with one another. We wanted to explore this issue in an unbiased way across the dACC that would allow each psychological domain to show where there is more or less support for it as an appropriate psychological interpretation. To perform this analysis, we extracted reverse inference statistics (Z-scores and posterior probabilities) across eight foci in the dACC for the terms “pain” (410 studies), “executive” (531 studies), “conflict” (246 studies), and “salience” (222 studies).

The foci were equally spaced out across the midline portion of the dACC (see Fig. 5 for coordinates). We plotted the posterior probabilities at each location for each of the four terms, as well as an average for each psychological term across the eight foci in the dACC (Fig. 5). Because Z-scores are less likely to be inflated from smaller sample sizes than the posterior probabilities, our statistical analyses were all carried out on the Z-scores associated with each posterior probability (21).

The paper goes on to compare z scores in various dACC voxels for different psychological terms. The paper was slammed by Tal Yarkoni, the creator of the NeuroSynth database, for not using Bayesian statistics (here) as well as for other reasons. Lieberman posted a snarky, passive-aggressive reply defending his statistical analysis (here), and Yarkoni brazenly responded to that post (here). Then things “got real”, and some heavy hitters in my field published a commentary in PNAS (here), to which Leiberman responded (here). (I show you all this to demonstrate how contentious things have gotten about this PNAS paper.)

Lieberman et al. defend their statistical approach and emphasized hit rates:

Imagine a database consisting of 100,000 attention studies and 100 pain studies. If a voxel is activated in 1,000 attention studies and all 100 pain studies, we would draw two conclusions. First, a randomly drawn study from the 1,100 with an effect would likely be an attention study. Second, because 100% of the pain studies produced an effect and only 1% of attention studies did, we would also conclude that this voxel is more selective for pain than attention. Hit rates (e.g., the number of pain studies that activate a region divided by the total number of pain studies in Neurosynth) are more important for assessing structure-to-function mapping than the historical tendency to conduct more studies on some topics than others.

They go on to analyzed a subset of data matching the number of studies included in analyses for each of the terms.

It seems like the appropriate thing to do is to analyze posterior probabilities for each psychological term and dACC activation. I think an appropriate analogy would be a doctor diagnosing smallpox. Say, patients with smallpox had a 99% probability of having spots, whereas patients with chickenpox had a 70% probability of having spots. Given how rare smallpox currently is (the prior), without taking into account the posterior probabilities the doctor would incorrectly diagnose patients as having smallpox, based on the reasoning that patients who have smallpox are more likely to show spots.

I think the same thing is happening in this PNAS paper. Ignoring the posterior probabilities is like just focusing on whether a patient has spots and ignoring the prior probabilities.

Am I reaching a correct conclusion? I think the PNAS paper is bad for numerous other reasons, but I want to understand Bayesian statistics better.

My response: It’s interesting how different fields have different terminologies. From the abstract of the paper under discussion:

No neural region has been associated with more conflicting accounts of its function than the dorsal anterior cingulate cortex (dACC), with claims that it contributes to executive processing, conflict monitoring, pain, and salience. However, these claims are based on forward inference analysis, which is the wrong tool for making such claims. Using Neurosynth, an automated brainmapping database, we performed reverse inference analyses to explore the best psychological account of dACC function. Although forward inference analyses reproduced the findings that many processes activate the dACC, reverse inference analyses demonstrated that the dACC is selective for pain and that pain-related terms were the single best reverse inference for this region.

I’d never before heard of “forward” or “reverse” inference. Here’s how they define it:

Forward inference, in this context, refers to the probability that a study or task that invokes a particular process will reliably produce dACC activity [e.g., the probability of dACC activity, given a particular psychological process: P(dACC activity|Ψ process)]. . . . Reverse inference, in the current context, refers to the probability that dACC activity can be attributed to a particular psychological process [i.e., the probability of a given psychological process, given activity in the dACC: P(Ψ process|dACC activity)].

I agree with Clayson that these reverse probabilities will depend on base rates. “The probability of a given psychological process” depends crucially on how this process is defined and how often it is happening. For example, some people are in pain all the time. If pain is the “Ψ process” under consideration here, then P(Ψ process|dACC activity) will be 1 for those people, automatically. Other people are just about never in pain, so P(Ψ process|dACC activity) will be essentially 0 for them.

I’m not saying that this reverse inference is necessarily a bad idea, just that much will depend on what scenarios are in this database that they are using. In his discussion, Yarkoni writes, “Pain has been extensively studied in the fMRI literature, so it’s not terribly surprising if z-scores for pain are larger than z-scores for many other terms in Neurosynth.” I think this is pretty much the same thing that I was saying (but backed up by actual data), that this reverse-inference comparison will depend strongly on what’s in the database.

I also have some problems with how both these inferences as defined, in that “dACC activity” is defined discretely, as if the dorsal anterior cingulate cortex is either on or off. But it’s my impression that things are not so simple.

Finally, there are the major forking-paths problems with the study, which are addressed in detail by Yarkoni. I agree with Yarkoni that the right way to go should be to perform a meta-analysis or hierarchical model with all possible comparisons, rather than just selecting a few things of interest and using them to tell dramatic stories.

On the other hand, Matthew Lieberman, one of the authors of the paper being discussed, has a Ted talk (“The Social Brain and its Superpowers”) and has been featured on NPR, and Tal Yarkoni hasn’t. So there’s that.

Stan Conference Live Stream

StanCon 2017 is tomorrow! Late registration ends in an hour. After that, all tickets are $400.

We’re going to be live streaming the conference. You’ll find the stream as a YouTube Live event from 8:45 am to 6 pm ET (and whatever gets up will be recorded by default). We’re streaming it ourselves, so if there are technical difficulties, we may have to stop early.

We’re on Twitter and you can track the conference with the #stancon2017 hashtag.


Looking for rigor in all the wrong places

My talk in the upcoming conference on Inference from Non Probability Samples, 16-17 Mar in Paris:

Looking for rigor in all the wrong places

What do the following ideas and practices have in common: unbiased estimation, statistical significance, insistence on random sampling, and avoidance of prior information? All have been embraced as ways of enforcing rigor but all have backfired and led to sloppy analyses and erroneous inferences. We discuss these problems and some potential solutions in the context of problems in applied survey research, and we consider ways in which future statistical theory can be better aligned with practice.

The talk should reflect how my thinking has changed from this talk a couple years ago.

Alternatives to jail for scientific fraud

Screen Shot 2016-04-06 at 10.22.57 PM

Mark Tuttle pointed me to this article by Amy Ellis Nutt, who writes:

Since 2000, the number of U.S. academic fraud cases in science has risen dramatically. Five years ago, the journal Nature tallied the number of retractions in the previous decade and revealed they had shot up 10-fold. About half of the retractions were based on researcher misconduct, not just errors, it noted.

The U.S. Office of Research Integrity, which investigates alleged misconduct involving National Institutes of Health funding, has been far busier of late. Between 2009 and 2011, the office identified three three cases with cause for action. Between 2012 and 2015, that number jumped to 36.

While criminal cases against scientists are rare, they are increasing. Jail time is even rarer, but not unheard of. Last July, Dong-Pyou Han, a former biomedical scientist at Iowa State University, pleaded guilty to two felony charges of making false statements to obtain NIH research grants and was sentenced to more than four years in prison.

Han admitted to falsifying the results of several vaccine experiments, in some cases spiking blood samples from rabbits with human HIV antibodies so that the animals appeared to develop an immunity to the virus.

“The court cannot get beyond the breach of the sacred trust in this kind of research,” District Judge James Gritzner said at the trial’s conclusion. “The seriousness of this offense is just stunning.”

In 2014, the Office of Research Integrity had imposed its own punishment. Although it could have issued a lifetime funding ban, it only barred Han from receiving federal dollars for three years.

Sen. Charles Grassley (R-Iowa) was outraged. “This seems like a very light penalty for a doctor who purposely tampered with a research trial and directly caused millions of taxpayer dollars to be wasted on fraudulent studies,” he wrote the agency. The result was a federal probe and Han’s eventual sentence.

I responded that I think community service would make more sense. Flogging seems like a possibility too. Jail seems so destructive.

I do agree with Sen. Grassley that a 3-year ban on federal dollars is not enough of a sanction in that case. Spiking blood samples is pretty much the worst thing you can do, when it comes to interfering with the scientific process. If spiking blood samples only gives you a 3-year ban, what does it take to get a 10-year ban? Do you have to be caught actually torturing the poor bunnies? And what would it take to get a lifetime ban? Spiking blood samples plus torture plus intentionally miscalculating p-values?

The point is, there should be some punishments more severe than the 3-year ban but more appropriate than prison, involving some sort of restitution. Maybe if you’re caught spiking blood samples you’d have to clean pipettes at the lab every Saturday and Sunday for the next 10 years? Or you’d have to check the p-value computations in every paper published in Psychological Science between the years of 2010 and 2015?

“Estimating trends in mortality for the bottom quartile, we found little evidence that survival probabilities declined dramatically.”

Screen Shot 2016-04-08 at 9.55.20 AM

Last year there was much discussion here and elsewhere about a paper by Anne Case and Angus Deaton, who noticed that death rates for non-Hispanic white Americans aged 45-54 had been roughly flat since 1999, even while the death rates for this age category had been declining steadily in other countries and among nonwhite Americans.

Here’s the quick summary of what was happening in the U.S. for non-Hispanic white Americans aged 45-54:

Screen Shot 2016-01-18 at 10.35.52 PM

Different things are happening in different regions—in particular, things have been getting worse for women in the south and midwest, whereas the death rate of men in this age group have been declining during the past few years—but overall there has been little change since 1999. In contrast, other countries and U.S. nonwhites have seen large declines in death rates, something like 20%.

The above graph (from this paper with Jonathan Auerbach) is not quite what Case and Deaton showed. They didn’t break things down by sex or region, and they didn’t age adjust, which was a mistake because during the 1999-2013 period, the baby boom moved through the 45-54 age group, so that this group increased in average age, leading to an increase in raw death rates simply because the people in this age category are older. (The instantaneous probability of dying increases at a rate of about 8% per year; that is, each year in this age range your chance of dying during the next year is multiplied by approximately a factor of 1.08; thus, when looking at relatively small changes of death rate you really have to be careful about age composition.)

Anyway, that’s all been hashed out a million times and now we understand it.

Today I want to talk about something different: trends in death rate by education. Much of the discussion in the news media has centered on the idea that the trend is particularly bad for lower-educated whites. But, as I wrote in my first post on the topic:

I’m not quite sure how to interpret Case and Deaton’s comparisons across education categories (no college; some college; college degree), partly because I’m not clear on why they used this particular binning but also because the composition of the categories have changed during the period under study. The group of 45-54-year-olds in 1999 with no college degree is different from the corresponding group in 2013, so it’s not exactly clear to me what is learned by comparing these groups. I’m not saying the comparison is meaningless, just that the interpretation is not so clear.

I was just raising a question, but it turns out that some people have studied it, and there’s a paper from 2015 in the journal Health Affairs.

Here it is: Measuring Recent Apparent Declines In Longevity: The Role Of Increasing Educational Attainment, by John Bound, Arline Geronimus, Javier Rodriguez, and Timothy Waidmann, who write:

Independent researchers have reported an alarming decline in life expectancy after 1990 among US non-Hispanic whites with less than a high school education. However, US educational attainment rose dramatically during the twentieth century; thus, focusing on changes in mortality rates of those not completing high school means looking at a different, shrinking, and increasingly vulnerable segment of the population in each year.

Yes, this was the question I raised earlier, and Bound et al. back it up with a graph, which I reproduced at the top of this post. (John Bullock argues in a comment that the above graph is wrong because high school completion rates aren’t so high, but this does not affect the general point made by Bound et al.)

Then they take the next step:

We analyzed US data to examine the robustness of earlier findings categorizing education in terms of relative rank in the overall distribution of each birth cohort, instead of by credentials such as high school graduation.

That makes sense. By using relative rank, they’re making an apples-to-apples comparison. And here’s what they find:

Estimating trends in mortality for the bottom quartile, we found little evidence that survival probabilities declined dramatically.

Interesting! They conclude:

Widely publicized estimates of worsening mortality rates among non-Hispanic whites with low socioeconomic position are highly sensitive to how educational attainment is classified. However, non-Hispanic whites with low socioeconomic position, especially women, are not sharing in improving life expectancy, and disparities between US blacks and whites are entrenched.

Come and work with us!

Stan is an open-source, state-of-the-art probabilistic programming language with a high-performance Bayesian inference engine written in C++. Stan had been successfully applied to modeling problems with hundreds of thousands of parameters in fields as diverse as econometrics, sports analytics, physics, pharmacometrics, recommender systems, political science, and many more. Research using Stan has been featured in the New York Times, Slate, and other media outlets as well as in leading scientific journals in a range of disciplines. The Stan user community is in the tens of thousands.

The Stan community is growing faster than we expected and we have a large backlog of features that we would like to add to the language. If you would like to join a small team of statisticians, computer scientists, and other researchers who are working on some of the most interesting problems in computational statistics today, we encourage you to apply.

Stan programmers

We are looking for software engineers to work in at least one of the following areas:
• C++ programming
• Scripting languages and Stan interfaces
• Web development
• Parallel, distributed, and high-performance computing
We have a fun, intense, non-hierarchical collaborative working environment. And everything is open-source, which will maximize the impact of your contributions. Come work on a project which makes a difference for thousands of academic and industrial researchers around the world.

Stan business developer/grants manager

Our larger vision is for breadth (there is a wide range of application areas where Stan can make a difference), depth (improving the algorithms and the language to be able to fit more complicated models), and scalability (“big data”).

We are currently involved in research and development in all these areas, and we need a manager to:
• Organize the work, matching projects to people, integrating new hires.
• Raise funds, including grant applications, contacts with foundations and businesses and consulting opportunities, etc.
• Both of these involve people skills, technical understanding, and an interest, ultimately, in solving real world problems. Hence the project manager should have a technical background and be interested in applications of statistics to the wider world.


In addition we have new and ongoing projects involving Bayesian modeling and Stan research and development in applications, including:

• Causal inference using Gaussian processes and Bart
• Survey weighting and regression modeling
• Mixture models for gene splicing

Key skills for any postdoc are statistical modeling, computing, and communication. Writing papers, also developing methods that work on real problems.


If you are interested in any of these, just email me telling me which position(s) interest you and why. Include a C.V. and anything else that might be relevant (such as papers you’ve written or links to code you’ve written), and have three letters of recommendation sent to me (or give the names of three people who could provide recommendations if asked).

30 tickets left to StanCon 2017! New sponsor!

Stan Conference 2017 is on Saturday. We just sold our 150th ticket! Capacity is 180. It’s going to be an amazing event. Register here (while tickets are still available):

Our Q&A Panel will have some members of the Stan Development Team:

  • Andrew Gelman. Stan super user.
  • Bob Carpenter. Stan language, math library.
  • Michael Betancourt. Stan algorithms, math library.
  • Daniel Lee. Math library, CmdStan, how everything fits together.
  • Ben Goodrich. RStan, RStanArm, math library.
  • Jonah Gabry. ShinyStan, and all packages downstream of RStan.
  • Allen Riddell. PyStan.

This only represents about half of the Stan developers at StanCon! Logistically, we just can’t have everyone on stage.



A big thanks to our sponsors! Without their contributions, we really couldn’t have secured the space and the services for the conference. Here are our sponsors (in order of sponsoring StanCon):

Live Stream

We’re going to live stream StanCon 2017. I’ll post more details later this week. I’m still working out technical details. We are running the live stream ourselves, so I’m sure we’ll have technical difficulties. I’d really suggest attending in person if you are able.


“If the horse race polls were all wrong about Trump, why should his approval rating polls be any better?”

A journalist forwarded the above question to me and asked what I thought.

My reply is that the horse race polls were not all wrong about Trump. The polls had him at approx 48% of the two-party vote and he received 49%. The polls were wrong by a few percentage points in some key swing states (as we know, this happens, it’s called non-sampling error), but for the task of measuring national opinion the polls did just fine, they were not “all wrong” as the questioner seemed to think.

P.S. Some polls during the campaign did have Clinton way in the lead but we never thought those big bounces were real, and based on the fundamentals we were also anticipating a close election.

Laurie Davies: time series decomposition of birthday data

On the cover of BDA3 is a Bayesian decomposition of the time series of birthdays in the U.S. over a 20-year period. We modeled the data as a sum of Gaussian processes and fit it using GPstuff.

Occasionally we fit this model to new data; see for example this discussion of Friday the 13th and this regarding April 15th.

We still can’t just pop these models into Stan—the matrix calculations are too slow—but we’re working on it, and I am confident that we’ll be able to fit the models in Stan some day. In the meantime I have some thoughts about how to improve the model and we plan to continue working on this.

Anyway, all the above analyses are Bayesian. Laurie Davies sent along this non-Bayesian analysis he did that uses residuals and hypothesis testing. Here’s his summary report, and here’s all his code.

I prefer our Bayesian analysis for various reasons, but Davies does demonstrate the point that hypothesis testing, if used carefully, can be used to attack this sort of estimation problem.

The data

The birthday data used in BDA3 come from National Vital Statistics System natality data, as provided by Google BigQuery and exported to csv by Robert Kern.

More recent data exported by Fivethirtyeight are available here:

The file US_births_1994-2003_CDC_NCHS.csv contains U.S. births data for the years 1994 to 2003, as provided by the Centers for Disease Control and Prevention’s National Center for Health Statistics.

US_births_2000-2014_SSA.csv contains U.S. births data for the years 2000 to 2014, as provided by the Social Security Administration.

NCHS and SSA data have some difference in numbers in the overlapping years, as we discussed here.

Stan is hiring! hiring! hiring! hiring!

[insert picture of adorable cat entwined with Stan logo]

We’re hiring postdocs to do Bayesian inference.

We’re hiring programmers for Stan.

We’re hiring a project manager.

How many people we hire depends on what gets funded. But we’re hiring a few people for sure.

We want the best best people who love to collaborate, who love to program, who love statistical modeling, who love to learn, who care about getting things right and are happy to admit their mistakes.

See here for more information.

Powerpose update


I contacted Anna Dreber, one of the authors of the paper that failed to replicate power pose, and asked her about a particular question that came up regarding their replication study. One of the authors of the original power pose study wrote that the replication “varied methodologically in about a dozen ways — some of which were enormous, such as having people hold the poses for 6 instead of 2 minutes, which is very uncomfortable.” As commenter Phil put it, “It does seem kind of ridiculous to have people hold any pose other than ‘lounging on the couch’ for six minutes.”

In response, Dreber wrote:

We discuss this in the paper and this is what we say in the supplementary material:

A referee also pointed out that the prolonged posing time could cause participants to be uncomfortable, and this may counteract the effect of power posing. We therefore reanalyzed our data using responses to a post-experiment questionnaire completed by 159 participants. The questionnaire asked participants to rate the degree of comfort they experienced while holding the positions on a four-point scale from “not at all” (1) to “very” (4) comfortable. The responses tended toward the middle of the scale and did not differ by High- or Low-power condition (average responses were 2.38 for the participants in the Low-power condition and 2.35 for the participants in the High-power condition; mean difference = -0.025, CI(-0.272, 0.221); t(159) = -0.204, p = 0.839; Cohen’s d = -0.032). We reran our main analysis, excluding those participants who
were “not at all” comfortable (1) and also excluding those who were “not at all” (1) or “somewhat” comfortable (2). Neither sample restriction changes the results in a substantive way (Excluding participants who reported a score of 1 gives Risk (Gain): Mean difference = -.033, CI (-.100,
0.034); t(136) = -0.973, p = 0.333; Cohen’s d = -0.166; Testosterone Change: Mean difference = -4.728, CI(-11.229, 1.773); t(134) = -1.438, p = .153; Cohen’s d = -0.247; Cortisol: Mean difference = -0.024, CI (-.088, 0.040); t(134) = -0.737, p = 0.463; Cohen’s d = -0.126. Excluding participants who reported a score of 1 or 2 gives Risk (Gain): Mean difference = -.105, CI (-0.332, 0.122); t(68) = -0.922, p = 0.360; Cohen’s d = -0.222; Testosterone Change: Mean difference = -5.503, CI(-16.536, 5.530); t(66) = -0.996, p = .323; Cohen’s d = -0.243; Cortisol: Mean difference = -0.045, CI (-0.144, 0.053); t(66) = -0.921, p = 0.360; Cohen’s d = -0.225). Thus, including only those participants who report having been “quite comfortable” (3), or “very comfortable” (4) does not change our results.

Also, each of the two positions was held for 3 min each (so not one for 6 min).

So, yes, the two studies differed, but there’s no particular reason to believe that the 1-minute intervention would have a larger effect than the 3-minute intervention. Indeed, we’d typically think a longer treatment would have a larger effect.

Again, remember the time-reversal heuristic: Ranehill et al. did a large controlled study and found no effect of pose on hormones. Carney et al. did a small uncontrolled study and found a statistically significant comparison. This is not evidence in favor of the hypothesis that Carney et al. found something real; rather, it’s evidence consistent with zero effects.

Dreber added:

In our study, we actually wanted to see whether power posing “worked” – we thought that if we find effects, we can figure out some other fun studies related to this, so in that sense we were not out “to get” Carney et al. That is, we did not do any modifications in the setup that we thought would kill the original result.

Indeed, lots of people seem to miss this point, that if you really care about a topic, you’d want to replicate it and remove all doubt. When a researcher expresses the idea that replication, data sharing, etc., is some sort of attack, I think that betrays an attitude or a fear that the underlying effect really isn’t there. If it were there, you’d want to see it replicated over and over. A strong anvil need not fear the hammer. And it’s the insecure researchers who feel the need for bravado such as “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

P.S. I wrote the above post close to a year ago, well before the recent fuss over replication trolls or whatever it was that we were called. In the meantime, Tom Bartlett wrote a long news article about the whole power pose story, so you can go there for background if all this stuff is new to you.

To know the past, one must first know the future: The relevance of decision-based thinking to statistical analysis

We can break up any statistical problem into three steps:

1. Design and data collection.

2. Data analysis.

3. Decision making.

It’s well known that step 1 typically requires some thought of steps 2 and 3: It is only when you have a sense of what you will do with your data, that you can make decisions about where, when, and how accurately to take your measurements. In a survey, the plans for future data analysis influence which background variables to measure in the sample, whether to stratify or cluster; in an experiment, what pre-treatment measurements to take, whether to use blocking or multilevel treatment assignment; and so on.

The relevance for step 3 to step 2 is perhaps not so well understood. It came up in a recent thread following a comment by Nick Menzies. In many statistics textbooks (including my own), the steps of data analysis and decision making are kept separate: we first discuss how to analyze the data, with the general goal being the production of some (probabilistic) inferences that can be piped into any decision analysis.

But your decision plans may very well influence your analysis. Here are two ways this can happen:

– Precision. If you know ahead of time you only need to estimate a parameter to within an uncertainty of 0.1 (on some scale), say, and you have a simple analysis method that will give you this precision, you can just go simple and stop. This sort of thing occurs all the time.

– Relevance. If you know that a particular variable is relevant to your decision making, you should not sweep it aside, even if it is not statistically significant (or, to put it Bayesianly, even if you cannot express much certainty in the sign of its coefficient). For example, the problem that motivated our meta-analysis of effects of survey incentives was a decision of whether to give incentives to respondents in a survey we were conducting, the dollar value of any such incentive, and whether to give the incentive before or after the survey interview. It was important to keep all these variables in the model, even if their coefficients were not statistically significant, because the whole purpose of our study was to estimate these parameters. This is not to say that on should use simple least squares: another impact of the anticipated decision analysis is to suggest parts of the analysis where regularization and prior information will be particularly crucial.

Conversely, a variable that is not relevant to decisions could be excluded from the analysis (possibly for reasons of cost, convenience, or stability), in which case you’d interpret inferences as implicitly averaging over some distribution of that variable.

Frank Harrell statistics blog!

Frank Harrell, author of an influential book on regression modeling and currently both a biostatistics professor and a statistician at the Food and Drug Administration, has started a blog. He sums up “some of his personal philosophy of statistics” here:

Statistics needs to be fully integrated into research; experimental design is all important

Don’t be afraid of using modern methods

Preserve all the information in the data; Avoid categorizing continuous variables and predicted values at all costs

Don’t assume that anything operates linearly

Account for model uncertainty and avoid it when possible by using subject matter knowledge

Use the bootstrap routinely

Make the sample size a random variable when possible

Use Bayesian methods whenever possible

Use excellent graphics, liberally

To be trustworthy research must be reproducible

All data manipulation and statistical analysis must be reproducible (one ramification being that I advise against the use of point and click software in most cases)

Harrell continues:

Statistics has multiple challenges today, which I [Harrell] break down into three major sources:

1. Statistics has been and continues to be taught in a traditional way, leading to statisticians believing that our historical approach to estimation, prediction, and inference was good enough.

2. Statisticians do not receive sufficient training in computer science and computational methods, too often leaving those areas to others who get so good at dealing with vast quantities of data that they assume they can be self-sufficient in statistical analysis and not seek involvement of statisticians. Many persons who analyze data do not have sufficient training in statistics.

3. Subject matter experts (e.g., clinical researchers and epidemiologists) try to avoid statistical complexity by “dumbing down” the problem using dichotomization, and statisticians, always trying to be helpful, fail to argue the case that dichotomization of continuous or ordinal variables is almost never an appropriate way to view or analyze data. Statisticians in general do not sufficiently involve themselves in measurement issues.

No evidence of incumbency disadvantage?


Several years ago I learned that the incumbency advantage in India was negative! There, the politicians are so unpopular that when they run for reelection they’re actually at a disadvantage, on average, compared to fresh candidates.

At least, that’s what I heard.

But Andy Hall and Anthony Fowler just wrote a paper claiming that, no, there’s no evidence for negative incumbency advantages anywhere. Hall writes,

We suspect the existing evidence is the result of journals’ preference for “surprising” results. Since positive incumbency effects have been known for a long time, you can’t publish “just another incumbency advantage” paper anymore, but finding a counterintuitive disadvantage seems more exciting.

And here’s how their paper begins:

Scholars have long studied incumbency advantages in the United States and other advanced democracies, but a recent spate of empirical studies claims to have identified incumbency disadvantages in other, sometimes less developed, democracies including Brazil, Colombia, India, Japan, Mexico, and Romania. . . . we reassess the existing evidence and conclude that there is little compelling evidence of incumbency disadvantage in any context so far studied. Some of the incumbency disadvantage results in the literature arise from unusual specifications and are not statistically robust. Others identify interesting phenomena that are conceptually distinct from what most scholars would think of as incumbency advantage/disadvantage. For example, some incumbency disadvantage results come from settings where incumbents are not allowed to run for reelection. . . .

Interesting. I’ve not looked at their paper in detail but one thing I noticed is that a lot of these cited papers seem to have been estimating the incumbent party advantage, which doesn’t seem to me to be the same as the incumbency advantage as it’s usually understood. This discontinuity thing seems like a classic example of looking for the keys under the lamppost. I discussed the problems with that approach several years ago in this 2005 post, which I never bothered to write up as a formal article. Given that these estimates are still floating around, I kinda wish I had.

Stan JSS paper out: “Stan: A probabilistic programming language”

As a surprise welcome to 2017, our paper on how the Stan language works along with an overview of how the MCMC and optimization algorithms work hit the stands this week.

  • Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A Probabilistic Programming Language. Journal of Statistical Software 76(1).

The authors are the developers at the time the first revision was submitted. We now have quite a few more developers. Because of that, we’d still prefer that people cite the manual authored by the development team collectively rather than this paper citing only some of our current developers.

The original motivation for writing a paper was that Wikipedia rejected our attempts at posting a Stan Wikipedia page without a proper citation.

I’d like to thank to Achim Zeileis at JSS for his patience and help during the final wrap up.


Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively defines a log probability function over parameters conditioned on specified data and constants. As of version 2.14.0, Stan provides full Bayesian inference for continuous-variable models through Markov chain Monte Carlo methods such as the No-U-Turn sampler, an adaptive form of Hamiltonian Monte Carlo sampling. Penalized maximum likelihood estimates are calculated using optimization methods such as the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm. Stan is also a platform for computing log densities and their gradients and Hessians, which can be used in alternative algorithms such as variational Bayes, expectation propagation, and marginal inference using approximate integration. To this end, Stan is set up so that the densities, gradients, and Hessians, along with intermediate quantities of the algorithm such as acceptance probabilities, are easily accessible. Stan can be called from the command line using the cmdstan package, through R using the rstan package, and through Python using the pystan package. All three interfaces support sampling and optimization-based inference with diagnostics and posterior analysis. rstan and pystan also provide access to log probabilities, gradients, Hessians, parameter transforms, and specialized plotting.


   author = {Bob Carpenter and Andrew Gelman and Matthew Hoffman
             and Daniel Lee and Ben Goodrich and Michael Betancourt
             and Marcus Brubaker and Jiqiang Guo and Peter Li
             and Allen Riddell},
   title = {Stan: {A} Probabilistic Programming Language},
   journal = {Journal of Statistical Software},
   volume = {76},
   number = {1},
   year = {2017}

Further reading

Check out the Papers about Stan section of the Stan Citations web page. There’s more info on our autodiff and on how variational inference works and a link to the original NUTS paper. And of course, don’t miss Michael’s latest if you want to understand HMC and NUTS, A conceptual introduction to HMC.

Problems with “incremental validity” or more generally in interpreting more than one regression coefficient at a time

Kevin Lewis points us to this interesting paper by Jacob Westfall and Tal Yarkoni entitled, “Statistically Controlling for Confounding Constructs Is Harder than You Think.” Westfall and Yarkoni write:

A common goal of statistical analysis in the social sciences is to draw inferences about the relative contributions of different variables to some outcome variable. When regressing academic performance, political affiliation, or vocabulary growth on other variables, researchers often wish to determine which variables matter to the prediction and which do not—typically by considering whether each variable’s contribution remains statistically significant after statistically controlling for other predictors. When a predictor variable in a multiple regression has a coefficient that differs significantly from zero, researchers typically conclude that the variable makes a “unique” contribution to the outcome. . . .

Incremental validity claims pervade the social and biomedical sciences. In some fields, these claims are often explicit. To take the present authors’ own field of psychology as an example, a Google Scholar search for the terms “incremental validity” AND psychology returns (in January 2016) over 18,000 hits—nearly 500 of which contained the phrase “incremental validity” in the title alone. More commonly, however, incremental validity claims are implicit—as when researchers claim that they have statistically “controlled” or “adjusted” for putative confounds—a practice that is exceedingly common in fields ranging from epidemiology to econometrics to behavioral neuroscience (a Google Scholar search for “after controlling for” and “after adjusting for” produces over 300,000 hits in each case). The sheer ubiquity of such appeals might well give one the impression that such claims are unobjectionable, and if anything, represent a foundational tool for drawing meaningful scientific inferences.

Wow—what an excellent start! They’re right. We see this reasoning so often. Yes, it is generally not appropriate to interpret regression coefficients this way—see, for example, “Do not control for post-treatment variables,” section 9.7 of my book with Jennifer—and things get even worse when you throw statistical significance into the mix. But researchers use this fallacious reasoning because it fulfills a need, or a perceived need, which is to disentangle their causal stories.

Westfall and Yarkoni continue:

Unfortunately, incremental validity claims can be deeply problematic. As we demonstrate below, even small amounts of error in measured predictor variables can result in extremely poorly calibrated Type 1 error probabilities.

Ummmm, I don’t like that whole Type 1 error thing. It’s the usual story: I don’t think there are zero effects, so I think it’s just a mistake overall to be saying that some predictors matter and some don’t.

That said, for people who are working in that framework, I think Westfall and Yarkoni have an important message. They say in mathematics, and with several examples, what Jennifer and I alluded to, which is that even if you control for pre-treatment variables, you have to worry about latent variables you haven’t controlled for. As they put it, there can (and will) be “residual confounding.”

So I’ll quote them one more time:

The traditional approach of using multiple regression to support incremental validity claims is associated with extremely high false positive rates under realistic parameter regimes.


They also say, “the problem has a principled solution: inferences about the validity of latent constructs should be supported by latent-variable statistical approaches that can explicitly model measurement unreliability,” which seems reasonable enough. That said, I can’t go along with their recommendation that researchers “adopt statistical approaches like SEM”—that seems to often just make things worse! I say Yes to latent variable models but No to approaches which are designed to tease out things that just can’t be teased (as in the “affective priming” example discussed here).

I am sympathetic to Westfall and Yarkoni’s goal of providing solutions, not just criticism—but in this case I think the solutions are further away than they seem to believe, and that part of the solution will be to abandon some of researchers’ traditional goals.

“A Conceptual Introduction to Hamiltonian Monte Carlo”

Michael Betancourt writes:

Hamiltonian Monte Carlo has proven a remarkable empirical success, but only recently have we begun to develop a rigorous understanding of why it performs so well on difficult problems and how it is best applied in practice. Unfortunately, that understanding is con- fined within the mathematics of differential geometry which has limited its dissemination, especially to the applied communities for which it is particularly important.

In this review I [Betancourt] provide a comprehensive conceptual account of these theoretical foundations, focusing on developing a principled intuition behind the method and its optimal implementations rather of any ex- haustive rigor. Whether a practitioner or a statistician, the dedicated reader will acquire a solid grasp of how Hamiltonian Monte Carlo works, when it succeeds, and, perhaps most importantly, when it fails.

This is great stuff. He has 38 figures! Read the whole thing.

I wish Mike’s paper had existed 25 years ago, as it contains more sophisticated and useful versions of various intuitions that my colleagues and I had to work so hard to develop when working on .234.