Skip to content

Bayesian Inference with Stan for Pharmacometrics Class

Bob Carpenter, Daniel Lee, and Michael Betancourt will be teaching the 3-day class starting on 19 September in Paris. Following is the outline for the course:

Day 1

Introduction to Bayesian statistics

  • Likelihood / sampling distributions
  • Priors, Posteriors via Bayes’s rule
  • Posterior expectations and quantiles
  • Events as expectations of indicator functions

Introduction to Stan

  • Basic data types
  • Variable declarations
  • Constrained parameters and transforms to unconstrained
  • Program blocks and execution
  • Derived quantities
  • Built-in functions and operators
  • Statements: sampling, assignment, loops, conditionals, blocks
  • How to use Stan within R with RStan

Hands-on examples

Day 2

ODE and PK/PD Modeling

  • Parameters and data to ODEs
  • Non-stiff ODE solver
  • Stiff ODE solver
  • Control parameters and tolerances
  • Coupled ODE systems for sensitivities
  • Elimination half-lifes

Inference with Markov chain Monte Carlo

  • Monte Carlo methods and plug-in inference
  • Markov chain Monte Carlo
  • Convergence diagnostics, R-hat, effective sample size
  • Effective sample size vs. number of iterations
  • Plug-in posterior expectations and quantiles
  • Event probability calculations

Hands-on examples

Day 3

Additional Topics in PK/PD Modeliong

  • Bolus and infusion dosing
  • Lag time and absorption models
  • Linear versus Michaelis/Menten elimination
  • Hierarchical models for patient-level effects
  • Transit compartment models and time lags
  • Multi-compartment models and varying time scales
  • Joint PK/PD modeling: Bayes vs. “cut”
  • Meta-analysis
  • Formulating informative priors
  • Clinical trial simulations and power calculations

Stan programming techniques

  • Reproducible research practices
  • Probabilistic programming principles
  • Generated quantities for inference
  • Data simulation and model checking
  • Posterior predictive checks
  • Cross-validation and predictive calibration
  • Variable transforms for sampling efficiency
  • Multiple indexing and range slicing
  • Marginalizing discrete parameters
  • Handling missing data
  • Ragged and aparse data structures
  • Identifiability and problematic posteriors
  • Weakly informative priors

If you are in Europe in September, please come and join us. Thanks to Julie Bertrand and France Mentré from Université Paris Diderot for helping us organize the course.

You can register here.

Killer O


Taggert Brooks points to this excellent news article by George Johnson, who reports:

Epidemiologists have long been puzzled by a strange pattern in their data: People living at higher altitudes appear less likely to get lung cancer. . . . The higher you live, the thinner the air, so maybe oxygen is a cause of lung cancer. . . .

But the hypothesis is not as crazy as it may sound. Oxygen is what energizes the cells of our bodies. Like any fuel, it inevitably spews out waste — a corrosive exhaust of substances called “free radicals,” or “reactive oxygen species,” that can mutate DNA and nudge a cell closer to malignancy.

Back to the epidemiology. Researchers Kamen Simeonov and Daniel Himmelstein adjusted for a bunch of demographic and medical variables, and then:

After an examination of all these numbers for the residents of 260 counties in the Western United States, situated from sea level to nearly 11,400 feet, one pattern stood out: a correlation between the concentration of oxygen in the air and the incidence of lung cancer. For each 1,000-meter rise in elevation, there were 7.23 fewer lung cancer cases per 100,000 people.

“7.23” . . . that’s a bit overprecise, there’s no way you could know it to this level of accuracy. But I get the general idea.

As Brooks notes, this idea is not new. He links to a 1987 paper by Clarice Weinberg, Kenneth Brown, and David Hoel, who discussed “recent evidence implicating reactive forms of oxygen in carcinogenesis and atherosclerosis” and wrote that “reduced oxygen pressure of inspired air may be protective against certain causes of death.”

The idea has also hit the mass media. For example, from a 2012 article by Michael Corvinus in Cracked (yes, Cracked):

One of the disadvantages of living at higher altitudes is that there’s less oxygen in the air, which can suck for those with respiratory problems. One of the advantages of those places, however, is that … there’s less oxygen in the air. A lack of oxygen makes people’s bodies more efficient, which makes them live longer. . . . Dr. Benjamin Honigman at the University of Colorado School of Medicine theorized that the lower levels of oxygen force the body to become more efficient at distributing that oxygen, activating certain genes that enhance heart function and create new blood vessels for bringing blood to and from the heart, greatly lowering the chances of heart disease.

On deck this week

Mon: Killer O

Tues: More evidence that even top researchers routinely misinterpret p-values

Wed: What makes a mathematical formula beautiful?

Thurs: Fish cannot carry p-values

Fri: Does Benadryl make you senile? Challenges in research communication

Sat: What recommendations to give when a medical study is not definitive (which of course will happen all the time, especially considering that new treatments should be compared to best available alternatives, which implies that most improvements will be incremental at best)

Sun: Powerpose update

“Children seek historical traces of owned objects”

Recently in the sister blog:

An object’s mental representation includes not just visible attributes but also its nonvisible history. The present studies tested whether preschoolers seek subtle indicators of an object’s history, such as a mark acquired during its handling. Five studies with 169 children 3–5 years of age and 97 college students found that children (like adults) searched for concealed traces of object history, invisible traces of object history, and the absence of traces of object history, to successfully identify an owned object. Controls demonstrated that children (like adults) appropriately limit their search for hidden indicators when an owned object is visibly distinct. Altogether, these results demonstrate that concealed and invisible indicators of history are an important component of preschool children’s object concepts.

“The Dark Side of Power Posing”

Shravan points us to this post from Jay Van Bavel a couple years ago. It’s an interesting example because Bavel expresses skepticism about the “power pose” hype but he makes the same general mistake of Carney, Cuddy, Yap, and other researchers in this area in that he overreacts to every bit of noise that’s been p-hacked and published.

Here’s Bavel:

Some of the new studies used different analysis strategies than the original paper . . . but they did find that the effects of power posing were replicable, if troubling. People who assume high-power poses were more likely to steal money, cheat on a test and commit traffic violations in a driving simulation. In one study, they even took to the streets of New York City and found that automobiles with more expansive driver’s seats were more likely to be illegally parked. . . .

Dr. Brinol [sic] and his colleagues found that power posing increased self-confidence, but only among participants who already had positive self-thoughts. In contrast, power posing had exactly the opposite effect on people who had negative self-thoughts. . . .

In two studies, Joe Cesario and Melissa McDonald found that power poses only increased power when they were made in a context that indicated dominance. Whereas people who held a power pose while they imagined standing at an executive desk overlooking a worksite engaged in powerful behavior, those who held a power pose while they imagined being frisked by the police actually engaged in less powerful behavior. . . .

In a way I like all this because it shows how the capitalize-on-noise strategy which worked so well for the original power pose authors can also be used to dismantle the whole idea. So that’s cool. But from a scientific point of view, I think there’s so much noise here that any of these interactions could well go in the opposite direction. Not to mention all the unstudied interactions and all the interactions that happened not to be statistically significant in these particular small samples.

I’m not trying to slam Bavel here. The above-linked post was published in 2013, before we were all fully aware of how easy it was for researchers to get statistical significance from noise, even without having to try. Now we know better: just cos some correlation or interaction appears in a sample, we don’t have to think it represents anything in the larger population.

When do statistical rules affect drug approval?

Someone writes in:

I have MS and take a disease-modifying drug called Copaxone. Sandoz developed a generic version​ of Copaxone​ and filed for FDA approval. Teva, the manufacturer of Copaxone, filed a petition opposing that approval (surprise!). FDA rejected Teva’s petitions and approved the generic.

My insurance company encouraged me to switch to the generic. Specifically, they increased the copay​ for the non-generic​ from $50 to $950 per month. That got my attention. My neurologist recommended against switching to the generic.

Consequently, I decided to try to review the FDA decision to see if I could get any insight into the basis for ​my neurologist’s recommendation​dation.​

What appeared on first glance to be a telling criticism of the Teva submission was a reference​ by the FDA​ to “non-standard statistical criteria” together with the FDA’s statement that reanalysis with standard practices found different results than those found by Teva. So, I looked up back at the Teva filing to identify the non-standard statistical criteria they used. If I found the right part of the Teva filing, they used R packages named ComBat and LIMMA​—both empirical Bayes tools.

​Now, it is possible that I have made a mistake and have not properly identified the statistical criteria that the FDA found wanting. I was unable to find any specific statement w.r.t. the “non-standard” statistics.

But, if empirical Bayes works better than older methods, then falling back to older methods would result in weaker inferences—and the rejection of the data from Teva.

It seems to me that this case raises interesting questions about the adoption and use of empirical Bayes. How should the FDA have treated the “non-standard statistical criteria”? More generally, is there a problem with getting regulatory agencies to accept Bayesian models? Maybe there is some issue here that would be appropriate for a masters student in public policy.

My correspondent included some relevant documentation:

The FDA docket files are available at!docketBrowser;rpp=25;po=0;dct=SR;D=FDA-2015-P-1050​

The test below is from ​ April 15, 2015 content/uploads/2016/07/Citizen_Petition_Denial_Letter_From_CDER_to_Teva_Pharmaceuticals.pdf”>FDA Denial Letter to Teva at pp. 41-42​

​Specifically, we concluded that the mouse splenocyte studies were poorly designed, contained a high level of residual batch bias, and used non-standard statistical criteria for assessing the presence of differentially expressed genes. When FDA reanalyzed the microarray data from one Teva study using industry standard practices and criteria, Copaxone and the comparator (Natco) product were found to have very similar effects on the efficacy-related pathways proposed for glatiramer acetate’s mechanism of action.

​The image below is from the ​Teva Petition, July 2, 2014 at p. 60


And he adds:

My interest in this topic arose only because of my MS treatment—I have had no contact with Teva, Sandoz, or the FDA. And I approve of the insurance company’s action—that is, I think that creating incentives to encourage consumers to switch to generic medicines is usually a good idea.

I have no knowledge of any of this stuff, but the interaction of statistics and policy seems generally relevant so I thought I would share this with all of you.

Ioannidis: “Evidence-Based Medicine Has Been Hijacked”

The celebrated medical-research reformer has a new paper (sent to me by Keith O’Rourke; official published version here), where he writes:

As EBM [evidence-based medicine] became more influential, it was also hijacked to serve agendas different from what it originally aimed for. Influential randomized trials are largely done by and for the benefit of the industry. Meta-analyses and guidelines have become a factory, mostly also serving vested interests. National and federal research funds are funneled almost exclusively to research with little relevance to health outcomes. We have supported the growth of principal investigators who excel primarily as managers absorbing more money.

He continues:

Diagnosis and prognosis research and efforts to individualize treatment have fueled recurrent spurious promises. Risk factor epidemiology has excelled in salami-sliced data-dredged papers with gift authorship and has become adept to dictating policy from spurious evidence. Under market pressure, clinical medicine has been transformed to finance-based medicine. In many places, medicine and health care are wasting societal resources and becoming a threat to human well-being. Science denialism and quacks are also flourishing and leading more people astray in their life choices, including health.

And concludes:

EBM still remains an unmet goal, worthy to be attained.

Read the whole damn thing.

Going beyond confidence intervals

Anders Lamberg writes:

In an article by Tom Sigfried, Science News, July 3 2014, “Scientists’ grasp of confidence intervals doesn’t inspire confidence” you are cited: “Gelman himself makes the point most clearly, though, that a 95 percent probability that a confidence interval contains the mean refers to repeated sampling, not any one individual interval.”

I have some simple questions that I hope you can answer. I am not a statistician but a biologist only with basic education in statistics. My company is working with surveillance of populations of salmon in Norwegian rivers and we have developed methods for counting all individuals in populations. We have moved from using estimates acquired from samples, to actually counting all individuals in the populations. This is possible because the salmon migrate between the ocean and the rivers and often have to pass narrow parts of the rivers where we use underwater video cameras to cover whole cross section. In this way we “see” every individual and can categorize size, sex etc. Another argument for counting all individuals is that our Atlantic salmon populations rarely exceed 3000 individuals (average of approx. 500) in contrast to Pacific salmon populations where numbers are more in the range of 100 000 to more than a million.

In Norway we also have a large salmon farming industry where salmon are held in net pens in the sea. The problem is that these fish, which have been artificially selected for over 10 generations, is a threat to the natural populations if they escape and breed with the wild salmon. There is a concern that the “natural gene pool” will be diluted. That was only a background for my questions, although the nature of the statistical problem is general for all sampling.

Here is the statistical problem: In a breeding population of salmon in a river there may be escapees from the fish farms. It is important to know the proportion of farmed escapees. If it exceed 5 % in a given population, measures should made to reduce the number of farmed salmon in that river. But how can we find the real proportion of farmed salmon in a river? The method used for over 30 years now is a sampling of approximately 60 salmon from each river and counting how many wild and how many farmed salmon you got in that sample. The total population may be 3000 individuals in total.

There is only taken one sample. A point estimate is calculated and a confidence interval for that estimate. In one realistic example we may sample 60 salmon and find that 6 of them are farmed fish. That gives a point estimate of 10 % farmed fish in the population of 3000 in that specific river. The 95% confidence interval will be from approximately 2% to 18%. Most commonly it is only the point estimate that is reported.

When I read your comment in the article cited in the start of this mail, I see that something must be wrong with this sampling procedure. Our confidence interval is linked to the sample and does not necessarily reflect the “real value” that we are interested in. As I see it now our point estimate acquired from only one sample does not give us much at all. We should have repeated the sampling procedure many times to get an estimate that is precise enough to say if we have passed the limit of 5% farmed fish in that population.

Can we use the one sample of 60 salmon in the example to say anything at all about the proportion of farmed salmon in that river? Can we use the point estimate 10%?

We have asked this question to the government, but they reply that it is more likely the real value lies near the 10% point estimate since the confidence has the shape of a normal distribution.

Is this correct?

As I see it the real value does not have to lie within the 95 % confidence interval at all. However, if we increase the sample size close to the population size, we will get a precise estimate. But, what happens when we use small samples and do not repeat?

My reply:

In this case, the confidence intervals seem reasonable enough (under the usual assumption that you are measuring a simple random sample). I suspect the real gains will come from combining estimates from different places and different times. A hierarchical model will allow you to do some smoothing.

Here’s an example. Suppose you sample 60 salmon in the same place each year and the number of farmed fish you see are 7, 9, 7, 6, 5, 8, 7, 2, 8, 7, … These data are consistent with their being a constant proportion of 10% farmed fish (indeed, I created these particular numbers using rbinom(10,60,.1) in R). On the other hand, if the number you see are 8, 12, 9, 5, 3, 11, 8, 0, 11, 9, … then this is evidence for real fluctuations. And of course if you see a series such as 5, 0, 3, 8, 9, 11, 9, 12, …, this is evidence for a trend. So you’d want to go beyond confidence intervals to make use of all that information. There’s actually a lot of work done using Bayesian methods in fisheries which might be helpful here.

Bayesian Linear Mixed Models using Stan: A tutorial for psychologists, linguists, and cognitive scientists

This article by Tanner Sorensen, Sven Hohenstein, and Shravan Vasishth might be of interest to some of you.

No, Google will not “sway the presidential election”

Grrr, this is annoying. A piece of exaggerated science reporting hit PPNAS and was promoted in Politico, then Kaiser Fung and I shot it down (“Could Google Rig the 2016 Election? Don’t Believe the Hype”) in our Daily Beast column last September.

Then it appeared again this week in a news article in the Christian Science Monitor.

I know Christian Scientists believe in a lot of goofy things but I didn’t know that they’d fall for silly psychology studies.

The Christian Science Monitor reporter did link to our column and did note that we don’t buy the Google-can-sway-the-election claim—so, in that sense, I can’t hope for much more. What I really think is that Rosen should’ve read what Kaiser and I wrote, realized our criticisms were valid, and then have not wasted time reporting on the silly claim based on a huge, unrealistic manipulation in a highly artificial setting. But that would’ve involved shelving a promising story idea, and what reporter wants to do that?

The Christian Science Monitor reporter did link to our column and did note that we don’t buy the Google-can-sway-the-election claim. So I can’t really get upset about the reporting: if the reporter is not an expert on politics, it can be hard for him to judge what to believe.

Nonetheless, even though it’s not really the reporter’s fault, the whole event saddens me, in that it illustrates how ridiculous hype pays off. The original researchers did a little study which has some value but then they hyped it well beyond any reasonable interpretation (as their results came from a huge, unrealistic manipulation in a highly artificial setting), resulting in a ridiculous claim that Google can sway the presidential election. The hypesters got rewarded for their hype with media coverage. Which of course motivates more hype in the future. It’s a moral hazard.

I talked about this general problem a couple years ago, under the heading, Selection bias in the reporting of shaky research. It goes like this. Someone does a silly study and hypes it up. Some reporters realize right away that it’s ridiculous, others ask around and learn that it makes no sense, and they don’t bother reporting on it. Other reporters don’t know any better—that’s just the way it is, nobody can be an expert on everything—and they report on it. Hence the selection bias: The skeptics don’t waste their time writing about a bogus or over-hyped study; the credulous do. The net result is that the hype continues.

P.S. I edited the above post (striking through some material and replacing with two new paragraphs) in response to comments.

Moving statistical theory from a “discovery” framework to a “measurement” framework

Avi Adler points to this post by Felix Schönbrodt on “What’s the probability that a significant p-value indicates a true effect?” I’m sympathetic to the goal of better understanding what’s in a p-value (see for example my paper with John Carlin on type M and type S errors) but I really don’t like the framing in terms of true and false effects, false positives and false negatives, etc. I work in social and environmental science. And in these fields it almost never makes sense to me to think about zero effects. Real-world effects vary, they can be difficult to measure, and statistical theory can be useful in quantifying available information—that I agree with. But I don’t get anything out of statements such as “Prob(effect is real | p-value is significant).”

This is not a particular dispute with Schönbrodt’s work; rather, it’s a more general problem I have with setting up the statistical inference problem in that way. I have a similar problem with “false discovery rate,” in that I don’t see inferences (“discoveries”) as being true or false. Just for example, does the notorious “power pose” paper represent a false discovery? In a way, sure, in that the researchers were way overstating their statistical evidence. But I think the true effect on power pose has to be highly variable, and I don’t see the benefit of trying to categorize it as true or false.

Another way to put it is that I prefer to thing of statistics via a “measurement” paradigm rather than a “discovery” paradigm. Discoveries and anomalies do happen—that’s what model checking and exploratory data analysis are all about—but I don’t really get anything out of the whole true/false thing. Hence my preference for looking at type M and type S errors, which avoid having to worry about whether some effect is zero.

That all said, I know that many people like the true/false framework so you can feel free to follow the above link and see what Schönbrodt is doing.

On deck this week

Mon: Moving statistical theory from a “discovery” framework to a “measurement” framework

Tues: Bayesian Linear Mixed Models using Stan: A tutorial for psychologists, linguists, and cognitive scientists

Wed: Going beyond confidence intervals

Thurs: Ioannidis: “Evidence-Based Medicine Has Been Hijacked”

Fri: What’s powdery and comes out of a metallic-green cardboard can?

Sat: “The Dark Side of Power Posing”

Sun: “Children seek historical traces of owned objects”

“Pointwise mutual information as test statistics”

Christian Bartels writes:

Most of us will probably agree that making good decisions under uncertainty based on limited data is highly important but remains challenging.

We have decision theory that provides a framework to reduce risks of decisions under uncertainty with typical frequentist test statistics being examples for controlling errors in absence of prior knowledge. This strong theoretical framework is mainly applicable to comparatively simple problems. For non-trivial models and/or if there is only limited data, it is often not clear how to use the decision theory framework.

In practice, careful iterative model building and checking seems to be the best what can be done – be it using Bayesian methods or applying “frequentist” approaches (here, in this particular context, “frequentist” seems often to be used as implying “based on minimization”).

As a hobby, I tried to expand the armory for decision making under uncertainty with complex models, focusing on trying to expand the reach of decision theoretic, frequentist methods. Perhaps at one point in the future, it will be become possible to bridge the existing, good pragmatic approaches into the decision theoretical framework.

So far:

– I evaluated an efficient integration method for repeated evaluation of statistical integrals (e.g., p-values) for a set of of hypotheses. Key to the method was the use of importance sampling. See here.

– I proposed pointwise mutual information as an efficient test statistics that is optimal under certain considerations. The commonly used alternative is the likelihood ratio test, which, in the limit where asymptotics are not valid, is annoyingly inefficient since it requires repeated minimizations of randomly generated data.
Bartels, Christian (2015): Generic and consistent confidence and credible regions.

More work is required, in particular:

– Dealing with nuisance parameters

– Including prior information.

Working on these aspects, I would appreciate feedback on what exists so far, in general, and on the proposal of using the pointwise mutual information as test statistics, in particular.

I have nothing to add here. The topic is important so I thought this was worth sharing.

You can post social science papers on the new SocArxiv

I learned about it from this post by Elizabeth Popp Berman.

The temporary SocArxiv site is here. It is connected to the Open Science Framework, which we’ve heard a lot about in discussions of preregistration.

You can post your papers at SocArxiv right away following these easy steps:

Send an email to the following address(es) from the email account you would like used on the OSF:

For Preprints, email
The format of the email should be as follows:

Preprint Title
Message body
Preprint abstract
Your preprint file (e.g., .docx, PDF, etc.)

It’s super-easy, actually much much easier than submitting to Arxiv. I assume that Arxiv has good reasons for its more elaborate submission process, but for now I found SocArxiv’s no-frills approach very pleasant.

I tried it out by sending a few papers, and it worked just fine. I’m already happy because I was able to upload my hilarious satire article with Jonathan Falk. (Here’s the relevant SocArxiv page.) When I tried to post that article on Arxiv last month, they rejected it as follows:

On Jun 16, 2016, at 12:17 PM, arXiv Moderation wrote:

Your submission has been removed. Our volunteer moderators determined that your article does not contain substantive research to merit inclusion within arXiv. Please note that our moderators are not referees and provide no reviews with such decisions. For in-depth reviews of your work you would have to seek feedback from another forum.

Please do not resubmit this paper without contacting arXiv moderation and obtaining a positive response. Resubmission of removed papers may result in the loss of your submission privileges.

For more information on our moderation policies see:

And the followup:

Dear Andrew Gelman,

Our moderators felt that a follow up should be made to point out arXiv only accepts articles that would be refereeable by a conventional publication venue. Submissions that that contain inflammatory or fictitious content or that use highly dramatic and mis-representative titles/abstracts/introductions may be removed. Repeated submissions of inflammatory or highly dramatic content may result in the suspension of submission privileges.

This kind of annoyed me because the only reason my article with Falk would not be refereeable by a conventional publication venue is because of all our jokes. Had we played it straight and pretended we were doing real research, we could’ve had a good shot at Psych Science or PPNAS. So we were, in effect, penalized for our honesty in writing a satire rather than a hoax.

As my couathor put it, the scary thing is how close our silly paper actually is to a publishable article, not how far.

Also, I can’t figure out how Arxiv’s rules were satisfied by this 2015 paper, “It’s a Trap: Emperor Palpatine’s Poison Pill,” which is more fictitious than ours, also includes silly footnotes, etc.

Anyway, I don’t begrudge Arxiv their gatekeeping. Arxiv is great great great, and I’m not at all complaining about their decision not to publish our funny article. Their site, their rules. Indeed, I wonder what will happen if someone decides to bomb SocArxiv with fake papers. At some point, a human will need to enter the loop, no?

For now, though, I think it’s great that there’s a place where everyone can post their social science papers.

Bigmilk strikes again

Screen Shot 2016-07-16 at 9.14.34 AM

Paul Alper sends along this news article by Kevin Lomagino, Earle Holland, and Andrew Holtz on the dairy-related corruption in a University of Maryland research study on the benefits of chocolate milk (!).

The good news is that the university did not stand behind its ethically-challenged employee. Instead:

“I did not become aware of this study at all until after it had become a news story,” Patrick O’Shea, UMD’s Vice President and Chief Research Officer, said in a teleconference. He says he took a look at both the chocolate milk and concussions news release and an earlier one comparing the milk to sports recovery drinks. “My reaction was, ‘This just doesn’t seem right. I’m not sure what’s going on here, but this just doesn’t seem right.’”

Back when I was a student there, we called it UM. I wonder when they changed it to UMD?

Also this:

O’Shea said in a letter that the university would immediately take down the release from university websites, return some $200,000 in funds donated by dairy companies to the lab that conducted the study, and begin implementing some 15 recommendations that would bring the university’s procedures in line with accepted norms. . . .

Dr. Shim’s lab was the beneficiary of large donations from Allied Milk Foundation, which is associated with First Quarter Fresh, the company whose chocolate milk was being studied and favorably discussed in the UMD news release.

Also this from a review committee:

There are simply too many uncontrolled variables to produce meaningful scientific results.

Wow—I wonder what Harvard Business School would say about this, if this criterion were used to judge some of its most famous recent research?

And this:

The University of Maryland says it will never again issue a news release on a study that has not been peer reviewed.

That seems a bit much. I think peer review is overrated, and if a researcher has some great findings, sure, why not do the press release? The key is to have clear lines of responsibility. And I agree with the University of Maryland on this:

The report found that while the release was widely circulated prior to distribution, nobody knew for sure who had the final say over what it could claim. “There is no institutional protocol for approval of press releases and lines of authority are poorly defined,” according to the report. It found that Dr. Shim was given default authority over the news release text, and that he disregarded generally accepted standards as to when study results should be disseminated in news releases.

Now we often seem to have the worst of both worlds, with irresponsible researchers making extravagant and ill-founded claims and then egging on press agents to make even more extreme statements. Again, peer review has nothing to do with it. There is a problem with press releases that nobody is taking responsibility for.

One-day workshop on causal inference (NYC, Sat. 16 July)

James Savage is teaching a one-day workshop on causal inference this coming Saturday (16 July) in New York using RStanArm. Here’s a link to the details:

Here’s the course outline:

How do prices affect sales? What is the uplift from a marketing decision? By how much will studying for an MBA affect my earnings? How much might an increase in minimum wages affect employment levels?

These are examples of causal questions. Sadly, they are the sorts of questions that data scientists’ run-of-the-mill predictive models can be ill-equipped to answer.

In this one-day course, we will cover methods for answering these questions, using easy-to-use Bayesian data analysis tools. The topics include:

– Why do experiments work? Understanding the Rubin causal model

– Regularized GLMs; bad controls; souping-up linear models to capture nonlinearities

– Using panel data to control for some types of unobserved confounding information

– ITT, natural experiments, and instrumental variables

– If we have time, using machine learning models for causal inference.

All work will be done in R, using the new rstanarm package.

Lunch, coffee, snacks and materials will be provided. Attendees should bring a laptop with R, RStudio and rstanarm already installed. A limited number of scholarships are available. The course is in no way affiliated with Columbia.

Replin’ ain’t easy: My very first preregistration


I’m doing my first preregistered replication. And it’s a lot of work!

We’ve been discussing this for awhile—here’s something I published in 2013 in response to proposals by James Moneghan and by Macartan Humphreys, Raul Sanchez de la Sierra, and Peter van der Windt for preregistration in political science, here’s a blog discussion (“Preregistration: what’s in it for you?”) from 2014.

Several months ago I decided I wanted to perform a preregistered replication of my 2013 AJPS paper with Yair on MRP. We found some interesting patterns of voting and turnout, but I was concerned that perhaps we were overinterpreting patterns from a single dataset. So we decided to re-fit our model to data from a different poll. That paper had analyzed the 2008 election using pre-election polls from Pew Research. The 2008 Annenberg pre-election poll was also available, so why not try that too?

Since we were going to do a replication anyway, why not preregister it? This wasn’t as easy as you might think. First step was getting our model to fit with the old data; this was not completely trivial given changes in software, and we needed to tweak the model in some places. Having checked that we could successfully duplicate our old study, we then re-fit our model to two surveys from 2004. We then set up everything to run on Annenberg 2008. At this point we paused, wrote everything up, and submitted to a journal. We wanted to time-stamp the analysis, and it seemed worthwhile to do this in a formal journal setting so that others could see all the steps in one place. The paper (that is, the preregistration plan) was rejected by the AJPS. They suggested we send it to Political Analysis, but they ended up rejecting it too. Then we sent it to Statistics, Politics, and Policy, which agreed to publish the full paper: preregistration plan plus analysis.

But, before doing the analysis, I wanted to time-stamp the preregistration plan. I put the paper up on my website, but that’s not really preregistration. So then I tried Arxiv. That took awhile too—it first they were thrown off by the paper being incomplete (by necessity, as we want to first publish the article with the plan but without the replication results). But they finally posted it.

The Arxiv post is our official announcement of preregistration. Now that it’s up, we (Rayleigh, Yair, and I) can run the analysis and write it up!

What have we learned?

Even before performing the replication analysis on the 2008 Annenberg data, this preregistration exercise has taught me some things:

1. The old analysis was not in runnable condition. We and others are now in position to fit the model to other data much more directly.

2. There do seem to be some problems with our model in how it fits the data. To see this, compare Figure 1 to Figure 2 of our new paper. Figure 1 shows our model fit to the 2008 Pew data (essentially a duplication of Figure 2 of our 2013 paper), and Figure 2 shows this same model fit to the 2004 Annenberg data.

So, two changes: Pew vs. Annenberg, and 2008 vs. 2004. And the fitted models look qualitatively different. The graphs take up a lot of space, so I’ll just show you the results for a few states.

We’re plotting the probability of supporting the Republican candidate for president (among the supporters of one of the two major parties; that is, we’re plotting the estimates of R/(R+D)) as a function of respondent’s family income (divided into five categories). Within each state, we have two lines: the brown line shows estimated Republican support among white voters, and the black lines shows estimated Republican support among all voters in the state. Y-axis goes from 0 to 100%.

From Figure 1:


From Figure 2:


You see that? The fitted lines are smoother in Figure 2 than in Figure 1, they seem to be tied closer to the data points. It appears as if this is coming from the raw data, which seem in Figure 2 to be closer to clean monotonic patterns.

My first thought was that this was something to do with sample size. OK, that was my third thought. My first thought was that it was a bug in the code, and my second thought was that there was some problem with coding of the income variable. But I don’t think it was any of these things. Annenberg 2004 had a larger sample than Pew 2008, so we re-fit to two random subsets of those Annenberg 2004 data, and the resulting graphs (not shown in the paper) look similar to the Figure 2 shown above; they were still a lot smoother than Figure 1 which shows results from Pew 2008.

We discuss this at the end of Section 2 of our new paper and don’t come to any firm conclusions. We’ll see what turns up with the replication on Annenberg 2008.

Anyway, the point is:
– Replication is not so easy.
– We can learn even from setting up the replications.
– Published results (even from me!) are always only provisional and it makes sense to replicate on other data.

About that claim that police are less likely to shoot blacks than whites


Josh Miller writes:

Did you see this splashy NYT headline, “Surprising New Evidence Shows Bias in Police Use of Force but Not in Shootings”?

It’s actually looks like a cool study overall, with granular data, and a ton of leg work, and rich set of results that extend beyond the attention grabbing headline that is getting bandied about (sometimes with ill-intent). While I do not work on issues of race and crime, I doubt I am alone in thinking that this counter-intuitive result is unlikely to be true. The result: whites are as likely as blacks to be shot at in encounters in which lethal force may have been justified? Further, in their taser data, blacks are actually less likely than whites to subsequently be shot by a firearm after being tasered! While its true that we are talking about odds ratios for small probabilities, dare I say that the ratios are implausible enough to cue us that something funny is going on? (blacks are 28-35% less likely to be shot in the taser data, table 5 col 2, PDF p. 54). Further, are we to believe that suddenly, when an encounter escalates, the fears and other biases of officers suddenly melt away and they become race-neutral? This seems to be inconsistent with the findings in other disciplines when it comes to fear, and other immediate emotional responses to race (think implicit association tests, fMRI imaging of the amygdala, etc.).

This is not to say we can’t cook up a plausible sounding story to support this result. For example, officers may let their guard down against white suspects, and then, whoops, too late! Now the gun is the only option.

But do we believe this? That depends on how close we are to the experimental ideal of taking equally dangerous suspects, and randomly assigning their race (and culture?), and then seeing if police end up shooting them.

Looking at the paper, it seems like we are far from that ideal. In fact, it appears likely that the white suspects in their sample were actually more dangerous than the black suspects, and therefore more likely to get shot at.

Potential For Bias:

How could this selection bias happen? Well, this headline result comes solely from the Houston data, and for that data, their definition of a “shoot or don’t shoot” situation (my words) is defined as an arrest report that describes an encounter in which lethal force was likely justified. What is the criteria for lethal force to be likely justified? Among other things, for this data, it includes “resisting arrest, evading arrest, and interfering in arrest” (PDF pp.16-17, actual p. 14-15—they sample 5% of 16,000 qualifying reports) They also have a separate data set in which the criteria is that a taser was deployed (~5000 incidents). Remember, just to emphasize, these are reports involving encounters that don’t necessarily lead to an officer involved shootings (OIS). Given the presences of exaggerated fears, cultural misunderstandings, and other more nefarious forms of bias, wouldn’t we expect an arrest report to over-apply these descriptors to blacks relative to whites? Wouldn’t we also expect the taser to be over-applied to blacks relatively to whites? If so, then won’t this mechanically lower the incidence of shootings of blacks relative to whites in this sample? There are more blacks in the researcher-defined “shoot, or don’t shoot” situation that just shouldn’t be there; they are not as dangerous as the whites, and lethal force was unlikely to be justified (and wasn’t applied in most cases).


With this potential selection bias, yet no discussion of it (as far as I can tell), the headline conclusion doesn’t appear to be warranted. Maybe the authors can do a calculation and find that the degree of selection you would need to cause this result is itself implausible? Who knows. But I don’t see how it is justified to spread around this result without checking into this (This takes nothing away, of course, from the other important results in the paper).


The analysis for this particular result is reported on PDF pp. 23-25 with the associated table 5 on PDF p. 54. Note that when adding controls, there appear to be power issues. There is a partial control for suspect danger, under “encounter characteristics,” which includes, e.g. whether the suspect attacked, or drew a weapon—interestingly, blacks are 10% more likely to be shot with this control (not significant). The table indicates a control is also added for the taser data, but I don’t know how they could do that, because the taser data has no written narrative.

See here for more on the study from Rajiv Sethi.

And Justin Feldman pointed me to this criticism of his. Feldman summarizes:

Roland Fryer, an economics professor at Harvard University, recently published a working paper at NBER on the topic of racial bias in police use of force and police shootings. The paper gained substantial media attention – a write-up of it became the top viewed article on the New York Times website. The most notable part of the study was its finding that there was no evidence of racial bias in police shootings, which Fryer called “the most surprising result of [his] career”. In his analysis of shootings in Houston, Texas, black and Hispanic people were no more likely (and perhaps even less likely) to be shot relative to whites.

I’m not endorsing Feldman’s arguments but I do want to comment on “the most surprising result of my career” thing. We should all have the capacity for being surprised. Science would go nowhere if we did nothing but confirm our pre-existing beliefs. Buuuuut . . . I feel like I see this reasoning a lot in media presentations of social science: “I came into this study expecting X, and then I found not-X, and the fact that I was surprised is an additional reason to trust my result.” The argument isn’t quite stated that way, but I think it’s implicit, that the surprise factor represents some sort of additional evidence. In general I’m with Miller that when a finding is surprising, we should look at it carefully as this could be an indication that something is missing in the analysis.

P.S. Some people also pointed out this paper by Cody Ross from last year, “A Multi-Level Bayesian Analysis of Racial Bias in Police Shootings at the County-Level in the United States, 2011–2014,” which uses Stan! Ross’s paper begins:

A geographically-resolved, multi-level Bayesian model is used to analyze the data presented in the U.S. Police-Shooting Database (USPSD) in order to investigate the extent of racial bias in the shooting of American civilians by police officers in recent years. In contrast to previous work that relied on the FBI’s Supplemental Homicide Reports that were constructed from self-reported cases of police-involved homicide, this data set is less likely to be biased by police reporting practices. . . .

The results provide evidence of a significant bias in the killing of unarmed black Americans relative to unarmed white Americans, in that the probability of being {black, unarmed, and shot by police} is about 3.49 times the probability of being {white, unarmed, and shot by police} on average. Furthermore, the results of multi-level modeling show that there exists significant heterogeneity across counties in the extent of racial bias in police shootings, with some counties showing relative risk ratios of 20 to 1 or more. Finally, analysis of police shooting data as a function of county-level predictors suggests that racial bias in police shootings is most likely to emerge in police departments in larger metropolitan counties with low median incomes and a sizable portion of black residents, especially when there is high financial inequality in that county. . . .

I’m a bit concerned by maps of county-level estimates because of the problems that Phil and I discussed in our “All maps of parameter estimates are misleading” paper.

I don’t have the energy to look at this paper in detail, but in any case its existence is useful in that it suggests a natural research project of reconciling it with the findings of the other paper discussed at the top of this post. When two papers on the same topic come to such different conclusions, it should be possible to track down where in the data and model the differences are coming from.

P.P.S. Miller points me to this post by Uri Simonsohn that makes the same point (as Miller at the top of the above post).

In their reactions, Miller and Simonsohn do something very important, which is to operate simultaneously on the level of theory and data, not just saying why something could be a problem but also connecting this to specific numbers in the article under discussion.

Of polls and prediction markets: More on #BrexitFail

David “Xbox poll” Rothschild and I wrote an article for Slate on how political prediction markets can get things wrong. The short story is that in settings where direct information is not easily available (for example, in elections where polls are not viewed as trustworthy forecasts, whether because of problems in polling or anticipated volatility in attitudes), savvy observers will deduce predictive probabilities from the prices of prediction markets. This can keep prediction market prices artificially stable, as people are essentially updating them from the market prices themselves.

Long-term, or even medium-term, this should sort itself out: once market participants become aware of this bias (in part from reading our article), they should pretty much correct this problem. Realizing that prediction market prices are only provisional, noisy signals, bettors should start reacting more to the news. In essence, I think market participants are going through three steps:

1. Naive over-reaction to news, based on the belief that the latest poll, whatever it is, represents a good forecast of the election.

2. Naive under-reaction to news, based on the belief that the prediction market prices represent best information (“market fundamentalism”).

3. Moderate reaction to news, acknowledging that polls and prices both are noisy signals.

Before we decided to write that Slate article, I’d drafted a blog post which I think could be useful in that I went into more detail on why I don’t think we can simply take the market prices are correct.

One challenge here is that you can just about never prove that the markets were wrong, at least not just based on betting odds. After all, an event with 4-1 odds against, should still occur 20% of the time. Recall that we were even getting people arguing that those Leicester City odds of 5,000-1 odds were correct, which really does seem like a bit of market fundamentalism.

OK, so here’s what I wrote the other day:

We recently talked about how the polls got it wrong in predicting Brexit. But, really, that’s not such a surprise: we all know that polls have lots of problems. And, in fact, the Yougov poll wasn’t so far off at all (see P.P.P.S. in above-linked post, also recognizing that I am an interested party in that Yougov supports some of our work on Stan).

Just as striking, and also much discussed, is that the prediction markets were off too. Indeed, the prediction markets were more off than the polls: even when polling was showing consistent support for Leave, the markets were holding on to Remain.

This is interesting because in previous elections I’ve argued that the prediction markets were chasing the polls. But here, as with Donald Trump’s candidacy in the primary election, the problem was the reverse: prediction markets were discounting the polls in a way which, retrospectively, looks like an error.

How to think about this? One could follow psychologist Dan Goldstein who, under the heading, “Prediction markets not as bad as they appear,” argued that prediction markets are approximately calibrated in the aggregate, and thus you can’t draw much of a conclusion from the fact that, in one particular case, the markets were giving 5-1 odds to an event (Brexit) that actually ended up happening. After all, there are lots of bets out there, and 1/6 of all 5:1 shots should come in.

And, indeed, if the only pieces of information available were: (a) the market odds against Brexit winning the vote were 5:1, and (b) Brexit won the vote; then, yes, I’d agree that nothing more could be said. But we actually to have more information.

Let’s start with this graph from Emile Servan-Schreiber, from a post linked to by Goldstein. The graph shows one particular prediction market for the week leading up to the vote:


It’s my impression that the odds offered by other markets looked similar. I’d really like to see the graph over the past several months, but I wasn’t quite sure where to find it, so we’ll go with the one-week time series.

One thing that strikes me is how stable these odds are. I’m wondering if one thing that went on was that a feedback mechanism where the betting odds reify themselves.

It goes like this: the polls are in different places, and we all know not to trust the polls, which have notoriously failed in various British elections. But we do watch the prediction markets, which all sorts of experts have assured us capture the wisdom of crowds.

So, serious people who care about the election watch the prediction markets. The markets say 5:1 for Leave. Then there’s other info, the latest poll, and so forth. How to think about this information? Informed people look to the markets. What do the markets say? 5:1. OK, then that’s the odds.

This is not an airtight argument or a closed loop. Of course, real information does intrude upon this picture. But my argument is that prediction markets can stay stable for too long.

In the past, traders followed the polls too closely and sent the prediction markets up and down. But now the opposite is happening. Traders are treating markets odds as correct probabilities and not updating enough based on outside information. Belief in the correctness of prediction markets causes them to be too stable.

We saw this with the Trump nomination, and we saw it with Brexit. Initial odds are reasonable, based on whatever information people have. But then when new information comes in, it gets discounted. People are using the current prediction odds as an anchor.

Related to this point is this remark from David Rothschild:

I [Rothschild] am very intrigued by this interplay of polls, prediction markets, and financial markets. We generally accept polls as exogenous, and assume the markets are reacting to the polls and other information. But, with growth of poll-based forecasting and more robust analytics on the polling, before release, there is the possibility that polls (or, at least what is reported from polls) are influenced by the markets. Markets were assuming that there were two things at play (1) social-desirability bias to over report leaving (which we saw in Scotland in 2014) (2) uncertain voters would break stay (which seemed to happen in the polling in the last few days). And, while there was a lot of concern about the turnout of stay voters (due to stay voters being younger) the unfortunate assassination of Jo Cox seemed to have assuaged the markets (either by rousing the stay supporters to vote or tempering the leave supports out of voting). Further, the financial markets were, seemingly, even more bullish than the prediction markets in the last few days and hours before the tallies were complete.

I know you guys think I have no filter, but . . .

. . . Someone sent me a juicy bit of news related to one of our frequent blog topics, and I shot back a witty response (or, at least, it seemed witty to me), but I decided not to post it here because I was concerned that people might take it as a personal attack (which it isn’t; I don’t even know the guy).

P.S. I wrote this post a few months ago and posted it for the next available slot, which is now. So you can pretty much forget about guessing what the news item was, as it’s not like it just happened or anything.

P.P.S. The post was going to be bumped again, to December! But this seemed a bit much so I’ll just post it now.