Skip to content

When do statistical rules affect drug approval?

Someone writes in:

I have MS and take a disease-modifying drug called Copaxone. Sandoz developed a generic version​ of Copaxone​ and filed for FDA approval. Teva, the manufacturer of Copaxone, filed a petition opposing that approval (surprise!). FDA rejected Teva’s petitions and approved the generic.

My insurance company encouraged me to switch to the generic. Specifically, they increased the copay​ for the non-generic​ from $50 to $950 per month. That got my attention. My neurologist recommended against switching to the generic.

Consequently, I decided to try to review the FDA decision to see if I could get any insight into the basis for ​my neurologist’s recommendation​dation.​

What appeared on first glance to be a telling criticism of the Teva submission was a reference​ by the FDA​ to “non-standard statistical criteria” together with the FDA’s statement that reanalysis with standard practices found different results than those found by Teva. So, I looked up back at the Teva filing to identify the non-standard statistical criteria they used. If I found the right part of the Teva filing, they used R packages named ComBat and LIMMA​—both empirical Bayes tools.

​Now, it is possible that I have made a mistake and have not properly identified the statistical criteria that the FDA found wanting. I was unable to find any specific statement w.r.t. the “non-standard” statistics.

But, if empirical Bayes works better than older methods, then falling back to older methods would result in weaker inferences—and the rejection of the data from Teva.

It seems to me that this case raises interesting questions about the adoption and use of empirical Bayes. How should the FDA have treated the “non-standard statistical criteria”? More generally, is there a problem with getting regulatory agencies to accept Bayesian models? Maybe there is some issue here that would be appropriate for a masters student in public policy.

My correspondent included some relevant documentation:

The FDA docket files are available at!docketBrowser;rpp=25;po=0;dct=SR;D=FDA-2015-P-1050​

The test below is from ​ April 15, 2015 content/uploads/2016/07/Citizen_Petition_Denial_Letter_From_CDER_to_Teva_Pharmaceuticals.pdf”>FDA Denial Letter to Teva at pp. 41-42​

​Specifically, we concluded that the mouse splenocyte studies were poorly designed, contained a high level of residual batch bias, and used non-standard statistical criteria for assessing the presence of differentially expressed genes. When FDA reanalyzed the microarray data from one Teva study using industry standard practices and criteria, Copaxone and the comparator (Natco) product were found to have very similar effects on the efficacy-related pathways proposed for glatiramer acetate’s mechanism of action.

​The image below is from the ​Teva Petition, July 2, 2014 at p. 60


And he adds:

My interest in this topic arose only because of my MS treatment—I have had no contact with Teva, Sandoz, or the FDA. And I approve of the insurance company’s action—that is, I think that creating incentives to encourage consumers to switch to generic medicines is usually a good idea.

I have no knowledge of any of this stuff, but the interaction of statistics and policy seems generally relevant so I thought I would share this with all of you.

Ioannidis: “Evidence-Based Medicine Has Been Hijacked”

The celebrated medical-research reformer has a new paper (sent to me by Keith O’Rourke; official published version here), where he writes:

As EBM [evidence-based medicine] became more influential, it was also hijacked to serve agendas different from what it originally aimed for. Influential randomized trials are largely done by and for the benefit of the industry. Meta-analyses and guidelines have become a factory, mostly also serving vested interests. National and federal research funds are funneled almost exclusively to research with little relevance to health outcomes. We have supported the growth of principal investigators who excel primarily as managers absorbing more money.

He continues:

Diagnosis and prognosis research and efforts to individualize treatment have fueled recurrent spurious promises. Risk factor epidemiology has excelled in salami-sliced data-dredged papers with gift authorship and has become adept to dictating policy from spurious evidence. Under market pressure, clinical medicine has been transformed to finance-based medicine. In many places, medicine and health care are wasting societal resources and becoming a threat to human well-being. Science denialism and quacks are also flourishing and leading more people astray in their life choices, including health.

And concludes:

EBM still remains an unmet goal, worthy to be attained.

Read the whole damn thing.

Going beyond confidence intervals

Anders Lamberg writes:

In an article by Tom Sigfried, Science News, July 3 2014, “Scientists’ grasp of confidence intervals doesn’t inspire confidence” you are cited: “Gelman himself makes the point most clearly, though, that a 95 percent probability that a confidence interval contains the mean refers to repeated sampling, not any one individual interval.”

I have some simple questions that I hope you can answer. I am not a statistician but a biologist only with basic education in statistics. My company is working with surveillance of populations of salmon in Norwegian rivers and we have developed methods for counting all individuals in populations. We have moved from using estimates acquired from samples, to actually counting all individuals in the populations. This is possible because the salmon migrate between the ocean and the rivers and often have to pass narrow parts of the rivers where we use underwater video cameras to cover whole cross section. In this way we “see” every individual and can categorize size, sex etc. Another argument for counting all individuals is that our Atlantic salmon populations rarely exceed 3000 individuals (average of approx. 500) in contrast to Pacific salmon populations where numbers are more in the range of 100 000 to more than a million.

In Norway we also have a large salmon farming industry where salmon are held in net pens in the sea. The problem is that these fish, which have been artificially selected for over 10 generations, is a threat to the natural populations if they escape and breed with the wild salmon. There is a concern that the “natural gene pool” will be diluted. That was only a background for my questions, although the nature of the statistical problem is general for all sampling.

Here is the statistical problem: In a breeding population of salmon in a river there may be escapees from the fish farms. It is important to know the proportion of farmed escapees. If it exceed 5 % in a given population, measures should made to reduce the number of farmed salmon in that river. But how can we find the real proportion of farmed salmon in a river? The method used for over 30 years now is a sampling of approximately 60 salmon from each river and counting how many wild and how many farmed salmon you got in that sample. The total population may be 3000 individuals in total.

There is only taken one sample. A point estimate is calculated and a confidence interval for that estimate. In one realistic example we may sample 60 salmon and find that 6 of them are farmed fish. That gives a point estimate of 10 % farmed fish in the population of 3000 in that specific river. The 95% confidence interval will be from approximately 2% to 18%. Most commonly it is only the point estimate that is reported.

When I read your comment in the article cited in the start of this mail, I see that something must be wrong with this sampling procedure. Our confidence interval is linked to the sample and does not necessarily reflect the “real value” that we are interested in. As I see it now our point estimate acquired from only one sample does not give us much at all. We should have repeated the sampling procedure many times to get an estimate that is precise enough to say if we have passed the limit of 5% farmed fish in that population.

Can we use the one sample of 60 salmon in the example to say anything at all about the proportion of farmed salmon in that river? Can we use the point estimate 10%?

We have asked this question to the government, but they reply that it is more likely the real value lies near the 10% point estimate since the confidence has the shape of a normal distribution.

Is this correct?

As I see it the real value does not have to lie within the 95 % confidence interval at all. However, if we increase the sample size close to the population size, we will get a precise estimate. But, what happens when we use small samples and do not repeat?

My reply:

In this case, the confidence intervals seem reasonable enough (under the usual assumption that you are measuring a simple random sample). I suspect the real gains will come from combining estimates from different places and different times. A hierarchical model will allow you to do some smoothing.

Here’s an example. Suppose you sample 60 salmon in the same place each year and the number of farmed fish you see are 7, 9, 7, 6, 5, 8, 7, 2, 8, 7, … These data are consistent with their being a constant proportion of 10% farmed fish (indeed, I created these particular numbers using rbinom(10,60,.1) in R). On the other hand, if the number you see are 8, 12, 9, 5, 3, 11, 8, 0, 11, 9, … then this is evidence for real fluctuations. And of course if you see a series such as 5, 0, 3, 8, 9, 11, 9, 12, …, this is evidence for a trend. So you’d want to go beyond confidence intervals to make use of all that information. There’s actually a lot of work done using Bayesian methods in fisheries which might be helpful here.

Bayesian Linear Mixed Models using Stan: A tutorial for psychologists, linguists, and cognitive scientists

This article by Tanner Sorensen, Sven Hohenstein, and Shravan Vasishth might be of interest to some of you.

No, Google will not “sway the presidential election”

Grrr, this is annoying. A piece of exaggerated science reporting hit PPNAS and was promoted in Politico, then Kaiser Fung and I shot it down (“Could Google Rig the 2016 Election? Don’t Believe the Hype”) in our Daily Beast column last September.

Then it appeared again this week in a news article in the Christian Science Monitor.

I know Christian Scientists believe in a lot of goofy things but I didn’t know that they’d fall for silly psychology studies.

The Christian Science Monitor reporter did link to our column and did note that we don’t buy the Google-can-sway-the-election claim—so, in that sense, I can’t hope for much more. What I really think is that Rosen should’ve read what Kaiser and I wrote, realized our criticisms were valid, and then have not wasted time reporting on the silly claim based on a huge, unrealistic manipulation in a highly artificial setting. But that would’ve involved shelving a promising story idea, and what reporter wants to do that?

The Christian Science Monitor reporter did link to our column and did note that we don’t buy the Google-can-sway-the-election claim. So I can’t really get upset about the reporting: if the reporter is not an expert on politics, it can be hard for him to judge what to believe.

Nonetheless, even though it’s not really the reporter’s fault, the whole event saddens me, in that it illustrates how ridiculous hype pays off. The original researchers did a little study which has some value but then they hyped it well beyond any reasonable interpretation (as their results came from a huge, unrealistic manipulation in a highly artificial setting), resulting in a ridiculous claim that Google can sway the presidential election. The hypesters got rewarded for their hype with media coverage. Which of course motivates more hype in the future. It’s a moral hazard.

I talked about this general problem a couple years ago, under the heading, Selection bias in the reporting of shaky research. It goes like this. Someone does a silly study and hypes it up. Some reporters realize right away that it’s ridiculous, others ask around and learn that it makes no sense, and they don’t bother reporting on it. Other reporters don’t know any better—that’s just the way it is, nobody can be an expert on everything—and they report on it. Hence the selection bias: The skeptics don’t waste their time writing about a bogus or over-hyped study; the credulous do. The net result is that the hype continues.

P.S. I edited the above post (striking through some material and replacing with two new paragraphs) in response to comments.

Moving statistical theory from a “discovery” framework to a “measurement” framework

Avi Adler points to this post by Felix Schönbrodt on “What’s the probability that a significant p-value indicates a true effect?” I’m sympathetic to the goal of better understanding what’s in a p-value (see for example my paper with John Carlin on type M and type S errors) but I really don’t like the framing in terms of true and false effects, false positives and false negatives, etc. I work in social and environmental science. And in these fields it almost never makes sense to me to think about zero effects. Real-world effects vary, they can be difficult to measure, and statistical theory can be useful in quantifying available information—that I agree with. But I don’t get anything out of statements such as “Prob(effect is real | p-value is significant).”

This is not a particular dispute with Schönbrodt’s work; rather, it’s a more general problem I have with setting up the statistical inference problem in that way. I have a similar problem with “false discovery rate,” in that I don’t see inferences (“discoveries”) as being true or false. Just for example, does the notorious “power pose” paper represent a false discovery? In a way, sure, in that the researchers were way overstating their statistical evidence. But I think the true effect on power pose has to be highly variable, and I don’t see the benefit of trying to categorize it as true or false.

Another way to put it is that I prefer to thing of statistics via a “measurement” paradigm rather than a “discovery” paradigm. Discoveries and anomalies do happen—that’s what model checking and exploratory data analysis are all about—but I don’t really get anything out of the whole true/false thing. Hence my preference for looking at type M and type S errors, which avoid having to worry about whether some effect is zero.

That all said, I know that many people like the true/false framework so you can feel free to follow the above link and see what Schönbrodt is doing.

On deck this week

Mon: Moving statistical theory from a “discovery” framework to a “measurement” framework

Tues: Bayesian Linear Mixed Models using Stan: A tutorial for psychologists, linguists, and cognitive scientists

Wed: Going beyond confidence intervals

Thurs: Ioannidis: “Evidence-Based Medicine Has Been Hijacked”

Fri: What’s powdery and comes out of a metallic-green cardboard can?

Sat: “The Dark Side of Power Posing”

Sun: “Children seek historical traces of owned objects”

“Pointwise mutual information as test statistics”

Christian Bartels writes:

Most of us will probably agree that making good decisions under uncertainty based on limited data is highly important but remains challenging.

We have decision theory that provides a framework to reduce risks of decisions under uncertainty with typical frequentist test statistics being examples for controlling errors in absence of prior knowledge. This strong theoretical framework is mainly applicable to comparatively simple problems. For non-trivial models and/or if there is only limited data, it is often not clear how to use the decision theory framework.

In practice, careful iterative model building and checking seems to be the best what can be done – be it using Bayesian methods or applying “frequentist” approaches (here, in this particular context, “frequentist” seems often to be used as implying “based on minimization”).

As a hobby, I tried to expand the armory for decision making under uncertainty with complex models, focusing on trying to expand the reach of decision theoretic, frequentist methods. Perhaps at one point in the future, it will be become possible to bridge the existing, good pragmatic approaches into the decision theoretical framework.

So far:

– I evaluated an efficient integration method for repeated evaluation of statistical integrals (e.g., p-values) for a set of of hypotheses. Key to the method was the use of importance sampling. See here.

– I proposed pointwise mutual information as an efficient test statistics that is optimal under certain considerations. The commonly used alternative is the likelihood ratio test, which, in the limit where asymptotics are not valid, is annoyingly inefficient since it requires repeated minimizations of randomly generated data.
Bartels, Christian (2015): Generic and consistent confidence and credible regions.

More work is required, in particular:

– Dealing with nuisance parameters

– Including prior information.

Working on these aspects, I would appreciate feedback on what exists so far, in general, and on the proposal of using the pointwise mutual information as test statistics, in particular.

I have nothing to add here. The topic is important so I thought this was worth sharing.

You can post social science papers on the new SocArxiv

I learned about it from this post by Elizabeth Popp Berman.

The temporary SocArxiv site is here. It is connected to the Open Science Framework, which we’ve heard a lot about in discussions of preregistration.

You can post your papers at SocArxiv right away following these easy steps:

Send an email to the following address(es) from the email account you would like used on the OSF:

For Preprints, email
The format of the email should be as follows:

Preprint Title
Message body
Preprint abstract
Your preprint file (e.g., .docx, PDF, etc.)

It’s super-easy, actually much much easier than submitting to Arxiv. I assume that Arxiv has good reasons for its more elaborate submission process, but for now I found SocArxiv’s no-frills approach very pleasant.

I tried it out by sending a few papers, and it worked just fine. I’m already happy because I was able to upload my hilarious satire article with Jonathan Falk. (Here’s the relevant SocArxiv page.) When I tried to post that article on Arxiv last month, they rejected it as follows:

On Jun 16, 2016, at 12:17 PM, arXiv Moderation wrote:

Your submission has been removed. Our volunteer moderators determined that your article does not contain substantive research to merit inclusion within arXiv. Please note that our moderators are not referees and provide no reviews with such decisions. For in-depth reviews of your work you would have to seek feedback from another forum.

Please do not resubmit this paper without contacting arXiv moderation and obtaining a positive response. Resubmission of removed papers may result in the loss of your submission privileges.

For more information on our moderation policies see:

And the followup:

Dear Andrew Gelman,

Our moderators felt that a follow up should be made to point out arXiv only accepts articles that would be refereeable by a conventional publication venue. Submissions that that contain inflammatory or fictitious content or that use highly dramatic and mis-representative titles/abstracts/introductions may be removed. Repeated submissions of inflammatory or highly dramatic content may result in the suspension of submission privileges.

This kind of annoyed me because the only reason my article with Falk would not be refereeable by a conventional publication venue is because of all our jokes. Had we played it straight and pretended we were doing real research, we could’ve had a good shot at Psych Science or PPNAS. So we were, in effect, penalized for our honesty in writing a satire rather than a hoax.

As my couathor put it, the scary thing is how close our silly paper actually is to a publishable article, not how far.

Also, I can’t figure out how Arxiv’s rules were satisfied by this 2015 paper, “It’s a Trap: Emperor Palpatine’s Poison Pill,” which is more fictitious than ours, also includes silly footnotes, etc.

Anyway, I don’t begrudge Arxiv their gatekeeping. Arxiv is great great great, and I’m not at all complaining about their decision not to publish our funny article. Their site, their rules. Indeed, I wonder what will happen if someone decides to bomb SocArxiv with fake papers. At some point, a human will need to enter the loop, no?

For now, though, I think it’s great that there’s a place where everyone can post their social science papers.

Bigmilk strikes again

Screen Shot 2016-07-16 at 9.14.34 AM

Paul Alper sends along this news article by Kevin Lomagino, Earle Holland, and Andrew Holtz on the dairy-related corruption in a University of Maryland research study on the benefits of chocolate milk (!).

The good news is that the university did not stand behind its ethically-challenged employee. Instead:

“I did not become aware of this study at all until after it had become a news story,” Patrick O’Shea, UMD’s Vice President and Chief Research Officer, said in a teleconference. He says he took a look at both the chocolate milk and concussions news release and an earlier one comparing the milk to sports recovery drinks. “My reaction was, ‘This just doesn’t seem right. I’m not sure what’s going on here, but this just doesn’t seem right.’”

Back when I was a student there, we called it UM. I wonder when they changed it to UMD?

Also this:

O’Shea said in a letter that the university would immediately take down the release from university websites, return some $200,000 in funds donated by dairy companies to the lab that conducted the study, and begin implementing some 15 recommendations that would bring the university’s procedures in line with accepted norms. . . .

Dr. Shim’s lab was the beneficiary of large donations from Allied Milk Foundation, which is associated with First Quarter Fresh, the company whose chocolate milk was being studied and favorably discussed in the UMD news release.

Also this from a review committee:

There are simply too many uncontrolled variables to produce meaningful scientific results.

Wow—I wonder what Harvard Business School would say about this, if this criterion were used to judge some of its most famous recent research?

And this:

The University of Maryland says it will never again issue a news release on a study that has not been peer reviewed.

That seems a bit much. I think peer review is overrated, and if a researcher has some great findings, sure, why not do the press release? The key is to have clear lines of responsibility. And I agree with the University of Maryland on this:

The report found that while the release was widely circulated prior to distribution, nobody knew for sure who had the final say over what it could claim. “There is no institutional protocol for approval of press releases and lines of authority are poorly defined,” according to the report. It found that Dr. Shim was given default authority over the news release text, and that he disregarded generally accepted standards as to when study results should be disseminated in news releases.

Now we often seem to have the worst of both worlds, with irresponsible researchers making extravagant and ill-founded claims and then egging on press agents to make even more extreme statements. Again, peer review has nothing to do with it. There is a problem with press releases that nobody is taking responsibility for.

One-day workshop on causal inference (NYC, Sat. 16 July)

James Savage is teaching a one-day workshop on causal inference this coming Saturday (16 July) in New York using RStanArm. Here’s a link to the details:

Here’s the course outline:

How do prices affect sales? What is the uplift from a marketing decision? By how much will studying for an MBA affect my earnings? How much might an increase in minimum wages affect employment levels?

These are examples of causal questions. Sadly, they are the sorts of questions that data scientists’ run-of-the-mill predictive models can be ill-equipped to answer.

In this one-day course, we will cover methods for answering these questions, using easy-to-use Bayesian data analysis tools. The topics include:

– Why do experiments work? Understanding the Rubin causal model

– Regularized GLMs; bad controls; souping-up linear models to capture nonlinearities

– Using panel data to control for some types of unobserved confounding information

– ITT, natural experiments, and instrumental variables

– If we have time, using machine learning models for causal inference.

All work will be done in R, using the new rstanarm package.

Lunch, coffee, snacks and materials will be provided. Attendees should bring a laptop with R, RStudio and rstanarm already installed. A limited number of scholarships are available. The course is in no way affiliated with Columbia.

Replin’ ain’t easy: My very first preregistration


I’m doing my first preregistered replication. And it’s a lot of work!

We’ve been discussing this for awhile—here’s something I published in 2013 in response to proposals by James Moneghan and by Macartan Humphreys, Raul Sanchez de la Sierra, and Peter van der Windt for preregistration in political science, here’s a blog discussion (“Preregistration: what’s in it for you?”) from 2014.

Several months ago I decided I wanted to perform a preregistered replication of my 2013 AJPS paper with Yair on MRP. We found some interesting patterns of voting and turnout, but I was concerned that perhaps we were overinterpreting patterns from a single dataset. So we decided to re-fit our model to data from a different poll. That paper had analyzed the 2008 election using pre-election polls from Pew Research. The 2008 Annenberg pre-election poll was also available, so why not try that too?

Since we were going to do a replication anyway, why not preregister it? This wasn’t as easy as you might think. First step was getting our model to fit with the old data; this was not completely trivial given changes in software, and we needed to tweak the model in some places. Having checked that we could successfully duplicate our old study, we then re-fit our model to two surveys from 2004. We then set up everything to run on Annenberg 2008. At this point we paused, wrote everything up, and submitted to a journal. We wanted to time-stamp the analysis, and it seemed worthwhile to do this in a formal journal setting so that others could see all the steps in one place. The paper (that is, the preregistration plan) was rejected by the AJPS. They suggested we send it to Political Analysis, but they ended up rejecting it too. Then we sent it to Statistics, Politics, and Policy, which agreed to publish the full paper: preregistration plan plus analysis.

But, before doing the analysis, I wanted to time-stamp the preregistration plan. I put the paper up on my website, but that’s not really preregistration. So then I tried Arxiv. That took awhile too—it first they were thrown off by the paper being incomplete (by necessity, as we want to first publish the article with the plan but without the replication results). But they finally posted it.

The Arxiv post is our official announcement of preregistration. Now that it’s up, we (Rayleigh, Yair, and I) can run the analysis and write it up!

What have we learned?

Even before performing the replication analysis on the 2008 Annenberg data, this preregistration exercise has taught me some things:

1. The old analysis was not in runnable condition. We and others are now in position to fit the model to other data much more directly.

2. There do seem to be some problems with our model in how it fits the data. To see this, compare Figure 1 to Figure 2 of our new paper. Figure 1 shows our model fit to the 2008 Pew data (essentially a duplication of Figure 2 of our 2013 paper), and Figure 2 shows this same model fit to the 2004 Annenberg data.

So, two changes: Pew vs. Annenberg, and 2008 vs. 2004. And the fitted models look qualitatively different. The graphs take up a lot of space, so I’ll just show you the results for a few states.

We’re plotting the probability of supporting the Republican candidate for president (among the supporters of one of the two major parties; that is, we’re plotting the estimates of R/(R+D)) as a function of respondent’s family income (divided into five categories). Within each state, we have two lines: the brown line shows estimated Republican support among white voters, and the black lines shows estimated Republican support among all voters in the state. Y-axis goes from 0 to 100%.

From Figure 1:


From Figure 2:


You see that? The fitted lines are smoother in Figure 2 than in Figure 1, they seem to be tied closer to the data points. It appears as if this is coming from the raw data, which seem in Figure 2 to be closer to clean monotonic patterns.

My first thought was that this was something to do with sample size. OK, that was my third thought. My first thought was that it was a bug in the code, and my second thought was that there was some problem with coding of the income variable. But I don’t think it was any of these things. Annenberg 2004 had a larger sample than Pew 2008, so we re-fit to two random subsets of those Annenberg 2004 data, and the resulting graphs (not shown in the paper) look similar to the Figure 2 shown above; they were still a lot smoother than Figure 1 which shows results from Pew 2008.

We discuss this at the end of Section 2 of our new paper and don’t come to any firm conclusions. We’ll see what turns up with the replication on Annenberg 2008.

Anyway, the point is:
– Replication is not so easy.
– We can learn even from setting up the replications.
– Published results (even from me!) are always only provisional and it makes sense to replicate on other data.

About that claim that police are less likely to shoot blacks than whites


Josh Miller writes:

Did you see this splashy NYT headline, “Surprising New Evidence Shows Bias in Police Use of Force but Not in Shootings”?

It’s actually looks like a cool study overall, with granular data, and a ton of leg work, and rich set of results that extend beyond the attention grabbing headline that is getting bandied about (sometimes with ill-intent). While I do not work on issues of race and crime, I doubt I am alone in thinking that this counter-intuitive result is unlikely to be true. The result: whites are as likely as blacks to be shot at in encounters in which lethal force may have been justified? Further, in their taser data, blacks are actually less likely than whites to subsequently be shot by a firearm after being tasered! While its true that we are talking about odds ratios for small probabilities, dare I say that the ratios are implausible enough to cue us that something funny is going on? (blacks are 28-35% less likely to be shot in the taser data, table 5 col 2, PDF p. 54). Further, are we to believe that suddenly, when an encounter escalates, the fears and other biases of officers suddenly melt away and they become race-neutral? This seems to be inconsistent with the findings in other disciplines when it comes to fear, and other immediate emotional responses to race (think implicit association tests, fMRI imaging of the amygdala, etc.).

This is not to say we can’t cook up a plausible sounding story to support this result. For example, officers may let their guard down against white suspects, and then, whoops, too late! Now the gun is the only option.

But do we believe this? That depends on how close we are to the experimental ideal of taking equally dangerous suspects, and randomly assigning their race (and culture?), and then seeing if police end up shooting them.

Looking at the paper, it seems like we are far from that ideal. In fact, it appears likely that the white suspects in their sample were actually more dangerous than the black suspects, and therefore more likely to get shot at.

Potential For Bias:

How could this selection bias happen? Well, this headline result comes solely from the Houston data, and for that data, their definition of a “shoot or don’t shoot” situation (my words) is defined as an arrest report that describes an encounter in which lethal force was likely justified. What is the criteria for lethal force to be likely justified? Among other things, for this data, it includes “resisting arrest, evading arrest, and interfering in arrest” (PDF pp.16-17, actual p. 14-15—they sample 5% of 16,000 qualifying reports) They also have a separate data set in which the criteria is that a taser was deployed (~5000 incidents). Remember, just to emphasize, these are reports involving encounters that don’t necessarily lead to an officer involved shootings (OIS). Given the presences of exaggerated fears, cultural misunderstandings, and other more nefarious forms of bias, wouldn’t we expect an arrest report to over-apply these descriptors to blacks relative to whites? Wouldn’t we also expect the taser to be over-applied to blacks relatively to whites? If so, then won’t this mechanically lower the incidence of shootings of blacks relative to whites in this sample? There are more blacks in the researcher-defined “shoot, or don’t shoot” situation that just shouldn’t be there; they are not as dangerous as the whites, and lethal force was unlikely to be justified (and wasn’t applied in most cases).


With this potential selection bias, yet no discussion of it (as far as I can tell), the headline conclusion doesn’t appear to be warranted. Maybe the authors can do a calculation and find that the degree of selection you would need to cause this result is itself implausible? Who knows. But I don’t see how it is justified to spread around this result without checking into this (This takes nothing away, of course, from the other important results in the paper).


The analysis for this particular result is reported on PDF pp. 23-25 with the associated table 5 on PDF p. 54. Note that when adding controls, there appear to be power issues. There is a partial control for suspect danger, under “encounter characteristics,” which includes, e.g. whether the suspect attacked, or drew a weapon—interestingly, blacks are 10% more likely to be shot with this control (not significant). The table indicates a control is also added for the taser data, but I don’t know how they could do that, because the taser data has no written narrative.

See here for more on the study from Rajiv Sethi.

And Justin Feldman pointed me to this criticism of his. Feldman summarizes:

Roland Fryer, an economics professor at Harvard University, recently published a working paper at NBER on the topic of racial bias in police use of force and police shootings. The paper gained substantial media attention – a write-up of it became the top viewed article on the New York Times website. The most notable part of the study was its finding that there was no evidence of racial bias in police shootings, which Fryer called “the most surprising result of [his] career”. In his analysis of shootings in Houston, Texas, black and Hispanic people were no more likely (and perhaps even less likely) to be shot relative to whites.

I’m not endorsing Feldman’s arguments but I do want to comment on “the most surprising result of my career” thing. We should all have the capacity for being surprised. Science would go nowhere if we did nothing but confirm our pre-existing beliefs. Buuuuut . . . I feel like I see this reasoning a lot in media presentations of social science: “I came into this study expecting X, and then I found not-X, and the fact that I was surprised is an additional reason to trust my result.” The argument isn’t quite stated that way, but I think it’s implicit, that the surprise factor represents some sort of additional evidence. In general I’m with Miller that when a finding is surprising, we should look at it carefully as this could be an indication that something is missing in the analysis.

P.S. Some people also pointed out this paper by Cody Ross from last year, “A Multi-Level Bayesian Analysis of Racial Bias in Police Shootings at the County-Level in the United States, 2011–2014,” which uses Stan! Ross’s paper begins:

A geographically-resolved, multi-level Bayesian model is used to analyze the data presented in the U.S. Police-Shooting Database (USPSD) in order to investigate the extent of racial bias in the shooting of American civilians by police officers in recent years. In contrast to previous work that relied on the FBI’s Supplemental Homicide Reports that were constructed from self-reported cases of police-involved homicide, this data set is less likely to be biased by police reporting practices. . . .

The results provide evidence of a significant bias in the killing of unarmed black Americans relative to unarmed white Americans, in that the probability of being {black, unarmed, and shot by police} is about 3.49 times the probability of being {white, unarmed, and shot by police} on average. Furthermore, the results of multi-level modeling show that there exists significant heterogeneity across counties in the extent of racial bias in police shootings, with some counties showing relative risk ratios of 20 to 1 or more. Finally, analysis of police shooting data as a function of county-level predictors suggests that racial bias in police shootings is most likely to emerge in police departments in larger metropolitan counties with low median incomes and a sizable portion of black residents, especially when there is high financial inequality in that county. . . .

I’m a bit concerned by maps of county-level estimates because of the problems that Phil and I discussed in our “All maps of parameter estimates are misleading” paper.

I don’t have the energy to look at this paper in detail, but in any case its existence is useful in that it suggests a natural research project of reconciling it with the findings of the other paper discussed at the top of this post. When two papers on the same topic come to such different conclusions, it should be possible to track down where in the data and model the differences are coming from.

P.P.S. Miller points me to this post by Uri Simonsohn that makes the same point (as Miller at the top of the above post).

In their reactions, Miller and Simonsohn do something very important, which is to operate simultaneously on the level of theory and data, not just saying why something could be a problem but also connecting this to specific numbers in the article under discussion.

Of polls and prediction markets: More on #BrexitFail

David “Xbox poll” Rothschild and I wrote an article for Slate on how political prediction markets can get things wrong. The short story is that in settings where direct information is not easily available (for example, in elections where polls are not viewed as trustworthy forecasts, whether because of problems in polling or anticipated volatility in attitudes), savvy observers will deduce predictive probabilities from the prices of prediction markets. This can keep prediction market prices artificially stable, as people are essentially updating them from the market prices themselves.

Long-term, or even medium-term, this should sort itself out: once market participants become aware of this bias (in part from reading our article), they should pretty much correct this problem. Realizing that prediction market prices are only provisional, noisy signals, bettors should start reacting more to the news. In essence, I think market participants are going through three steps:

1. Naive over-reaction to news, based on the belief that the latest poll, whatever it is, represents a good forecast of the election.

2. Naive under-reaction to news, based on the belief that the prediction market prices represent best information (“market fundamentalism”).

3. Moderate reaction to news, acknowledging that polls and prices both are noisy signals.

Before we decided to write that Slate article, I’d drafted a blog post which I think could be useful in that I went into more detail on why I don’t think we can simply take the market prices are correct.

One challenge here is that you can just about never prove that the markets were wrong, at least not just based on betting odds. After all, an event with 4-1 odds against, should still occur 20% of the time. Recall that we were even getting people arguing that those Leicester City odds of 5,000-1 odds were correct, which really does seem like a bit of market fundamentalism.

OK, so here’s what I wrote the other day:

We recently talked about how the polls got it wrong in predicting Brexit. But, really, that’s not such a surprise: we all know that polls have lots of problems. And, in fact, the Yougov poll wasn’t so far off at all (see P.P.P.S. in above-linked post, also recognizing that I am an interested party in that Yougov supports some of our work on Stan).

Just as striking, and also much discussed, is that the prediction markets were off too. Indeed, the prediction markets were more off than the polls: even when polling was showing consistent support for Leave, the markets were holding on to Remain.

This is interesting because in previous elections I’ve argued that the prediction markets were chasing the polls. But here, as with Donald Trump’s candidacy in the primary election, the problem was the reverse: prediction markets were discounting the polls in a way which, retrospectively, looks like an error.

How to think about this? One could follow psychologist Dan Goldstein who, under the heading, “Prediction markets not as bad as they appear,” argued that prediction markets are approximately calibrated in the aggregate, and thus you can’t draw much of a conclusion from the fact that, in one particular case, the markets were giving 5-1 odds to an event (Brexit) that actually ended up happening. After all, there are lots of bets out there, and 1/6 of all 5:1 shots should come in.

And, indeed, if the only pieces of information available were: (a) the market odds against Brexit winning the vote were 5:1, and (b) Brexit won the vote; then, yes, I’d agree that nothing more could be said. But we actually to have more information.

Let’s start with this graph from Emile Servan-Schreiber, from a post linked to by Goldstein. The graph shows one particular prediction market for the week leading up to the vote:


It’s my impression that the odds offered by other markets looked similar. I’d really like to see the graph over the past several months, but I wasn’t quite sure where to find it, so we’ll go with the one-week time series.

One thing that strikes me is how stable these odds are. I’m wondering if one thing that went on was that a feedback mechanism where the betting odds reify themselves.

It goes like this: the polls are in different places, and we all know not to trust the polls, which have notoriously failed in various British elections. But we do watch the prediction markets, which all sorts of experts have assured us capture the wisdom of crowds.

So, serious people who care about the election watch the prediction markets. The markets say 5:1 for Leave. Then there’s other info, the latest poll, and so forth. How to think about this information? Informed people look to the markets. What do the markets say? 5:1. OK, then that’s the odds.

This is not an airtight argument or a closed loop. Of course, real information does intrude upon this picture. But my argument is that prediction markets can stay stable for too long.

In the past, traders followed the polls too closely and sent the prediction markets up and down. But now the opposite is happening. Traders are treating markets odds as correct probabilities and not updating enough based on outside information. Belief in the correctness of prediction markets causes them to be too stable.

We saw this with the Trump nomination, and we saw it with Brexit. Initial odds are reasonable, based on whatever information people have. But then when new information comes in, it gets discounted. People are using the current prediction odds as an anchor.

Related to this point is this remark from David Rothschild:

I [Rothschild] am very intrigued by this interplay of polls, prediction markets, and financial markets. We generally accept polls as exogenous, and assume the markets are reacting to the polls and other information. But, with growth of poll-based forecasting and more robust analytics on the polling, before release, there is the possibility that polls (or, at least what is reported from polls) are influenced by the markets. Markets were assuming that there were two things at play (1) social-desirability bias to over report leaving (which we saw in Scotland in 2014) (2) uncertain voters would break stay (which seemed to happen in the polling in the last few days). And, while there was a lot of concern about the turnout of stay voters (due to stay voters being younger) the unfortunate assassination of Jo Cox seemed to have assuaged the markets (either by rousing the stay supporters to vote or tempering the leave supports out of voting). Further, the financial markets were, seemingly, even more bullish than the prediction markets in the last few days and hours before the tallies were complete.

I know you guys think I have no filter, but . . .

. . . Someone sent me a juicy bit of news related to one of our frequent blog topics, and I shot back a witty response (or, at least, it seemed witty to me), but I decided not to post it here because I was concerned that people might take it as a personal attack (which it isn’t; I don’t even know the guy).

P.S. I wrote this post a few months ago and posted it for the next available slot, which is now. So you can pretty much forget about guessing what the news item was, as it’s not like it just happened or anything.

P.P.S. The post was going to be bumped again, to December! But this seemed a bit much so I’ll just post it now.

Some insider stuff on the Stan refactor

From the stan-dev list, Bob wrote [and has since added brms based on comments; the * packages are ones that aren’t developed or maintained by the stan-dev team, so we only know what we hear from their authors]:

The bigger picture is this, and you see the stan-dev/stan repo really spans three logical layers:

  math <- language <- algorithms <- services <- pystan
                                             <- rstan   <- rstanarm
                                                        <- rethinking (*)
                                                        <- brms (*)
                                             <- cmdstan <- statastan
                                                        <- matlabstan
                                                        <- stan.jl

What we are trying to do with the services refactor is make a clean services layer between the core interfaces (pystan, rstan, cmdstan) so that these don't have to know anything below the services layer. Ideally, there wouldn't be any calls from pystan, rstan, or cmdstan other than ones to the stan::services namespace. services, on the other hand, is almost certainly going to need to know about things below the algorithms level in language and math.

And Daniel followed up with:

This clarified a lot of things. I think this is what we should do:

  1. Split algorithms and services into their own repos. (Language too, but that's a given.)
  2. Each "route" to calling an algorithm should live in the "algorithms" repo. That is, algorithms should expose a simple function for calling it directly. It'll be a C++ API, but not ones that the interfaces use directly.
  3. In "services," we'll have a config object with validation and only a handful of calls that pystan, rstan, cmdstan call. The config object needs to be simple and safe, but I think the pseudocode Bob and I created (which is looking really close to Michael's config object if it were safe) will suffice.

I don't really know what they're talking about but I thought it might be interesting to those of you who don’t usually see software development from the inside.

Retro 1990s post


I have one more for you on the topic of jail time for fraud . . . Paul Alper points us to a news article entitled, “Michael Hubbard, Former Alabama Speaker, Sentenced to 4 Years in Prison.” From the headline this doesn’t seem like such a big deal, just run-of-the-mill corruption that we see all the time, but Alper’s eye was caught by this bit:

His power went almost unquestioned by members of both parties: Even after he was indicted, Mr. Hubbard received all but one vote in the Legislature for his re-election as speaker.

Mr. Hubbard’s problems are only a part of the turmoil in Montgomery these days. The governor, Robert Bentley, is being threatened with impeachment for matters surrounding an alleged affair with a chief adviser, and the State Supreme Court chief justice, Roy S. Moore, who is suspended, has been charged with violating judicial ethics in his orders to probate judges not to issue marriage licenses to same-sex couples.

Wow! The governor, the chief justice of the state supreme court, and all but one member of the legislature.

Back in the Clinton/Gingrich era, I came up with the proposal that every politician be sent to prison for a couple years before assuming office. That way the politician would already know how the other half lived; also, governing would be straightforward without the possibility of jail time hanging over the politician’s head. With the incarceration already in the past, the politician could focus on governing.

“Most notably, the vast majority of Americans support criminalizing data fraud, and many also believe the offense deserves a sentence of incarceration.”


Justin Pickett sends along this paper he wrote with Sean Roche:

Data fraud and selective reporting both present serious threats to the credibility of science. However, there remains considerable disagreement among scientists about how best to sanction data fraud, and about the ethicality of selective reporting.

OK, let’s move away from asking scientists. Let’s ask the general public:

The public is arguably the largest stakeholder in the reproducibility of science; research is primarily paid for with public funds, and flawed science threatens the public’s welfare. Members of the public are able to make rapid but meaningful judgments about the morality of different behaviors using moral intuitions.

Pickett and Roche did a couple surveys:

We conducted two studies—a survey experiment with a nationwide convenience sample (N = 821), and a follow-up survey with a representative sample of US adults (N = 964)—to explore public judgments about the morality of data fraud and selective reporting in science.

What did they find?

The public overwhelming judges both data fraud and selective reporting as morally wrong, and supports a range of serious sanctions for these behaviors. Most notably, the vast majority of Americans support criminalizing data fraud, and many also believe the offense deserves a sentence of incarceration.

We know from other surveys that people generally feel that, if there’s something they don’t like, that it should be illegal. And are pretty willing to throw wrongdoers into prison. So, in that general sense, this isn’t so surprising. Still interesting to see it in this particular case.

As Evelyn Beatrice Hall never said, I disapprove of your questionable research practices, but I will defend to the death your right to publish their fruits in PPNAS and have them featured on NPR.

P.S. Just to be clear on this, I’m just reporting on an article that someone sent me. I don’t think people should be sent to prison for data fraud and selective reporting. Not unless they also commit real crimes that are serious.

P.P.S. Best comment comes from Shravan and AJG:

Ask the respondent what you think the consequences should be if
– you commit data fraud
– your coauthor commits data fraud
– your biggest rival commits data fraud
Then average these responses.

On deck this week

Mon: “Most notably, the vast majority of Americans support criminalizing data fraud, and many also believe the offense deserves a sentence of incarceration.”

Tues: Some insider stuff on the Stan refactor

Wed: I know you guys think I have no filter, but . . .

Thurs: Bigmilk strikes again

Fri: “Pointwise mutual information as test statistics”

Sat: Some U.S. demographic data at zipcode level conveniently in R

Sun: So little information to evaluate effects of dietary choices

Over at the sister blog, they’re overinterpreting forecasts

Matthew Atkinson and Darin DeWitt write, “Economic forecasts suggest the presidential race should be a toss-up. So why aren’t Republicans doing better?”

Their question arises from a juxtaposition of two apparently discordant facts:

1. “PredictWise gives the Republicans a 35 percent chance of winning the White House.”

2. A particular forecasting model (one of many many that are out there) predicts “The Democratic Party’s popular-vote margin is forecast to be only 0.1 percentage points. . . . a 51 percent probability that the Democratic Party wins the popular vote.”

Thus Atkinson and DeWitt conclude that “the Republican Party is underperforming this model’s prediction by 14 percentage points.” And they go on to explain why.

But I think they’re mistaken—not in their explanations, maybe, but in their implicit assumption that a difference between a 49% chance of winning from a forecast, and a 35% chance of winning from a prediction market, demands an explanation.

Why do I say this?

First, when you take one particular model as if it represents the forecast, you’re missing a lot of your uncertainty.

Second, you shouldn’t take the probability of a win as if it were an outcome in itself. The difference between a 65% chance of winning and a 51% chance of winning is not 14 percentage points in any real sense, it’s more like a difference of 1% or 2% of the vote. That is, the model predicts a 50/50 vote shift, maybe the markets are predicting 52/48, that’s a 2 percentage point difference, not 14 percentage points.

It’s not that Atkinson and DeWitt are wrong to be looking at discrepancies between different forecasts; I just think they’re overintepreting what is essentially 1 data point. Forecasts are valuable, but different information is never going to be completely aligned.