Skip to content

Are self-driving cars 33 times more deadly than regular cars?

Paul Kedrosky writes:

I’ve been mulling the noise over Uber’s pedestrian death.

While there are fewer pedestrian deaths so far from autonomous cars than non-autonomous (one in a few thousand hours, versus 1 every 1.5 hours), there is also, of course, a big difference in rates per passenger-mile. The rate for autonomous cars is now 1 for 3 million passenger miles, while the rate for non-autonomous cars is 1 for every 100 million passenger miles. This raises the obvious question: If the rates are actually the same per passenger mile, what’s the likelihood we would have seen that first autonomous car pedestrian death in the first 3 million passenger-miles?

Initially wanted to model this as a Poisson distribution, with outbreaks (accidents) randomly distributed through passenger-miles. Then I thought it should be a comparison of proportions. What is the best approach here?

I haven’t checked the above numbers so I’ll take Kedrosky’s word for them, for the purpose of this post.

My quick reply to the above question is that the default model would be exponential waiting time. So if the rate of the process is 1 for every 100 million passenger miles, then the probability of seeing the first death within the first 3 million passenger miles is 1 – exp(-0.03) = 0.03. So, yes, it could happen with some bad luck.

Really, though, I don’t think this approach is appropriate to this problem, as the probabilities are changing over time—maybe going up, maybe going down, I’m not really sure. I guess the point is that we could use the observed frequency of 1 per 3 million to get an estimated rate. But this one data point doesn’t tell us so much. In general I’d say we could get more statistical information using precursor events that are less rare—in this case, injuries as well as deaths—but then we could have concerns about reporting bias.

Forking paths said to be a concern in evaluating stock-market trading strategies

Kevin Lewis points us to this paper by Tarun Chordia, Amit Goyal, and Alessio Saretto. I have no disagreement with the substance, but I don’t like their statistical framework with that “false discoveries” thing, as I don’t think there are any true zeros. I believe that most possible trading strategies have very little effect but I doubt the effect is exactly zero, hence I disagree with the premise that there are “type 1 errors.” For the same reason, I don’t like the Bayes factors in the paper. Their whole approach using statistical significance seems awkward to me and I think that in future they’d be able to learn more using multilevel models and forgetting thresholds entirely.

Again, the Chordia et al. paper is fine for what it is; I just think they’re making their life more difficulty by using this indirect hypothesis-testing framework, testing hypotheses that can’t be true and oscillating between the two inappropriate extremes of theta=0 and theta being unconstrained. To me, life’s just to short to mess around like that.

Lessons learned in Hell

This post is by Phil. It is not by Andrew.

I’m halfway through my third year as a consultant, after 25 years at a government research lab, and I just had a miserable five weeks finishing a project. The end product was fine — actually really good — but the process was horrible and I put in much more time than I had anticipated. I definitely do not want to experience anything like that again, so I’ve been thinking about what went wrong and what I should do differently in the future. It occurred to me that other people might also learn from my mistakes, so here’s my story.

Continue reading ‘Lessons learned in Hell’ »

Some of the data from the NRA conventions and firearm injuries study

Dave Kane writes:

You wrote about the NRA conventions and firearm injuries study here.

The lead author, Anupam Jena, kindly provided some of the underlying data and a snippet of the code they used to me. You can see it all here.

The data are here.

I [Kane] wrote up a brief analysis, R Markdown and html files are at Github.

“It’s not just that the emperor has no clothes, it’s more like the emperor has been standing in the public square for fifteen years screaming, I’m naked! I’m naked! Look at me! And the scientific establishment is like, Wow, what a beautiful outfit.”

Somebody pointed Nick Brown to another paper by notorious eating behavior researcher Brian Wansink. Here’s Brown:

I have that one in my collection of PDFs. I see I downloaded it on January 7, 2017, which was 3 days before our preprint went live. Probably I skimmed it and didn’t pay much further attention. I don’t know if my coauthors looked at it. Let’s give it five minutes worth of attention:

1. I notice right off the bat that the first numerical statement in the Method section contains a GRIM inconsistency:
“Data collection took place in 60 distinct FSR ranging from large chains (e.g., AppleBees®, Olive Garden®, Outback Steakhouse®, TGIF®) to small independent places (58.8%).”
58.8% is not possible. 35 out of 60 is 58.33%. 36 out of 60 is 60%.

2. The split of interactions by server gender (female 245, male 250) does not add up to the total of 497 interactions. The split by server BMI does. Maybe they couldn’t determine server gender in two cases. (However, one would expect far fewer servers than interactions. Maybe with the reported ethnic and gender percentage splits of the servers we can work out a plausible number of total servers that match those percentages when correctly rounded. Maybe.)

3. The denominator degrees of freedom for the F statistics in Table 1 are incorrect (N=497 implies df2=496 for the first two, 495 for the third; subtract 2 if the real N is in fact 405 rather than 407).

4. In Table 5, the total observations with low (337) and high (156) BMI servers do not match the numbers (low, 215, high, 280) in Table 2.

There are errors right at the surface, and errors all the way through: the underlying scientific model (in which small, seemingly irrelevant manipulations are supposed to have large and consistent effects, a framework which is logically impossible because all these effects could interact with each other), the underlying statistical approach (sifting through data to find random statistically-significant differences which won’t replicate), the research program (in which a series of papers are published, each contradicting something that came before but presented as if they are part of a coherent whole), the details (data that could never have been, incoherent descriptions of data collection protocols, fishy numbers that could never have occurred with any data), all wrapped up in an air of certainty and marketed to the news media, TV audiences, corporations, the academic and scientific establishment, and the U.S. government.

What’s amazing here is not just that someone publishes low-quality research—that happens, journals are not perfect, and even when they make terrible mistakes they’re loath to admit it, as in the notorious case of that econ journal that refused to retract that “gremlins” paper which had nearly as many errors as data points—but that Wansink was, until recently, considered a leading figure in his field. Really kind of amazing. It’s not just that the emperor has no clothes, it’s more like the emperor has been standing in the public square for fifteen years screaming, I’m naked! I’m naked! Look at me! And the scientific establishment is like, Wow, what a beautiful outfit.

A lot of this has to be that Wansink and other social psychology and business-school researchers have been sending a message (that easy little “nudges” can have large and beneficial effects) that many powerful and influential people want to hear. And, until recently, this sort of feel-good message has had very little opposition. Science is not an adversarial field—it’s not like the U.S. legal system where active opposition is built into its processes—but when you have unscrupulous researchers on one side and no opposition on the other, bad things will happen.

P.S. I wrote this post in Sep 2017 and it is scheduled to appear in Mar 2018, by which time Wansink will probably be either president of Cornell University or the chair of the publications board of the Association for Psychological Science.

P.P.S. We’ve been warning Cornell about this one for awhile.

The moral hazard of quantitative social science: Causal identification, statistical inference, and policy

A couple people pointed me to this article, “The Moral Hazard of Lifesaving Innovations: Naloxone Access, Opioid Abuse, and Crime,” by Jennifer Doleac and Anita Mukherjee, which begins:

The United States is experiencing an epidemic of opioid abuse. In response, many states have increased access to Naloxone, a drug that can save lives when administered during an overdose. However, Naloxone access may unintentionally increase opioid abuse through two channels: (1) saving the lives of active drug users, who survive to continue abusing opioids, and (2) reducing the risk of death per use, thereby making riskier opioid use more appealing. . . . We exploit the staggered timing of Naloxone access laws to estimate the total effects of these laws. We find that broadening Naloxone access led to more opioid-related emergency room visits and more opioid-related theft, with no reduction in opioid-related mortality. . . . We also find suggestive evidence that broadening Naloxone access increased the use of fentanyl, a particularly potent opioid. . . .

I see three warning signs in the above abstract:

1. The bank-shot reasoning by which it’s argued that a lifesaving drug can actually make things worse. It could be, but I’m generally suspicious of arguments in which the second-order effect is more important than the first-order effect. This general issue has come up before.

2. The unintended-consequences thing, which often raises my hackles. In this case, “saving the lives of active drug users” is a plus, not a minus, right? And I assume it’s an anticipated and desired effect of the law. So it just seems wrong to call this “unintentional.”

3. Picking and choosing of results. For example, “more opioid-related emergency room visits and more opioid-related theft, with no reduction in opioid-related mortality,” but then, “We find the most detrimental effects in the Midwest, including a 14% increase in opioid-related mortality in that region.” If there’s no reduction in opioid-related mortality nationwide, but an increase in the midwest, then there should be a decrease somewhere else, no?

I find it helpful when evaluating this sort of research to go back to the data. In this case the data are at the state-year level (although some of the state-level data seems to come from cities, for reasons that I don’t fully understand.) The treatment is at the state-month level, when a state implements a law that broadens Naloxone access. This appears to have happened in 39 states between 2013 and 2015, so we have N=39 cases. So I guess what I want to see, for each outcome, are a bunch of time series plots showing the data in all 50 states.

We don’t quite get that but we do get some summaries, for example:

The weird curvy lines are clearly the result of overfitting some sort of non-regularized curves; see here for more discussion of this issue. More to the point, if you take away the lines and the gray bands, I don’t see any patterns at all! Figure 4 just looks like a general positive trend, and figure 8 doesn’t look like anything at all. The discontinuity in the midwest is the big thing—this is the 14% increase mentioned in the abstract to the paper—but, just looking at the dots, I don’t see it.

I’m not saying the conclusions in the linked paper are wrong, but I don’t find the empirical results very compelling, especially given that they’re looking at changes over time, in a dataset where there may well be serious time trends.

On the particular issue of Nalaxone, one of my correspondents passes along a reaction from an addiction specialist whose “priors are exceedingly skeptical of this finding (it implies addicts think carefully about Naloxone ‘insurance’ before overdosing, or something).” My correspondent also writes:

Another colleague, who is pre-tenure, requested that I anonymize the message below, which increases my dismay over the whole situation. Somehow both sides have distracted from the paper’s quality by shifting the discussion to the tenor of the discourse, which gives the paper’s analytics a pass.

There’s an Atlantic article on the episode.

Of course there was an overreaction by the harm reduction folks, but if you spend 5 minutes talking to non-researchers in that community, you’d realize how much they are up against and why these econ papers are so troubling.

My main problem remains that their diff-in-diff has all the hallmarks of problematic pre-trends and yet this very basic point has escaped the discussion somehow.

There is a problem that researchers often think that an “identification strategy” (whether it be randomization, or instrumental variables, or regression discontinuity, or difference in difference) gives them watertight inference. An extreme example is discussed here. An amusing example of econ-centrism comes from this quote in the Atlantic article:

“Public-health people believe things that are not randomized are correlative,” says Craig Garthwaite, a health economist at Northwestern University. “But [economists] have developed tools to make causal claims from nonrandomized data.”

It’s not really about economics: causal inference from observational data comes up all the time in other social sciences and also in public health research.

Olga Khazan, the author of the Atlantic article, points out that much of the discussion of the paper has occurred on twitter. I hate twitter; it’s a medium that seems so well suited for thoughtless sloganeering. From one side, you have people emptily saying, “Submit it for peer review and I’ll read what comes from it”—as if peer review is so great. On the other side, you get replies like “This paper uses causal inference, my dude”—not seeming to recognize that ultimately this is an observational analysis and the causal inference doesn’t come for free. I’m not saying blogs are perfect, and you don’t have to tell me about problems with the peer review process. But twitter can bring out the worst in people.

P.S. One more thing: I wish the data were available. It would be easy, right? Just some ascii files with all the data, along with code for whatever models they fit and computations they performed. This comes up all the time, for almost every example we look at. It’s certainly not a problem specific to this particular paper; indeed, in my own work, too, our data are often not so easily accessible. It’s just a bad habit we all fall into, of not sharing our data. We—that is, social scientists in general, including me—should do a better job of this. If a topic is important enough that it merits media attention, if the work could perhaps affect policy, then the data should be available for all to see.

P.P.S. See also this news article by Alex Gertner that expresses skepticism regarding the above paper.

P.P.P.S. Richard Border writes:

After reading your post, I was overly curious how sensitive those discontinuous regression plots were and I extracted the data to check it out. Results are here in case you or your readers are interested.

P.P.P.P.S. One of the authors of the article under discussion has responded, but without details; see here.

Last lines of George V. Higgins

Wonderful Years, Wonderful Years ends with this beautiful quote:

“Everybody gets just about what they want. It’s just, they don’t recognize it, they get it. It doesn’t look the same as what they had in mind.”

The conclusion of Trust:

“What ever doesn’t kill us, makes us strong,” Cobb said.

“Fuck Nietzsche,” Beale said. “He’s never around when you need him.”

Brute force, but funny.

The conclusion of The Judgment of Deke Hunter:

“There are,” Hunter said, “I was talking to Gillis, the day he came in to testify, scared shitless of course, and I was trying to calm him down so he wouldn’t crap his pants and offend the jury, which I guessed he must’ve done anyway. He was worried about where we’re gonna put him, to do the five he got for pleading and talking, if we hook Teddy, and I finally got him a little more relaxed. And he looks at me and he says: ‘Ah, fuck it. What difference does it make if he does? What you lose on the swings,’ he says, Horace, ‘you make up on the merry-go-round.'”

That night, Hunter went home.

The conclusion of Style Versus Substance:

We are all very wise in Boston, and we know a lot of things; many of them are not so.

And, finally, The Friends of Eddie Coyle really is Higgins’s best book, and it has the best last lines:

“Hey, Foss,” the prosecutor said, taking Clark by the arm, “of course it changes. Don’t take it so hard. Some of us die, the rest of us get older, new guys come along, old guys disappear. It changes every day.”

“It’s hard to notice, though,” Clark said.

“It is,” the prosecutor said, “it certainly is.”

The purpose of a pilot study is to demonstrate the feasibility of an experiment, not to estimate the treatment effect

David Allison sent this along:

Press release from original paper: “The dramatic decrease in BMI, although unexpected in this short time frame, demonstrated that the [Shaping Healthy Choices Program] SHCP was effective . . .”

Comment on paper and call for correction or retraction: “. . . these facts show that the analyses . . . are unable to assess the effect of the SHCP, and so conclusions stating that the data demonstrated effects of the SHCP on BMI are unsubstantiated.”

Authors’ response to “A Comment on Scherr et al ‘A Multicomponent, School-Based Intervention, the Shaping Healthy Choices Program, Improves Nutrition-Related Outcomes’.”

From the authors’ response:

We appreciate that Dr David B. Allison, the current Dean and Provost Professor at the Indiana University School of Public Health, and his colleagues [the comment is by Wood, Brown, Li, Oakes, Pavela, Thomas, and Allison] have shown interest in our pilot study. Although we appreciate their expertise, we respectfully submit that they may not be fully familiar with the challenges of designing and implementing community nutrition education interventions in kindergarten through sixth grade. . . .

It is evident that researchers conducting community-based programs are typically faced with limitations of sample size and study design when working with schools. We fully agree that the work we conducted is at a pilot scale. . . .

Given the limitations we had in sample size, we agree it should be viewed as a pilot study. Although this work can be viewed as a pilot study, we submit that it generates hypotheses for future larger-scale multicomponent studies. . . .

Huh? So you’re clear that it’s a pilot study, but you still released a statement saying that your data “demonstrated that [the treatment] was effective”???


There were also specific problems with the analysis in the published paper (see the above-linked comment by Wood et al.) but, really, it’s easier than that. You did a pilot study so don’t go around claiming you’ve demonstrated the treatment was effective.

Also a slightly more subtle point: I think the authors are also wrong when they write that the patterns of statistical significance in their pilot study “generates hypotheses for future larger-scale multicomponent studies.” Lots of people think this sort of thing but they’re mistaken. The problem is that patterns in a pilot study are just too noisy. You might as well be just rolling dice. To take a bunch of data and root around for statistically significant differences—this’ll just lead you in circles.

As the authors write, it is challenging to design and implement community nutrition education interventions in kindergarten through sixth grade. It’s tough that things are so difficult, but that doesn’t mean that you get a break and can make strong claims from noisy data. Science doesn’t work that way.

Why bother?

Why bother to bang on an statistical error published in an obscure journal? Two reasons. First, I assume that these sorts of research claims really can affect policy. Wood et al. are doing a service by writing that letter to the journal, and I wish the authors of the original study would recognize their mistake. Second, I find these sorts of examples to be instructive, both in illustrating statistical misconceptions that people have, and helping us clarify our thinking. For example, it seems so innocuous and moderate to say, “Although this work can be viewed as a pilot study, we submit that it generates hypotheses for future larger-scale multicomponent studies.” But, no, that’s a mistake.

Statistical controversy over “trophy wives”

Aaron Gullickson writes:

I thought you might be interested in this comment (of which I am the author) and response (by Elizabeth McClintock) that just came out in ASR. The subject is about whether beauty and status (e.g. education, income) are exchanged on the marriage market. The reason I thought you might be interested is because of my second critique starting in the “Log-Linear Model Interpretation” section where the error is basically a variation on your article about whether differences between stat sig and not stat sig are themselves stat sig. Basically, the original author used interaction coding to look directly at the effect of men’s beauty and the difference in the effect between men and women and concluded that nothing was statistically significant. However, if you reverse the reference category or use two separate main effects rather than an interaction term, you get a strong effect of beauty for women, which I tend of think is what most people’s priors would be. Figure 1 in the article summarizes all of this. Its surprising to me how hard it has been to explain this simple concept to some of my colleagues. The author seems similarly confused in the response and believes that I have (incorrectly) estimated a different model rather than re-coded the 1’s and 0’s in the same model.

Gullickson provides further details here. [link updated]

The articles in question are called, “Comments on Conceptualizing and Measuring the Exchange of Beauty and Status” and “Support for Beauty-Status Exchange Remains Illusory,” and the article that started it all is “Beauty and Status: The Illusion of Exchange in Partner Selection?” by Elizabeth McClintock.

Back when I was a student at MIT, there was this expression:

Brains * Beauty = Constant.

This formula ostensibly applied to girls, but in retrospect I think we were talking about ourselves without realizing it.

Anyway, to get back to the above discussion, I’m not really happy with how Gullickson or McClintock are looking at the data.

I have a few problems with what they’re doing. First, it seems pretty clear, from various stories we hear about in the news, that trophy wives do exist. Gullickson writes, “The subject is about whether beauty and status (e.g. education, income) are exchanged on the marriage market.” I can see how people can “exchange” beauty in the sense that if you marry someone beautiful you then get the consumption value of basking in their beauty, and of course you can exchange income. But I don’t quite get how you can exchange education.

But let’s set aside the terminology, and just accept that, by “exchange,” McClintock and Gullickson are just talking about marriages where one partner has more beauty and the other has more income and social status. Fine. But then how can it be a question of “whether” beauty and status are exchanged? Of course they are exchanged, in some numbers. The question is how much. I prefer McClintock’s formulation in terms of “the prevalence of beauty-status exchange.”

But I’m still hung up on the definitions. What is the definition of “beauty-status exchange” being used by these scholars? They disagree on their conclusions but I still can’t figure out what exactly they’re talking about.

McClintock refers to “the claim that individuals (generally women) of relatively high physical attractiveness barter their beauty to attract a partner of higher socioeconomic status,” and that seems pretty clear. But, again, I don’t see how they’re getting to this from their data.

The statistical debate has to do with coefficients being compared in different regression models, and Gullickson has a good point that the apparent interpretation of a coefficient can change, and easily be misunderstood, when flipping variables around going from one model to another. This statistical issue does seem relevant to the substantive questions being asked, but I still feel that a couple of steps needed to be added before I can understand this debate.

P.S. Alessio D’Aquino sent along the above image of two creatures who fit together very well.

Wanna know what happened in 2016? We got a ton of graphs for you.

The paper’s called Voting patterns in 2016: Exploration using multilevel regression and poststratification (MRP) on pre-election polls, it’s by Rob Trangucci, Imad Ali, Doug Rivers, and myself, and here’s the abstract:

We analyzed 2012 and 2016 YouGov pre-election polls in order to understand how different population groups voted in the 2012 and 2016 elections. We broke the data down by demographics and state and found:
• The gender gap was an increasing function of age in 2016.
• In 2016 most states exhibited a U-shaped gender gap curve with respect to education indicating a larger gender gap at lower and higher levels of education.
• Older white voters with less education more strongly supported Donald Trump versus younger white voters with more education.
• Women more strongly supported Hillary Clinton than men, with young and more educated women most strongly supporting Hillary Clinton.
• Older men with less education more strongly supported Donald Trump.
• Black voters overwhelmingly supported Hillary Clinton.
• The gap between college-educated voters and non-college-educated voters was about 10 percentage points in favor of Hillary Clinton
We display our findings with a series of graphs and maps. The R code associated with this project is available at

There’s a lot here. I mean, a lot. 44 displays, from A:

to Z:

And all sorts of things in between:


The New England Journal of Medicine wants you to “identify a novel clinical finding”

Mark Tuttle writes:

This is worth a mention in the blog.

At least they are trying to (implicitly) reinforce re-analysis and re-use of data.

Apparently, some of the re-use efforts will be published, soon.

My reply: I don’t know enough about medical research to make any useful comments here. But there’s one bit that raises my skepticism: the goal is to “use the data underlying a recent NEJM article to identify a novel clinical finding that advances medical science.”

I’m down on the whole idea that the role of statistics and empirical work is to identify novel findings. Maybe we have too much novelty and not enough reproducibility.

I’m not saying that I think the whole project is a bad idea, just that this aspect of it concerns me.

P.S. A lot more in comments from Dale Lehman, who writes:

This is a challenge I [Lehman] entered and am still mad about. Here are some pertinent details:

1. The NEJM editors had published an anti-sharing editorial which attracted much criticism. They felt pressured to do something that either appeared pro-sharing or actually might move data sharing (from clinical trials) forward. So, they started this Challenge.

2. There were a number of awkward impediments to participating – including the need to get IRB approval (even though the data was anonymized and had already been used in publications) and have an officer at your institution/organization sign off that had financial authority (for what?).

3. 279 teams entered, 143 completed (there was a qualifying round and then a challenge round – ostensibly to make sure that entrants into the latter knew what they were doing enough to be allowed to participate), and 3 winners were selected.

4. I entered but did not win. My own “discovery” was that the results of the more aggressive blood pressure treatment depended greatly on whether or not participants in the trial had missed any of their scheduled visits – particularly if they missed one of the first 3 monthly visits that were in the protocol.

5. Since it appeared to me that compliance with the protocol was important, I was particularly interested in data about noncompliance, I asked about data on “adherence to antihyperintensive medications” which the protocol said data was collected in the trial. I was told that the original publication did not use that data, so I could not have it (so much for “novel” findings).

6. To make matters worse, I subsequently discovered that a different article has been published in a different journal (by some of the same authors) using the very adherence scale data I had asked for.

7. To make matters even worse, I sent a note to the editors complaining about this, and saying that either the authors misled the NEJM or the journal was complicit in this. I got no response.

8. The final winners did some nice work, but 2 of the 3 winners created decision tools (one was an app) providing a rating for a prospective patient as to whether or not more aggressive blood pressure treatment was recommended. I did not (and do not) think this is such a novel finding and it disturbs me that these entries focused on discrete (binary) choices – the uncertainty about the estimated effects disappeared. On the contrary, I submitted a way to view the confidence intervals (yes, sorry I still live in that world) for the primary effects and adverse events simultaneously.

So, yes I am upset by the experience, as were a number of other participants. The conference they held afterwards was also quite interesting – the panel of trial patients were universal in supporting open data sharing and were shocked that researchers were not enthralled by the idea. Of course, I am a sore loser and perhaps that is what all the other disgruntled lowers feel. But it is hard to escape the bad taste the whole thing left in my mouth.

When all the dust settles, it may still prove to be a small step forward towards more open sharing of clinical trial data and the difficulties may be due to the hard work of changing established and entrenched ways of doing things. But at this point in time, I don’t feel supportive of such a conclusion.

What are the odds of Trump’s winning in 2020?

Kevin Lewis asks:

What are the odds of Trump’s winning in 2020, given that the last three presidents were comfortably re-elected despite one being a serial adulterer, one losing the popular vote, and one bringing race to the forefront?

My reply:

Serial adulterer, poor vote in previous election, ethnicity . . . I don’t think these are so important. It does seem that parties do better when running for a second term (i.e., reelection) than when running for third term (i.e., a new candidate), but given our sparse data it’s hard to distinguish these three stories:
1. Incumbency advantage: some percentage of voters support the president.
2. Latent variable: given that a candidate wins once, that’s evidence that he’s a strong candidate, hence it’s likely he’ll win again.
3. Pendulum or exhaustion: after awhile, voters want a change.

My guess is that the chances in 2020 of the Republican candidate (be it Trump or someone else) will depend a lot on how the economy is growing at the time. This is all with the approximately 50/50 national division associated with political polarization. If the Republican party abandons Trump, that could hurt him a lot. But the party stuck with Trump in 2016 so they very well might in 2020 as well.

I guess I should blog this. Not because I’m telling you anything interesting but because it can provide readers a clue as to how little I really know.

Also, by the time the post appears in March, who knows what will be happening.

What is not but could be if

And if I can remain there I will say – Baby Dee

Obviously this is a blog that love the tabloids. But as we all know, the best stories are the ones that confirm your own prior beliefs (because those must be true).  So I’m focussing on  this article in Science that talks about how STEM undergraduate programmes in the US lose gay and bisexual students.  This leaky pipeline narrative (that diversity is smaller the further you go in a field because minorities drop out earlier) is pretty common when you talk about diversity in STEM. But this article says that there are now numbers! So let’s have a look…

And when you’re up there in the cold, hopin’ that your knot will hold and swingin’ in the snow…

From the article:

The new study looked at a 2015 survey of 4162 college seniors at 78 U.S. institutions, roughly 8% of whom identified as LGBQ (the study focused on sexual identity and did not consider transgender status). All of the students had declared an intention to major in STEM 4 years earlier. Overall, 71% of heterosexual students and 64% of LGBQ students stayed in STEM. But looking at men and women separately uncovered more complexity. After controlling for things like high school grades and participation in undergraduate research, the study revealed that heterosexual men were 17% more likely to stay in STEM than their LGBQ male counterparts. The reverse was true for women: LGBQ women were 18% more likely than heterosexual women to stay in STEM.

Ok. There’s a lot going on here. First things first, let’s say a big hello to Simpson’s paradox! Although LGBQ people have a lower attainment rate in STEM, it’s driven by men going down and women going up. I think the thing that we can read straight off this is that there are “base rate” problems happening all over the place. (Note that the effect is similar across the two groups and in opposite directions, yet the combined total is fairly strongly aligned with the male effect.) We are also talking about a drop out of around 120 of the 333 LGBQ students in the survey. So the estimate will be noisy.

I’m less worried about forking paths–I don’t think it’s unreasonable to expect the experience to differ across gender. Why? Well there is a well known problem with gender diversity in STEM.  Given that gay women are potentially affected by two different leaky pipelines, it sort of makes sense that the interaction between gender and LGBQ status would be important.

The actual article does better–it’s all done with multilevel logistic regression, which seems like an appropriate tool. There are p-values everywhere, but that’s just life. I struggled from the paper to work out exactly what the model was (sometimes my eyes just glaze over…), but it seems to have been done fairly well.

As with anything however (see also Gayface), the study is only as generalizable as the data set. The survey seems fairly large, but I’d worry about non-response. And, if I’m honest with you, me at 18 would’ve filled out that survey as straight, so there are also some problems there.

My father’s affection for his crowbar collection was Freudian to say the least

So a very shallow read of the paper makes it seems like the stats is good enough. But what if it’s not? Does that really matter?

This is one of those effects that’s anecdotally expected to be true. But more importantly, a lot of the proposed fixes are the types of low-cost interventions that don’t really need to work very well to be “value for money”.

For instance, it’s suggested that STEM departments work to make LGBT+ visibility more prominent (have visible, active inclusion policies). They suggest that people teaching pay attention to diversity in their teaching material.

The common suggestion for the last point is to pay special attention to work by women and under-represented groups in your teaching. This is never a bad thing, but if you’re teaching something very old (like the central limit theorem or differentiation), there’s only so much you can do. The thing that we all have a lot more control over is our examples and exercises. It is a no-cost activity to replace, for example, “Bob and Alice” with “Barbra and Alice” or “Bob and Alex”.

This type of low-impact diversity work signals to students that they are in a welcoming environment. Sometimes this is enough.

A similar example (but further up the pipeline) is that when you’re interviewing PhD students, postdocs, researchers, or faculty, don’t ask the men if they have a wife. Swapping to a gender neutral catch-all (partner) is super-easy. Moreover, it doesn’t force a person who is not in an opposite gender relationship to throw themselves a little pride parade (or, worse, to let the assumption fly because they’re uncertain if the mini-pride parade is a good idea in this context). Partner is a gender-neutral term. They is a gender-neutral pronoun. They’re not hard to use.

These environmental changes are important. In the end, if you value science you need to value diversity. Losing women, racial and ethnic minorities, LGBT+ people, disabled people, and other minorities really means that you are making your talent pool more shallow. A deeper pool leads to better science and creating a welcoming, positive environment is a serious step towards deepening the pool.

In defence of half-arsed activism

Making a welcoming environment doesn’t fix STEM’s diversity problem. There is a lot more work to be done. Moreover, the ideas in the paragraph above may do very little to improve the problem. They are also fairly quiet solutions–no one knows you’re doing these things on purpose. That is, they are half-arsed activism.

The thing is, as much as it’s lovely to have someone loudly on my side when I need it, I mostly just want to feel welcome where I am. So this type of work is actually really important. No one will ever give you a medal, but that doesn’t make it less appreciated.

The other thing to remember is that sometimes half-arsed activism is all that’s left to you. If you’re a student, or a TA, or a colleague, you can’t singlehandedly change your work environment. More than that, if a well-intentioned-but-loud intervention isn’t carefully thought through it may well make things worse. (For example, a proposal at a previous workplace to ensure that all female students (about 400 of them) have a female faculty mentor (about 7 of them) would’ve put a completely infeasible burden on the female faculty members.)

So don’t discount low-key, low-cost, potentially high-value interventions. They may not make things perfect, but they can make things better and maybe even “good enough”.

What We Talk About When We Talk About Bias

Shira Mitchell wrote:

I gave a talk today at Mathematica about NHST in low power settings (Type M/S errors). It was fun and the discussion was great.

One thing that came up is bias from doing some kind of regularization/shrinkage/partial-pooling versus selection bias (confounding, nonrandom samples, etc). One difference (I think?) is that the first kind of bias decreases with sample size, but the latter won’t. Though I’m not sure how comforting that is in small-sample settings. I’ve read this post which emphasizes that unbiased estimates don’t actually exist, but I’m not sure how relevant this is.

I replied that the error is to think that an “unbiased” estimate is a good thing. See p.94 of BDA.

And then Shira shot back:

I think what is confusing to folks is when you use unbiasedness as a principle here, for example here:

Ahhhh, good point! I was being sloppy. One difficulty is that in classical statistics, there are two similar-sounding but different concepts, unbiased estimation and unbiased prediction. For Bayesian inference we talk about calibration, which is yet another way that an estimate can be correct on average.

The point of my above-linked BDA excerpt is that, in some settings, unbiased estimation is not just a nice idea that can’t be done in practice or can be improved in some ways; rather it’s an actively bad idea that leads to terrible estimates. The key is that classical unbiased estimation requires E(theta.hat|theta) = theta for any theta, and, given that some outlying regions of theta are highly unlikely, the unbiased estimate has to be a contortionist in order to get things right for those values.

But in certain settings the idea of unbiasedness is relevant, as in the linked post above where we discuss the problems of selection bias. And, indeed, type M and type S errors are defined with respect to the true parameter values. The key difference is that we’re estimating these errors—these biases—conditional on reasonable values of the underlying parameters. We’re not interested in these biases conditional on unreasonable values of theta.

Subtle point, worth thinking about carefully. Bias is important, but only conditional on reasonable values of theta.

P.S. Thanks to Jaime Ashander for the above picture.

Bob’s talk at Berkeley, Thursday 22 March, 3 pm

It’s at the Institute for Data Science at Berkeley.

And here’s the abstract:

I’ll provide an end-to-end example of using R and Stan to carry out full Bayesian inference for a simple set of repeated binary trial data: Efron and Morris’s classic baseball batting data, with multiple players observed for many at bats; clinical trial, educational testing, and manufacturing quality control problems have the same flavor.

We will consider three models that provide complete pooling (every player is the same), no pooling (every player is independent), and partial pooling (every player is to some degree like every other player). Hierarchical models allow the degree of similarity to be jointly modeled with individual effects, tightening estimates and sharpening predictions compared to the no pooling and complete pooling models. They also outperform empirical Bayes and max marginal likelihood predictively, both of which rely on point estimates of hierarchical parameters (aka “mixed effects”). I’ll show how to fit observed data to make predictions for future observations, estimate event probabilities, and carry out (multiple) comparisons such as ranking. I’ll explain how hierarchical modeling mitigates the multiple comparison problem by partial pooling (and I’ll tie it into rookie of the year effects and sophomore slumps). Along the way, I will show how to evaluate models predictively, preferring those that are well calibrated and make sharp predictions. I’ll also show how to evaluate model fit to data with posterior predictive checks and Bayesian p-values.

Gaydar and the fallacy of objective measurement

Greggor Mattson, Dan Simpson, and I wrote this paper, which begins:

Recent media coverage of studies about “gaydar,” the supposed ability to detect another’s sexual orientation through visual cues, reveal problems in which the ideals of scientific precision strip the context from intrinsically social phenomena. This fallacy of objective measurement, as we term it, leads to nonsensical claims based on the predictive accuracy of statistical significance. We interrogate these gaydar studies’ assumption that there is some sort of pure biological measure of perception of sexual orientation. Instead, we argue that the concept of gaydar inherently exists within a social context and that this should be recognized when studying it. We use this case as an example of a more general concern about illusory precision in the measurement of social phenomena, and suggest statistical strategies to address common problems.

There’s a funny backstory to this one.

I was going through my files a few months ago and came across an unpublished paper of mine from 2012, “The fallacy of objective measurement: The case of gaydar,” which I didn’t even remember ever writing! A completed article, never submitted anywhere, just sitting in my files.

How can that happen? I must be getting old.

Anyway, I liked the paper—it addresses some issues of measurement that we’ve been talking about a lot lately. In particular, “the fallacy of objective measurement”: researchers took a rich real-world phenomenon and abstracted it so much that they removed its most interesting content. “Gaydar” existed within a social context—a world in which gays were an invisible minority, hiding in plain sight and seeking to be inconspicuous to the general population while communicating with others of their subgroup. How can it make sense to boil this down to the shapes of faces?

Stripping a phemenon of its social context, normalizing a base rate to 50%, and seeking an on-off decision: all of these can give the feel of scientific objectivity—but the very steps taken to ensure objectivity can remove social context and relevance.

We had some gaydar discussion (also here) on the blog recently and this motivated me to freshen up the gaydar paper, with the collaboration of Mattson and Simpson. I also recently met Michal Kosinski, the coauthor of one of the articles under discussion, and that was helpful too.

You need 16 times the sample size to estimate an interaction than to estimate a main effect

Yesterday I shared the following exam question:

In causal inference, it is often important to study varying treatment effects: for example, a treatment could be more effective for men than for women, or for healthy than for unhealthy patients. Suppose a study is designed to have 80% power to detect a main effect at a 95% confidence level. Further suppose that interactions of interest are half the size of main effects. What is its power for detecting an interaction, comparing men to women (say) in a study that is half men and half women? Suppose 1000 studies of this size are performed. How many of the studies would you expect to report a statistically significant interaction? Of these, what is the expectation of the ratio of estimated effect size to actual effect size?

Here’s the solution:

If you have 80% power, then the underlying effect size for the main effect is 2.8 standard errors from zero. That is, the z-score has a mean of 2.8 and standard deviation of 1, and there’s an 80% chance that the z-score exceeds 1.96 (in R, pnorm(2.8, 1.96, 1) = 0.8).

Now to the interaction. The standard of an interaction is roughly twice the standard error of the main effect, as we can see from some simple algebra:
– The estimate of the main effect is ybar_1 – ybar_2, which has standard error sqrt(sigma^2/(N/2) + sigma^2/(N/2)) = 2*sigma/sqrt(N); for simplicity I’m assuming a constant variance within groups, which will typically be a good approximation for binary data, for example.
– The estimate of the interaction is (ybar_1 – ybar_2) – (ybar_3 – ybar_4), which has standard error sqrt(sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4)) = 4*sigma/sqrt(N). [algebra fixed]

And, from the statement of the problem, we’ve assumed the interaction is half the size of the main effect. So if the main effect is 2.8 on some scale with a se of 1, then the interaction is 1.4 with an se of 2, thus the z-score of the interaction has a mean of 0.7 and a sd of 1, and the probability of seeing a statistically significant effect difference is pnorm(0.7, 1.96, 1) = 0.10. That’s right: if you have 80% power to estimate the main effect, you have 10% power to estimate the interaction.

And 10% power is really bad. It’s worse than it looks. 10% power kinda looks like it might be OK; after all, it still represents a 10% chance of a win. But that’s not right at all: if you do get “statistical significance” in that case, your estimate is a huge overestimate:

> raw < - rnorm(1e6, .7, 1)
> significant < - raw > 1.96
> mean(raw[significant])
[1] 2.4

So, the 10% of results which do appear to be statistically significant give an estimate of 2.4, on average, which is over 3 times higher than the true effect.

Take-home point

The most important point here, though, has nothing to do with statistical significance. It’s just this: Based on some reasonable assumptions regarding main effects and interactions, you need 16 times the sample size to estimate an interaction than to estimate a main effect.

And this implies a major, major problem with the usual plan of designing a study with a focus on the main effect, maybe even preregistering, and then looking to see what shows up in the interactions. Or, even worse, designing a study, not finding the anticipated main effect, and then using the interactions to bail you out. The problem is not just that this sort of analysis is “exploratory”; it’s that these data are a lot noisier than you realize, so what you think of as interesting exploratory findings could be just a bunch of noise.

I don’t know if all this in the textbooks, but it should be.

Some regression simulations in R

In response to a comment I did some simulations which I thought were worth including in the main post.
Continue reading ‘You need 16 times the sample size to estimate an interaction than to estimate a main effect’ »

Here’s the title of my talk at the New York R conference, 20 Apr 2018:

The intersection of Graphics and Bayes, a slice of the Venn diagram that’s a lot more crowded than you might realize

And here are some relevant papers:

And here’s the conference website.

Classical hypothesis testing is really really hard

This one surprised me. I included the following question in an exam:

In causal inference, it is often important to study varying treatment effects: for example, a treatment could be more effective for men than for women, or for healthy than for unhealthy patients. Suppose a study is designed to have 80% power to detect a main effect at a 95% confidence level. Further suppose that interactions of interest are half the size of main effects. What is its power for detecting an interaction, comparing men to women (say) in a study that is half men and half women? Suppose 1000 studies of this size are performed. How many of the studies would you expect to report a statistically significant interaction? Of these, what is the expectation of the ratio of estimated effect size to actual effect size?

None of the students got any part of this question correct.

In retrospect, the question was too difficult; it had too many parts given that it was an in-class exam, and I can see how it would be tough to figure out all these numbers. But the students even didn’t get close: they had no idea how to start. They had no sense that you can work backward from power to effect size and go from there.

And these were statistics Ph.D. students. OK, they’re still students and they have time to learn. But this experience reminds me, once again, that classical hypothesis testing is really really hard. All these null hypotheses and type 1 and type 2 errors are distractions, and it’s hard to keep your eye on the ball.

I like the above exam question. I’ll put it in our new book, but I’ll need to break it up into many pieces to make it more doable.

P.S. See here for an awesome joke-but-not-really-a-joke solution from an anonymous commenter.

P.P.S. Solution is here.

Reasons for an optimistic take on science: there are not “growing problems with research and publication practices.” Rather, there have been, and continue to be, huge problems with research and publication practices, but we’ve made progress in recognizing these problems.

Javier Benitez points us to an article by Daniele Fanelli, “Is science really facing a reproducibility crisis, and do we need it to?”, published in the Proceedings of the National Academy of Sciences, which begins:

Efforts to improve the reproducibility and integrity of science are typically justified by a narrative of crisis, according to which most published results are unreliable due to growing problems with research and publication practices. This article provides an overview of recent evidence suggesting that this narrative is mistaken, and argues that a narrative of epochal changes and empowerment of scientists would be more accurate, inspiring, and compelling.

My reaction:

Kind of amusing that this was published in the same journal that published the papers on himmicanes, air rage (see also here), and ages ending in 9 (see also here).

But, sure, I agree that there may not be “growing problems with research and publication practices.” There were huge problems with research and publication practices, these problems remain but there may be some improvement (I hope there is!). What’s happened in recent years is that there’s been a growing recognition of these huge problems.

So, yeah, I’m ok with an optimistic take. Recent ideas in statistical understanding have represented epochal changes in how we think about quantitative science, and blogging and post-publication review represent a new empowerment of scientists. And PNAS itself now admits fallibility in a way that it didn’t before.

To put it another way: It’s not that we’re in the midst of a new epidemic. Rather, there’s been an epidemic raging for a long time, and we’re in the midst of an exciting period where the epidemic has been recognized for what it was, and there are some potential solutions.

The solutions aren’t easy—they don’t just involve new statistics, they primarily involve more careful data collection and a closer connection between data and theory, and both these steps are hard work—but they can lead us out of this mess.

P.S. I disagree with the above-linked article on one point, in that I do think that science is undergoing a reproducibility crisis, and I do think this is a pervasive problem. But I agree that it’s probably not a growing problem. What’s growing is our awareness of the problem, and that’s a key part of the solution, to recognize that we do have a problem and to beware of complacency.

P.P.S. Since posting this I came across a recent article by Nelson, Simmons, and Simonsohn (2018), “Psychology’s Renaissance,” that makes many of the above points. Communication is difficult, though, because nobody cites anybody else. Fanelli doesn’t cite Nelson et al.; Nelson et al. don’t cite my own papers on forking paths, type M errors, and “the winds have changed” (which covers much of the ground of their paper); and I hadn’t been aware of Nelson et al.’s paper until just now, when I happened to run across it in an unrelated search. One advantage of the blog is that we can add relevant references as we hear of them, or in comments.