Skip to content

No, I don’t think it’s the file drawer effect

Someone named Andrew Certain writes:

I’ve been reading your blog since your appearance on Econtalk . . . explaining the ways in which statistics are misused/misinterpreted in low-sample/high-noise studies. . . .

I recently came across a meta-analysis on stereotype threat [a reanalysis by Emil Kirkegaard] by that identified a clear relationship between smaller sample sizes and higher effect estimates. However, their conclusion [that is, the conclusion of the original paper, The influence of stereotype threat on immigrants: review and meta-analysis, by Markus Appel, Silvana Weber, and Nicole Kronberger] seems to be the following: In order for the effects estimated by the small samples to be spurious, there would have to be a large number of small-sample studies that showed no effect. Since that number is so large, even accounting for the file-drawer effect, and because they can’t find those null-effect studies, the effect size must be large.

Am I misinterpreting their argument? Is it as crazy at it sounds to me?

My reply:

I’m not sure. I didn’t see where in the paper they said that the effect size must be large. But I do agree that something seemed odd in their discussion, in that first they said that there were suspiciously few small-n studies showing small effect size estimates, but then they don’t really do much with that conclusion.

Here’s the relevant bit from the Appel et al. paper:

Taken together, our sampling analysis pointed out a remarkable lack of null effects in small sample studies. If such studies were conducted, they were unavailable to us. A file-drawer analysis showed that the number of studies in support of the null hypothesis that were needed to change the average effect size to small or even to insubstantial is rather large. Thus, we conclude that the average effect size in support of a stereotype threat effect among people with an immigrant background is not severely challenged by potentially existing but unaccounted for studies.

I’m not quite ready to call this “crazy” because maybe I’m just missing something here.

I will say, though, that I expect that “file drawer” is much less of an issue here than “forking paths.” That is, I don’t think there are zillions of completed studies that yielded non-statistically-significant results and then were put away in a file. Rather, I think that researchers manage to find statistically significant comparisons each time, or most of the time, using the data they have. And in that case the whole “count how many hidden studies would have to be in the file drawer” thing is irrelevant.

P.S. “Andrew Certain” . . . what a great name! I’d like to be called Andrew Uncertain; that would be a good name for a Bayesian.

Cool tennis-tracking app

Swupnil Sahai writes that he’s developed Swing, “the best app for tracking all of your tennis stats, and maybe we’ll expand to other sports in the future.”

According to Swupnil, the app runs on Apple Watch making predictions in real time. I hope in the future they’ll incorporate some hierarchical modeling to deal with sparse-data situations.

In any case, it’s great to see our former students having this kind of success.

It should be ok to just publish the data.

Gur Huberman asked for my reaction to a recent manuscript, Are CEOs Different? Characteristics of Top Managers, by Steven Kaplan and Morten Sorensen. The paper begins:

We use a dataset of over 2,600 executive assessments to study thirty individual characteristics of candidates for top executive positions – CEO, CFO, COO and others. We classify the thirty candidate characteristics with four primary factors: general ability, execution vs. interpersonal, charisma vs. analytic, and strategic vs. managerial. CEO candidates tend to score higher on these factors; CFO candidates score lower. Conditional on being a candidate, executives with greater interpersonal skills are more likely to be hired, suggesting that such skills are important in the selection process. Scores on the four factors also predict future career progression. Non-CEO candidates who score higher on the four factors are subsequently more likely to become CEOs. The patterns are qualitatively similar for public, private equity and venture capital owned companies. We do not find economically large differences in the four factors for men and women. Women, however, are subsequently less likely to become CEOs, holding the four factors constant.

I really don’t know what to do with this sort of thing. On one hand, the selection processes for business managers are worth studying, and these could be valuable data. On the other hand, the whole study looks like a mess: there’s no real sense of seeing all the data and once; rather, it looks like just a bunch of comparisons grabbed from the noise. So I have no real reason to take any of these empirical patterns seriously in the sense of thinking they would generalize beyond this particular dataset. But you have to start somewhere.

It was hard for me to bring myself to read a lot of the article; the whole thing just seemed kinda boring to me. If it were about sports, I’d be interested. But my decision to set the paper aside because it’s boring . . . that brings up a real bias in the dissemination of these sorts of reports. If a paper makes a strong claim, however ridiculous (Managers who wear red are more likely to perform better in years ending in 9!), or some claim with political content (Group X does, or does not, discriminate against group Y) then it’s more likely to get attention, also more likely to attract criticism. It’s just more likely to be talked about.

But, again, my take-home point is I don’t have a good way of thinking about this sort of paper, in which a somewhat interesting dataset is taken and then some regressions and comparisons are made. What I really think, I suppose, is that the academic communication system should be changed so it becomes OK to just publish interesting data, without having to clothe it in regressions and statistical significance. Not that regression is a bad method, it’s just that in this case I suspect the main contribution is putting together the dataset, and there’s no need for these data to be tied to some particular set of analyses.

It was the weeds that bothered him.

Bill Jefferys points to this news article by Denise Grady. Bill noticed the following bit, “In male rats, the studies linked tumors in the heart to high exposure to radiation from the phones. But that problem did not occur in female rats, or any mice,” and asked:

​Forking paths, much?

My reply: The summary of the news article seems reasonable: “But two government studies released on Friday, one in rats and one in mice, suggest that if there is any risk, it is small, health officials said.”

But, yes, later on they get into the weeds: “Scientists do not know why only male rats develop the heart tumors, but Dr. Bucher said one possibility is simply that the males are bigger and absorb more of the radiation.” They didn’t mention the possibility that variations just happen at random, and the fact that a comparison happened to be statistically significant in the data is not necessarily strong evidence that it represents a corresponding pattern in the population.

Bill responded:

Yes, it was the weeds that bothered me.

Overall, then, yes, a good news article.

How feminism has made me a better scientist

Feminism is not a branch of science. It is not a set of testable propositions about the observable world, nor is it any single research method. From my own perspective, feminism is a political movement associated with successes such as votes for women, setbacks such as the failed Equal Rights Amendment, and continuing struggles in areas ranging from reproductive freedom to equality in the workplace. Feminism is also a way of looking at the world through awareness of the social and political struggles of women, historically and in the present. And others will define feminism in other ways.

How is this relevant to the practice of science? As a researcher in statistical methods and applications, and I have found feminism to help me do better science. I make this claim based on various experiences in my work.

To start with, statistics is all about learning from data. Statisticians from Florence Nightingale and Francis Galton through Hans Rosling and Nate Silver have discovered unexpected patterns using mathematical modeling and data visualization. What does that have to do with feminism? Feminism is about a willingness to question the dominant, tacitly accepted ideology. This is essential for science. What is labeled “non-ideological” basically means the dominant ideology is accepted without thought. As Angela Davis said in her lecture, Feminism and Abolition, “feminist methodologies impel us to explore connections that are not always apparent. And they drive us to inhabit contradictions and discover what is productive in these contradictions. Feminism insists on methods of thought and action that urge us to think about things together that appear to be separate, and to disaggregate things that appear to naturally belong together.”

Along with questioning the dominant, tacitly accepted ideology is the need to recognize this ideology. This is related to the idea that a valuable characteristic of a scientist is a willingness to be disturbed. We learn from anomalies (see discussion here), which requires recognizing how a given observation or story is anomalous (that is, anomalous with respect to what expectations, exactly?), which in turn is more effective if one can identify the substrates of theories that determine our expectations. An example from our statistical work is our research using the Xbox survey to reveal that apparent swings in public opinion can largely be explained by variations in nonresponse.

On a more specific level, feminism can make us aware of work where the male perspective is unthinkingly taken as a baseline. For example, a paper was released a few years ago presenting survey data from the United States showing that parents of girls were more likely to support the Republican party, compared to parents of boys. I’m not completely sure what to make of this finding—for one thing, an analysis a few years ago of data from Britain found the opposite pattern—but here I want to focus on the reception of this research claim. There’s something oddly asymmetrical about how these results were presented, both by the authors and in the media. Consider the following headlines: “The Effect of Daughters on Partisanship and Social Attitudes Toward Women,” “Does Having Daughters Make You More Republican?”, “Parents With Daughters Are More Likely To Be Republicans, Says New Study,” “Parents Of Daughters Lean Republican, Study Shows,” “The Daughter Theory: Does raising girls make parents conservative?” Here’s our question: Why is it all about “the effect of daughters”? Why not, Does having sons make you support the Democrats? This is not just semantics: the male-as-baseline perspective affects the explanations that are given for this pattern: Lots of discussion about how you, as a parent, might change your views of the world if you have a girl. But not so much about how you might change your views if you have a boy.

The fallacy here was that people were reasoning unidirectionally. In this case, the benefit of a feminist perspective is not political but rather just a recognition of multiple perspectives and social biases, recognizing that in this case the boy baby is considered a default. As the saying goes, the greatest trick the default ever pulled was convincing the world it didn’t exist.

To put it another way: it is not that feminism is some sort of superpower that allows one to consider alternatives to the existing defaults, it’s more that these alternatives are obvious and can only not be seen if you don’t allow yourself to look. Feminism is, for this purpose, not a new way of looking at the world; rather, it is a simple removal of blinders. But uncovering blind spots isn’t that simple, and can be quite powerful.

A broader point is that it’s hard to do good social science if you don’t understand the community you’re studying. The lesson from feminism is not just to avoid taking the male perspective for granted but more generally to recognize the perspective of outgroups. An example came up recently with the so-called gaydar study, a much-publicized paper demonstrating the ability of a machine learning algorithm to distinguish gay and straight faces based on photographs from dating sites. This study was hyped beyond any reasonable level. From a statistical perspective, it’s no surprise at all that two groups of people selected from different populations will differ from each other, and if you have samples from two different populations and a large number of cases, then you should be able to train an algorithm to distinguish them at some level of accuracy. The authors of the paper in question went way beyond this, though, claiming that their results “provide strong support for the PHT [prenatal hormone theory], which argues that same-gender sexual orientation stems from the underexposure of male fetuses and overexposure of female fetuses to prenatal androgens responsible for the sexual differentiation of faces, preferences, and behavior.” Also some goofy stuff about the fact that gay men in this particular sample are less likely to have beards. Beyond the purely statistical problems here is a conceptual error, which is to think of “gay faces” as some sort of innate property of gay people, rather than as cues that gays are deliberately sending to each other. The distinctive and noticeable characteristics of the subpopulation are the result of active choices by members of that group, not (as assumed in the paper under discussion) essential attributes derived from “nature” or “nurture.”

Looking from a different direction, feminism can make us suspicious of simplistic gender-essentialist ideas such as expressed in various papers that make use of schoolyard evolutionary biology—the idea that, because of evolution, all people are equivalent to all other people, except that all boys are different from all girls. It’s the attitude I remember from the grade school playground, in which any attribute of a person, whether it be how you walked or how you laughed or even how you held your arms when you were asked to look at your fingernails (really) were gender-typed. It’s gender and race essentialism. And when you combine it with what Tversky and Kahneman called “the law of small numbers” (the naive but common attitude that any underlying pattern should reproduce in any small sample) has led to endless chasing of noise in data analyses. In short, if you believe this sort of essentialism, you can find it just about anywhere you look, hence the value of a feminist perspective which reminds us of the history and fallacies of gender essentialism. Of course there are lots of systematic differences between boys and girls, and between men and women, that are not directly sex-linked. To be a feminist is not to deny these differences; rather, placing these differences within a larger context, and relating them to past and existing power imbalances, is part of what feminism is about.

Many examples of schoolyard evolutionary biology in published science will be familiar to regular readers of this blog: there’s the beauty-and-sex-ratio study, the ovulation-and-clothing study, the fat-arms-and-political attitudes study (a rare example that objectified men instead of women), the ovulation-and-voting study, and various others. Just to be clear: I’m not saying that the claims in those studies have to be wrong. All these claims, to my eyes, look crudely gender-essentialist, but that doesn’t mean they’re false, or that it was a bad idea to study them. No, all those studies had problems in their statistics (in the sense that poor understanding of statistical methods led researchers and observers to wrongly think that those data presented strong evidence in favor of those claims) and in their measurement (in that the collected data were too sparse and noisy to really have a chance of supplying the desired evidence).

A feminist perspective was helpful to me in unraveling these studies for several reasons. To start with, feminism gave me the starting point of skepticism regarding naive gender essentialism—or, I could say, it help keep me from being intimidated by weak theorizing coming from that direction. Second, feminism made me aware of multiple perspectives: if someone can hypothesize that prettier parents are more likely to produce girls, I can imagine an opposite just-so story that makes just as much sense. And, indeed, both stories could be true at different times and different places, which brings me to the third thing I bring from feminism: an awareness of variation. There’s no reason to think that a hypothesized effect will be consistent in magnitude or even sign for different people and under different conditions. Understanding this point is a start toward moving away from naive, one might say “scientistic,” views of what can be learned or proved from simple surveys or social experiments.

I consider myself a feminist but I understand that others have different political views, and I’m not trying to say that being a feminist is necessary for doing science. Of course not. Rather, my point is that I think the political and historical insights of feminism have made me a better statistician and scientist.

As I wrote a few years ago:

At some level, in this post I’m making the unremarkable point that each of us has a political perspective which informs our research in positive and negative ways. The reason that this particular example of the feminist statistician is interesting is that it’s my impression that feminism, like religion, is generally viewed as a generally anti-scientific stance. I think some of this attitude comes from some feminists themselves who are skeptical of science in that is a generally male-dominated institution that is in part used to continue male dominance of society, and it also comes from people who might say that reality has an anti-feminist bias.

We can try to step back and account for our own political and ethical positions when doing science, but at the same time we should be honest about the ways that our positions and experiences shape the questions we ask and produce insights.

Feminism, like religion or other social identifications, can be competitive with science or it can be collaborative. See, for example, the blog of Echidne for a collaborative approach. To the extent that feminism represents a set of tenets are opposed to reality, it could get in the way of scientific thinking, in the same way that religion would get in the way of scientific thinking if, for example, you tried to apply faith healing principles to do medical research. If you’re serious about science, though, I think of feminism (or, I imagine, Christianity, for example) as a framework rather than a theory—that is, as a way of interpreting the world, not as a set of positive statements. This is in the same way that I earlier wrote that racism is a framework, not a theory. Not all frameworks are equal; my point here is just that, if we’re used to thinking of feminism, or religion, as anti-scientific, it can be useful to consider ways in which these perspectives can help one’s scientific work.

All of this is just one perspective. I did get several useful comments and references from Shira Mitchell and Dan Simpson when preparing this post, but the particular stories and emphases are mine. One could imagine a whole series of such posts—“How feminism made me a worse scientist,” “How science has made me a better feminist,” “How Christianity has made me a better scientist,” and so forth—all written by different people. And each one could be valid.

I was impelled to write the above post after reflecting upon all those many pseudo-scientific stories of cavemen (as Rebecca Solnit put it, “the five-million-year-old suburb”) and reflecting on the difficulties we so often have in communicating with one another; see for example here (where psychologist Steven Pinker, who describes himself as a feminist, gives a list of topics that he feels are “taboo” and cannot be openly discussed among educated Americans; an example is the statement, “Do men have an innate tendency to rape?”) and here (where social critic Charles Murray, who I assume would not call himself a feminist, argues that educated Americans are too “nonjudgmental” and not willing enough to condemn others, for example by saying “that it is morally wrong for a woman to bring a baby into the world knowing that it will not have a father”).

When doing social science, we have to accept that different people have sharply different views. Awareness of multiple perspectives is to me a key step, both in understanding social behavior and also in making sense of the social science we read. I do not think that calling oneself a feminist makes someone a better person, nor do I claim that feminism represents some higher state of virtue. All I’m saying here is that feminism, beyond its political context, happens to be a perspective that can help some of us be better scientists.

“Usefully skeptical science journalism”

Dean Eckles writes:

I like this Wired piece on the challenges of learning about how technologies are affecting us and children.

The journalist introducing a nice analogy (that he had in mind before talking with me — I’m briefly quoted) between the challenges in nutrition (and observational epidemiology more generally) and in studying “addictive” technologies.

He also gets how important it is to think about the magnitude of effects.

Perhaps an example of usefully skeptical science journalism? There are some little bits that aren’t quite right (“They need randomized controlled trials, to establish stronger correlations between the architecture of our interfaces and their impacts”) but that’s to be expected.

I have nothing to add except that I think it’s best to identify the author of the article, in this case it’s Robbie Gonzalez. Calling it a “Wired piece” doesn’t seem quite right. I wouldn’t like it if someone referred to one of my papers as “an article in the Journal of the American Statistical Association” without crediting me! This general issue has come up before; see for example the final paragraph in this post from 2006.

Discussion of the value of a mathematical model for the dissemination of propaganda

A couple people pointed me to this article, “How to Beat Science and Influence People: Policy Makers and Propaganda in Epistemic Networks,” by James Weatherall, Cailin O’Connor, and Justin Bruner, also featured in this news article. Their paper begins:

In their recent book Merchants of Doubt [New York:Bloomsbury 2010], Naomi Oreskes and Erik Conway describe the “tobacco strategy”, which was used by the tobacco industry to influence policy makers regarding the health risks of tobacco products. The strategy involved two parts, consisting of (1) promoting and sharing independent research supporting the industry’s preferred position and (2) funding additional research, but selectively publishing the results. We introduce a model of the Tobacco Strategy, and use it to argue that both prongs of the strategy can be extremely effective—even when policy makers rationally update on all evidence available to them. As we elaborate, this model helps illustrate the conditions under which the Tobacco Strategy is particularly successful. In addition, we show how journalists engaged in ‘fair’ reporting can inadvertently mimic the effects of industry on public belief.

This is an important topic and I like the general principle but I wasn’t so clear what was gained from the mathematical model, beyond the qualitative description and discussion. I asked the authors, and O’Connor replied:

There are a few reasons we think a model is useful here. 1) The models can help us verify that the tobacco strategy might have indeed played the type of role that Oreskes and Conway claim it played. For example, the models show that in principle something as simple as sharing the spurious results of real, independent scientists might be able to prevent the public from figuring out that smoking was dangerous. 2) They can help us neatly identify causal dependencies in cases like this, which is especially useful in figuring out which conditions make it harder or easier for propagandists. For instance, we see from the models that larger scientific communities might be a bad thing in some cases, because the extra researchers are potential sources of spurious results. This is not initially obvious. 3) By dint of making these causal dependencies clear, they help us identify interventions that might help protect the public from industry propaganda.

Jeremy Freese was ahead of the curve

Here’s sociologist Jeremy Freese writing, back in 2008:

Key findings in quantitative social science are often interaction effects in which the estimated “effect” of a continuous variable on an outcome for one group is found to differ from the estimated effect for another group. An example I use when teaching is that the relationship between high school test scores and earnings is stronger for men than for women. Interaction effects are notorious for being much easier to publish than to replicate, partly because it is easy for researchers to forget (?) how they tested many dozens of possible interactions before finding one that is statistically significant and can be presented as though it was hypothesized by the researchers all along.

Various things ought to heighten suspicion that a statistically significant interaction effect has a strong likelihood of not being “real.” Results that imply a plot like the one above practically scream “THIS RESULT WILL NOT REPLICATE.” There are so many ways of dividing a sample into subgroups, and there are so many variables in a typical dataset that have low correlation with an outcome, that it is inevitable that there will be all kinds of little pockets for high correlation for some subgroup just by chance.

Examples of such findings in the published literature are left as an exercise for the reader.

Interesting to see this awareness expressed so clearly way back when, at the very beginning of what we now call the replication crisis in quantitative research. I noticed Freese’s post when it appeared but at the time I didn’t fully recognize the importance of his points.

What’s gonna happen in the 2018 midterm elections?

Following up on yesterday’s post on party balancing, here’s a new article from Joe Bafumi, Bob Erikson, and Chris Wlezien giving their predictions for November:

We forecast party control of the US House of Representatives after the 2018 midterm election. First, we model the expected national vote relying on available generic Congressional polls and the party of the president. Second, we model the district vote based primarily on results from 2016 and the national swing. . . . Based on our analysis, the Democrats are projected to win a solid plurality of the national vote, above 53% of the two-party share, and gain control of the House with a narrow 7-seat majority. Our simulations yield considerable variation, however, with the Republicans winning the majority of seats 46% of the time but with the distinct possibility of a big Democratic wave.

No scatterplot, unfortunately. Also, whassup with the weird y-axis labels above, huh? Anyway, if you stare at the graph long enough, the point is clear: The system is asymmetric, what we call a Republican bias in the seats-votes curve. Democrats need quite a bit more than half the votes to be assured of half the seats in the legislature.

Erikson adds:

Past wave elections have been surprisingly strong. One reason is that seats that had previously seemed safe for incumbents suddenly became endangered. Why? In the prior election, the out-party (Dems today) had not competed strongly for a seat that they could only come close to winning but not win. The combination of a wave of new support for the out party plus the out party’s renewed effort can tip the balance where incumbents had previously seemed safe enough.

With a super-sized wave (bigger than observers now predict), the fallout can be enormous because gerrymandering only rearranges district lines and cannot manufacture more votes for a party. So designers of gerrymanders ignore the possibility of a 100 year flood—so their dikes are shallow. A large wave can wash away many in-party seats. I hope this analogy is clear.

But as of now the generic polls do not show a super-sized wave. The central question is more modest: which party controls the House. The Democrats are favored but not certain of winning the most seats.

Meanwhile, if there is the wave that people think is coming, the Senate might be more in play than people think today. A strong blue wave could probably help almost all if not all of the vulnerable Democrat Senators survive. Meanwhile the Dems could pick up 1 to 4 seats, possibly regaining a Senate majority.

Some thoughts after reading “Bad Blood: Secrets and Lies in a Silicon Valley Startup”

I just read the above-titled John Carreyrou book, and it’s as excellent as everyone says it is. I suppose it’s the mark of any compelling story that it will bring to mind other things you’ve been thinking about, and in this case I saw many connections between the story of Theranos—a company that raised billions of dollars based on fake lab tests—and various examples of junk science that we’ve been discussing for the past ten years or so.

Before going on, let me emphasize three things:

1. Theranos was more fake than I’d realized. On the back cover of “Bad Blood” is this quotation from journalist Bethany McLean:

No matter how bad you think the Theranos story was, you’ll learn that the reality was actually far worse.

Indeed. Before reading Carreyrou’s book, I had the vague idea that Theranos had some high-tech ideas that didn’t work out, and that they’d covered up their failures in a fraudulent way. The “fraudulent” part seems about right, but it seems that they didn’t have any high-tech ideas at all!

Their claim was that they could do blood tests without a needle, just using a drop of blood from your finger.

And how did they do this? They took the blood, diluted it, then ran existing blood tests. Pretty obvious, huh? Well, there’s a reason why other companies weren’t doing this: you can’t do 100 tests on a single drop of blood. Or, to put it another way, you can’t do 1 test on 1/100th of a drop of blood: there’s just not enough there, using conventional assay technology.

I think that, with care in data collection and analysis, you could do a lot better than standard practice—I’d guess that it wouldn’t be hard at all to reduce the amount of blood needed by a factor of 2, by designing and analyzing your assays more efficiently (see here, for example, where we talk about all the information available from measurements that are purportedly “below detection limit”). But 100 tests from one drop of blood: no way.

So I’d just assumed Theranos was using a new technology entirely, maybe something with gene sequencing or microproteins or some other idea I’d never heard of. No, not at all. What they actually had was an opaque box containing several little assay setups, with a mechanical robot arm to grab and squeeze the pipette to pass around the blood. Unsurprisingly, the machine broke down all the time. But even if it worked perfectly, it was a stupid hack. Or, I should say, stupid from the standpoint of measuring blood; not so stupid from the standpoint of conning investors.

You’ve heard about that faked moon landing, right? Well, Theranos really was the fake moon landing. They got billions of dollars for, basically, nothing.

So, yeah, the reality was indeed far worse than I’d thought!

It would be as if Stan couldn’t really fit any models, as if what we called “Stan” was just an empty program that scanned in models, ran them in Bugs, and then made up values of R-hat in order to mimic convergence. Some key differences between Stan and Theranos: (a) we didn’t do that, (b) Stan is open source so anyone could check that we didn’t do that by running models themselves, and (c) nobody gave us a billion dollars. Unfortunately, (c) may be in part a consequence of (a) and (b): Theranos built a unicorn, and we just built a better horse. You can get more money for a unicorn, even though—or especially because—unicorns don’t exist.

2. Clarke’s Law. Remember Clarke’s Third Law: Any sufficiently crappy research is indistinguishable from fraud. Theranos was an out-and-out fraud, as has been the case for some high-profile examples of junk science. In other cases, scientists started out by making inadvertent mistakes and then only later moved to unethical behavior of covering up or denying their errors. And there are other situations where there is enough confusion in the literature that scientists could push around noise and get meaningless results without possibly ever realizing they were doing anything wrong. From the standpoint of reality, it hardly matters. The Theranos story is stunning, but from the perspective of understanding science, I don’t really care so much if people are actively cheating, whether they’re deluding themselves, or whether it’s something in between. For example, when that disgraced primatologist at Harvard was refusing to let other people look at his videotapes, was this the action of a cheater who didn’t want to get caught, or just a true believer who didn’t trust unbelievers to evaluate his evidence? I don’t care: either way, it’s bad science.

3. The role of individual personalities. There are a lot of shaky business plans out there. It seems that what kept Theranos afloat for so long was the combination of Elizabeth Holmes, Ramesh Balwani, and David Boies, three leaders who together managed to apply some mixture of charisma, money, unscrupulousness, and intimidation to keep the narrative alive in contradiction to all the evidence. It would be nearly impossible to tell the story of Theranos without the story of these personalities, in the same way that it would be difficult to try to understand the disaster that was Cornell University’s Food and Brand Lab without considering the motivations of its leader. People matter, and it took a huge amount of effort for Holmes, Balwani, and their various cheerleaders and hired guns, to keep their inherently unstable story from exploding.

That said, my own interest in Theranos, as in junk science, is not so much on the charismatic and perhaps self-deluded manipulators, or on the disgusting things people will do for money or prestige (consider the actions of Theranos’s legal team to intimidate various people who were concerned about all the lying going on).

Rather, I’m interested in the social processes by which obviously ridiculous statements just sit there, unchallenged and even actively supported, for years, by people who really should know better. Part of this is the way that liars violate norms—we expect scientists to tell the truth, so a lie can stand a long time before it is fully checked—part of it is wishful thinking, and part of it seems to be an attitude by people who are already overcommitted to a bad idea to protect their investment, if necessary by attacking truth-tellers who dispute their claims.

Bad Blood

OK, on to the book. There seem to have been two ingredients that allowed Theranos to work. And neither of these ingredients involved technology or medicine. No, the two things were:

1. Control of the narrative.

2. Powerful friends.

Neither of these came for free. Theranos’s leaders had to work hard, for long hours, for years and years, to maintain control of the story and to attract and maintain powerful friends. And they needed to be willing to lie.

One thing I really appreciated about Carreyrou’s telling of the tale was the respect he gives to the whistleblowers, people who told the truth and often were attacked for their troubles. Each of them is a real person with complexities, decisions, and a life of his or her own. Sometimes in such stories there’s such a focus on the perpetrators, that the dissenters and whistleblowers are presented just as obstacles in the way of someone’s stunning success. Carreyrou doesn’t do that; he treats the critics with the respect they deserve.

When reading through the book, I took a lot of little notes, which I’ll share here.

p.35: “Avid had asked more pointed questions about the pharmaceutical deals and been told they were held up in legal review. When he’d asked to see the contracts, Elizabeth had said she didn’t have the copies readily available.” This reminds me of how much of success comes from simply outlasting the other side, from having the chutzpah to lie and just carrying it out, over and over again.

p.37: various early warnings arise, suspicious and odd patterns. What strikes me is how clear this all was. It really is an Emperor’s New Clothes situation in which, from the outside, the problems are obvious. Kind of like a lot of junk science, for example that “critical positivity ratio” stuff that was clearly ridiculous from the start. Or that ESP paper that was published in JPSP, or the Bible Code paper published in Statistical Science. None of these cases were at all complicated; it was just bad judgment to take them seriously in the first place.

p.38: “By midafternoon, Ana had made up her mind. She wrote up a brief resignation letter and printed out two copies . . . Elizabeth emailed her back thirty minutes later, asking her to please call her on her cell phone. Ana ignored her request. She was done with Theranos.” Exit, voice, and loyalty. It’s no fun fighting people who don’t fight fair; easier just to withdraw. That’s what I’ve done when I’ve had colleagues who plagiarize, or fudge their data, or more generally don’t seem to really care if their answers make sense. I walk away, and sometimes these colleagues can then find new suckers to fool.

p.42: “Elizabeth was conferenced in by phone from Switzerland, where she was conducting a second demonstration for Novartis some fourteen months after the faked one that had led to Henry Mosley’s departure.” And this happened in January, 2008. What’s amazing here is how long it took for all this to happen. Theranos faked a test in 2006, causing one of its chief executives to leave—but it wasn’t until nearly ten years later that this all caught up to them. Here I’m reminded of Cornell’s Food and Brand Lab, where problems had been identified several years before the scandal finally broke.

p.60: Holmes gave an interview to NPR’s “BioTech Nation” back in 2005! I guess I shouldn’t be surprised that NPR got sucked into this one: they seem to fall for just about anything.

p.73: Dilution assays! Statistics is a small world. Funny to think that my colleagues and I have worked on a problem that came up in this story.

p.75: “Chelsea’s job was to warm up the samples, put them in the cartridges, slot the cartridges into the readers, and see if they tested positive for the virus.” This is so amusingly low-tech! Really I think the best analogy here is the original “Mechanical Turk,” that supposedly automatic chess-playing device from the 1700s that was really operated by a human hiding inside the machine.

p.86: “Hunter was beginning to grow suspicious.” And p.86: “The red flags were piling up.” This was still just in 2010! Still many years for the story to play out. It’s as if Wiley E. Coyote had run off the cliff and was standing midair for the greater part of a decade before finally crashing into the Arizona desert floor.

p.88: “Walgreens suffered from a severe case of FoMO—the fear of missing out.” Also there was a selection effect. Over the years, lots of potential investors decided not to go with Theranos—but they didn’t matter. Theranos was able to go with just the positive views and filter out the negative. Here again I see an analogy to junk science: Get a friend on the editorial board or a couple lucky reviews and you can get your paper published in a top scientific journal. Get negative reviews, and just submit somewhere else. Once your paper is accepted for publication, you can publicize. Some in the press will promote your work, others will be skeptical—but the skeptics might not bother to share their skepticism with the world. In that way, the noisiness and selection of scientific publication and publicity have the effect of converting variation into a positive expectation, hence rewarding big claims even if they only have weak empirical support.

p.97: “To be sure, there were already portable blood analyzers on the market. One of them, a device that looked like a small ATM called the Piccolo Xpress, could perform thirty-one different blood tests and produce results in as little as twelve minutes.” Hey: so it already existed! It seems that the only real advantage Theranos had over Piccolo Xpress was a willingness to lie. I guess that’s worth a lot.

p.98: “Nepotism at Theranos took on a new dimension in the spring of 2011 when Elizabeth hired her younger brother, Christian, as associate director of project management. . . . Christian had none of his sister’s ambition and drive; he was a regular guy who liked to watch sports, chase girls, and party with friends. After graduating from Duke University in 2009, he’d worked as an analyst at a Washington, D.C., firm that advised corporations about best practices.” That’s just too funny. What better qualifications to advise corporations about best practices, right?

p.120: It seems that in 2011, Theranos wanted to deploy its devises for the military in Afghanistan. But the devices didn’t exist! It makes me wonder what Theranos was aiming for. My guess is that they wanted to get some sheet of paper representing military approval, so they could then claim they were using their devices in Afghanistan, a claim which they could then use to raise more money from suckers in Silicon Valley and Wall Street. If Theranos had actually received the damn contract, they’d’ve had to come up with some excuse for why they couldn’t fulfill it.

p.139: Agressive pitbull lawyer David Boies “had accepted stock in lieu of his regular fees”! Amusing to see that he got conned too. Funny, as he probably saw himself as a hard-headed, street-smart man of the world. Or maybe he didn’t get conned at all; maybe he dumped some of that stock while it was still worth something.

p.151: Theranos insists on “absolute secrecy . . . need to protect their valuable intellectual property.” This one’s clever, kind of like an old-school detective story! By treating the maguffin as it if has value, we make it valuable. The funny thing is, academic researchers typically act in the opposite way: We think our ideas are soooo wonderful but we give them away for free.

p.154: “Elizabeth had stated on several occasions that the army was using her technology [Remember, they had no special technology—ed.] on the battlefield in Afghanistan and that it was saving soldier’s lives.” Liars are scary. I understand that people can have different perspectives, but I’m always thrown when people just make shit up, or stare disconfirming evidence in the eye and then walk away. I just can’t handle it.

p.155: “After all, there were laws against misleading advertising.” I love the idea that it was the ad agency that had the scruples here, contrary to our usual stereotypes.

p.168: “Elizabeth and Sunny decided to dust off the Edison and launch with the older device. That, in turn, led to another fateful decision—the decision to cheat.” The Armstrong principle!

p.174: An article (not by Carreyrou) in the Wall Street Journal described Theranos’s processes as requiring “only microscopic blood volumes” and as “faster, cheaper, and more accurate than the conventional methods.” Both claims were flat-out false. I’m disappointed. As noted above, I find it difficult to deal with liars. But a news reporter should be used to dealing with liars, right? It’s part of the job. So how does this sort of thing happen? This is flat-out malpractice. Sure, I know you can’t fact-check everything, but still.

p.187: “To Tyler’s dismay, data runs that didn’t achieve low enough CVs (coefficients of variations) were simply discarded and the experiments repeated until the desired number was repeated.” Amusing to see some old-school QRP’s coming up. Seems hardly necessary given all the other cheating going on. But I guess that’s part of the point: people who cheat in one place are likely to cheat elsewhere too.

p.190: “Elizabeth and Sunny had decided to make Phoenix their main launch market, drawn by Arizona’s pro-business reputation and its large number of uninsured patients.” Wow—that’s pretty upsetting to see people getting a direct financial benefit from other people being in desperate straits. Sure, I know this happens, but it still makes me uncomfortable to see it.

p.192: “Tyler conceded her point and made a mental note to check the vitamin D validation data.” Always a good idea to check the data.

p.199: “George said a top surgeon in New York had told him the company was going to revolutionize the field of surgery and this was someone his good friend Henry Kissinger considered to be the smartest man alive.” This one’s funny. American readers of a certain age will recall a joke involving Kissinger himself being described as “the smartest man in the world.”

p.207: “He talked to Schultz, Perry, Kissinger, Nunn, Mattis and to two new directors: Richard Kovacevich, the former CEO of the giant bank Wells Fargo, and former Senate majority leader Bill Frist.” This is interesting. You could imagine one or two of these guys getting conned, but all of them? How could that be? They key here, I think, is that these seven endorsements seem like independent evidence, but they’re not. It’s groupthink. We’ve seen this happen with junk science, too: respected scientists with reputations for careful skepticism endorse shaky claims on topics that they don’t really know anything about. Why? Because these claims have been endorsed by other people they trust. You get these chains of credulity (here’s a particularly embarrassing example, but I’ve seen many others) with nothing underneath. This is all interesting in that there’s a real statistical fallacy going on here, where multiple endorsements are taken as independent pieces of evidence, but they’re not.

p.209: “President Obama appointed [Holmes] a U.S. ambassador for global entrepreneurship, and Harvard Medical School invited her to join its prestigious board of fellows.” Harvard Medical School, huh? What happened to First Do No Harm?

p.253: “It was frustrating but also a sign that I [Carreyrou] was on the right track. They wouldn’t be stonewalling if they had nothing to hide.” This reminds me of so many scientists who won’t share their data. What are they so afraid of, indeed?

p.271: “In a last-ditch attempt to prevent publication [of Carreyrou’s Wall Street Journal article, Boies sent the Journal a third lengthy letter . . .” The letter included this passage: “That thesis, as Mr. Carreyrou explained in discussions with us, is that all of the recognition by the academic, scientific, and health-care communities of the breakthrough contributions of Threanos’s achievements is wrong . . .” Indeed, and that’s the problem: again, what is presented as a series of cascading pieces of evidence is actually nothing more than the results of a well-funded, well-connected echo chamber. Again, this reminds me a lot of various examples of super-hyped junk science that was promoted with faced no serious opposition. It was hard for people to believe there was nothing there: how could all these university researchers, and top scientific journals, and scientific award committees, and government funders, and NPR get it all wrong?

p.278: “Soon after the interview ended, Theranos posted a long document on its website that purported to rebut my reporting point by point. Mike and I went over it with the standards editors and the lawyers and concluded that it contained nothing that undermined what we had published. It was another smokescreen.” This makes me think of two things. First, what a waste of time, dealing with this sort of crap. Second, it reminds me of lots and lots of examples of scientists responding to serious, legitimate criticism with deflection and denial, never even considering the possibility that maybe they got things wrong the first time. All of this has to be even worse when lawyers are involved, threatening people.

p.294: “In January 2018, Theranos published a paper about the miniLab . . . The paper described the device’s components and inner workings and included some data . . . But there was one major catch: the blood Theranos had used in its study was drawn the old-fashioned way, with a needle in the arm. Holmes’s original premise—fast and accurate test results from just a drop or two pricked from a finger—was nowhere to be found in the paper.” This is an example of moving the goalposts: When the big claims get shot down, retreat to small, even empty claims, and act as if that’s what you cared about all along. We see this a lot with junk science after criticism. Not always—I think those ESP guys haven’t backed down an inch—but in a lot of other cases. A study is done, it doesn’t replicate, and the reply from the original authors is that they didn’t ever claim a general effect, just something very very specific.

Dogfooding it

Here’s a question to which I don’t know the answer. Theranos received reputational support from various bigshots such as George Schultz, Henry Kissinger, and David Boies. Would these guys have relied on Theranos to test their own blood? Maybe so: maybe they thought that Theranos was the best, most high-tech lab out there. But maybe not: maybe they thought that Theranos was just right for the low end, for the schmoes who would get their blood tested at Safeway and who couldn’t afford real health care.

I think about this with a lot of junk science, that its promoters think it applies to other people, not to them. For example, that ridiculous (and essentially unsupported by data) claim that certain women were 20 percentage points more likely to support Barack Obama during certain times of the month: Do you think the people promoting this work thought that their own political allegiances were so weak?

Summary

“Bad Blood” offers several take-home points. In no particular order:

– Theranos’s claims were obviously flawed, just ridiculous. Yet the company thrived for nearly a decade, after various people in the company realized the emptiness of the company’s efforts.

– Meanwhile, Theranos spent tons of money: I guess that even crappy prototypes are expensive to build and maintain.

– In addition to all the direct damage done by Theranos to patients, I wonder how much harm arose from crowding-out effects. If it really took so much $ to build crappy fake machines, just imagine how much money and organization it would take to build real machines using new technology. You’d either need a major corporation or some really dedicated group of people with access to some funding. Theranos sucked up resources that could’ve gone to such efforts.

– There are lots of incentives against criticizing unscrupulous people. People who will lie and cheat might also be the kind of people who will retaliate against criticism. I admire Carreyrou for sticking with this one, as it can be exhausting to deal with these situations.

– The cheaters relied for their success on a network of clueless people and team players, the sort of people who would express loyalty toward Theranos without knowing or even possibly caring what was in its black boxes, people who just loved the idea of entrepreneurship and wanted to be part of the success story. I’ve seen this many times with junk science: mediocre researchers will get prominent scientists and public figures on their side, people who want the science to be true or who have some direct or indirect personal connection or who just support the idea of science and don’t want to see it criticized. The clueless defenders and team players can do a lot of defending without ever looking carefully into that black box. They just want some reassurance that all is ok.

The stakes with Theranos were much bigger, in both dollar and public health terms, than with much of the junk science that we’ve discussed on this blog. Beauty and sex ratio, advice on eating behavior, Bible codes, ESP, ovulation and voting, air rage, etc.: These are small potatoes compared to the billions spent on blood testing. The only really high-stakes example I can think of that we’ve discussed here is the gremlins guy: To the extent he muddies the water enough to get people to think that global warming is good for the economy, I guess he could do some real damage.

Again, I applaud Carreyrou for tracking down the amazing story of Theranos. As with many cases of fraud or self-delusion, one of the most amazing aspects of the story is how simple and obvious it all was, how long the whole scheme stayed afloat given that there was nothing there at all but a black box with a robot arm squeezing pipettes of diluted blood.

“Richard Jarecki, Doctor Who Conquered Roulette, Dies at 86”

[relevant video]

Thanatos Savehn is right. This obituary, written by someone named “Daniel Slotnik” (!), is just awesome:

Many gamblers see roulette as a game of pure chance — a wheel is spun, a ball is released and winners and losers are determined by luck. Richard Jarecki refused to believe it was that simple. He became the scourge of European casinos in the 1960s and early ′70s by developing a system to win at roulette. And win he did, by many accounts accumulating more than $1.2 million, or more than $8 million in today’s money . . . He and his wife honed his technique at dozens of casinos, including in Monte Carlo; Divonne-les-Bains, France; Baden-Baden, Germany; San Remo, on the Italian Riviera; and, briefly, Las Vegas.

How did they do it?

At the time, Dr. Jarecki told reporters that he had cracked roulette with the help of a powerful computer at the University of London. But the truth was more prosaic. He accomplished his improbable lucky streak through painstaking observation, with no electronic assistance.

Ms. Jarecki said in a telephone interview on Monday that she, Dr. Jarecki and a handful of other people helping them would record the results of every turn of a given roulette wheel to discover its biases, or tendency to land on some numbers more frequently than others, usually because of a minute mechanical defect caused by shoddy manufacturing or wear and tear.

Here’s some juicy statistical detail:

Ms. Jarecki said that watching, or “clocking,” a wheel, as Mr. Barnhart described it, could mean observing more than 10,000 spins over as long as a month. Sometimes a wheel would yield no observable advantage. But when Dr. Jarecki and company did find a wheel with a discernible bias, he would have an edge over the house. “It isn’t something he invented,” Ms. Jarecki said. “It’s something he perfected.”

Wow. This obit has more statistical sophistication than most of the PNAS papers I’ve seen.

Jarecki was bi-cultural: He was born in Germany, then his family moved to the U.S. when he was a child, then after graduating from college he moved back to Germany, then he met his wife, an American, during a medical residency in New Jersey, then not long after that they returned to live in Germany together.

Also this:

In addition to his wife, with whom he also had a home in Las Vegas, he is survived by a brother, Henry, a billionaire psychiatrist, commodities trader and entrepreneur; two daughters, Divonne Holmes a Court and Lianna Jarecki; a son, John, a chess prodigy who became a master at 12; and six grandchildren.

Two nephews of Dr. Jarecki are the award-winning documentarians Andrew Jarecki (“Capturing the Friedmans” and the HBO series “The Jinx: The Life and Deaths of Robert Durst”) and Eugene Jarecki (“Why We Fight” and “The House I Live In).”

And, finally:

Dr. Jarecki moved to Manila about 20 years ago, his wife said, because he liked the lifestyle there and preferred the city’s casinos to those run by Americans.

His touch at the roulette wheel endured until nearly the end. Ms. Jarecki said he last played in December, at a tournament in Manila. He came in first.

Roulette tournaments? Who knew??

What is “party balancing” and how does it explain midterm elections?

As is well known, presidential election outcomes are somewhat predictable based on economic performance. Votes for the U.S. Congress, are to a large part determined by party balancing. Right now, the Republicans control the executive branch, both houses of congress, and the judiciary, so it makes sense that voters are going to swing toward the Democrats. Political scientists Bob Erikson, Joe Bafumi, and Chris Wlezien have written a lot about this; see for example here, here, and here.

Here’s the deal with party balancing. Most voters will go with their party almost all the time. But there’s some subset of swing voters, and they’ll make the difference in marginal seats.

This simple idea can cause a lot of confusion. For example, in 2010, when the swing was going the other way, we discussed a a news commenter who characterized preference for divided government as the product of irrationality and “unconscious bias.” His mistake was to think of the electorate as a single entity (“Americans distrust the GOP. So why are they voting for it?”) rather than to recognize that different voters have different preferences.

Ironically, one reason the Democrats may well not regain the Senate in 2018 is . . . party balancing in 2016! Most people thought Hillary Clinton would win the presidency, so lots of people voted Republican for congress to balance that.

“The most important aspect of a statistical analysis is not what you do with the data, it’s what data you use” (survey adjustment edition)

Dean Eckles pointed me to this recent report by Andrew Mercer, Arnold Lau, and Courtney Kennedy of the Pew Research Center, titled, “For Weighting Online Opt-In Samples, What Matters Most? The right variables make a big difference for accuracy. Complex statistical methods, not so much.”

I like most of what they write, but I think some clarification is needed to explain why it is that complex statistical methods (notably Mister P) can make a big difference for survey accuracy. Briefly: complex statistical methods can allow you to adjust for more variables. It’s not that the complex methods alone solve the problem, it’s that, with the complex methods, you can include more data in your analysis (specifically, population data to be used in poststratification). It’s the sample and population data that do the job for you; the complex model is just there to gently hold the different data sources in position so you can line them up.

In more detail:

I agree with the general message: “The right variables make a big difference for accuracy. Complex statistical methods, not so much.” This is similar to something I like to say: the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use. (I can’t remember when I first said this; it was decades ago, but here’s a version from 2013.). I add, though, that better statistical methods can do better to the extent that they allow us to incorporate more information. For example, multilevel modeling, with its partial pooling, allows us to control for more variables in survey adjustment.

So I was surprised they didn’t consider multilevel regression and poststratification (MRP) in their comparisons in that report. The methods they chose seem limited in that they don’t do regularization, hence they’re limited in how much poststratification information they can include without getting too noisy. Mercer et al. write, “choosing the right variables for weighting is more important than choosing the right statistical method.” Ideally, though, one would not have to “choose the right variables” but would instead include all relevant variables (or, at least, whatever can be conveniently managed), using multilevel modeling to stabilize the inference.

They talk about raking performing well, but raking involves its own choices and tradeoffs: in particular, simple raking involves tough choices about what interactions to rake on. Again, MRP can do better here because of partial pooling. In simple raking, you’re left with the uncomfortable choice of raking only on margins and realizing you’re missing key interactions, or raking on lots of interactions and getting hopelessly noisy weights, as discussed in this 2007 paper on struggles with survey weighting.

Also, I’m glad that Mercer et al. pointed out that increasing the sample size reduces variance but does nothing for bias. That’s a point that’s so obvious from a statistical perspective but is misunderstood by so many people. I’ve seen this in discussions of psychology research, where outsiders recommend increasing N as if it is some sort of panacea. Increasing N can get you statistical significance, but who cares if all you’re measuring is a big fat bias? So thanks for pointing that out.

But my main point is that I think it makes sense to be moving to methods such as MRP that allow adjustment for more variables. This is, I believe, completely consistent with their main point as indicated in the title of their report.

One more thing: The title begins, “For Weighting Online Opt-In Samples…”. I think these issues are increasingly important in all surveys, including surveys conducted face to face, by telephone, or by mail. Nonresponse rates are huge, and differential nonresponse is a huge issue (see here). The point is not just that there’s nonresponse bias, but that this bias varies over time, and it can fool people. I fear that focusing the report on “online opt-in samples” may give people a false sense of security about other modes of data collection.

I sent the above comments to Andrew Mercer, who replied:

I’d like to do more with MRP, but the problem that we run into is the need to have a model for every outcome variable. For Pew and similar organizations, a typical survey has somewhere in the neighborhood of 90 questions that all need to be analyzed together, and that makes the usual approach to MRP impractical. That said, I have recently come across this paper by Yajuan Si, yourself, and others about creating calibration weights using MRP. This seems very promising and not especially complicated to implement, and I plan to test it out.

I don’t think this is such a problem among statisticians, but among many practicing pollsters and quite a few political scientists I have noticed two strains of thought when it comes to opt-in surveys in particular. One is that you can get an enormous sample and that makes up for whatever other problems may exist in the data and permit you to look at tiny subgroups. It’s the survey equivalent of saying the food is terrible but at least they give you large portions. As you say, this is obviously wrong but not well understood, so we wanted to address that.

The second is the idea that some statistical method will solve all your problems, particularly matching and MRP, and, to a lesser extent, various machine learning algorithms. Personally, I have spoken to quite a few researchers who read the X-box survey paper and took away the wrong lesson, which is that MRP can fix even the most unrepresentative data. But it’s not just, MRP. It was MRP plus powerful covariates like party, plus a population frame that had the population distribution for those covariates. The success of surveys like the CCES that use sample matching leads to similar perceptions about sample matching. Again, it’s not the matching, it’s matching and all the other things. The issue of interactions is important, and we tried to get at that using random forests as the basis for matching and propensity weighting, and we did find that there was some extra utility there, but only marginally more than raking on the two-way interactions of the covariates, and no improvement at all when the covariates were just demographics. And the inclusion of additional covariates really didn’t add much in the way of additional variance.

You’re absolutely right that these same underlying principles apply to probability-based surveys, although I would not necessarily expect the empirical findings to match. They have very different data generating processes, especially by mode, different confounds, and different problems. In this case, our focus was on the specific issues that we’ve seen with online opt-in surveys. The fact that there’s no metaphysical difference between probability-based and nonprobability surveys is something that I’ve written about elsewhere (e.g. https://academic.oup.com/poq/article/81/S1/250/3749176), and we’ve got lots more research in the pipeline focused on both probability and nonprobability samples.

I think we’re in agreement. Fancy methods can work because they make use of more data. But then you need to include that “more data”; the methods don’t work on their own.

To put it another way: the existence of high-quality random-digit-dialing telephone surveys does not mean that a sloppy telephone survey will do a good job, even if it happens to use random digit dialing. Conversely, the existence of high-quality MRP adjustment in some surveys does not mean that a sloppy MRP adjustment will do a good job.

A good statistical method gives you the conditions to get a good inference—if you put in the work of data collection, modeling, inference, and checking.

Trapped in the spam folder? Here’s what to do.

[Somewhat-relevant image]

It seems that some people’s comments are getting trapped in the spam filter.

Here’s how things go. The blog software triages the comments:

1. Most legitimate comments are automatically approved. You write the comment and it shows up right away.

2. Some comments are flagged as potentially spam. About half of these are legitimate comments and about half are actually spam. I go through the main comment folder about once a day (or more often if I’m trying to procrastinate) to (a) read the new comments, and (b) check the comments that were flagged as possibly spam and either approve them or send them to the spam folder. Sometimes I see already-approved comments that are spam, and I send them to the spam folder too, but that’s rare. And sometimes the classification is difficult: a comment looks real, but the identifying url is spam, and then I classify the whole comment as spam.

3. Lots and lots of comments are identified by the software as spam and sent directly into the spam filter, where I never see them. We get thousands and thousands of these, and there’s no way I could go through them. (Just to get a sense of this problem, the spam folder currently has 30,000 comments, all since 22 July (the last time I emptied it, I guess), as compared to 100,000 comments in the entire history of the blog. So it seems that we’d get more spam in 2 months than we received real comments in 15 years.

But . . . after hearing about the recent comments caught in spam, I went into that spam filter, searched on Anonymous, and found a few legitimate comments that got trapped there.

Unfortunately, I don’t know how to whitelist people, and it seems that various regular commenters still have been having this problem.

As a stopgap, try this: If you’re worried that your comment might be going straight to the spam filter, include the following bit of html in your comment:

<!-- notspam -->

Then, every once in awhile I can search the spam folder for the string “notspam” and fish these out.

An actual spammer could read all this and then spam me, but there’d be no real point, as I won’t be automatically approving these new comments; I’ll check them first in any case.

P.S. The whole thing reminds me of this story, hence the above image.

Let’s be open about the evidence for the benefits of open science

A reader who wishes to remain anonymous writes:

I would be curious to hear your thoughts on is motivated reasoning among open science advocates. In particular, I’ve noticed that papers arguing for open practices have seriously bad/nonexistent causal identification strategies.
Examples:

Kidwell et al. 2017, Badges to Acknowledge Open Practices: A Simple, Low-Cost, Effective Method for Increasing Transparency. Published in Plos Bio, criticized in Plos blog here. Brian Nosek responds at great length therein.

McKiernan et al. 2017, Point of View: How open science helps researchers succeed, claims that there is “evidence that publishing openly is associated with higher citation rates,” but also notes that “some controlled studies have failed to find” an effect. A glance through the citations suggests that all the RCTs find null effects, and the observational studies find substantial effects.
3) Rowhani-Farid and Barnett 2018, “Badges for sharing data and code at Biostatistics: an observational study” — compares citations for articles in a journal (biostatistics) that introduced badges to those in a journal that did not (statistics in medicine); the causal identification strategy is that the two journals are “in the same field of research with similar goals of publishing papers on statistical methods development in health and medicine.” The article finds that the “effect of badges at Biostatistics was a 7.3% increase in the data sharing rate.” (Further down, they write that their study “cannot accurately deduce the effectiveness of badges because of the biases of the non-randomised study design.” Well, then how should we interpret the claims in the abstract??)
So, one thing that makes me feel sad about this is that they’re all published in journals with a clear stake in open access norms — PLOS, eLife, and F1000. I worry that publishing articles like these discredits the model.
Also, I do think there are large, well-justified benefits to open science practices. Davis 2011 finds that “[a]rticles placed in the open access condition (n=712) received significantly more downloads and reached a broader audience within the first year, yet were cited no more frequently, nor earlier, than subscription-access control articles (n=2533) within 3 yr.” David Donoho (2017) writes that “[w]orking from the beginning with a plan for sharing code and data leads to higher quality work, and ensures that authors can access their own former work, and those of their co-authors, students and postdocs.” But I guess that there is still demand for research showing a strong citation benefit to open scholarship, regardless of what the evidence says.

I don’t have the energy to read these papers—I guess I don’t really care so much if open-science increases citation rates by 17% or whatever—but I agree with the general principle expressed by our correspondent that it’s not good practice to exaggerate evidence, even in a good cause.

Response to Rafa: Why I don’t think ROC [receiver operating characteristic] works as a model for science

Someone pointed me to this post from a few years ago where Rafael Irizarry argues that scientific “pessimists” such as myself are, at least in some fields, “missing a critical point: that in practice, there is an inverse relationship between increasing rates of true discoveries and decreasing rates of false discoveries and that true discoveries from fields such as the biomedical sciences provide an enormous benefit to society.” So far so good—within the framework in which the goal of p-value-style science is to make “discoveries” and in which these discoveries can be characterized as “true” or “false.”

But I don’t see this framework as being such a useful description of science, or at least the sort of science for which statistical hypothesis tests, confidence intervals, etc., are used. Why do I say this? Because I see the following sorts of statistical analysis:

– Parameter estimation and inference, for example estimation of a treatment effect. The goal of a focused randomized clinical trial is not to make a discovery—any “discovery” to be had was made before the study began, in the construction of the new treatment. Rather, the goal is to estimate the treatment effect (or perhaps to demonstrate that the treatment effect is nonzero, which is a byproduct of estimation).

– Exploration. That describes much of social science. Here one could say that discoveries are possible, and even that the goal is discovery, but we’re not discovering statements that are true or false. For example, in our red-state blue-state analysis we discovered an interesting and previously unknown pattern in voting—but I don’t see the ROC framework being applicable here. It’s not like it would make sense to say that if our coefficient estimate or z-score or whatever is higher than some threshold that we declare the pattern to be real, otherwise not. Rather, we see a pattern and use statistical analysis (multilevel modeling, partial pooling, etc.), to give our best estimate of the underlying voting patterns and of our uncertainties. I don’t see the point of dichotomizing: we found an interesting pattern, we did auxiliary analyses to understand it, it can be further studied using new data on new elections, etc.

– OK, you might say that this is fine in social science, but if you’re the FDA you have to approve or not approve a new drug, and if you’re a drug company you have to decide to proceed with a candidate drug or give it up. Decisions need to be made. Sure, but here I’d prefer to use formal decision analysis with costs and benefits. If this points us toward taking more risks—for example, approving drugs whose net benefit remains very uncertain—so be it. This fits Rafael’s ROC story, but not based on any fixed p-value or posterior probability; see my paper with McShane et al.

Also, again, the discussion of “false positive rate” and “true positive rate” seems to miss the point. If you’re talking about drugs or medical treatments: well, lots of them have effects, but the effects are variable, positive for some people and negative for others.

– Finally, consider the “shotgun” sort of study in which a large number of drugs, or genes, or interactions, or whatever, are tested, and the goal is to discover which ones matter. Again, I’d prefer a decision-theoretic framework, moving away from the idea of statistical “discovery” toward mere “inference.”

What’s the practical implications for all this? It’s good for researchers to present their raw data, along with clean summary analyses. Report what your data show, and publish everything! But when it comes to decision making, including the decision of what lines of research to pursue further, I’d go Bayesian, incorporating prior information and making the sources and reasoning underlying that prior information clear, and laying out costs and benefits. Of course that’s all a lot of work, and I don’t usually do it myself. Look at my applied papers and you’ll see tons of point estimates and uncertainty intervals, and only a few formal decision analyses. Still, I think it makes sense to think of Bayesian decision analysis as the ideal form and to interpret inferential summaries in light of these goals. Or, even more short term than that, if people are using statistical significance to make publication decisions, we can do our best to correct for the resulting biases, as in section 2.1 of this paper.

Don’t call it a bandit

Here’s why I don’t like the term “multi-armed bandit” to describe the exploration-exploitation tradeoff of inference and decision analysis.

First, and less importantly, each slot machine (or “bandit”) only has one arm. Hence it’s many one-armed bandits, not one multi-armed bandit.

Second, the basic strategy in these problems is to play on lots of machines until you find out which is the best, and then concentrate your plays on that best machine. This all presupposes that either (a) you’re required to play, or (b) at least one of the machines has positive expected value. But with slot machines, they all have negative expected value for the player (that’s why they’re called “bandits”), and the best strategy is not to play at all. So the whole analogy seems backward to me.

Third, I find the “bandit” terminology obscure and overly cute. It’s an analogy removed at two levels from reality: the optimization problem is not really like playing slot machines, and slot machines are not actually bandits. It’s basically a math joke, and I’m not a big fan of math jokes.

So, no deep principles here, the “bandit” name just doesn’t work for me. I agree with Bob Carpenter about the iid part (although I do remember a certain scene from The Grapes of Wrath with a non-iid slot machine), but other than that, the analogy seems a bit off.

The replication crisis and the political process

Jackson Monroe writes:

I thought you might be interested in an article [by Dan McLaughlin] in NRO that discusses the replication crisis as part of a broadside against all public health research and social science. It seemed as though the author might be twisting the nature of the replication crisis toward his partisan ends, but I was curious as to your thoughts.

From the linked article:

The social-science problem is that “public health” studies — like that NRA-convention study — can be highly subjective and ungoverned by the rigors of hard sciences that seek to test a hypothesis with results that can be replicated by other researchers. Indeed, the social sciences in general today suffer from a systemic “replication crisis,” a bias toward publishing only results that support the researcher’s hypothesis, and chronic problems with errors remaining uncorrected.

NRO is the website of the National Review, a conservative magazine, and the NRA (National Rifle Association) convention study is something we discussed in this space recently.

I think McLaughlin is probably correct that studies that seek to advance a political agenda are likely to have serious methodological problems. I’ve seen this in both the left and the right, and I don’t really know what to do about it, except to hope that all important research has active opposition. By “opposition,” I mean, ideally, honest opposition, and I don’t mean gridlock. It’s just good if serious claims are evaluated seriously, and not just automatically believed because they are considered to be on the side of the angels.

When LOO and other cross-validation approaches are valid

Introduction

Zacco asked in Stan discourse whether leave-one-out (LOO) cross-validation is valid for phylogenetic models. He also referred to Dan’s excellent blog post which mentioned iid assumption. Instead of iid it would be better to talk about exchangeability assumption, but I (Aki) got a bit lost in my discourse answer (so don’t bother to go read it). I started to write this blog post to clear my thoughts and extend what I’ve said before, and hopefully produce a better answer.

TL;DR

The question is when leave-one-out cross-validation or leave-one-group-out cross-validation is valid for model comparison. The short answer is that we need to think about what is the joint data generating mechanism, what is exchangeable, and what is the prediction task. LOO can be valid or invalid, for example, for time-series and phylogenetic modelling depending on the prediction task. Everything said about LOO applies also to AIC, DIC, WAIC, etc.

iid and exchangeability

Dan wrote

To see this, we need to understand what the LOO methods are using to select models. It is the ability to predict a new data point coming from the (assumed iid) data generating mechanism. If two models asymptotically produce the same one point predictive distribution, then the LOO-elpd criterion will not be able to separate them.

This is well put, although iid is stronger than necessary assumption. It would be better to assume exchangeability which doesn’t imply independence. Exchangeability assumption thus extends in which cases LOO is applicable. The details of difference between iid and exchangeability is not that important for this post, but I recommend interested readers to see Chapter 5 in BDA3. Instead of iid vs exchangeability, I’ll focus here on prediction tasks and data generating mechanisms.

Basics of LOO

LOO is easiest to understand in the case where we have a joint distribution p(y|x,\theta)p(x|\phi), and p(y|x,\theta) factorizes as

p(y|x,\theta) = \prod_{n=1}^N p(y_n|x_n,\theta).

We are interested in predicting a new data point coming from the data generating mechanism

p(y_{N+1}|x_{N+1},\theta).

When we don’t know \theta, we’ll integrate over posterior of \theta to get posterior predictive distribution

p(y_{N+1}|x_{N+1},x,y)=\int p(y_{N+1}|x_{N+1},\theta) p(\theta|x,y) d\theta.

We would like to evaluate how good this predictive distribution is by comparing it to observed (y_{N+1}, x_{N+1}). If we have not yet observed (y_{N+1}, x_{N+1}) we can use LOO to approximate expectation over different possible values for (y_{N+1}, x_{N+1}). Instead of making a model p(y,x), we re-use observation as pseudo-Monte Carlo samples from p(y_{N+1},x_{N+1}), and in addition not to use the same observation for fitting theta and evaluation we use LOO predictive distributions

p(y_{n}|x_{n},x_{-n},y_{-n})=\int p(y_{n}|x_{n},\theta) p(\theta|x_{-n},y_{-n}) d\theta.

The usual forgotten assumption is that x_{N+1} is unknown with the uncertainty described by a distribution! We have a model for p(y|x), but often we don’t build a model for p(x). BDA3 Chapter 5 discusses that when we have extra information x_n for y_n, then y_n are not exchangeable, but (x_n, y_n) pairs are. BDA3 Chapter 14 discusses that if we have a joint model p(x,y|\phi,\theta) and a prior which factorizes as

p(\phi,\theta)=p(\phi)p(\theta),

then the posterior factorizes as

p(\phi,\theta|x,y)=p(\phi|x)p(\theta|x,y).

We can analyze the second factor by itself with no loss of information.

For the predictive performance estimate we however need to know how the future x_{N+1} would be generated! In LOO we avoid making that explicit model by assuming observed x‘s present implicitly (or non-parametrically) the distribution p(x).
If we assume that the future distribution is different from the past, we can use importance weighting to adjust. Extrapolation (far) beyond the observed data would require explicit model for p(x_{N+1}). We discuss different joint data generating processes in Vehtari & Ojanen (2012), but I’ll use more words and examples below.

Data generating mechanisms

Dan wrote

LOO-elpd can fail catastrophically and silently when the data cannot be assumed to be iid. A simple case where this happens is time-series data, where you should leave out the whole future instead.

This is a quite common way to describe the problem, but I hope this can be more clear by discussing more about the data generating mechanisms.

Sometimes p(x) does not exist, for example when x is fixed, chosen by design, or deterministic! Then p(x,y) also does not exist, and exchangeability (or iid) does not apply to (x_n,y_n). We can still make a conditional model p(y|x,\theta) and analyze posterior p(\theta|x,y).

If x is fixed or chosen by design, we could think of a prediction task in which we would repeat one of the measurements with same x‘s. In that case, we might want to predict all the new observations jointly, but in cross-validation approximation that would lead to leave-all-out cross-validation which is also far from what we want (and sensitive to prior choices). So we may then use more stable and easier to compute LOO, which is still informative on predictive accuracy for repeated experiments.

LOO can be used in case of fixed x to evaluate the predictive performance of the conditional terms p(\tilde{y}_n|x_n,x,y), where \tilde{y}_n is a new observation conditionally a fixed x_n. Taking the sum (or mean) of these terms then weights equally each fixed x_n. If we care about the performance for some fixed x_n than for others, we can use different and weighting schemes to adjust.

x can sometimes be a mix of something with a distribution and something which is fixed or deterministic. For example in clinical study, we could assume that patient covariates come from some distribution in the past and in the future, but the drug dosage is deterministic. It’s clear that if the drug helps, we don’t assume that we would continue giving bad dosage in the future so that it’s unlikely we would ever observe corresponding outcome either

Time series

We could assume future be fixed, chosen by design, or deterministic, but so that x are different. For example, in time series we have observed time points 1,\ldots,N and then often we want to predict for N+1,\ldots,N+m. Here the data generating mechanism for x (time) is deterministic and all values of x are outside of our observed x‘s. It is still possible that the conditional part for y factorizes as \prod_{n=1}^N p(y_n|f_n,\theta) given latent process values f (and we have a joint prior on f describing our time series assumptions), and we may assume (y_n-f_n) to be exchangeable (or iid). What matters here more, is the structure of x. Approximating the prediction task leads to m-step-ahead cross-validation where we don’t use the future data to provide information about f_n or \theta (see a draft case study). Under short range dependency and stationarity assumptions, we can also use some of the future data in m-step-ahead leave-a-block-out cross-validation (see a draft case study).

We can also have time series, where we don’t want to predict for the future and thus the focus is only on the observed time points 1,\ldots,N. For example, we might be interested analysing whether more or less babies are born on some special days during a time period in the past (I would assume there are plenty of more interesting examples in history studies and social sciences). It’s sensible to use LOO here to analyze whether we have been able to learn relevant structure in the time series in order to better predict the number of births in a left out day. Naturally there we could also analyze different aspects of the time series model, by leaving out one week, one month, one year, or several days around special days to focus on different assumptions in our time series model.

LOO-elpd can fail or doesn’t fail for time-series depending on your prediction task. LOO is not great if you want to estimate the performance of extrapolation to future, but having a time series doesn’t automatically invalidate the use of cross-validation.

Multilevel data and models

Dan wrote

Or when your data has multilevel structure, where you really should leave out whole strata.

For multilevel structure, we can start from a simple example with individual observations from M known groups (or strata as Dan wrote). The joint conditional model is commonly

\prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) p(\theta_m|\psi) \right] p(\psi),

where y are partially exchangeable, so that y_{mn} are exchangeable in group j, and \theta_m are exchangeable. But for the prediction we need to consider also how x‘s are generated. If we have a model also for x, then we might have a following joint model

p(y,x)= \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m)p(\theta_m|\psi) \right] p(\psi) \prod_{m=1}^M \left[ \prod_{n=1}^N p(x_{mn}|\phi_m)p(\phi_m|\varphi) \right] p(\varphi).

Sometimes we assume we haven’t observed all groups and want to make predictions for y_{M+1,n} for a new group M+1. Say we have observed individual students in different schools and we want to predict for a new school. Then it is natural to consider leave-one-school-out cross-validation to simulate the fact that we don’t have any observations yet from that new school. Leave-one-school-out cross-validation will then also implicitly (non-parametrically) model the distribution of x_{mn}.

On the other hand, if we poll people in all states of USA, there will be no new states (at least in near future) and some or all x might be fixed or deterministic, but there can be new people to poll in these states and LOO could be sensible for predicting what the next person would answer. And even if there are more schools we could focus just to these schools and have fixed or deterministic x. Thus validity of LOO in hierarchical models also depends on the data generating mechanism and the prediction task.

Phylogenetic models and non-factorizable priors

Zacco’s discourse question was related to phylogenetic models.

Wikipedia says

In biology, phylogenetics is the study of the evolutionary history and relationships among individuals or groups of organisms (e.g. species, or populations). These relationships are discovered through phylogenetic inference methods that evaluate observed heritable traits, such as DNA sequences or morphology under a model of evolution of these traits. The result of these analyses is a phylogeny (also known as a phylogenetic tree) – a diagrammatic hypothesis about the history of the evolutionary relationships of a group of organisms. The tips of a phylogenetic tree can be living organisms or fossils, and represent the “end”, or the present, in an evolutionary lineage.

This is a similar to above hierarchical model, but now we assume a non-factorizable prior for theta

\prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) \right] p(\theta_1,\ldots,\theta_M|\psi)p(\psi),

and if we have a model for x, we may have also a non-factorizable prior for \phi

p(y,x)= \prod_{m=1}^M \left[ \prod_{n=1}^N p(y_{mn}|x_{mn},\theta_m) \right] p(\theta_1,\ldots,\theta_M|\psi)p(\psi) \prod_{m=1}^M \left[ \prod_{n=1}^N p(x_{mn}|\phi_j) \right] p(\phi_1,\ldots,\phi_M|\varphi)p(\varphi).

Again LOO is valid, if we focus on new observations in the current groups (e.g. observed species). Alternatively we could consider prediction for new individuals in a new group (e.g. species) and use leave-one-group-out cross-validation. I would assume we might have extra information about some x‘s for a new species which would require re-weighting or more modeling. Non-factorizable priors are no problem for LOO or cross-validation in general, although fast LOO computations maybe possible only for special non-factorizable forms and direct cross-validation may require the full prior structure to be included as demonstrated in Leave-one-out cross-validation for non-factorizable models vignette.

Prediction in scientific inference

Zacco wrote also

In general, with a phylogenetics model there is only very rarely interest in predicting new data points (ie. unsampled species). This is not to say that predicting these things would not be interesting, just that in the traditions of this field it is not commonly done.

It seems quite common in many fields to have a desire to have some quantity to report as an assessment of how good the model is, but not to consider whether that model would generalize to new observations. The observations are really the only thing we have observed and something we might observe more in the future. Only in toy problems, we might be able to observe also usually unknown model structure and parameters. All model selection methods are inherently connected to the observations, and instead of thinking the model assessment methods as black boxes (*cough* information criteria *cough*) it’s useful to think how the model can help us to predict something.

In hierarchical models there are also different parts we can focus, and the depending on the focus, benefit of some parts can sometimes be hidden beyond benefits of other parts. For example, in a simple hierarchical model described above, if we have a plenty of individual observations in each group, then the hierarchical prior does have only weak effect if we leave one observation out and LOO is then assessing mostly the lowest level model part p(y_{mn}|x_{mn},\theta_m). On the other hand if we use leave-one-group-out cross-validation, then hierarchical prior has a strong effect in prediction for that group and we are assessing more the higher level model part p(\theta_1,\ldots,\theta_M|\psi). I would guess, this would be the part in phylogenetic models which should be in focus. If there is no specific prediction task, it’s probably useful to do both LOO and leave-one-group-out cross-validation.

Or spatial data, where you should leave out large-enough spatial regions that the point you are predicting is effectively independent of all of the points that remain in the data set.

This would be sensible if we assume that future locations are spatially disconnected from the current locations, or if the focus is specifically in non-spatial model part p(y_{mn}|x_{mn},\theta_m) and we don’t want spatial information to help that prediction.

Information criteria

Zacco wrote

My general impression from your 2017 paper is that WAIC has fewer issues with exchangeability.

It’s unfortunate that we didn’t make it more clear in that paper. I naïvely trusted too much that people would read also the cited theoretical 87 pages long paper (Vehtari & Ojanen, 2012), but I now do understand that some repetition is needed. Akaike’s (1974) original paper made the clear connection to the prediction task of predicting y_{N+1} given x_{N+1} and \hat{\theta}. Stone (1997) showed that TIC (Takeuchi, 1976) which is a generalization of AIC, can be obtained from Taylor series approximation of LOO. Also papers introducing RIC (Shibata, 1989), NIC (Yoshizawa and Amari, 1994), DIC (Spiegelhalter et al., 2002) make the connection to the predictive performance. Watanabe (2009, 2010a,b,c) is very explicit on the equivalance of WAIC and Bayesian LOO. The basic WAIC has exactly the same exchangeability assumptions as LOO, estimates the same quantity, but uses a different computational approximation. It’s also possible to think of DIC and WAIC at different levels of hierarchical models (see, e.g., Spiegelhalter et al., 2002; Li et al., 2016; Merkle et al. 2018).

Unfortunately most of the time, information criteria are presented as fit + complexity penalty without any reference to the prediction task, assumptions about exchangeability, data generating mechanism or which part of the model is in the focus. This, combined with the difficult to interpret unit of the resulting quantity, has lead to the fact that information criteria are used as black box measure. As the assumptions are hidden, people think they are always valid (if you already forgot: WAIC and LOO have the same assumptions).

I prefer LOO over WAIC because of better diagnostics and better accuracy in difficult cases (see, e.g., Vehtari, Gelman, Gabry, 2017; Using Stacking to Average Bayesian Predictive Distributions we used LOO, but we can do Bayesian stacking also with other forms of cross-validation like leave-one-group-out or m-step-ahead cross-validation.

 

PS. I found this great review of the topic Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Although it completely ignores all Bayesian cross-validation literature and gives some recommendations not applicable for Bayesian modeling, it mostly gives the same recommendations as what I discuss above.

China air pollution regression discontinuity update

Avery writes:

There is a follow up paper for the paper “Evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River policy” [by Yuyu Chen, Avraham Ebenstein, Michael Greenstone, and Hongbin Li] which you have posted on a couple times and used in lectures. It seems that there aren’t much changes other than newer and better data and some alternative methods. Just curious what you think about it.

The paper is called, “New evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River Policy” [by Avraham Ebenstein, Maoyong Fan, Michael Greenstone, Guojun He, and Maigeng Zhou].

The cleanest summary of my problems with that earlier paper is this article, “Evidence on the deleterious impact of sustained use of polynomial regression on causal inference,” written with Adam Zelizer.

Here’s the key graph, which we copied from the earlier Chen et al. paper:

The most obvious problem revealed by this graph is that the estimated effect at the discontinuity is entirely the result of the weird curving polynomial regression, which in turn is being driven by points on the edge of the dataset. Looking carefully at the numbers, we see another problem which is that life expectancy is supposed to be 91 in one of these places (check out that green circle on the upper right of the plot)—and, according to the fitted model, the life expectancy there would be 5 years higher, that is, 96 years!, if only they hadn’t been exposed to all that pollution.

As Zelizer and I discuss in our paper, and I’ve discussed elsewhere, this is a real problem, not at all resolved by (a) regression discontinuity being an identification strategy, (b) high-degree polynomials being recommended in some of the econometrics literature, and (c) the result being statistically significant at the 5% level.

Indeed, items (a), (b), (c) above represent a problem, in that they gave the authors of that original paper, and the journal reviewers and editors, a false sense of security which allowed them to ignore the evident problems in their data and fitted model.

We’ve talked a bit recently about “scientism,” defined as “excessive belief in the power of scientific knowledge and techniques.” In this case, certain conventional statistical techniques for causal inference and estimation of uncertainty have led people to turn off their critical thinking.

That said, I’m not saying, nor have I ever said, that the substantive claims of Chen et al. are wrong. It could be that this policy really did reduce life expectancy by 5 years. All I’m saying is that their data don’t really support that claim. (Just look at the above scatterplot and ignore the curvy line that goes through it.)

OK, what about this new paper? Here’s the new graph:

You can make of this what you will. One thing that’s changed is that the two places with life expectancy greater than 85 have disappeared. So that seems like progress. I wonder what happened? I did not read through every bit of the paper—maybe it’s explained there somewhere?

Anyway, I still don’t buy their claims. Or, I should say, I don’t buy their statistical claim that their data strongly support their scientific claim. To flip it around, though, if the public-health experts find the scientific claim plausible, then I’d say, sure, the data are consistent with the this claimed effect on life expectancy. I just don’t see distance north or south of the river as a key predictor, hence I have no particular reason to believe that the data pattern shown in the above figure would’ve appeared, without the discontinuity, had the treatment not been applied.

I feel like kind of a grinch saying this. After all, air pollution is a big problem, and these researchers have clearly done a lot of work with robustness studies etc. to back up their claims. All I can say is: (1) Yes, air pollution is a big problem so we want to get these things right, and (2) Even without the near-certainty implied by these 95% intervals excluding zero, decisions can and will be made. Scientists and policymakers can use their best judgment, and I think they should do this without overrating the strength of any particular piece of evidence. And I do think this new paper is an improvement on the earlier one.

P.S. If you want to see some old-school ridiculous regression discontinuity analysis, check this out:
Continue reading ‘China air pollution regression discontinuity update’ »