Skip to content

Power analysis and NIH-style statistical practice: What’s the implicit model?

So. Following up on our discussion of “the 80%” power lie, I was thinking about the implicit model underlying NIH’s 80% power rule.

Several commenters pointed out that, to have your study design approved by NSF, it’s not required that you demonstrate that you have 80% power for real; what’s needed is to show 80% power conditional on an effect size of interest, and also you must demonstrate that this particular effect size is plausible. On the other hand, in NIH-world the null hypothesis could be true: indeed, in some way the purpose of the study is to see, well, not if the null hypothesis is true, but if there’s enough evidence to reject the null.

So, given all this, what’s the implicit model? Let “theta” be the parameter of interest, and suppose the power analysis is performed assuming theta = 0.5, say, on some scale.

My guess, based on how power analysis is usually done and on how studies actually end up, is that in this sort of setting the true average effect size is more like 0.1, with a lot of variation: perhaps it’s -0.1 in some settings and +0.3 in others.

But forget about what I think. Let’s ask: what does the NIH think, or what distribution for theta is implied by NIH’s policies and actions?

To start with, if the effect is real, we’re supposed to think that theta = 0.5 is a conservative estimate. So maybe we can imagine some distribution of effect sizes like normal with mean 0.75, sd 0.25, so that the effect is probably larger than the minimal level specified in the power analysis.

Next, I think there’s some expectation that the effect is probably real, let’s say there’s at least a 50% chance of there being a large effect as hypothesized.

Finally, the NIH accepts that researcher’s model could’ve been wrong, in which case theta is some low value. Not exactly zero, but maybe somewhere in a normal distribution with mean 0 and standard deviation 0.1, say.

Put this together and you get a bimodal distribution:

And this doesn’t typically make sense, that something would either have a near-zero, undetectable effect, or a huge effect with little possibility of anything in between. But that’s what’s the NIH is implicitly assuming, I think.

Bayesians are frequentists

Bayesians are frequentists. What I mean is, the Bayesian prior distribution corresponds to the frequentist sample space: it’s the set of problems for which a particular statistical model or procedure will be applied.

I was thinking about this in the context of this question from Vlad Malik:

I noticed this comment on Twitter in reference to you. Here’s the comment and context:

“It’s only via significance tests that model assumptions are checked. Hence wise Bayesian go back to them e.g., Box, Gelman.” and

While I’m not qualified to comment on this, it doesn’t sound to me like something you’d say. With all the “let’s report Bayesian and Frequentist stats together” talk flying around, I’m curious where statistical significance does fit it for you.

Necessary evil or, following most of the comments on your Abondon Stat Sig post, something not so necessary? My layman impression was that you’d fall into the “do away with it” camp.

Is stat sig necessary to “evaluate a model”? Perhaps I misunderstand the terminoloy, but my thinking is that experience/reality is the only thing that evaluates a model ie. effect size, reproducibility, usefulness… I see uses for stat sig as one “clue” or indicator, but I don’t see how stat sig helps check any assumptions, given it’s based on fairy big assumptions.

My reply:

Actually that quote is not a bad characterization of the views of myself and my collaborators. As we discuss in chapters 6 and 7 of BDA3, there are various ways to evaluate a model, and one of these is model checking, comparing fitted model to data. This is the same as significance testing or hypothesis testing, but with two differences:
(1) I prefer graphical checks rather than numerical summaries and p-values;
(2) I do model checking to test the model that I am fitting, usually not to test a straw-man null hypothesis. I already know my model is false, so I don’t pat myself on the back for finding problems with the fit (thus “rejecting” the model); rather, when I find problems with fit, this motivates improvement to the model.

By the way, I followed the above links and they were full of ridiculous statements such as “scientists will never accept invented prior distrib’s”—which is kind of a shocker to me as I’ve been publishing scientific papers with “invented prior distrib’s” for nearly 30 years now! But I guess people will say all sorts of foolish things on twitter.

Chasing the noise in industrial A/B testing: what to do when all the low-hanging fruit have been picked?

Commenting on this post on the “80% power” lie, Roger Bohn writes:

The low power problem bugged me so much in the semiconductor industry that I wrote 2 papers about around 1995. Variability estimates come naturally from routine manufacturing statistics, which in semicon were tracked carefully because they are economically important. The sample size is determined by how many production lots (e.g. 24 wafers each) you are willing to run in the experiment – each lot adds to the cost.

What I found was that small process improvements were almost impossible to detect, using the then-standard experimental methods. For example, if an experiment has a genuine yield impact of 0.2 percent, that can be worth a few million dollars. (A semiconductor fabrication facility produced at that time roughly $1 to $5 billion of output per year.) But a change of that size was lost in the noise. Only when the true effect rose into the 1% or higher range was there much hope of detecting it. (And a 1% yield change, from a single experiment, would be spectacular.)

Yet semicon engineers were running these experiments all the time, and often acting on the results. What was going on? One conclusion was that most good experiments were “short loop” trials, meaning that the wafers did not go all the way through the process. For example, you could run an experiment on a single mask layer, and then measure the effect on manufacturing tolerances. (Not the right terminology in semicon, but that is what they are called elsewhere.) In this way, the only noise was from the single mask layer. Such an experiment would not tell you the impact on yields, but an engineering model could estimate the relationship between tolerances ===> yields. Now, small changes were detectable with reasonable sample sizes.

This relates to noise-chasing in A/B testing, it relates to the failure of null hypothesis significance testing when studying incremental changes, and what to do about it, and it relates to our recent discussions about how to do medical trials using precise measurements of relevant intermediate outcomes.

Stan goes to the World Cup

Leo Egidi shares his 2018 World Cup model, which he’s fitting in Stan.

But I don’t like this:

First, something’s missing. Where’s the U.S.??

More seriously, what’s with that “16.74%” thing? So bogus. You might as well say you’re 66.31 inches tall.

Anyway, as is often the case with Bayesian models, the point here is not the particular predictions but rather the transparency of the whole process. If the above win probabilities look wrong to you: Fine. You’re saying you have prior knowledge that’s not captured in Leo’s model. The thing to do next is to formally express that knowledge, alter the model, and re-fit in Stan.

One good and one bad response to statistics’ diversity problem

(This is Dan)

As conference season rolls into gear, I thought I’d write a short post contrasting some responses by statistical societies to the conversation that the community has been having about harassment of women and minorities at workshops and conferences.

ISI: Do what I say, not what I do

Let’s look at a different diversity statement by the International Statistical Institute, more commonly known as the ISI.  Feel free to read it in full–it’s not very long. But I’ll reproduce a key paragraph.

ISI is committed to providing a professional environment free from discrimination on the basis of sex, race, colour, religion, national origin, disability, age, sexual orientation, gender identity, and politics.

On the face of it, this looks fine. It’s a boilerplate paragraph that is in almost every code of conduct and diversity statement. But let’s actually take a look at how it’s currently implemented.

One of the major activities of the ISI is a biennial World Statistics Congress, which is one of the larger statistics conferences on the calendar. Last year, they held it in Morocco. Next year it will be held in Malaysia.

In Morocco, the penalty for same-sex sexual activity is up to three years in jail. In Malaysia, the penalty for same-sex sexual activity is up to twenty years in jail (as well as fines and corporal punishment).

That the ISI has put two consecutive major statistics meetings in countries where homosexual activity is illegal is not news to anyone. These aren’t secret meetings–they are very large and have been on the books for a few years.

But these meetings manifestly fail to live up to the new diversity statement. This reflects a lack of care and a lack of thought. The ISI World Statistics Conferences do not provide a professional environment free from discrimination.

By holding these meetings in these countries, the ISI sending a strong message to the LGBT+ community that they do not value us a statisticians. They are explicitly forcing LGBT+ scholars to make a very difficult choice in the event that they get invited to speak in a session: do they basically pretend to be straight for a week or do they give up this career opportunity.

The ISI has released a diversity statement that it does not live up to. It is a diversity statement that anyone organizing a session at the next ISI World Statistics Conference will not live up to (they are participating in organizing a hostile environment for LGBT+ scholars).

This is pathetic. It is rare to see a group release a diversity statement and actually make the situation worse.

That they did it at the beginning of Pride Month in North America is so bleak I actually find it funny.

ISBA: A serious, detailed, and actionable response

For a much better response to the problems facing minority communities in statistics, we can look at ISBA’s response to reports of sexual harassment at their conferences.

ISBA has taken these reports seriously and have released a detailed Code of Conduct that covers all future events as well as responsibilities and expectations of members of the society.

This was a serious and careful response to a real problem and it’s a credit to ISBA that they made this response in time for its forthcoming major meeting. It’s also a credit to them that they did not rush the process–this is the result of several months of hard work by a small team of people.

The level of detail that the code of conduct goes into in its complaints procedures, its investigation procedures, and the rights and responsibilities of all involved is very good. While it is impossible to undo the harm from not dealing with this problem earlier, this code of conduct is a good basis for making ISBA a safe place for statisticians from now and into the future.


PS (added later): There was some concern shown in the comments that the two countries I mentioned were Muslim-majority countries. There were no other ISI meeting I could see in the same time period that happened in places with similar anti-LGBT legislation. (Although such places exist and the laws are justified using a variety of religious and social positions.)

The last ISI WSC to be held in a Muslim-majority country before Morocco was 20 years previous is Turkey, which, to my knowledge, did not have anti-LGBT laws on the books then or now.

It is always a mistake to attribute bad laws to faith groups. People who share a common religion are diverse and assigning them responsibility for a law would be like assigning every American blame for everything Trump (or Obama) wrote into law.

About that quasi-retracted study on the Mediterranean diet . . .

Some people asked me what I thought about this story. A reporter wrote to me about it last week, asking if it looked like fraud. Here’s my reply:

Based on the description, there does not seem to be the implication of fraud. The editor’s report mentioned “protocol deviations, including the enrollment of participants who were not randomized.” I suppose this could be fraud, or it could be simply incompetence, or it could be something in between, for example maybe the researchers realized there were problems with the randomization but they didn’t bother to mention it in the paper because they thought it was no big deal, all studies are imperfect, etc. I have no idea. In any case, it’s good that the NEJM looked into this.

Looking at the new article, it sees that it was only a small subset of patients that were not randomized as designed: out of a total of 7447 patients, 425 shared a household with a previously enrolled participant, and 467 participants were at the site where patients were not randomized. There may be some overlap, but, even if not, this is only 892 people, that’s 12% of the people in the study. So the easiest thing is to just analyze the 88% of the people who were in the trial as designed. Sure, that’s throwing away some data, but that will be the cleanest way to go.

I’m surprised this story is getting so much play, as it appears that the conclusions weren’t really changed by the removal of the questionable 12% of the data. (According to the above-linked report, the data problems were “affecting 10 percent of respondents,” and according to this news article, it’s 14%. So now we have three numbers: 10% from one report, 14% from another, and 12% from my crude calculation. A minor mystery, I guess.)

It says here that the researchers “spent a year working on the re-analysis.” If it was me, I think I would’ve taken the easy way out just analyzed the 88% (or 86%, or 90%) of the data that were clean, but I can’t fault them for trying to squeeze out as much information as they could. I’m similarly confused by the quotes from skeptical statisticians. Is the skepticism on specific statistical grounds, or is it just that, if 12% of the study was botched, there’s concern about other, undiscovered biases lurking in the data? As usual when the study involves real-life decisions, I end up with more questions than answers.

P.S. Just read Hilda Bastian’s discussion. Lots of interesting details.

Stan Workshop on Pharmacometrics—Paris, 24 July 2018

What: A one-day event organized by France Mentre (IAME, INSERM, Univ SPC, Univ Paris 7, Univ Paris 13) and Julie Bertrand (INSERM) and sponsored by the International Society of Pharmacometrics (ISoP).

When: Tuesday 24 July 2018

Where: Faculté Bichat, 16 rue Henri Huchard, 75018 Paris

Free Registration: Registration is being handled by ISoP; please click here to register.


  • 14:00-14:50 Andrew Gelman (Columbia Univ.) What I learned about Bayesian methods while working on pharmacometrics
  • 14:50-15:20 Bob Carpenter (Columbia Univ.) The Stan software development roadmap
  • 15:20-15:40 Bill Gillespie (Metrum) TBD [Stan, Bayesian Pharmacology, Torsten]
  • 15:40-16:00 Solene Desmee (Tours Univ.) Joint models and individual predictions with Stan
  • Coffee / tea / macarons Break
  • 16:20-16:40 Mitzi Morris (Columbia Univ.) ICAR spatial models in Stan
  • 16:40-17:00 Charles Margossian (Columbia Univ.) Understanding automatic differentiation to improve performance
  • 17:00-17:20 Florence Loingeville (INSERM) Model Based Bioequivalence for sparse designs using Stan
  • 17:20-17:40 Felix Held (Fraunhofer-Chalmers Centre) Bayesian hierarchical model of oscillatory cortisol response during drug intervention
  • 17:40-18:00 Sebastian Weber (Novartis) Qualifying drug dosing regimens in pediatrics using probabilistic Gaussian Processes

I’m really looking forward to this one, because I really like focused workshops and this is a great lineup of speakers. Thanks for organizing, France and Julie. See you in Paris!

Do women want more children than they end up having?

Abigail Haddad writes:

In his column, Ross Douthat states that “most women want more children than they have”, linking to a medium article about the gap between actual and intended fertility.

The big argument/finding of the linked-to article is this: the scale of the gap between actual and intended fertility in the United States is between .3 and .6 kids, and this has been relatively stable for a few decades. The analysis seems plausible. But it doesn’t get you to “most women want more children than they have”, and I’d argue it’s only barely consistent with it. If you assume we’re at the top of the range (.5-.6) and that all of the women who want more kids only want one more kid, Ross’s line would be accurate. But, for instance, if you instead assume the midpoint of that range (which is also the most recent point estimate), and that half the women with below-intended fertility wanted two more kids and half wanted one, you’d instead be looking at 30% of women who wanted more children than they have. (To get a sense of the distribution, you’d want the person-level data.)

My guess would be that Douthat’s quote is derived from the article’s line that “women have fewer children than they say they would have liked to have.” But that’s not the same thing.

My reply:

You could be right—but it must be possible to break down the survey data for individual respondents, so Douthat’s question, even if it can’t be answered from averages alone, probably can be answered from the raw data. My guess is that if we were to label:
X = the percentage of women who have had fewer children than they wanted
Y = the percentage of women who have the exact number they wanted
Z = the percentage of women who have had more children than they wanted,
then the data would show that X > Z. But not necessarily that X > 50%. Douthat literally claimed X > 50% but I think that’s just because he’s being casual with his math, and I expect that X > Z would still make his point.

In the real world, all this is complicated by the fact that people’s intentions change. A couple can fully intend to have 2 kids, but then baby #2 is so adorable that they decide to try and have a third kid . . . and, after a few years, they succeed. So the number of children is 3, but what’s the desired number? It’s ultimately 3, but it was originally 2. Or, maybe a couple isn’t planning to have a fourth child, but they have one by accident. Mistakes happen! But then they’re very happy to have had kid #4. So, they had more children than they originally wanted, but they’re happy with the number of children they had.

I’m sure some sociologists have looked carefully at such questions, as many of them are directly answerable from the data, if you have a series of surveys asking people of different ages how many children they have, how many they plan to have, and how many they would like to have.

Can somebody please untangle this one for us? Are centrists more, or less, supportive of democracy, compared to political extremists?

OK, this is a nice juicy problem for a political science student . . .

Act 1: “Centrists Are the Most Hostile to Democracy, Not Extremists”

David Adler writes in the New York Times:

My research suggests that across Europe and North America, centrists are the least supportive of democracy, the least committed to its institutions and the most supportive of authoritarianism.

I examined the data from the most recent World Values Survey (2010 to 2014) and European Values Survey (2008), two of the most comprehensive studies of public opinion carried out in over 100 countries. The survey asks respondents to place themselves on a spectrum from far left to center to far right. I then plotted the proportion of each group’s support for key democratic institutions.

Respondents who put themselves at the center of the political spectrum are the least supportive of democracy, according to several survey measures. These include views of democracy as the “best political system,” and a more general rating of democratic politics. In both, those in the center have the most critical views of democracy.

I don’t quite know why Australia and New Zealand are in with the European countries on the above graph, or why the Netherlands is sitting there next to the United States, and I wouldn’t recommend displaying this information in the form of bar graphs, but in any case you get the picture, so the graph does its job. Adler shows similar patterns for other survey responses, and his full article is here.

Act 2: “It’s radicals, not centrists, who are really more hostile to democracy”

Matthijs Rooduijn responds in the Washington Post:

Can [the claim that centrists are the most hostile to democracy] be right? That depends on how you define “centrist.” Adler relies on what citizens say about their own ideology. . . . But do these self-characterizations represent how moderate or radical they are? I looked beyond how people describe their left-right position and assessed their actions and beliefs. That reveals something different: The people who vote for radical parties and who hold radical views on those parties’ issues are the ones more skeptical of and less satisfied with democracy. . . .

How people describe their own left-right position isn’t a very accurate way of distinguishing among those who are in the moderate ideological center and those who are more radical.

To illustrate, I’ll focus on the radical right. Let me stress, however, that you would see similar patterns among the radical left.

Here I’m defining the radical right as a family of parties that endorse “nativism” — i.e., the belief that the homogeneous nation-state is threatened by “dangerous others” such as immigrants or people of another race. Examples include the Front National in France and the League in Italy. Various studies have shown that those who vote for such parties mainly do so because they hold anti-immigrant attitudes. . . .

Why does the traditional left-right self-identification scale fail to distinguish moderates from radicals? That’s because the categories of far left, far right, and center include very broad groups of respondents.

To make this clear, let’s look at those who vote for radical right parties and those who endorse those parties’ attitudes. In the analyses below, I use the European Social Survey (ESS), which has more recent information than the WVS and EVS.

Of those who voted for a radical right party, about 43 percent — a plurality — place themselves in the center. We get a similar result when we look at who holds the most negative attitude toward immigrants: About 48 percent — again, a plurality — call themselves centrists.

In other words, many self-identified centrists aren’t moderates at all, once you look at how they vote and what they believe. . . .

So let’s discard left-right self-placement for now. Instead, we’ll assess whether someone is radical right or moderate by looking at how he or she votes and what he or she believes on the radical right’s main issue: immigration.

Again, a bar plot!? But, again, the graph does the job so I won’t harp on it. Rooduijn shows something similar using opposition to immigration as a predictor of anti-democratic attitudes. He concludes:

When we look at voting behavior and ideological beliefs, radicals feel it’s less important to live in a democratically governed country and are less satisfied with how their own democracy works than do those in the center.

Act 3: Whassup?

OK, how do we reconcile these findings? Rooduijn argues that the problem is with self-declared ideology, but I don’t know about that, for a couple of reasons. First, you can be far-right or far-left ideologically and not vote for a particular far-right or far-left party: perhaps you disagree with them on key issues, or maybe you don’t want to waste your vote on a fringe candidate, or maybe you object to the party’s corruption, for example. Second, even if “centrist” is just a label that people give themselves, it still seems surprising that people who give themselves this label are dissing democracy: I could see a centrist being dissatisfied with the current politically polarized environment, but it’s surprising for me to see them wanting to get rid of the entire system. Third, I’d expect to see a correlation between left-right position and satisfaction with the political system, but with this correlation varying depending on who’s in power. If your party’s in power, you’re more likely to trust the voters, no? But there could be some exceptions: for example, Trump is president even though his opponent got more votes, so it could be a coherent position to support Trump but be unhappy with the democratic process.

My other concern involves data. Adler has approximately 50% of Europeans viewing democracy as a “very good” political system; from Rooduijn’s polls, over 90% of Europeans think it important to live in a democratically governed country. Sure, these are different questions, but the proportions are sooooo different, it makes me wonder how to compare results from these different surveys.

Act 4: Here’s where you come in.

So, there’s a lot going on here. I’d’ve thought that moderates—whether self-described or as defined based on their voting patterns or their issue attitudes—would be more supportive of democracy, as asked in various ways, than extremists.

Adler found moderates to be less supportive of democracy—but, to me, the big thing was that he found democracy as a whole to not be so popular, in this set of democratic countries that he studied.

Rooduijn found what I’d consider a more expected result: democracy is overwhelmingly popular, perhaps slightly less so among people who hold extreme views or vote for extreme parties on the left or right.

Based on my expectations, I’d think that Rooduijn’s conclusion is more plausible. But Adler has some convincing graphs. I’ve not looked at these data myself, at all.

So here’s where you come in. Download the survey data—all publicly available, I believe—and reanalyze from scratch to figure out what’s going on.

Here are some suggestions to get started:

– Look at the averages and maybe the distributions of responses to questions asked on a 0-10 scale, rather than throwing away information by cutting off at a threshold.

– Analyze each country separately, then make scatterplots where each dot is a country.

– Break up the U.S. into separate regions, treat each one like a “country” in this scatterplot.

– Include in your analysis the party in power when the survey was conducted. (Recall this story.)

OK, go off and do it! This would be a great project for a student in political science, and it’s an important topic.

Josh “hot hand” Miller speaks at Yale tomorrow (Wed) noon

Should be fun.

Global shifts in the phenological synchrony of species interactions over recent decades

Heather Kharouba et al. write:

Phenological responses to climate change (e.g., earlier leaf-out or egg hatch date) are now well documented and clearly linked to rising temperatures in recent decades. Such shifts in the phenologies of interacting species may lead to shifts in their synchrony, with cascading community and ecosystem consequences . . . We compared phenological shifts among pairwise species interactions (e.g., predator–prey) using published long-term time-series data of phenological events from aquatic and terrestrial ecosystems across four continents since 1951 to de- termine whether recent climate change has led to overall shifts in synchrony. We show that the relative timing of key life cycle events of interacting species has changed significantly over the past 35 years. Further, by comparing the period before major climate change (pre-1980s) and after, we show that estimated changes in phenology and synchrony are greater in recent decades.


However, there has been no consistent trend in the direc- tion of these changes. . . . the next challenges are to improve our ability to predict the direction of change and understand the full consequences for communities and ecosystems.

And here’s a published comment from Andreas Linden.

A style of argument can be effective in an intellectual backwater but fail in the big leagues—but maybe it’s a good thing to have these different research communities

Following on a post on Tom Wolfe’s evolution-denial trolling, Thanatos Savehn pointed to this obituary, “Jerry A. Fodor, Philosopher Who Plumbed the Mind’s Depths, Dies at 82,” which had lots of interesting items, including this:

“We think that what is needed,” they wrote, “is to cut the tree at its roots: to show that Darwin’s theory of natural selection is fatally flawed.” . . .

The book loosed an uproar among scientists. (Its review in the magazine Science appeared under the headline “Two Critics Without a Clue.”)

“He and Chomsky had a modus operandi which was ‘Bury your opponents as early as possible,’ ” Dr. [Ernie] Lepore said, speaking of Dr. Fodor. “And when he went up against the scientific community, I don’t think Fodor was ready for that. He basically told these guys that natural selection was bogus. The arguments are interesting, but he didn’t win a lot of converts.”

That’s an interesting idea, that a style of argument can be effective in an intellectual backwater such as academic linguistics but fail in the big leagues of biology. It’s not so bad to have these different academic communities: we can think of academic linguistics as a “safe space” where scholars can pursue ridiculous ideas that might still become useful.

If we crudely model scientific hypotheses as being true/false, or reasonable/unreasonable, then it can at times be a good research strategy to start in “reasonable” territory and then deliberately wander into the “unreasonable” zone as a way of better traversing the space of theories. The best way to get to new reasonable hypotheses might be to entertain some silly ideas, considering these ideas seriously enough to fully work through their implications. And perhaps that is what Foder was doing in his thought experiment of cutting the evolutionary tree “at its roots.”

At the same time, you can’t expect biologists to just sit there and take it. Hence the value of distinct research communities. As long as we’re not using linguists’ theories of evolution to fight disease, I guess we’re ok.

P.S. Peter Erwin convincingly makes the case that the above post is “massively and bizarrely unfair to linguistics, especially by taking a single, controversial theorist (that is, Chomsky; Fodor is a philosopher) as being somehow representative of the field.”

We’re putting together a list of big, high profile goals that proved far more challenging than people had anticipated circa 1970

Palko writes:

The postwar era (roughly defined here as 1945 to 1970) was a period of such rapid and ubiquitous technological and scientific advances that people naturally assumed that this rate of progress would continue or even accelerate. This led not just futurists like Arthur C Clarke but also researchers in the fields to underestimate the difficulty of certain problems, often optimistically applying the within-a-decade deadline to their predictions.

I [Palko] am trying to come up with a list of big, high profile goals that proved far more challenging than people had anticipated circa 1970.

He continues with some examples:

The war on cancer. I suspect that the celebrated victory over polio significantly contributed to an unrealistic expectation for other major diseases.

Fusion reactors. It took about a decade to go from atomic bomb to nuclear power compact and reliable enough to deploy in submarines.

Artificial intelligence. We’ve already mentioned the famously overoptimistic predictions that came out of the field at the time.

Artificial hearts.
From Wikipedia:

“In 1964, the National Institutes of Health started the Artificial Heart Program, with the goal of putting a man-made heart into a human by the end of the decade.”

Not sure whether they meant 1970 or 1974, but either way, they missed their target.

Does anyone out there have additional items I should add to the list?

Yes, I have some examples! Before getting to them, let me first point out that some obvious candidates don’t really work. Rockets to Mars and flying cars . . . people were talking about both of these, but there was also lots of skepticism. It’s my impression that the flying-car thing was always a bit of a joke, as the energy consumption and air traffic control problems were always pretty obvious.

Weather modification, maybe? I don’t know enough about the history (see book #4 in this list) to know for sure; was this an area in which wild optimism was the norm?

Most of my examples are intellectual rather than physical products:

– Game theory. If you read Luce and Raiffa’s classic 1957 text, you’ll see lots of fun stuff and also a triumphalist attitude that just about all the important problems of game theory had been solved, and we were just in a mopping-up phase.

– Classical statistics. As X and I wrote, “Many leading mathematicians and statisticians had worked on military problems during the World War II, using available statistical tools to solve real problems in real time. Serious applied work motivates the development of new methods and also builds a sense of confidence in the existing methods that have led to such success. After some formalization and mathematical development of the immediate postwar period, it was natural to feel that, with a bit more research, the hypothesis testing framework could be adapted to solve any statistical problem.” As the recent replication crisis illustrates, we’re still living with the consequences of this war-inspired confidence.

– Psychotherapy. Between Freudian analysis from one direction, and Thorazine etc. from the other, it must have seemed to many that we were gradually solving the major problems of mental health.

– Keynesian economics. Just a matter of fine tuning, right?

– Various specific social engineering problems, such as traffic congestion, not enough houses or apartments where people want to live, paying for everyone’s health care, etc.: these were low-level concerns which I imagine that many people assumed would solve themselves in due course as we gradually got richer.

As Palko notes, a common feature of all these examples is that a period of sustained success (in the case of Freudian analysis, success in the social realm even if not in the outcomes that matter) gave people the illusion that they were at the beginning or middle, rather than the end, of a long upward ramp. Also, in none of these cases was the techo-optimism universal; there were always skeptics. But these are all examples where an extreme optimism was, at least, considered an intellectually and socially respectable condition.

What other examples do you have?

P.S. It’s mid-December in blogtime but this topic is so juicy, I’m posting it right away. I’ve bumped today’s scheduled post, “‘My advisor and I disagree on how we should carry out repeated cross-validation. We would love to have a third expert opinion. . .'”, to the end of the queue.

The necessity—and the difficulty—of admitting failure in research and clinical practice

Bill Jefferys sends along this excellent newspaper article by Siddhartha Mukherjee, “A failure to heal,” about the necessity—and the difficulty—of admitting failure in research and clinical practice. Mukherjee writes:

What happens when a clinical trial fails? This year, the Food and Drug Administration approved some 40 new medicines to treat human illnesses, including 13 for cancer, three for heart and blood diseases and one for Parkinson’s. . . . Yet the vastly more common experience in the life of a clinical scientist is failure: A pivotal trial does not meet its expected outcome. What happens then? . . .

The first thing you feel when a trial fails is a sense of shame. You’ve let your patients down. You know, of course, that experimental drugs have a poor track record — but even so, this drug had seemed so promising (you cannot erase the image of the cancer cells dying under the microscope). You feel as if you’ve shortchanged the Hippocratic oath. . . .

There’s also a more existential shame. In an era when Big Pharma might have macerated the last drips of wonder out of us, it’s worth reiterating the fact: Medicines are notoriously hard to discover. The cosmos yields human drugs rarely and begrudgingly — and when a promising candidate fails to work, it is as if yet another chemical morsel of the universe has been thrown into the Dumpster. The meniscus of disappointment rises inside you . . .

And then a second instinct takes over: Why not try to find the people for whom the drug did work? . . . This kind of search-and-rescue mission is called “post hoc” analysis. It’s exhilarating — and dangerous. . . . The reasoning is fatally circular — a just-so story. You go hunting for groups of patients that happened to respond — and then you turn around and claim that the drug “worked” on, um, those very patients that you found. (It’s quite different if the subgroups are defined before the trial. There’s still the statistical danger of overparsing the groups, but the reasoning is fundamentally less circular.) . . .

Perhaps the most stinging reminder of these pitfalls comes from a timeless paper published by the statistician Richard Peto. In 1988, Peto and colleagues had finished an enormous randomized trial on 17,000 patients that proved the benefit of aspirin after a heart attack. The Lancet agreed to publish the data, but with a catch: The editors wanted to determine which patients had benefited the most. Older or younger subjects? Men or women?

Peto, a statistical rigorist, refused — such analyses would inevitably lead to artifactual conclusions — but the editors persisted, declining to advance the paper otherwise. Peto sent the paper back, but with a prank buried inside. The clinical subgroups were there, as requested — but he had inserted an additional one: “The patients were subdivided into 12 … groups according to their medieval astrological birth signs.” When the tongue-in-cheek zodiac subgroups were analyzed, Geminis and Libras were found to have no benefit from aspirin, but the drug “produced halving of risk if you were born under Capricorn.” Peto now insisted that the “astrological subgroups” also be included in the paper — in part to serve as a moral lesson for posterity.

I actually disagree with Peto—not necessarily for that particular study, but considering the subgroup problem more generally. I mean, sure, I agree that raw comparisons can be noisy, but with a multilevel model it should be possible to study lots of comparisons and just partially pool these toward zero.

That said, I agree with the author’s larger point that it would be good if researchers could just admit that sometimes an experiment is just a failure, that their hypothesis didn’t work and it’s time to move on.

I recently encountered an example in political science where the researcher had a preregistered hypothesis, did the experiment, and the result was in the wrong direction and not statistically significant: a classic case of a null finding. But the researcher didn’t give up, instead reporting the result was statistically significant at the 10% level, explaining that even though the result was in the wrong direction, that was consistent with theory also, and also reporting some interactions. That’s a case where the appropriate multilevel model would’ve partially pooled everything toward zero, or, alternatively, Peto’s just-give-up strategy would’ve been fine too. Or, not giving up but being clear that your claims are not strongly supported by the data, that’s ok. But it was not ok to claim strong evidence in this case; that’s a case of people using statistical methods to fool themselves.

To return to Mukherjee’s article:

Why do we do it then? Why do we persist in parsing a dead study — “data dredging,” as it’s pejoratively known? One answer — unpleasant but real — is that pharmaceutical companies want to put a positive spin on their drugs, even when the trials fail to show benefit. . . .

The less cynical answer is that we genuinely want to understand why a medicine doesn’t work. Perhaps, we reason, the analysis will yield an insight on how to mount a second study — this time focusing the treatment on, say, just men over 60 who carry a genetic marker. We try to make sense of the biology: Maybe the drug was uniquely metabolized in those men, or maybe some physiological feature of elderly patients made them particularly susceptible.

Occasionally, this dredging will indeed lead to a successful follow-up trial (in the case of O, there’s now a new study focused on the sickest patients). But sometimes, as Peto reminds us, we’ll end up chasing mirages . . .

I think Mukherjee’s right: it’s not all about cynicism. Researchers really do believe. The trouble is that raw estimates selected on statistical significance give biased estimates (see section 2.1 of this paper). To put it another way: if you have the longer-term goal of finding interesting avenues to pursue for future research, that’s great—and the way to do this is not to hunt for “statistically significant” differences in your data, but rather to model the entire pattern of your results. Running your data through a statistical significance filter is just a way to add noise.

Wolfram Markdown, also called Computational Essay

I was reading Stephen Wolfram’s blog and came across this post:

People are used to producing prose—and sometimes pictures—to express themselves. But in the modern age of computation, something new has become possible that I’d like to call the computational essay.
I [Wolfram] have been working on building the technology to support computational essays for several decades, but it’s only very recently that I’ve realized just how central computational essays can be to both the way people learn, and the way they communicate facts and ideas. . . .

There are basically three kinds of things here. First, ordinary text (here in English). Second, computer input. And third, computer output. And the crucial point is that these all work together to express what’s being communicated. . . .

But what really makes this work is the Wolfram Language—and the succinct representation of high-level ideas that it provides, defining a unique bridge between human computational thinking and actual computation and knowledge delivered by a computer. . . .

Computational essays are great for students to read, but they’re also great for students to write. Most of the current modalities for student work are remarkably old. Write an essay. Give a math derivation. These have been around for millennia. Not that there’s anything wrong with them. But now there’s something new: write a computational essay. And it’s wonderfully educational.

A computational essay is in effect an intellectual story told through a collaboration between a human author and a computer. The computer acts like a kind of intellectual exoskeleton, letting you immediately marshall vast computational power and knowledge. But it’s also an enforcer of understanding. Because to guide the computer through the story you’re trying to tell, you have to understand it yourself. . . .

Wolfram gives some examples, and it looks pretty cool. It also looks just like R Markdown, iPython notebooks, Jupiter notebooks, and various other documents of this sort.

So I posted a comment:

You write, “what really makes this work is the Wolfram Language.” I’m confused. How is your computational essay different from what you’d get from Markdown, iPython notebooks, etc.?

Which elicited the following response:

The Wolfram Language underpins our technology stack, creating one cohesive process. Our notebooks offer a complete production environment from data import, prototyping, manipulation and testing which can all be done in the same notebook as deploying to the cloud or generating final reports or presentations. It doesn’t need to be bundled with separate tools in the same way that iPython notebooks need to be, making our notebooks are better suited for computational essays since you don’t need to set up your notebook to use particular libraries.

On top of that, since our language unifies our tech stack, there’s no issue with mutually incompatible interpreters like with Python, and even when you move to Jupyter which can take on multiple language kernels, it becomes increasingly difficult to ensure you have every needed package for each particular language. Markdown is useful in a notebook, but Markdown in iPython notebooks aren’t collapsible and that makes them difficult to structure.

One additional key feature is our built-in knowledge base. Across thousands of domains, the Knowledgebase contains carefully curated expert knowledge directly derived from primary sources. It includes not only trillions of data elements, but also immense numbers of algorithms encapsulating the methods and models of almost every field.

I don’t quite understand all this but it sounds like they’re saying that notebooks computational essays are particularly easy to do if you’re already working in the Wolfram language. I don’t really have a sense of how many people use Wolfram for statistics, or would like to do so. Should we have a Wolfram interface for Stan?

Regarding the “computational essay” thing: my inclination would be to call these “Wolfram markdown” or “Wolfram notebooks,” by analogy to R markdown and Python notebooks. On the other hand, there’s no standard terminology (is it “markdown” or a “notebook”?) so maybe it makes sense for a third term (“computational essay”) to be added to the mix.

Oxycontin, Purdue Pharma, the Sackler family, and the FDA

I just read this horrifying magazine article by Patrick Radden Keefe: The Family That Built an Empire of Pain: The Sackler dynasty’s ruthless marketing of painkillers has generated billions of dollars—and millions of addicts.

You really have to read the whole thing, because it’s just one story after another of bad behavior, people getting rich off others’ misfortunes.

But there was one thing that caught my eye, after the profiteering and the astroturf lobbying and the million-dollar lawsuits and the sleazy public relations and the deceptive advertising and the planted stories in the trade press:

Purdue, facing a shrinking market and rising opprobrium, has not given up the search for new users. In August, 2015, over objections from critics, the company received F.D.A. approval to market OxyContin to children as young as eleven.

On one hand, this seems just terrible. But I guess the point is that Purdue Pharma can act strategically, trying out different plans to increase the number of addicts and sell more pills; but regulators are required to consider each decision on its own merits. And, after all, if there were no potential for abuse, maybe it would make sense to approve, and even market, this opioid for use by children. It’s an interesting asymmetry.

P.S. Full disclosure: some of my research is supported by pharmaceutical companies.

“Choose the data visualization that best serves your audience.”

Tian Zheng prepared the above slide which very clearly displays an important point about statistical communication.

The maps are squished to be too narrow, and the scatterplot has too many numbers on the axes (better to have income in thousands and percentages in tens), also given the numbers it seems that the data must be pretty old—but maybe that’s part of the point, that the principle of different sorts of data display is so general that it works even if the component graphs have problems.

“Human life is unlimited – but short”

Holger Rootzén and Dmitrii Zholud write:

This paper studies what can be inferred from data about human mortality at extreme age. We find that in western countries and Japan and after age 110 the risk of dying is constant and is about 47% per year. Hence data does not support that there is a finite upper limit to the human lifespan. Still, given the present stage of biotechnology, it is unlikely that during the next 25 years anyone will live longer than 128 years in these countries. Data, remarkably, shows no difference in mortality after age 110 between sexes, between ages, or between different lifestyles or genetic backgrounds.

This relates to our recent discussion, “No no no no no on ‘The oldest human lived to 122. Why no person will likely break her record.'”

I’ve not looked at Rootzén and Zholud’s article in detail, nor have I tried to evaluate their claims, but their general approach seems reasonable to me. The rule of thumb of a 50% mortality per year at the highest ages is interesting.

That said, I doubt that we can take this constant hazard rate too seriously. My guess is that an empirically constant hazard rate is an overlay of two opposing phenomena: On one hand, any individual person gets weaker and weaker so I’d expect his or her conditional probability of death to increase with age. On the other hand, any group of people is a mixture of the strong and the weak, which induces an inferential (“spurious,” as Feller called it) correlation. Maybe these two patterns happen to roughly cancel out and give a constant hazard rate in these data.

Average predictive comparisons and the All Else Equal fallacy

Annie Wang writes:

I’m a law student (and longtime reader of the blog), and I’m writing to flag a variant of the “All Else Equal” fallacy in ProPublica’s article on the COMPAS Risk Recidivism Algorithm. The article analyzes how statistical risk assessments, which are used in sentencing and bail hearings, are racially biased. (Although this article came out a while ago, it’s been recently linked to in this and this NYT op-ed.)

ProPublica posts a github repo with the data and replication code. I wanted to flag this part of the analysis:

  • The analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 45 percent more likely to be assigned higher risk scores than white defendants.
  • The violent recidivism analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 77 percent more likely to be assigned higher risk scores than white defendants

The basic method is to build a logistic regression model with the score as outcome and race and few other demographic variables as the independent variables. (You can also reasonably argue that a logistic regression without any interaction terms is not the best way to analyze this data, but for the moment, I’ll just stick within the authors’ approach.)

Here’s the problem: to arrive at the numbers above, they compare an intercept-only model vs. intercept + African-American Indicator model. (See Cell 16-17 of the original analysis)

But since it’s a logistic regression, the marginal effect of being African-American isn’t captured by the coefficient alone. Instead, they calculate the marginal effect of being African-American with all the other factors set to 0, i.e., it’s a comparison among White and African-American males, between age 25-45, with zero priors, with zero recidivism within the last two years, and with a particular severity of crime.

Fewer than 5% of the entire dataset meets these specifications in the first analysis and it’s only 7% in the second, so the statistical result reported is really only applicable for a small portion of the population.

If you calculate marginal effects over the entire dataset, taking into account men and women, all ages, and the full distribution of prior crimes, severity, and recidivism, those numbers are more modest:

  • The analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 45 20 percent more likely to be assigned higher risk scores than white defendants.
  • The violent recidivism analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 77 33 percent more likely to be assigned higher risk scores than white defendants.

This doesn’t change the piece’s overall argument, but some of these claims seem a little misleading in light of the actual comparison being made. My full analysis here (written for an undergraduate who’s taken a first course on statistics):

Curious to get your take here. I emailed the authors of this article, who responded with “Very interesting and informative. We were advised that our way of reporting is standard practice.”

My reply: Without getting into any of the specifics (not because I disagree with the above argument but just because I don’t have the energy to try to evaluate the details), I’ll say that this reminds me a lot of my paper with Iain Pardoe on average predictive comparisons for models with nonlinearity, interactions, and variance components. The key point is that predictive comparisons depend in general on the values of the other variables in the model, and if you want some sort of average number, you have to think a bit about what to average over. I hadn’t thought of the connection to the All Else Equal fallacy but that’s an interesting point.

Forking paths come from choices in data processing and also from choices in analysis

Michael Wiebe writes:

I’m a PhD student in economics at UBC. I’m trying to get a good understanding of the garden of forking paths, and I have some questions about your paper with Eric Loken.

You describe the garden of forking paths as “researcher degrees of freedom without fishing” (#3), where the researcher only performs one test. However, in your example of partisan differences in math skills, you discuss the multiple potential comparisons that could be made: an effect for men and not women, an effect for women and not men, a significant difference, etc. I would describe this as multiple testing: the researcher is running many regressions, and reporting the significant ones. Am I misunderstanding?

The case where the researcher only performs one test is when the degrees of freedom come only from data processing. For example, the researcher only tests for a significant difference between men and women, but because they have flexibility in measuring partisanship, classifying independents, etc, they can still run multiple versions of the same test and find significance that way.

So we can classify researcher degrees of freedom as coming from (1) multiple potential comparisons and (2) flexibility in data processing. In the extreme case, the degrees of freedom come only from (2), and the researcher only performs one test. But that doesn’t seem to be how you use the term “garden of forking paths” in practice.

My reply:

– You point to an example of multiple potential comparisons and write that you “would describe this as multiple testing: the researcher is running many regressions, and reporting the significant ones.” I’d say it’s multiple potential testing: the researcher might perform one analysis, but he or she gets to choose which analysis to do, based on the data. For example, the researcher notices a striking pattern among men but not women, and so performs that comparison, computes the significance level, etc. Later on, someone else points to the other comparisons that could’ve been done, and the original researcher replies, “No, I only did one comparison, so I couldn’t’ve been p-hacking.” Loken and I would reply that, as long as the analysis is affected by the data that were seen, there’s a multiple potential comparisons problem, even if only one particular comparison was done on the particular data at hand.

– You distinguish between choices in the data analysis and choices in the data processing. I don’t see these as being much different; either way, you have researcher degrees of freedom, and both sets of choices give you forking paths.

– Finally, let me emphasize that my preferred solution is not to perform just one, preregistered, comparison, nor is it to take the most extreme comparison and then perform a multiplicity correction. Rather, I recommend analyzing and presenting the grid all relevant comparisons, ideally combining them in a multilevel model.