Skip to content

I didn’t say that! Part 2

Uh oh, this is getting kinda embarrassing.

The Garden of Forking Paths paper, by Eric Loken and myself, just appeared in American Scientist. Here’s our manuscript version (“The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ‘fishing expedition’ or ‘p-hacking’ and the research hypothesis was posited ahead of time”), and here’s the final, trimmed and edited version (“The Statistical Crisis in Science”) that came out in the magazine.

Russ Lyons read the published version and noticed the following sentence, actually the second sentence of the article:

Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation.

How horrible! Russ correctly noted that the above statement is completely wrong, on two counts:

1. To the extent the p-value measures “confidence” at all, it would be confidence in the null hypothesis, not confidence in the data.

2. In any case, the p-value is not not not not not “the probability that a perceived result is actually the result of random variation.” The p-value is the probability of seeing something at least as extreme as the data, if the model (in statistics jargon, the “null hypothesis”) were true.

How did this happen?

The editors at American Scientist liked our manuscript but it was too long, also parts of it needed explaining for a nontechnical audience. So they cleaned up our article and added bits here and there. This is standard practice at magazines. It’s not just Raymond Carver and Gordon Lish.

Then they sent us the revised version and asked us to take a look. They didn’t give us much time. That too is standard with magazines. They have production schedules.

We went through the revised manuscript but not carefully enough. Really not carefully enough, given that we missed a glaring mistake—two glaring mistakes—in the very first paragraph of the article.

This is ultimately not the fault of the editors. The paper is our responsibility and it’s our fault for not checking the paper line by line. If it was worth writing and worth publishing, it was worth checking.

P.S. Russ also points out that the examples in our paper all are pretty silly and not of great practical importance, and he wouldn’t want readers of our article to get the impression that “the garden of forking paths” is only an issue in silly studies.

That’s a good point. The problems of nonreplication etc affect all sorts of science involving human variation. For example there is a lot of controversy about something called “stereotype threat,” a phenomenon that is important if real. For another example, these problems have arisen in studies of early childhood intervention and the effects of air pollution. I’ve mentioned all these examples in talks I’ve given on this general subject, they just didn’t happen to make it into this particular paper. I agree that our paper would’ve been stronger had we mentioned some of these unquestionably important examples.

In one of life’s horrible ironies, I wrote a paper “Why we (usually) don’t have to worry about multiple comparisons” but now I spend lots of time worrying about multiple comparisons

On deck this week

Tues: In one of life’s horrible ironies, I wrote a paper “Why we (usually) don’t have to worry about multiple comparisons” but now I spend lots of time worrying about multiple comparisons

Wed: The Fault in Our Stars: It’s even worse than they say

Thurs: Buggy-whip update

Fri: The inclination to deny all variation

Sat: Hoe noem je?

Sun: “Your Paper Makes SSRN Top Ten List”

10th anniversary of “Statistical Modeling, Causal Inference, and Social Science”

Richard Morey pointed out the other day that this blog is 10 years old!

During this time, we’ve had 5688 posts, 48799 comments, and who knows how many readers.

On this tenth anniversary, I’d like to thank my collaborators on all the work I’ve blogged, my co-bloggers (“This post is by Phil”), our commenters, Alex Tabarrok for linking to us way back when, and also the many many people who’ve pointed us to interesting research, interesting graphs, bad research, bad graphs, and links to the latest stylings of David Brooks and Satoshi Kanazawa.

It’s been fun, and I think this blog has been (and I hope will remain) an excellent communication channel on all sorts of topics, statistical and otherwise. Through the blog I’ve met friends, colleagues, and collaborators—including some such as Basbøll and Palko whom I’ve still not yet met!—; I’ve been motivated to think hard about ideas that I otherwise would’ve encountered; and I’m pretty sure I’ve motivated many people to examine ideas that they otherwise would not have thought seriously about.

The blog has been enlivened with a large and continuing cast of characters, including lots of “bad guys” such as . . . well, no need to list these people here. It’s enough to say they’ve provided us with plenty of entertainment and food for thought.

We’ve had some epic comment threads and enough repeating topics that we had to introduce the Zombies category. We’ve had comments or reactions from culture heroes including Gerd Gigerenzer, Judea Pearl, Helen DeWitt, and maybe even Scott Adams (but we can’t be sure about that last one). We’ve had fruitful exchanges with other researchers such as Christian Robert, Deborah Mayo, and Dan Kahan who have blogs of their own, and, several years back, we launched the internet career of the late Seth Roberts.

Here are the titles of the first five posts from our blog (in order):

A weblog for research in statistical modeling and applications, especially in social sciences

The Electoral College favors voters in small states

Why it’s rational to vote

Bayes and Popper

Overrepresentation of small states/provinces, and the USA Today effect

As you can see, some of our recurrent themes showed up early on.

Here are the next five:

Sensitivity Analysis of Joanna Shepherd’s DP paper

Unequal representation: comments from David Samuels

Problems with Heterogeneous Choice Models

Morris Fiorina on C-SPAN

A fun demo for statistics class

And the ten after that:

Red State/Blue State Paradox

Statistical issues in modeling social space

2 Stage Least Squares Regression for Death Penalty Analysis

Partial pooling of interactions

Bayesian Methods for Variable Selection

Reference for variable selection

The blessing of dimensionality

Why poll numbers keep hopping around by Philip Meyer

Matching, regression, interactions, and robustness

Homer Simpson and mixture models

(Not all these posts are by me.)

See you again in 2024!

“Illinois chancellor who fired Salaita accused of serial self-plagiarism.”

I came across a couple of stories today that made me wonder how much we can learn from a scholar’s professional misconduct.

The first was a review by Kimberle Crenshaw of a book by Joan Biskupic about Supreme Court judge Sonia Sotomayor. Crenshaw makes the interesting point that Sotomayor, like many political appointees of the past, was chosen in part because of her ethnic background, but that unlike various other past choices (for example, Antonin Scalia, the first Italian American on the court), “Sotomayor’s ethnicity is still viewed [by many] with skepticism.”

I was reminded of Laurence “ten-strike” Tribe’s statement that Sotomayor is “not nearly as smart as she seems to think she is,” a delightfully paradoxical sentence that one could imagine being said by Humpty Dumpty or some other Lewis Carroll character. More to the point, Tribe got caught plagiarizing a few years ago.

So here’s the question. Based on the letter where the above quote appears, Tribe seems to consider himself to be pretty smart (smarter than Sotomayor, that’s for sure). But, from my perspective, what kind of smart person plagiarizes? Not a very smart person, right?

But maybe I’m completely missing the point. If some of the world’s best athletes are doping, maybe some of the world’s best scholars are plagiarizing? It’s hard for me to wrap my head around this one. Also, in fairness to Tribe, he’s over 70 years old. Maybe he used to be smart when he was younger.

The second story came to me via an email from John Transue who pointed me to a post by Ali Abunimah, “Illinois chancellor who fired Salaita accused of serial self-plagiarism.” I had to follow some links to see what was going on here: apparently there was a professor who got fired after pressure on the university from a donor.

I hadn’t heard of Stephen Salaita (the prof who got fired) or Phyllis Wise (the University of Illinois administrator who apparently was in charge of the process), but apparently there’s some controversy about her publication record from her earlier career as a medical researcher.

It looks like a simple case of Arrow’s theorem, that any result can only be published at most five times. Wise seemed to have published the particular controversial paper only three different times, so she has two freebies to go.

As I discussed a couple years ago (click here and scroll down to “It’s 1995″), in some places Arrow’s theorem is such a strong expectation that you’re penalized if you don’t publish several versions of the same paper.

But, to get back to the main thread here: to what extent does Wise’s unscholarly behavior—and it is definitely unscholarly and uncool to copy your old papers without making clear the source, even if it’s not as bad as many other academic violations, it’s something you shouldn’t do, and it demonstrates an ethical lapse or a level of sloppiness so extreme as to cast questions on one’s scholarship—to what extend should this lead us to mistrust her other decisions, in this case in the role of university administrator?

In some sense this doesn’t matter at all: Wise could’ve been the most upstanding, rule-following scientist of all time and the supporters of Salaita would still be strongly disagreeing with her decision and the process used to make it (just as we can all give a hearty laugh at Laurence Tribe’s obnoxiousness, even if he’d never in his life put his name on someone else’s writing).

Or maybe it is relevant, in that Wise’s disregard for the rules in science might be matched by her disregard for the rules in administration. And Tribe’s diminished capacities as a scholar, as revealed by his plagiarism, might lead one to doubt his judgment of the intellectual capacities of his colleagues.

P.S. A vocal segment of our readership gets annoyed when I write about plagiarism. I continue to insist that my thoughts in this area have scholarly value (see here and here, for example, and that latter article even appeared in a peer-reviewed journal!), but I am influenced by the judgments of others, and so I do feel a little bad about these posts, so I’ve done youall a favor by posting this one late at night on a weekend when nobody will be reading. So there’s that.

Science tells us that fast food lovers are more likely to marry other fast food lovers


Emma Pierson writes:

I’m a statistician working at the genetics company 23andMe before pursuing a master’s in statistics at Oxford on a Rhodes scholarship. I’ve really enjoyed reading your blog, and we’ve been doing some social science research at 23andMe which I thought might be of interest. We have about half a million customers answering thousands of survey questions on everything from homosexuality to extroversion to infidelity, which as you can imagine produces an interesting dataset.

1. We found that customers who answer our survey questions in the middle of the night are significantly less happy and significantly more likely to be manic. See here and here.


2. Using genetic data, we identified 15,000 couples along with the child they had had together. We showed that couples tended to be similar — 97% of traits showed a positive correlation between woman and man, even when we controlled for race and age — although there were often intriguing exceptions.

At this point Pierson shows an info graphic that says that “punctual people,” “skiers,” “hikers,” “non-smokers,” “fast food lovers,” and “apology prone people” are more likely to marry each other, while, in contrast, “early birds” are more likely to marry “night owls,” and “human GPSs” are more likely to marry “constant wrong turners.”

She continues:

We also showed that couples who were dissimilar in terms of BMI or age tended to be less happy (even when we controlled for individual BMI + age). See here and here.

What I’d really like to see is the full list. I’m not so interested in learning that skiers and hikers are likely to marry each other, but if I could see an (organized) list of all the traits they look at, this could be interesting.

When am I a conservative and when am I a liberal (when it comes to statistics, that is)?


Here I am one day:

Let me conclude with a statistical point. Sometimes researchers want to play it safe by using traditional methods — most notoriously, in that recent note by Michael Link, president of the American Association of Public Opinion Research, arguing against non-probability sampling on the (unsupported) grounds that such methods have “little grounding in theory.” But in the real world of statistics, there’s no such thing as a completely safe method. Adjusting for party ID might seem like a bold and risky move, but, based on the above research, it could well be riskier to not adjust.

I’ve written a lot about the benefits of overcoming the scruples of traditionalists and using (relatively) new methods, specifically Bayesian multilevel models, to solve problems (such as estimation of public opinion in small subgroups of the population) that would otherwise be either impossible or would be done on a completely sloppy, ad hoc basis.

On the other hand, sometimes I’m a conservative curmudgeon, for example in my insistence that claims about beauty and sex ratios, or menstrual cycles and voting, are bogus (or, to be precise, that those claims are pure theoretical speculation, unsupported by the data that purport to back them up).

What’s the deal? How to resolve this? One way to get a handle on this is in each case to think about the alternative. The balance depends on the information available in the problem at hand. In a sense, I’m always a curmudgeonly conservative (as in that delightful image of Grampa Simpson above) in that I’m happy to use prior information and I don’t think I should defer to whatever piece of data happens to be in front of me.

This is the point that Aleks Jakulin and I made in our article, “Bayes: radical, liberal, or conservative?”

Consider the polling scene, where I’m a liberal or radical in wanting to use non-probability sampling (gasp!). But, really, this stance of mine has two parts:

1 (conservative): I don’t particularly trust raw results from probability sampling, as the nonresponse rate is so high and so much adjustment needs to be done to such surveys anyway.

2 (liberal): I think with careful modeling we can do a lot more than just estimate toplines and a few crosstabs.

Now consider those junk psychology studies that get published in tabloid journals based on some flashy p-values. Again, I have two stances:

1 (conservative): Just cos someone sees a pattern in 100 online survey responses, I don’t see this as strong evidence for a pattern in the general population, let alone as evidence for a general claim about human nature or biology or whatever.

2 (liberal): I’m open to the possibility that there are interesting patterns to be discovered, and I recommend careful measurement and within-subject designs to estimate these things accurately.

Varieties of description in political science

“Science does not advance by guessing”

I agree with Deborah Mayo who agrees with Carlo Rovelli that “Science does not advance by guessing. It advances by new data or by a deep investigation of the content and the apparent contradictions of previous empirically successful theories.”

And, speaking as a statistician and statistical educator, I think there’s a big problem with the usual treatment of statistics and scientific discovery in statistics articles and textbooks, in that the usual pattern is for the theory and experimental design to be airlifted in from who-knows-where and then the statistical methods are just used to prove (beyond some reasonable doubt) that the theory is correct, via a p-value or a posterior probability or whatever. As Seth pointed out many times, this skips the most key question of where the theory came from, and in addition it skips the almost-as-key question of how the study is designed.

I do have a bit of a theory of where theories come from, and that is from anomalies: in a statistical sense, predictions from an existing model that do not make sense or that contradict the data. We discuss this in chapter 6 of BDA, and Cosma Shalizi and I frame it from a philosophical perspective in our philosophy paper. Or, for a completely non-technical, “humanistic,” take, see my paper with Thomas Basbøll on the idea that good stories are anomalous and immutable.

The idea is that we have a tentative model of the world, and we push that model, and gather data, and find problems with the model, and it is the anomalies that motivate us to go further.

The new theories themselves, though, where do they come from? That’s another question. It seems to me that new theories often come via analogies from other fields (or from other subfields, within physics, for example). At this point I think I should supply some examples but I don’t quite have the energy.

My real point is that sometimes it does seem like science advances by guessing, no? At least, retrospectively, it seems like Bohr, Dirac, etc., just kept guessing different formulations and equations and pushing them forward and getting results. Or, to put it another way, these guys did do “deep investigation of the content and the apparent contradictions of previous empirically successful theories.” But then they guessed too. But their guesses were highly structured, highly constrained. The guesses of Dirac etc. were mathematically sophisticated, not the sort of thing that some outsider could’ve come up with.

How does this relate to, say, political science or economics? I’m not sure. I do think that outsiders can make useful contributions to these fields but there does need to be some sense of the theoretical and empirical constraints.

When there’s a lot of variation, it can be a mistake to make statements about “typical” attitudes

This story has two points:

1. There’s a tendency for scientific results to be framed in absolute terms (in psychology, this corresponds to general claims about the population) but that can be a mistake in that sometimes the most important part of the story is variation; and

2. Before getting to the comparisons, it can make sense to just look at the data.

Here’s the background. I came across a post by Leif Nelson, who wrote:

Recently Science published a paper [by Timothy Wilson, David Reinhard, Erin Westgate, Daniel Gilbert, Nicole Ellerbeck, Cheryl Hahn, Casey Brown, and Adi Shaked] concluding that people do not like sitting quietly by themselves. . . .

The reason I [Nelson] write this post is that upon analyzing the data for those studies, I arrived at an inference opposite the authors’. They write things like:

Participants typically did not enjoy spending 6 to 15 minutes in a room by themselves with nothing to do but think. (abstract)

It is surprisingly difficult to think in enjoyable ways even in the absence of competing external demands. (p.75, 2nd column)

The untutored mind does not like to be alone with itself (last phrase).

But the raw data point in the opposite direction: people reported to enjoy thinking. . . .

In the studies, people sit in a room for a while and then answer a few questions when they leave, including how enjoyable, how boring, and how entertaining the thinking period was, in 1-9 scales (anchored at 1 = “not at all”, 5 = “somewhat”, 9 = “extremely”). Across the nine studies, 663 people rated the experience of thinking, the overall mean for these three variables was M=4.94, SD=1.83 . . . Which is to say, people endorse the midpoint of the scale composite: “somewhat boring, somewhat entertaining, and somewhat enjoyable.”

Five studies had means below the midpoint, four had means above it.

I see no empirical support for the core claim that “participants typically did not enjoy spending 6 to 15 minutes in a room by themselves.”

Here are the data:


Nelson writes:

Out of 663 participants, MOST (69.6%) [or, as we would say in statistics, "70%" --- ed.] said that the experience was somewhat enjoyable or better.

If I were trying out a new manipulation and wanted to ensure that participants typically DID enjoy it, I would be satisfied with the distribution above. I would infer people typically enjoy being alone in a room with nothing to do but think.

Nelson concludes:

If readers think that the electric shock finding is interesting conditional on the (I think, erroneous) belief that it is not enjoyable to be alone in thought, then the finding is surely even more interesting if we instead take the data at face value: Some people choose to self-administer an electric shock despite enjoying sitting alone with their thoughts.

He also asked the authors if they had any comments on his reaction that their paper showed a finding opposite to what they’d claimed, and the authors sent him a reply in which they wrote that they “were continually surprised by these results” which reminded me of our earlier discussion of how to interpret surprising results.

Reconciling the article and the criticism

It’s a challenge to go back and forth reading the original article, Nelson’s comments, and Wilson and Gilbert’s reply. I agree with Nelson that it seems incorrect to state that people did not enjoy being alone with their thoughts, given that more than two-thirds of the people in the study reported the experience to be “somewhat enjoyable” or better. On the other hand, Wilson and Gilbert point out that “The percentage who admitted cheating [doing other activities beyond just sitting and thinking] ranged from 32% to 54% . . . 67% of men and 25% of women opted to shock themselves rather than ‘just think’ . . .”

The resolution, I think, is that we have to avoid the tendency to think deterministically. There’s variation! As shown in the above histogram, some people reported thinking to be “not at all enjoyable,” some reported it to be “somewhat enjoyable,” and there were a lot of people in the middle. Given this, it’s not so helpful to make statements about what people “typically” enjoy (as in the abstract of the paper).

Finally, let me return to my original point about respecting the data. In their reply, Wilson and Gilbert write, “we believe the preponderance of the evidence does not favor Professor Nelson’s claim that most people in our studies enjoyed thinking.” Looking at the above graph, it all seems to depend on how you categorize the “somewhat enjoyable” response.

Perhaps it’s most accurate to say that (a) two-thirds of respondents find thinking to be at least somewhat enjoyable, and, at the same time, (b) two-thirds of respondents find thinking to be no more than somewhat enjoyable! The glass is both two-thirds empty (according to Wilson et al.) and two-thirds full (according to Nelson).

P.S. Nelson credits the paper to Science, the journal where it is published. I think it’s more appropriate to credit the authors, so I’ve done it that way (see brackets in the first paragraph of quoted material above). The authors are the ones who do the work; the journal is just a vessel where it is published.

P.P.S. Zach wins the thread with this comment:

I enjoy thinking, but I can do that any time. Put me in a room with a way to safely shock myself and I’ll take the opportunity to experiment.