Skip to content

Avoiding model selection in Bayesian social research

The other day I happened to come across this paper that I wrote with Don Rubin in 1995. I really like it—it’s so judicious and mature, I can’t believe I wrote it over 20 years ago!

Let this be a lesson to all of you that it’s possible to get somewhere by reasoning from first principles based on specific applications. This is not the only way to do research—I have a lot of respect for the “machine learning” approach of evaluating success on a corpus of data, or various more methodology-driven approaches—but I think much can be learned by studying how we kept pushing back and asked what were the goals of the statistical methods in question, and how we brought in theory where relevant (for example, in dismissing the claim that BIC is an approximation to the Bayes factor, and in dismissing the claim that BIC-chosen models have lower predictive error). When trying to understand and evaluate statistical methods, it’s often useful to step back and consider why they’re being used in the first place, partly to see to what extent the underlying goals are being achieved, and partly to understand how a method such as BIC can be useful in practice even if it doesn’t live up to the theoretical claims being made for it. One can be critical of specific claims while being generous in a larger sense.

In this particular paper, I think one breakthrough we made was to recognize that model selection was being used to solve two completely different problems: (1) Giving users permission to use a simple model that didn’t fit the data but explained most of the variance, in a large-sample-size setting, and (2) Giving users a simple way to regularize and do some sort of partial pooling, in a small-sample-size setting. Recognizing that these are two completely different problems gives us a way to think about moving forward on both of them, without overloading a crude tool such as BIC.

P.S. In rereading this paper that I absolutely love, I guess I can see how my colleagues at Berkeley didn’t want me around back then. They had no way of thinking about a paper such as this that wasn’t written in the Definition, Theorem, Proof format. It’s just a language thing. Even my colleagues such as David Donoho who, I assume, would have had the ability and maybe even the inclination to understand this work, wouldn’t’ve known what to make of it, because it was written in such a different style than their papers. Also it was published in a sociology journal so I guess it didn’t count. It as easy enough for them to just conclude that my work was no good and not try to figure out exactly what I was doing.

To be fair, I have difficulty reading papers written in the Definition, Theorem, Proof style: I usually have to ask the authors to explain to me what they are doing, or else I have to struggle through on my own in order to then re-interpret things in my own language, as here. Communication is difficult, and I guess it can be awkward to have someone around who speaks a different language. It’s funny that I went to Berkeley in the first place, but at the time I had the naive view that I could compel them to like me. I was able to read their work, after all (with effort), and I think I underrated the difficulty they would have in reading my work, or perhaps I overrated the value they would place on my work: had they thought it important, they would’ve put in the work to figure out what I was doing, but because they were told not to care about it, they didn’t bother. That’s all fine too, I suppose: had I really cared myself about explaining my ideas on model selection and averaging to the Definition, Theorem, Proof crowd, I could’ve written some papers in that format. But it wasn’t really necessary; they’re doing fine on their own path.

Put your own questions on the General Social Survey!

Tom Smith of the National Opinion Research Center writes:

The General Social Survey plans to include some items or short topical modules designed by users in its 2018 survey, and invites users to submit proposals recommending such items or modules. Proposals submitted in response to this call will be included based on assessments of their scientific merit; they need not be accompanied by funding to cover costs of data collection and data processing. The proposals are due by June 30, 2016.

See here for details.

Go for it! I used to be on the GSS external advisory board, and we loved reading these proposals.

Stan Coding Corner: O(N) Change-Point Program with Clever Forward-Backward Calculation

It’s so much fun to work in open source. Luke Wiklendt sent along this improved code for a change-point model calculation in Stan. With N data points in the time series, the version in the manual is O(N2), whereas the improved version is O(N). In practice, Luke says

[the new code] results in a dramatic speed up. On my machine 10k samples of the original took 8m53s and the new one took 33s.

Now that’s the kind of improvement I can get behind, both in theory and in practice. As soon as I understood his code, I slapped my head and yelled “D’oh!”. Of course there’s a linear algorithm, because it can be encoded as a simple state-space HMM. Everyone (who spent 25 years working on natural language processing and speech recognition) knows the HMM forward-backward algorithm is linear in N.

Quadratic version from Stan manual v2.9

Here’s my original version.

transformed parameters {
  vector[T] lp;
  lp <- rep_vector(log_unif, T);
  for (s in 1:T)
    for (t in 1:T)
      lp[s] <- lp[s] + poisson_log(D[t], if_else(t < s, e, l));

Linear version from Luke Wiklendt

And here's Luke's improved version, which he prefaces by saying

Given a vector v and two indices i and j, the sum from i to j is performed by v[i] + v[i+1] + ... + v[j-1] + v[j]. It can also be performed by precalculating the cumulative sum of v (call it cv), and then the sum from i to j is cv[j] - cv[i].

That's what the engineers call "dynamic programming."

transformed parameters {
    vector[T+1] lp_e;
    vector[T+1] lp_l;
    vector[T] lp;

    lp_e[1] <- 0;
    lp_l[1] <- 0;
    for (t in 1:T) {
        lp_e[t + 1] <- lp_e[t] + poisson_log(D[t], e);
        lp_l[t + 1] <- lp_l[t] + poisson_log(D[t], l);

    lp <- rep_vector(log_unif + lp_l[T + 1], T) + head(lp_e, T) - head(lp_l, T);

The key for me was realizing this is just the forward-backward algorithm of HMMs. The forward values are lp_e and the backward values are lp_l[T+1] - lp_l, so lp is the complete forward+backward values (plus log_unif for the prior).

Thanks, Luke!

These Twin Names Match, But Aren’t “Matchy-Matchy”

I love this stuff:

Alice/Celia: This subtle anagram yields two charming classics with completely different sounds.

Beckett/Marlowe: Two playwrights representing two of the hottest contemporary name styles, double-t names and hidden-o names.

Zoe/Eve: These Greek and Hebrew “life” names look similar on paper, but not spoken aloud.

Rima/Amir : These mirror-image Arabic name make a smooth, trim pairing.

Laurel/Daphne : Daphne is the Greek form of Laurel. Both names are thoroughly familiar, but neither has ever been common.

Tristan/Gavin: Two Knights of the Round Table, but far less conspicuous than Lancelot and Galahad.

Indigo/Sienna: Straight from your local Crayola 64-pack, these color names share a creative spirit.

And many more. Also this comment:

My inner scientist has always wanted twin girls named Ethyl and Lauryl as both names are chemistry terms.

Bayesian Umpires: The coolest sports-statistics idea since the hot hand!


Hiro Minato points us to this recent article by Guy Molyneux:

Baseball fans have long known, or at least suspected, that umpires call balls and strikes differently as the count changes. At 0-2, it seems that almost any taken pitch that is not right down the middle will be called a ball, while at 3-0 it feels like pitchers invariably get the benefit of the doubt. One of the earliest discoveries made possible by PITCHf/x data was the validation of this perception: Researchers confirmed that the effective size of the strike zone at 0-2 is only about two-thirds as large as in a 3-0 count.

One common explanation offered for this pattern is that umpires don’t want to decide the outcome of a plate appearance. Preferring to let the players play, this argument goes, umpires will only call “strike three” or “ball four” if there is no ambiguity about the call.

But Molyneux has another theory that he claims better fits the data. The theory is that umpires are using Bayesian reasoning. I love it!

The argument goes as follows:

– It’s 3-0. You know and I know and the umpire knows that the pitcher’s probably going to throw a strike. Now a pitch comes in that’s a close call. The “base rate” (in Tversky and Kahneman terms) is that most of those 3-0 pitches are strikes. The Bayesian thing to do is to multiply the likelihood ratio by the ratio of base rates, and the result is that, as the umpire, you should call these close ones as strikes.

– Or, it’s 0-2. We know the pitcher is likely to throw this one away. The base rates is that the pitch is likely to be a ball. Now you take the exact same close call as before—the same “likelihood ratio,” in statistics jargon—but now you multiply it by this new ratio of base rates, and it’s rational to call this one as a ball.

It’s statistical discrimination, baseball-style. Rational in each individual case, with a predictable bias in the aggregate.

This is really a beautiful argument of Molyneux. I’ve never seen it before, but it makes so much sense. And that’s why I say it’s the coolest sports-statistics idea since Miller and Sanjurjo’s work on the hot hand.

P.S. See here for more on Bayesian umpires from Chris Moore in 2009.

2016 Atlantic Causal Inference Conference

Jennifer Hill writes:

Registration for the 2016 Atlantic Causal Inference Conference is now live.

Stay tuned for short course registration (free for conference participants) and an announcement regarding a causal inference data analysis competition…both coming soon!

Also please consider signing up to give a lightning talk (link on website).

The conference will be held 26-27 May in NYC, and I strongly recommend it.

Why I don’t believe Fergus Simpson’s Big Alien Theory


It all began with this message from Christopher Bonnett:

I’m a observational cosmologist and I am writing you as I think the following paper + article might be of interest for your blog.

A fellow cosmologist, Fergus Simpson, has done a Bayesian analysis on the size of aliens, it has passed peer-review and has been published in Monthly Notices, a reputable astronomy journal. (just to make clear that it’s not some crackpot theory). Fergus has also written a blog-post with a less technical explanation of his research.

As the article is very heavy on bayesian analysis, I thought it might be an interesting topic for you and your co-bloggers to read and potentially blog about.

Indeed, this reminded me of the example on page 90 of this paper.

I see two problems with Simpson’s analysis, problems that are serious enough that it all just seems wrong to me. My first problem is the prior or distribution of aliens, and my second problem is the likelihood or model for what is observed. I don’t actually see any prior at all in the paper, I guess there must be some implicit prior, but otherwise it would seems for anyone to say much from a sample of size 1 with no prior. [No, I was wrong, there is a prior; see P.P.S. below. — AG] As for the likelihood, the assumption seems to be that we, or Fergus Simpson, in writing this paper, is being randomly sampled from all creatures in the universe, or I guess all creatures that have the ability to think these thoughts, or maybe all creatures that have the ability to get a paper published in an astronomy journal, or . . . whatever. It doesn’t make sense to me.

I forwarded the discussion to David Hogg, who wrote:

Amusing idea, but there is only so far you can go in these “we have one data point” studies. That said, we want to pick targets for spectroscopy and imaging carefully, since each imaged planet is likely to cost upwards of $50 million to image (see current plans for terrestrial planet imaging).

Also, why didn’t he also use the data point that Mars (which is also habitable-zone) doesn’t have intelligent life on it? That might be more informative than the datum that Earth does have life.

That’s pretty much it for me, but Fergus Simpson did reply, so I’ll also include some of our back-and-forth discussion:
Continue reading ‘Why I don’t believe Fergus Simpson’s Big Alien Theory’ »



Lee Wilkinson writes:

In the latest issue of Harvard Magazine (, a letter writer (David W. Pittelli) comments under the section “Social Progress Index”:

We are informed by Harvard Magazine (November-December 2015, page 15) that the country with the best “Health and Wellness” (“Do people live long and healthy lives?”) is Peru, while the United States ranks a dismal sixty-eighth in the world.

This seemed unlikely to me, and so I went to to see how Social Progress Imperative (SPI) arrived at its statistical claims.

The broadest statistic making up the Health and Wellness (HW) rating is Life Expectancy. From the figures, we see that the United States has a life expectancy a full four years longer than that of Peru (78.7 vs. 74.5 years). So how does SPI come to a figure that puts Peru at best in the world? They add other figures related to death, such as “Premature deaths from non-communicable diseases,” which are somewhat higher in the United States than in Peru. But why should we double-count a death from a noncommunicable disease like a heart attack or diabetes, which strikes mostly in advanced nations, but ignore a death from a communicable disease, most of which are more common in poorer nations like Peru?

The writer’s letter is longer in the printed version of the magazine, where he goes on to point out that several composite statistics for the countries aggregate over individuals that are in many cases counted twice or more. (See here, which has the juicy bit that Saudi Arabia gets credit for a secondary school enrollment rate that is above 110%!)

In the printed version of the magazine, Michael Porter (Harvard Business School’s TED-talk professor) comments on this letter, specifically refuting Pittelli’s claim:

There is also no double counting, as one reader suggested. Our principal component methodology is specifically designed to minimize or eliminate it.

The relevant SPI methodology document describing how the components were computed is here.

Aside from the apparent confusion the SPI authors have over the difference between Factor Analysis and Principal Components (perhaps because their “authoritative” source is Manly, Bryan F. J.
Multivariate Statistical Methods: A Primer. CRC Press), there are two immediate questions:

1. Porter’s claim that there is no double counting has nothing to do with Principal Components methodology. As Pittelli says, the aggregation of various statistical measures was done prior to the computation of the components. Including measures that involve more than one individual makes the aggregate measures ambiguous. Interpreting the resulting principal components is subject to this artifactual bias.

2. Inferences on principal components (and other classical multivariate statistics) depend on a covariance matrix that does not include structural dependencies. If two or more measures are dependent partly due to a shared nonrandom component, the standard Principal Components model is a misrepresentation of the variation in the data.

I may be misinterpreting the survey methodology, and I may be quibbling, but this apparently influential survey appears to merit some deeper investigation.

I responded: this sort of thing is just crap, something like the notorious Places Rated Almanac or US News rankings. I wrote about something similar several years ago; see here and here.

But as long as the “garbage out” gets media attention, there will always be somebody willing to supply the “garbage in.”

P.S. I googled Ted-talk professor Michael Porter who was quoted above, and I came across this, entitled “The case for letting business solve social problems”:

I’m not a social problem guy. I’m a guy that works with business, helps business make money. God forbid.

OK, so far so good. I help businesses make money too, that’s part of what I do.

But then a few minutes later:

I’m a business school professor, but I’ve actually founded, I think, now, four nonprofits. Whenever I got interested and became aware of a societal problem, that was what I did, form a nonprofit.

And then a bit later:

I’ve also, over the years, worked more and more on social issues. I’ve worked on healthcare, the environment, economic development, reducing poverty, and as I worked more and more in the social field, I started seeing something that had a profound impact on me and my whole life, in a way.

All I can say here is . . . c’mon, make up your mind, dude. There’s nothing wrong with being “not a social problem guy”—not everyone can work on social problems. But then don’t pop up 3 minutes later to brag about how you solve social problems!

This particular speech seems to be approx 15 minutes of pure happy talk about how doing good is actually more profitable than doing evil. I wonder if someone could lock this guy up in the room with Gary Loveman or, even better, those guys who keep filling my inbox with spam.

I see no direct connection with the bogus statistics that Porter is touting and his doing-well-by-doing-good shtick, except perhaps for a general willingness to say whatever can make his point, without caring whether it makes any sense or is even internally consistent.

I’m not saying Porter has any particular political bias: politically, he seems quite centrist. When he says that all wealth is actually created by business, that sounds conservative, but he sounds liberal when he talks about gay rights or about reducing pollution being good business. I looked up his political contributions on the FEC site and I found a Michal Porter from the Harvard Business School who’s given money to Scott Brown, Steve Pagliuca, Nancy Johnson, Jeff Bingaman, William Binnie, James Ogonowki, and the Massachusetts Republican Party. This is the signature of a political moderate. So it’s not like I’m saying Porter’s statistical shenanigans are serving some extreme agenda; I just think he’s sloppy and doesn’t really care about the facts, he’s just throwing numbers around to illustrate whatever point he happens to be making. He probably gives an excellent Ted talk.

On deck this week


Tues: Why I’m skeptical of Fergus Simpson’s Big Alien Theory

Wed: One more thing you don’t have to worry about

Thurs: These Twin Names Match, But Aren’t “Matchy-Matchy”

Fri: Put your own questions on the General Social Survey!

Sat: Avoiding model selection in Bayesian social research

Sun: A short answer to a short question

Should I be upset that the NYT credulously reviewed a book promoting iffy science?


I want to say “junk science,” but that’s not quite right. The claims in questions are iffy, far from proven, and could not be replicated, but they still might be true. As usual, my criticism is the claim that the evidence is strong, when it isn’t.

From the review, by Heather Havrilesky:
Continue reading ‘Should I be upset that the NYT credulously reviewed a book promoting iffy science?’ »

Black Box Challenge


Georgy Cheremovskiy writes:

I’m one of the organizers of an unusual reinforcement learning competition named Black Box Challenge.

The conception is simple — one need to program an agent that can play a game with unknown rules. At each time step agent is given an environment state vector and has a few possible actions. The rewards may be delayed and have stochastic nature — the same actions can lead to different rewards.

The competition is created with support of (one of the largest Russian Internet companies) and Data-Centric Alliance (DCA is a Russian company specialising on big data and high-load systems), and the winners will be rewarded with pleasant prizes:
$4400 (300,000 rubles) — for the 1st place
$2550 (175,000 rubles) — for the 2nd place
$1820 (125,000 rubles) — for the 3rd place
Microsoft Xbox One — for the 4th, 5th, 6th, 7th and 8th places

I have no idea what this is but I thought I’d post it, because—hey, that’s a lot of rubles!

Perhaps Richard Tol can enter, once he’s through winning the $100,000 global warming time-series challenge.

John Yoo blogging


Jonathan Falk sends along this gem:

Judicial Torture as a Screening Device

Kong-Pin Chen / Tsung-Sheng Tsai

Judicial torture to extract information or to elicit a confession was a common practice in pre-modern societies, both in the east and the west. This paper proposes a positive theory for judicial torture. It is shown that torture reflects the magistrate’s attempt to balance type I and type II errors in the decision-making, by forcing the guilty to confess with a higher probability than the innocent, and thereby decreases the type I error at the cost of the type II error. Moreover, there is a non-monotonic relationship between the superiority of torture and the informativeness of investigation: when investigation is relatively uninformative, an improvement in technology used in the investigation actually lends an advantage to torture so that torture is even more attractive to the magistrates; however, when technological progress reaches a certain threshold, the advantage of torture is weakened, so that a judicial system based on torture becomes inferior to one based on evidence. This result can explain the historical development of the judicial system.

Sample bit:

Screen Shot 2015-12-17 at 9.52.24 PM

And, get this:

The insight that resources need to be invested in order to overcome informational barriers is a common theme in the theoretical economics literature. . . . There is surprisingly scant formal modeling of torture.

Thinking the unthinkable, indeed. This article was published by the Berkeley Electronic Journal of Theoretical Economics. And yoo know who teaches at Berkeley:


I can only assume he was one of the referees and that he vetted the paper for realism.

Gay persuasion update

Hey, did you hear about that study last year, where some researchers claimed to find that a 20-minute doorstep conversation with skeptical voters could change views on same-sex marriage? It was published in the tabloids and featured on This American Life? And it turned out it was all a fraud, that one of the authors of the paper made up the data and the other author had to retract it from the journal? Remember that?

OK, good. Well, here’s some news. David Broockman and Joshua Kalla, two of the people who figured out about that earlier study being faked, then up and ran their own study, and look what they found:

A single approximately 10-minute conversation encouraging actively taking the perspective of others can markedly reduce prejudice for at least 3 months. We illustrate this potential with a door-to-door canvassing intervention in South Florida targeting antitransgender prejudice. . . . 56 canvassers went door to door encouraging active perspective-taking with 501 voters at voters’ doorsteps. A randomized trial found that these conversations substantially reduced transphobia. . . . These effects persisted for 3 months, and both transgender and nontransgender canvassers were effective. The intervention also increased support for a nondiscrimination law, even after exposing voters to counterarguments.

Screen Shot 2016-04-05 at 8.18.39 PM

Wow. As I wrote at the sister blog, this has to cause us to rethink the idea that persuasion is so difficult—at least for fluid issues such as gay rights where public attitudes are changing fast.

Betsy Levy Paluck wrote an accompanying article summarizing the research on persuasion:

What do social scientists know about reducing prejudice in the world? In short, very little. Of the hundreds of studies on prejudice reduction conducted in recent decades, only ~11% test the causal effect of interventions conducted in the real world. Far fewer address prejudice among adults or measure the long-term effects of those interventions.

Also, I’d expect persuasion to be less effective in the real world than in Broockman and Kalla’s experiment, because in a live political setting you’ll see efforts at persuasion from both sides.

Finally, let me remind you that when the suggestion came up to replicate the LaCour and Green study, I was dismissive:

Ulp. There are lots and lots of studies people are interested in doing, and I’m sure this activist group in Los Angeles has a long to-do list. Do you really think they should spend their precious time, money, and human resources to study an idea that is contradicted by an entire 900-paper literature and whose only claim to plausibility was a made-up experiment?? . . . You gotta be kidding.

I was dismissive, and I was wrong. I’m glad nobody was listening to me on this one!

Selection bias, or, some things are better off left unsaid

I got two of these in the same day!

1. A colleague emails me that a colleague emailed him regarding a study on women in the workplace. The headline conclusion is: “Corporate America is not on a path to gender equality.” My colleague’s colleague writes:

This coincides with my prior beliefs, but for exactly that reason I thought it important to dig into the evidence and see what the latest numbers show. . . .

He then went into all sorts of problems he had with the report, and then summarized:

I am just curious why the report chooses to focus on “we aren’t there yet, and therefore are not on the path” when in fact it looks from the data like we are *exactly* on the path.

I *do* have a lot of sympathy for the report. I think the recommendations at the end are sensible, and that many of the statements about forms of cultural bias and inequality in the home, etc., are correct. But these are based on other data. This study claims to present novel conclusions that are not supported by the data presented.

My colleague sent me the above with the note that it might be of interest.

I replied: The last thing I want to do, as a man, is to say that progress in gender equality is just fine—so I think I’ll stay out of this one!

2. Someone else points me to a news article entitled, “Height May Be Linked to Increased Cancer Risk, Study Contends,” and asks, “What do we do with something so absurd?”

I replied: Too difficult for me to understand . . . I don’t want to touch this one!

P.S. The real selection bias, though, comes when I don’t even tell you I’m not writing about a topic!

Postdoc in Alabama on obesity-related research using statistics

These celebrity photos are incredible: Type S errors in use!

Kaveh sends along this, from a recent talk at Berkeley by Katherine Casey:


It’s so gratifying to see this sort of thing in common use, only 15 years after Francis and I introduced the idea (and see also this more recent paper with Carlin).

Best Disclaimer Ever

Paul Alper sends this in, from the article, “Ovarian cancer screening and mortality in the UK Collaborative Trial of Ovarian Cancer Screening (UKCTOCS): a randomised controlled trial,” by Ian J Jacobs, Usha Menon, Andy Ryan, Aleksandra Gentry-Maharaj, Matthew Burnell, Jatinderpal K Kalsi, Nazar N Amso, Sophia Apostolidou, Elizabeth Benjamin, Derek Cruickshank, Danielle N Crump, Susan K Davies, Anne Dawnay, Stephen Dobbs, Gwendolen Fletcher, Jeremy Ford, Keith Godfrey, Richard Gunu, Mariam Habib, Rachel Hallett, Jonathan Herod, Howard Jenkins, Chloe Karpinskyj, Simon Leeson, Sara J Lewis, William R Liston, Alberto Lopes, Tim Mould, John Murdoch, David Oram, Dustin J Rabideau, Karina Reynolds, Ian Scott, Mourad W Seif, Aarti Sharma, Naveena Singh, Julie Taylor, Fiona Warburton, Martin Widschwendter, Karin Williamson, Robert Woolas, Lesley Fallowfield, Alistair J McGuire, Stuart Campbell, Mahesh Parmar, and Steven J Skates:

Declaration of interests IJJ reports personal fees from and stock ownership in Abcodia as the non-executive director and consultant. He reports personal fees from Women’s Health Specialists as the director. He has a patent for the Risk of Ovarian Cancer algorithm and an institutional licence to Abcodia with royalty agreement. He is a trustee (2012–14) and Emeritus Trustee (2015 to present) for The Eve Appeal. He has received grants from the Medical Research Council (MRC), Cancer Research UK, the National Institute for Health Research, and The Eve Appeal. UM has stock ownership in and research funding from Abcodia. She has received grants from the MRC, Cancer Research UK, the National Institute for Health Research, and The Eve Appeal. NNA is the founder of, owns stock in, and is a board member of MedaPhor, a spin-off company at Cardiff University. He has a patent for the ultrasound simulation training system MedaPhor. SA is funded by a research grant from Abcodia. AD reports personal fees from Abcodia. AL reports personal fees from Roche as a panel member and advisory board member and fees from Sanofi Pasteur Merck Sharp & Dohme (Gardasil) as an advisory board member. JM is involved in a private ovarian cancer screening programme after closure of this trial. MWS reports personal fees from Abcodia as a consultant. LF reports funding from the MRC for the UK Collaborative Trial of Ovarian Cancer Screening psychosocial study. She reports personal fees from GlaxoSmithKline, Amgen, AstraZeneca, Roche, Pfzier, Teva, Bristol-Myers Squibb, and Sanofi and grants from Boehringer Ingelheim and Roche. SJS reports personal fees from the LUNGevity Foundation and SISCAPA Assay Technologies as a member of their Scientifi c Advisory Boards. He reports personal fees from Abcodia as a consultant and AstraZeneca as a speaker honorarium. He has a patent for the Risk of Ovarian Cancer algorithm and an institutional license to Abcodia. All other authors declare no competing interests.

All right, then.

Oh, in case you were wondering, here’s the last part of the paper’s summary:

Although the mortality reduction was not significant in the primary analysis, we noted a significant mortality reduction with MMS when prevalent cases were excluded. We noted encouraging evidence of a mortality reduction in years 7–14, but further follow-up is needed before firm conclusions can be reached on the efficacy and cost-effectiveness of ovarian cancer screening.

N = 202638 and the effect wasn’t statistically significant. No problem, says the non-executive director and consultant of Abcodia, the director of Women’s Health Specialists, the trustee of the Eve Appeal, the owner of stock in Abcodia, the owner of stock in MedaPhor, the patenter of MedaPhor, the receiver of personal fees from Abcodia, the panel member of Roche and advisory board member of Sanofi Pasteur Merck Sharp & Dohme, the receiver of personal fees from GlaxoSmithKline, Amgen, AstraZeneca, Roche, Pfzier, Teva, Bristol-Myers Squibb, and Sanofi and grants from Boehringer Ingelheim and Roche, the receiver of personal fees from SISCAPA Assay Technologies, Abcodia, and AstraZeneca. No problem, says the patent holder for the Risk of Ovarian Cancer algorithm. No problem at all. Encouraging evidence, they say.

P.S. I get funding from Novartis and have also been paid by Merck, Procter & Gamble, Google, and lots of other companies that I can’t remember right now.

Somebody’s reading our research.

See footnote 10 on page 5 of this GAO report.

Screen Shot 2016-01-18 at 10.35.52 PM

(The above graphs are just for age 45-54, which demonstrates an important thing about statistical graphics: They should be as self-contained as possible. Otherwise when the graph is separated from its caption, it requires additional words of explanation, as you are seeing here.)

“A strong anvil need not fear the hammer”


Wagenmakers et al. write:

A single experiment cannot overturn a large body of work. . . . An empirical debate is best organized around a series of preregistered replications, and perhaps the authors whose work we did not replicate will feel inspired to conduct their own preregistered studies. In our opinion, science is best served by ruthless theoretical and empirical critique, such that the surviving ideas can be relied upon as the basis for future endeavors. A strong anvil need not fear the hammer, and accordingly we hope that preregistered replications will soon become accepted as a vital component of a psychological science that is both though-provoking and reproducible.

I don’t feel quite so strongly as E.J. regarding preregistered replications, but I agree strongly with his anvil/hammer quote, which comes at the end of a recent paper, “Turning the hands of time again: a purely confirmatory replication study and a Bayesian analysis,” by Eric-Jan Wagenmakers, Titia Beek, Mark Rotteveel, Alex Gierholz, Dora Matzke, Helen Steingroever, Alexander Ly, Josine Verhagen, Ravi Selker, Adam Sasiadek, Quentin Gronau, Jonathon Love, and Yair Pinto, which begins:

In a series of four experiments, Topolinski and Sparenberg (2012) found support for the conjecture that clockwise movements induce psychological states of temporal progression and an orientation toward the future and novelty.

OK, before we go on, let’s just see where we stand here. This is a Psychological Science or PPNAS-style result: it’s kinda cool, it’s worth a headline, and it could be true. Just as it could be that college men with fat arms have different political attitudes, or that your time of the month could affect how you vote or how you dress, or that being primed with elderly-related words could make you walk slower. Or just as any of these effects could exist but go in the opposite direction. Or, as the authors of those notorious earlier papers claimed, such effects could exist but only in the presence of interactions with socioeconomic class, relationship status, outdoor temperature, and attitudes toward the elderly. Or just as any of these could exist, interacted with any number of other possible moderators such as age, education, religiosity, number of older siblings, number of younger siblings, etc etc etc.

Topolinski and Sparenberg (2012) wandered through the garden of forking paths and picked some pretty flowers.

What happened when Wagenmakers et al. tried to replicate?

Here we report the results of a preregistered replication attempt of Experiment 2 from Topolinski and Sparenberg (2012). Participants turned kitchen rolls either clockwise or counterclockwise while answering items from a questionnaire assessing openness to experience. Data from 102 participants showed that the effect went slightly in the direction opposite to that predicted by Topolinski and Sparenberg (2012) . . .

No surprise. If the original study is basically pure noise, a replication could go in any direction.

Wagenmakers et al. also report a Bayes factor, but I hate that sort of thing so I won’t spend any more time discussing it here. Perhaps I’ll cover it in a separate post but for now I want to focus on the psychology experiments.

And the point I want to make is how routine this now is:

1. A study is published somewhere, it has p less than .05, but we know now that this says little to nothing at all.

2. The statistically significant p-value comes with a story, but through long experience we know that these sort of just-so stories can go in either direction.

3. Someone goes to the trouble of replicating. The result does not replicate.

Let’s just hope that we can bypass the next step:

4. The original authors start spinnin and splainin.

And instead we can move to the end of this story:

5. All parties agree that any effect or interaction will be so small that it can’t be detected with this sort of crude experimental setup.

And, ultimately, to a realization that noisy studies and forking paths is not a great way to learn about the world.

Let me clarify just one thing about preregistered studies. Preregistration is fine, but it helps to have a realistic sense of what might happen. That’s one reason I did not recommend that those ovulation-and-clothing researchers do a preregistered replication. Sure, they could, but given their noise level, it’s doomed to fail (indeed, they did do a replication and it did fail in the sense of not reproducing their original result, and then they salvaged it by discovering an interaction with outdoor temperature). Instead, I usually recommend people work on reliability and validity, that is, on reducing the variance and bias of their measurements. It seems kinda mean to suggest someone do a preregistered replication, if I think they’re probably gonna fail. And, if they do succeed, it’s likely to be a type S error, which is its own sort of bummer.

I guess what I’m saying is:

– Short-term, a preregistered replication is a clean way to shoot down a lot of forking-paths-type studies.

– Medium-term, I’m hoping (and maybe EJ and his collaborators are, too) that the prospect of preregistered replication will cause researchers to moderate their claims and think twice about publishing and promoting the exciting statistically-significant patterns that show up.

– Long term, maybe people will do more careful experiments in the first place. Or, when people do want to trawl through data to find interesting patterns (not that there’s anything wrong with that, I do it all the time), that they will use multilevel models and do partial pooling to get more conservative, less excitable inference.

On deck this week

Mon: “A strong anvil need not fear the hammer”

Tues: Best Disclaimer Ever

Wed: These celebrity photos are incredible: Type S errors in use!

Thurs: Selection bias, or, some things are better off left unsaid

Fri: John Yoo blogging

Sat: You won’t be able to forget this one: Alleged data manipulation in NIH-funded Alzheimer’s study

Sun: Should I be upset that the NYT credulously reviewed a book promoting iffy science?