Skip to content

Shameless little bullies claim that published triathlon times don’t replicate


Paul Alper sends along this inspiring story of Julie Miller, a heroic triathlete who just wants to triathle in peace, but she keeps getting hassled by the replication police. Those shameless little bullies won’t let her just do her thing, instead they harp on technicalities like missing timing clips and crap like that. Who cares about missing timing clips? Her winning times were statistically significant, that’s what matters to me. And her recorded victories were peer reviewed. But, no, those second stringers can’t stop with their sniping.

I for one don’t think this running star should resist any calls for her to replicate her winning triathlon times. The replication rate of those things is statistically indistinguishable from 100%, after all! Track and field has become preoccupied with prevention and error detection—negative psychology—at the expense of exploration and discovery.

In fact, I’m thinking the American Statistical Association could give this lady the Founders Award, which hasn’t really had a worthy recipient since 2002.

On deck this week

Mon: Shameless little bullies claim that published triathlon times don’t replicate

Tues: Boostrapping your posterior

Wed: You won’t be able to forget this one: Alleged data manipulation in NIH-funded Alzheimer’s study

Thurs: Are stereotypes statistically accurate?

Fri: Will youths who swill Red Bull become adult cocaine addicts?

Sat: Science reporters are getting the picture

Sun: Modeling correlation of issue attitudes and partisanship within states

Documented forking paths in the Competitive Reaction Time Task

Screen Shot 2016-04-08 at 6.45.15 PM

Baruch Eitan writes:

This is some luscious garden of forking paths.

Indeed. Here’s what Malte Elson writes at the linked website:

The Competitive Reaction Time Task, sometimes also called the Taylor Aggression Paradigm (TAP), is one of the most commonly used tests to purportedly measure aggressive behavior in a laboratory environment. . . .

While the CRTT ostensibly measures how much unpleasant, or even harmful, noise a participant is willing to administer to a nonexistent confederate, that amount of noise can be extracted as a measure in myriad different ways using various combinations of volume and duration over one or more trials. There are currently 120 publications in which results are based on the CRTT, and they reported 147 different quantification strategies in total!

Elson continues:

This archive does not contain all variations of the CRTT, as some procedural differences are so substantial that their quantification strategies would be impossible to compare. . . . Given the number of different versions of the CRTT measure that can be extracted from its use in a study, it is very easy for a researcher to analyze several (or several dozen) versions of the CRTT outcome measures in a study, running hypothesis tests with one version of the measure after another until a version is found that produces the desired pattern of results. . . .

Smooth poll aggregation using state-space modeling in Stan, from Jim Savage

Jim Savage writes:

I just saw your post on poll bounces; have been thinking the same myself. Why are the poll aggregators so jumpy about new polls?

Annoyed, I put together a poll aggregator that took a state-space approach to the unobserved preferences; nothing more than the 8 schools (14 polls?) example with a time-varying mean process and very small updates to the state.

One of the things I was thinking of was to use aggregated demographic polling data (from the polling companies’ cross-tabs) as a basis for estimating individual states for each demographic cell, and then performing post-stratification on those. Two benefits: a) having a time-varying state deals nicely with the decaying importance of old polls, and b) getting hold of unit-level polling data for MRP is tough if you’re me (perhaps tough if you’re you?).

Here’s the plot:


A full writeup, automated data scraping, model etc. is below.

Here’s the zipfile with everything.

My only comment is that you should be able to do even better—much better—by also including party ID among the predictors in the model, then fitting a state-space model to the underlying party ID proportions and poststratifying on it as well. That would fix some of the differential nonresponse stuff we’ve been talking about.

And here’s Jim’s writeup:
Continue reading ‘Smooth poll aggregation using state-space modeling in Stan, from Jim Savage’ »

“What can recent replication failures tell us about the theoretical commitments of psychology?”


Psychology/philosophy professor Stan Klein was motivated by our power pose discussion to send along this article which seems to me to be a worthy entry in what I’ve lately been calling “the literature of exasperation,” following in the tradition of Meehl etc.

I offer one minor correction. Klein writes, “I have no doubt that the complete reinstatement of experimental conditions will ensure a successful replication of a task’s outcome.” I think this statement is too optimistic. Redo the same experiment on the same people but re-randomize, and anything can happen. If the underlying effect is near zero (as I’d guess is the case, for example, in the power pose example), then there’s no reason to expect success even in an exact replication.

More to the point is Klein’s discussion of the nature of theorizing in psychology research. Near the end of his article he discusses the materialist doctrine “that reality, in its entirety, must be composed of quantifiable, material substances.”

That reminds me of one of the most ridiculous of many ridiculous hyped studies in the past few decades, a randomized experiment purporting to demonstrate the effectiveness of intercessory prayer (p=.04 after performing 3 hypothesis tests, not that anyone’s counting; Deb and I mention it in our Teaching Statistics book). What amazed me about this study—beyond the philosophically untenable (to me) idea that God is unable to interfere with the randomization but will go to the trouble of improving the health of the prayed-for people by some small amount, just enough to assure publication—was the effort the researchers put in to diminish any possible treatment effect.

It’s reasonable to think that prayer could help people in many ways, for example it is comforting to know that your friends and family care enough about your health to pray for it. But in this study they chose people to pray who had no connection to the people prayed for—and the latter group were not even told of the intervention. The experiment was explicitly designed to remove all but supernatural effects, somewhat in the manner that a magician elaborately demonstrates that there are no hidden wires, nothing hidden in the sleeves, etc. Similarly with Bargh’s embodied cognition study: the elderly words were slipped into the study so unobtrusively as to almost remove any chance they could have an effect.

I suppose if you tell participants to think about elderly people and then they walk slower, this is boring; it only reaches the status of noteworthy research if the treatment is imperceptible. Similarly for other bank-shot ideas such as the correlation between menstrual cycle and political attitudes. There seems to be something that pushes researchers to attenuate their treatments to zero, at which point they pull out the usual bag of tricks to attain statistical significance. It’s as if they were taking ESP research as a model. See discussion here on “piss-poor omnicausal social science.”

Klein’s paper, “The Unplanned Obsolescence of Psychological Science and an Argument for Its Revival”, is relevant to this discussion.

Don’t believe the bounce


Alan Abramowitz sent us the above graph, which shows the results from a series of recent national polls, for each plotting Hillary Clinton’s margin in support (that is, Clinton minus Trump in the vote-intention question) vs. the Democratic Party’s advantage in party identification (that is, percentage Democrat minus percentage Republican).

This is about as clear a pattern as you’ll ever see in social science: Swings in the polls are driven by swings in differential nonresponse. After the Republican convention, Trump supporters were stoked, and they were more likely to respond to surveys. After the Democratic convention, the reverse: Democrats are more likely to respond, driving Clinton up in the polls.

David Rothschild and I have the full story up at Slate:

Tou sort of know there is a convention bounce that you should sort of ignore, but why? What’s actually in a polling bump? The recent Republican National Convention featured conflict and controversy and one very dark acceptance speech—enlivened by some D-list celebrities (welcome back Chachi!)—but it was still enough to give nominee Donald Trump a big, if temporary, boost in many polls. This swing, which occurs predictably in election after election, is typically attributed to the persuasive power of the convention, with displays of party unity persuading partisans to vote for their candidate and cross-party appeals coaxing over independents and voters of the other party.

Recent research, however, suggests that swings in the polls can often be attributed not to changes in voter intention but in changing patterns of survey nonresponse: What seems like a big change in public opinion turns out to be little more than changes in the inclinations of Democrats and Republicans to respond to polls. We learned this from a study we performed [with Sharad Goel and Wei Wang] during the 2012 election campaign using surveys conducted on the Microsoft Xbox. . . .

Our Xbox study showed that very few respondents were changing their vote preferences—less than 2 percent during the final month of the campaign—and that most, fully two-thirds, of the apparent swings in the polls (for example, a big surge for Mitt Romney after the first debate) were explainable by swings in the percentages of Democrats and Republicans responding to the poll. This nonresponse is very loosely correlated with likeliness to vote but mainly reflects passing inclinations to participate in polling. . . . large and systematic changes in nonresponse had the effect of amplifying small changes in actual voter intention. . . .

[See this paper, also with Doug Rivers, with more, including supporting information from other polls.]

We can apply these insights to the 2016 convention bounces. For example, Reuters/Ipsos showed a swing from a 15-point Clinton lead on July 14 to a 2-point Trump lead on July 27. Who was responding in these polls? The pre-convention survey saw 53 percent Democrats, 38 percent Republican, and the rest independent or supporters of other parties. The post-convention respondents looked much different, at 46 percent Democrat, 43 percent Republican. The 17-point swing in the horse-race gap came with a 12-point swing in party identification. Party identification is very stable, and there is no reason to expect any real swings during that period; thus, it seems that about two-thirds of the Clinton-Trump swing in the polls comes from changes in response rates. . . .

Read the whole thing.

The political junkies among you have probably been seeing all sorts of graphs online showing polls and forecasts jumping up and down. These calculations typically don’t adjust for party identification (an idea we wrote about back in 2001, but without realizing the political implications that come from systematic, rather than random, variation in nonresponse) and thus can vastly overestimate swings in preferences.

The p-value is a random variable

Sam Behseta sends along this paper by Laura Lazzeroni, Ying Lu, and Ilana Belitskaya-Lévy, who write:

P values from identical experiments can differ greatly in a way that is surprising to many. The failure to appreciate this wide variability can lead researchers to expect, without adequate justification, that statistically significant findings will be replicated, only to be disappointed later.

I agree that the randomness of the p-value—the fact that it is a function of data and thus has a sampling distribution—is an important point that is not well understood. Indeed, I think that the z-transformation (the normal cdf, which takes a z-score and transforms it into a p-value) is in many ways a horrible thing, in that it takes small noisy differences in z-scores and elevates them into the apparently huge differences between p=.1, p=.01, p=.001. This is the point of the paper with Hal Stern, “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant.” The p-value, like any data summary, is a random variable with a sampling distribution.

Incidentally, I have the same feeling about cross-validation-based estimates and even posterior distributions: all of these are functions of the data and thus have sampling distributions, but theoreticians and practitioners alike tend to forget this and instead treat them as truths.

This particular article is that it takes p-values at face value, whereas in real life p-values typically are the product of selection, as discussed by Uri Simonson et al. a few years ago in their “p-hacking” article and as discussed by Eric Loken and myself a couple years ago in our “garden of forking paths” article. I think real-world p-values are much more optimistic than the nominal p-values discussed by Lazzeroni et al. But in any case I think they’re raising an important point that’s been under-emphasized in textbooks and in the statistics literature.

Guy Fieri wants your help! For a TV show on statistical models for real estate


I got the following email from David Mulholland:

I’m a producer at Citizen Pictures where we produce Food Network’s “Diners, Dives and Drive-Ins” and Bravo’s digital series, “Going Off The Menu,” among others. A major network is working with us to develop a show that pits “data” against a traditional real estate agent to see who can find a home buyer the best house for them. In this show, both the real estate agent and the data team each choose two properties using their very different methods. The show will ask the question: “Who will do a better job of figuring out what the client wants, ‘data’ or the traditional real estate agent?”

TV and real estate are two topics I know nothing about, so I pointed Mulholland to some Finnish dudes who do sophisticated statistical modeling of the housing market. They didn’t think it was such a good fit for them, with Janne Sinkkonen remarking that “Models are good at finding trends and averages from large, geographically or temporally sparse data. The richness of a single case, seen on the spot, is much better evaluated by a human.”

That makes sense, but it is also possible that a computer-assisted human can do better than a human alone. Say you have a model that gives quick price estimates for every house. Those estimates are sitting on the computer. A human then goes to house X and assesses its value at, say, $350,000. The human then looks up and sees that the computer gave an assessment, based on some fitted algorithm, of $420,000. What does the human conclude? Not necessarily that the computer is wrong; rather, at this point the human can introspect and consider why the computer estimate is so far off. What features of the house make it so much less valuable than the computer “thinks”? Perhaps some features not incorporated into the computer’s model, for example the state of the interior of the house, or a bad paint job and unkempt property, or something about the location that had not been in the model. This sort of juxtaposition can be valuable.

That said, I still know nothing about real estate or about what makes good TV, so I offered to post Mulholland’s question here. He said sure, and added:

I’m particularly delighted to hear your analysis of a “computer-assisted human” as that is a direction we have been investigating. Simply put, we do not have the resources to implement any sort of fully computerized solution. I think the computer-assisted human is definitely a direction we would take.

I’d love to hear the thoughts of blog readers. At the moment, the big question we are considering is, “Assuming that we have full access to a users data (with the user’s cooperation of course . . . data example include Facebook, web browser history, online shopping history, geotracking, etc), how can we use human and computer to best sort through this data to find the house the user will like the most?”

Ball’s in your court now, TV-savvy blog readers!

Amazon NYC decision analysis jobs

Dean Foster writes:

Amazon is having a hiring event (Sept 8/9) here in NYC. If you are interested in working on demand forecasting either here in NYC or in Seattle send your resume to by September 1st, 2016.

Here’s the longer blurb:

Amazon Supply Chain Optimization Technologies (SCOT) builds systems that automate decisions in Amazon’s supply chain. These systems are responsible for predicting customer demand; optimization of sourcing, buying and placement of all inventory, and ensuring optimal customer experience from an order fulfillment promise perspective. In other words, our systems automatically decide how much to buy of which items, from which vendors/suppliers, which fulfilment centers to put them in, how to get it there, how to get it to the customer and what to promise customers – all while maximizing customer satisfaction and minimizing cost.

Could be interesting, and it’s always fun to work on real decision problems!

In policing (and elsewhere), regional variation in behavior can be huge, and perhaps give a clue about how to move forward.

Rajiv Sethi points to a discussion of Peter Moskos on the recent controversy over racial bias in police shootings.

Here’s Sethi:

Moskos is not arguing here that the police can do no wrong; he is arguing instead that in the aggregate, whites and blacks are about equally likely to be victims of bad shootings. . . .

Moskos offers another, quite different reason why bias in individual incidents might not be detected in aggregate data: large regional variations in the use of lethal force.

To see the argument, consider a simple example of two cities that I’ll call Eastville and Westchester. In each of the cities there are 500 police-citizen encounters annually, but the racial composition differs: 40% of Eastville encounters and 20% of Westchester encounters involve blacks. There are also large regional differences in the use of lethal force: in Eastville 1% of encounters result in a police killing while the corresponding percentage in Westchester is 5%. That’s a total of 30 killings, 5 in one city and 25 in the other.

Now suppose that there is racial bias in police use of lethal force in both cities. In Eastville, 60% of those killed are black (instead of the 40% we would see in the absence of bias). And in Westchester the corresponding proportion is 24% (instead of the no-bias benchmark of 20%). Then we would see 3 blacks killed in one city and 6 in the other. That’s a total of 9 black victims out of 30. The black share of those killed is 30%, which is precisely the black share of total encounters. Looking at the aggregate data, we see no bias. And yet, by construction, the rate of killing per encounter reflects bias in both cities.

This is just a simple example to make a logical point. Does it have empirical relevance? Are regional variations in killings large enough to have such an effect? Here is Moskos again:

Last year in California, police shot and killed 188 people. That’s a rate of 4.8 per million. New York, Michigan, and Pennsylvania collectively have 3.4 million more people than California (and 3.85 million more African Americans). In these three states, police shot and killed… 53 people. That’s a rate of 1.2 per million. That’s a big difference.

Were police in California able to lower their rate of lethal force to the level of New York, Michigan, and Pennsylvania… 139 fewer people would be killed by police. And this is just in California… If we could bring the national rate of people shot and killed by police (3 per million) down to the level found in, say, New York City… we’d reduce the total number of people killed by police 77 percent, from 990 to 231!

This is a staggeringly large effect.

Additional evidence for large regional variations comes from a recent report by the Center for Policing Equity. The analysis there is based on data provided voluntarily by a dozen (unnamed) departments. Take a close look at Table 6 in that document, which reports use of force rates per thousand arrests. The medians for lethal force are 0.29 and 0.18 for blacks and whites respectively, but the largest recorded rates are much higher: 1.35 for blacks and 3.91 for whites. There is at least one law enforcement agency that is killing whites at a rate more than 20 times greater than that of the median agency.

On the reasons for these disparities, one can only speculate:

I really don’t know what some departments and states are doing right and others wrong. But it’s hard for me to believe that the residents of California are so much more violent and threatening to cops than the good people of New York or Pennsylvania. I suspect lower rates of lethal force has a lot to do with recruitment, training, verbal skills, deescalation techniques, not policing alone, and more restrictive gun laws.

This is all important in its own right but I also wanted to highlight it as an example of a more general principle about different levels of variation when considering policy interventions.

One of my favorite examples here is smoking: it’s really hard to have an individual-level intervention to help people quit smoking. But aggregate interventions, such as banning indoor smoking, seem to work. This seems a bit paradoxical: after all, aggregate changes are nothing but aggregations of individual changes, so how could it be easier to change the smoking behavior of many thousands of people, than to change behaviors one at a time? But that’s how it is. Individual decisions are not so individual, as is most obvious, perhaps, in the variation across populations and across eras in family size: nowadays, it’s trendy in the U.S. to have 3 kids; a couple decades back, 2 was the standard; and a few decades earlier, 4-child families were common. We make our individual choices based on what other people are doing. And, again, it’s really hard to quit smoking, which can make it seem like smoking is as inevitable as death or taxes, but smoking rates vary a lot by country, and by state within this country.

To return to the policing example, we’ve had lots of discussion about whether or not particular cops or particular police departments are racially biased—lots of comparisons within cities—but Moskos argues we have not been thinking hard enough about comparisons between cities. An interesting point, and it would be good to see it on the agenda.

My next 170 blog posts (inbox zero and a change of pace)

I’ve successfully emptied my inbox:

Screen Shot 2016-07-30 at 11.47.28 PM

And one result was to fill up the blog through mid-January.

I think I’ve been doing enough blogging recently, so my plan now is to stop for awhile and instead transfer my writing energy into articles and books. We’ll see how it goes.

Just to give you something to look forward to, below is a list of what’s in the queue. I’m sure I’ll be interpolating new posts on this and that—there is an election going on, after all, indeed I just inserted a politics-related post two days ago. And other things come up sometime that just can’t wait. Also, my co-bloggers are free to post on Stan or whatever else they want, whenever they want.

But this is what’s on deck so far:

In policing (and elsewhere), regional variation in behavior can be huge, and perhaps can give a clue about how to move forward.

Guy Fieri wants your help! For a TV show on statistical models for real estate

The p-value is a random variable

“What can recent replication failures tell us about the theoretical commitments of psychology?”

Documented forking paths in the Competitive Reaction Time Task

Shameless little bullies claim that published triathlon times don’t replicate

Boostrapping your posterior

You won’t be able to forget this one: Alleged data manipulation in NIH-funded Alzheimer’s study

Are stereotypes statistically accurate?

Will youths who swill Red Bull become adult cocaine addicts?

Science reporters are getting the picture

Modeling correlation of issue attitudes and partisanship within states

Tax Day: The Birthday Dog That Didn’t Bark

The history of characterizing groups of people by their averages

Calorie labeling reduces obesity Obesity increased more slowly in California, Seattle, Portland (Oregon), and NYC, compared to some other places in the west coast and northeast that didn’t have calorie labeling

What’s gonna happen in November?

An ethnographic study of the “open evidential culture” of research psychology

Things that sound good but aren’t quite right: Art and research edition

Michael Porter as new pincushion

Kaiser Fung on the ethics of data analysis

One more thing you don’t have to worry about

Evil collaboration between Medtronic and FDA

His varying slopes don’t seem to follow a normal distribution

A day in the life

Letters we never finished reading

Better to just not see the sausage get made

Oooh, it burns me up

Birthdays and heat waves

Publication bias occurs within as well as between projects

Graph too clever by half

Take that, Bruno Frey! Pharma company busts through Arrow’s theorem, sets new record!

A four-way conversation on weighting and regression for causal inference

How paracompact is that?

In Bayesian regression, it’s easy to account for measurement error

Garrison Keillor would be spinning etc

“Brief, decontextualized instances of colaughter”

The new quantitative journalism

It’s not about normality, it’s all about reality

Hokey mas, indeed

You may not be interested in peer review, but peer review is interested in you

Hypothesis Testing is a Bad Idea (my talk at Warwick, England, 2:30pm Thurs 15 Sept)

Genius is not enough: The sad story of Peter Hagelstein, living monument to the sunk-cost fallacy

Bayesian Statistics Then and Now

No guarantee

Let’s play Twister, let’s play Risk

“Evaluating Online Nonprobability Surveys”


Pro Publica Surgeon Scorecard Update

Hey, PPNAS . . . this one is the fish that got away

FDA approval of generic drugs: The untold story

Acupuncture paradox update

More p-value confusion: No, a low p-value does not tell you that the probability of the null hypothesis is less than 1/2

Multicollinearity causing risk and uncertainty

Andrew Gelman is not the plagiarism police because there is no such thing as the plagiarism police.

Cracks in the thin blue line

Politics and chance

I refuse to blog about this one

“Find the best algorithm (program) for your dataset.”

NPR’s gonna NPR

Why the garden-of-forking-paths criticism of p-values is not like a famous Borscht Belt comedy bit

Don’t trust Rasmussen polls!

Astroturf “patient advocacy” group pushes to keep drug prices high

It’s not about the snobbery, it’s all about reality: At last, I finally understand hatred of “middlebrow”

The never-back-down syndrome and the fundamental attribution error

Michael Lacour vs John Bargh and Amy Cuddy

It’s ok to criticize

“The Prose Factory: Literary Life in England Since 1918” and “The Windsor Faction”

Note to journalists: If there’s no report you can read, there’s no study


No, I don’t think the Super Bowl is lowering birth weights

Gray graphs look pretty

Should you abandon that low-salt diet?

Transparency, replications, and publication

Is it fair to use Bayesian reasoning to convict someone of a crime?

“Marginally Significant Effects as Evidence for Hypotheses: Changing Attitudes Over Four Decades”

Some people are so easy to contact and some people aren’t.

Should Jonah Lehrer be a junior Gladwell? Does he have any other options?

Advice on setting up audio for your podcast

The Psychological Science stereotype paradox

We have a ways to go in communicating the replication crisis

Authors of AJPS paper find that the signs on their coefficients were reversed. But they don’t care: in their words, “None of our papers actually give a damn about whether it’s plus or minus.” All right, then!

Another failed replication of power pose

“How One Study Produced a Bunch of Untrue Headlines About Tattoos Strengthening Your Immune System”

Ptolemaic inference

How not to analyze noisy data: A case study

The problems are everywhere, once you know to look

“Generic and consistent confidence and credible regions”

Happiness of liberals and conservatives in different countries

Conflicts of interest

“It’s not reproducible if it only runs on your laptop”: Jon Zelner’s tips for a reproducible workflow in R and Stan

Unintentional parody of Psychological Science-style research redeemed by Dan Kahan insight

Rotten all the way through

Some modeling and computational ideas to look into

How to improve science reporting? Dan Vergano sez: It’s not about reality, it’s all about a salary

Kahan: “On the Sources of Ordinary Science Knowledge and Ignorance”

Why I prefer 50% to 95% intervals

How effective (or counterproductive) is universal child care? Part 1

How effective (or counterproductive) is universal child care? Part 2

“Another terrible plot”

Can a census-tract-level regression analysis untangle correlation between lead and crime?

The role of models and empirical work in political science

More on my paper with John Carlin on Type M and Type S errors

Should scientists be allowed to continue to play in the sandbox after they’ve pooped in it?

“Men with large testicles”

Sniffing tears perhaps not as effective as claimed

Thinking more seriously about the design of exploratory studies: A manifesto

From zero to Ted talk in 18 simple steps: Rolf Zwaan explains how to do it!

Individual and aggregate patterns in the Equality of Opportunity research project

Unfinished (so far) draft blog posts

Deep learning, model checking, AI, the no-homunculus principle, and the unitary nature of consciousness

How best to partition data into test and holdout samples?

Abraham Lincoln and confidence intervals

“Breakfast skipping, extreme commutes, and the sex composition at birth”

Discussion on overfitting in cluster analysis

Happiness formulas

OK, sometimes the concept of “false positive” makes sense.

“A bug in fMRI software could invalidate 15 years of brain research”

Interesting epi paper using Stan

How can you evaluate a research paper?

Some U.S. demographic data at zipcode level conveniently in R

So little information to evaluate effects of dietary choices

Frustration with published results that can’t be reproduced, and journals that don’t seem to care

Using Stan in an agent-based model: Simulation suggests that a market could be useful for building public consensus on climate change

Data 1, NPR 0

Dear Major Textbook Publisher

“So such markets were, and perhaps are, subject to bias from deep pocketed people who may be expressing preference more than actual expectation”

Temple Grandin

fMRI clusterf******

How to think about the p-value from a randomized test?

Avoiding selection bias by analyzing all possible forking paths

The social world is (in many ways) continuous but people’s mental models of the world are Boolean

Science journalist recommends going easy on Bigfoot, says you should bash of mammograms instead

Applying statistical thinking to the search for extraterrestrial intelligence

An efficiency argument for post-publication review

Bayes is better

What’s powdery and comes out of a metallic-green cardboard can?

Low correlation of predictions and outcomes is no evidence against hot hand

Jail for scientific fraud?

Is the dorsal anterior cingulate cortex “selective for pain”?

Quantifying uncertainty in identification assumptions—this is important!

If I had a long enough blog delay, I could just schedule this one for 1 Jan 2026

Historical critiques of psychology research methods

p=.03, it’s gotta be true!

Objects of the class “George Orwell”

Sorry, but no, you can’t learn causality by looking at the third moment of regression residuals

“The Pitfall of Experimenting on the Web: How Unattended Selective Attrition Leads to Surprising (Yet False) Research Conclusions”

Two unrelated topics in one post: (1) Teaching useful algebra classes, and (2) doing more careful psychological measurements

Transformative treatments

Comment of the year

Migration explaining observed changes in mortality rate in different geographic areas?

Fragility index is too fragile

When you add a predictor the model changes so it makes sense that the coefficients change too.

Nooooooo, just make it stop, please!

“Which curve fitting model should I use?”

We fiddle while Rome burns: p-value edition

The Lure of Luxury

Confirmation bias

Problems with randomized controlled trials (or any bounded statistical analysis) and thinking more seriously about story time

When do stories work, Process tracing, and Connections between qualitative and quantitative research

A small, underpowered treasure trove?

Problems with “incremental validity” or more generally in interpreting more than one regression coefficient at a time

No evidence of incumbency disadvantage?

To know the past, one must first know the future: The relevance of decision-based thinking to statistical analysis

Powerpose update

Absence of evidence is evidence of alcohol?

“Estimating trends in mortality for the bottom quartile, we found little evidence that survival probabilities declined dramatically.”

SETI: Modeling in the “cosmic haystack”

There should really be something here for everyone. I don’t remember half these posts myself, and I look forward to reading them when they come out!

P.S. It’s a good thing I blog for free because nobody could pay me enough for the effort that goes into it.

All maps of parameter estimates remain misleading


Roland Rau writes:

After many years of applying frequentist statistical methods in mortality research, I just began to learn about the application of Bayesian methods in demography. Since I also wanted to change a part of my research focus on spatial models, I discovered your 1999 paper with Phil Price, All maps of parameter estimates are misleading. As this article is already 17 years old, I wanted to ask whether you think that the last part of the final sentence of the article—“we know of no satisfactory solution to the problem of generating maps for general use”—is still valid. Or would you recommend some other technique to avoid the pitfalls of plotting observed rates or posterior means/medians?

My reply:

For the reasons discussed in our article, I think that there is inherently no way to avoid a map of parameter estimates being misleading in some way (unless variation is tiny or the data have some symmetry so that all sample sizes are identical). It’s just not possible to project the globe of multivariate uncertainty onto the plane of point estimates.

That said, there could well be new ideas in how best to map uncertainty and variation. So I expect there has been progress in mapping parameter estimates in the past twenty years, even if there are fundamental mathematical constraints that will always be with us.

A kangaroo, a feather, and a scale walk into Viktor Beekman’s office

The Kangaroo with a feather effect

E. J. writes:

I enjoyed your kangaroo analogy [see also here—ed.] and so I contacted a talented graphical artist—Viktor Beekman—to draw it. The drawing is on Flickr under a CC license.

Thanks, Viktor and E.J.!

Stan 2.11 Good, Stan 2.10 Bad

Stan 2.11 is available for all interfaces

We are happy to announce that all of the interfaces have been updated to Stan 2.11. There was a subtle bug introduced in 2.10 where a probabilistic acceptance condition was being checked twice. Sorry about that and thanks for your patience. We’ve added some additional tests to catch this kind of thing going forward.

As usual, instructions on downloading all of the interfaces are linked from:

The bug was introduced in 2.10, so 2.9 (pre language syntax enhancements) should still be OK.

There are also a couple bonus bug fixes: printing now works if the current iteration is rejected, and rejecting integer division by zero rather than crashing.

Thanks to everyone who helped make this happen

We found the bug due to a reproducible example posted to our user list by Yannick Jadoul. Thanks!

Thanks to Joshua Pritikin, who updated OpenMX (a structural equation modeling package that uses Stan’s automatic differentiation), so that the CRAN release could go through.

Thanks also to Michael Betancourt, Daniel Lee, Allen Riddell, and Ben Goodrich of the Stan dev team for stepping up to fix Stan itself and get the PyStan and RStan interfaces out ASAP.

What’s next?

We’re aiming to release minor versions at most quarterly. Here’s what should be in the next release (2.12):

  • substantial speed improvements to our matrix arithmetic
  • compound declare and define statements
  • elementwise versions of all unary functions for arrays, vectors, and matrices
  • command refactor (mostly under the hood, but will make new command functionality much easier)
  • explicit control of proportionality (dropping constants) in probability mass and density functions in the language
  • vector-based lower- and upper-bounds constraints for variables

After the next release, we’ll bring the example model code up to our current recommendations on priors and our current recommendations on Stan programming. Then, after the command refactor, the way will be clear for the Stan 3 versions of our interfaces, where we’ll be making all of our interfaces consistent and giving them more fine-grained control. Stay tuned!

Even social scientists can think like pundits, unfortunately

I regularly read the Orgtheory blog which has interesting perspectives from sociologists. Today I saw this, from Sean Safford:

I [Safford] actually hold to the idea that the winning candidate for President is always the one who has a clearer view of the challenges and opportunities facing the country and articulates a viable roadmap for how to navigate them.

I disagree entirely. I think Safford’s view is naive and it is based on the idea that the outcome of the election is determined by the candidates and the campaign. In contrast, I believe the political science research that says that economic conditions are the most important factor determining the election outcome. You could go through election after election and be hard-pressed to make the case that the winning candidate had a clearer view of the challenges etc. Sure, you can find some examples: arguably Reagan had a clearer view than Carter, and Obama had a clearer view than McCain, and . . . ummmm, maybe that’s about it. Or maybe not. You could make a pretty good argument for either candidate in most elections, from 1948 onward.

Also Safford should really really watch out about that “always.” In the postwar period, there have been three elections that were essentially ties: 1960, 1968, and 2000. Even if you want to make the case (a case that I completely disagree with) that presidential elections are won by candidates expressing a clearer view and a more viable roadmap, still, you can’t hope to think that this will work every time, not given that some elections are basically coin flips.

The above fallacies—the idea that elections are determined by candidates and campaigns, and the idea that there is some key by which the election outcome can be known deterministically—appear a lot in political journalism, and my colleagues and I spend a bit of time at the sister blog explaining why they’re wrong. I don’t usually see academic researchers making these errors, though.

I looked up Safford and he’s done a lot of qualitative work on labor and social networks. This is important stuff, and I expect that if I started opining on the effects of labor union strategies, I’d be about as confused as Safford is when writing about electoral politics. So I’d like to emphasize that I’m not trying to slam the guy for making this mistake. We all make mistakes, and what’s a blog for, if not to put some of our more casual speculations out for general criticism. (In contrast, I was more annoyed a few years ago when political theorist David Runciman had the BBC as his platform for spreading pundit-level errors about U.S. public opinion.) So, no hard feelings, it’s just interesting to see an academic make a mistake that I usually associate with pundits.

What recommendations to give when a medical study is not definitive (which of course will happen all the time, especially considering that new treatments should be compared to best available alternatives, which implies that most improvements will be incremental at best)

Simon Gates writes:

I thought you might be interested in a recently published clinical trial, for potential blog material. It picks up some themes that have cropped in recent months. Also, it is important for the way statistical methods influence what can be life or death decisions.

The OPPTIMUM trial ( evaluated use of vaginal progesterone for prevention of preterm delivery.  The trial specified three primary outcomes: 1. fetal death or delivery at less than 34 weeks’ gestation; 2. Neonatal death or serious morbidity; 3. Cognitive score at two years.  Because there were three primary outcome they applied a Bonferroni/Holm correction to “control type 1 error rate”.

Results were:

outcome Progesterone Placebo Risk ratio / difference (95% CI) P unadjusted P adjusted
Fetal death or delivery < 34 weeks 96/600 108/597 0.86 (0.64, 1.17) 0.34 0.67
Neonatal death or serious morbidity 39/589 60/587 0.62 (0.41, 0.94) 0.02 0.072
Cognitive score at 2 years 97.3 (n=430) 97.7 (n=439) -0.48 (-2,77, 1.81) 0.68 0.68

The conclusion was (from the abstract) “Vaginal progesterone was not associated with reduced risk of preterm birth or composite neonatal adverse outcomes, and had no long-term benefit or harm on outcomes in children at 2 years of age.

A few comments (there is lots more that could be said):

  1. There is a general desire to collapse results of a clinical trial into a dichotomous outcome: the treatment works or it doesn’t. This seems to stem from a desire for the trial to provide a decision as to whether the treatment should be used or not.  However, those decisions often come later and are largely based on other information as well (often cost-effectiveness).
  2. The approach taken here (adjustment of p-values for multiple testing) implies that a conclusion of treatment effectiveness would be made if any of the three “primary outcomes” had p<0.05, after adjustment.
  3. “Statistical significance” is taken to mean treatment effectiveness (as it usually is by clinicians, researchers and even statisticians).
  4. There were a lot of patients missing for the assessment of cognition at 2 years, so there has to be a question mark over that result.  It is certainly possible that bias has been introduced.
  5. The Bonferroni adjustment moves the neonatal death/morbidity outcome from “significance” to “non-significance”, but the data still support a reduction in this outcome more than no effect or an increase (as the posterior distribution would no doubt show).
  6. It seems to me that the statistical methods used here have really not helped to understand what the effects of this treatment are.

I’d be interested in your take (and those of commenters) on the analysis and conclusions.

Interest declaration: I [Gates] was tangentially involved in this trial (as a member of the Data Monitoring Committee) and I know several of the authors of the paper.

My reply: Yes, I agree that the analysis does not seem appropriate for the goals of the study. Even setting aside the multiple comparisons issues, “not statistically significant” is not the same as “no effect.” Also the multiple comparisons correction bothers me for the usual reason that it doesn’t make sense to me that including more information should weaken one’s conclusion.

I wonder what would happen if they were to use a point system where death counts as 2 points and the other negative outcomes count as 1 point, or some other sort of graded scale? At this point maybe there would be a concern of fishing through the data, but for the next study of this sort maybe they can think ahead of time about a reasonable combined outcome measure.

The other issue is decision making: what to do when a study is not definitive (which of course will happen all the time, especially considering that new treatments should be compared to best available alternatives, which implies that most improvements will be incremental at best). In some ways the existing paper is good in that way, in that it presents the results right out there for people to look at. I’d like to see a formal decision analysis for whether, or when, to recommend this treatment for new patients.

Also, one more thing: the measurement of cognitive score has another selection bias problem which is that it is conditional on the baby being born. So it might be that they’d want to fit some sort of selection modeling to handle this. I could imagine a scenario in which a treatment reduced deaths and ended up also reducing cognitive scores if it saved the lives of babies who later had problems. Or maybe not, I’m not sure; it just seems like one more thing to think about.

Does Benadryl make you senile? Challenges in research communication

Mark Tuttle points to a post, “Common anticholinergic drugs like Benadryl linked to increased dementia risk” by Beverly Merz, Executive Editor, Harvard Women’s Health Watch. Merz writes:

In a report published in JAMA Internal Medicine, researchers offers compelling evidence of a link between long-term use of anticholinergic medications like Benadryl and dementia. . . .

A team led by Shelley Gray, a pharmacist at the University of Washington’s School of Pharmacy, tracked nearly 3,500 men and women ages 65 and older who took part in Adult Changes in Thought (ACT), a long-term study conducted by the University of Washington and Group Health, a Seattle healthcare system. They used Group Health’s pharmacy records to determine all the drugs, both prescription and over-the-counter, that each participant took the 10 years before starting the study. Participants’ health was tracked for an average of seven years. During that time, 800 of the volunteers developed dementia. When the researchers examined the use of anticholinergic drugs, they found that people who used these drugs were more likely to have developed dementia as those who didn’t use them. Moreover, dementia risk increased along with the cumulative dose. Taking an anticholinergic for the equivalent of three years or more was associated with a 54% higher dementia risk than taking the same dose for three months or less. . . .

Scary. But then I scroll down, and here’s the very first comment, from Joe (no last name):

Took a look at the study
The diffence between the people that were in the non users and heavy users is massively different. 3x higher on EVERY risk factor stroke, obese, etc.. Odd this would get so much traction with the press.. Borderline irresponsible of Harvard to publish this on their blog. Nothing here even hints at causality

So whassup? I can click too . . . so let’s see what the study says.

They used Cox proportional hazard regression, adjusting for a bunch of background variables:

Screen Shot 2016-04-06 at 11.47.26 PM

They excluded people where any of these covariates were missing.

And here are their results:

Screen Shot 2016-04-06 at 11.51.02 PM

Seems pretty clear. Although I guess they’re relying pretty heavily on their regression model. Maybe it would make sense to clean the data first by doing some matching so that you have treatment and control groups that are more similar, before running the regression.

Anyway, this all indicates some of the challenges of statistical communication.

For more, see this article by Natalie Smith with the provocative title, “Clinical Misinformation: The Case of Benadryl Causing Dementia,” and this article by Cynthia Fox with the opposite spin: “Strong Link Found Between Dementia, Common Anticholinergic Drugs.”

Tuttle writes:

For all the reasons this article speculates on this could be true – these medications cause dementia.

But, as you know only too well there are so many confounding variables here – the simple one is that currently unknown precursors of dementia cause people to take these drugs.

I don’t really know what to say here. On one hand, yes, lots of potential confounders, also the usual issues of statistical uncertainties, garden of forking paths, etc. On the other hand, it does seem valuable for researchers to find out what is currently happening. The whole thing is a challenge, especially given people’s inclination to base their views on N=1 anecdotal evidence.

Fish cannot carry p-values

Following up on our discussion from last week on inference for fisheries, Anders Lamberg writes:

Since I first sent you the question, there has been a debate here too.

In the discussion you send, there is a debate both about the actual sampling (the mathematics) and about more the practical/biological issues. How accurate can farmed be separate from wild fish, is the 5 % farmed fish limit correct etc… There is constantly acquired new data on this first type of question. I am not worried about that, because there is an actual process going on that makes methods better.

However, it is the discussion of the second question, use of statistics and models, that until recently, have not been discussed properly. Here a lot of biologists have used the concept “confidence interval” without really understanding what it means. I gave you an example of sampling 60 salmon in a population of 3000. There are a lot of examples where the sample size have been as low as 10 individuals. The problem has been how to interpret the uncertainty. Here is a constructed (but not far from realistic example) example:

Population size is 3000. From different samples you could hypothetically get these three results:

1) You sample 10, get 1 farmed fish. This gives 10 % farmed fish
2) You sample 30, get 3 farmed fish. This gives 10 % farmed fish
3) You sample 60, get 6 farmed fish. This gives 10 % farmed fish

All surveys show the same result, but they are dramatically different when you have to draw a conclusion.

When reporting the sampling (current practice) it is the point estimate 10 % that is the main reported result. Sometimes the confidence interval with upper and lower limits is also reported, but not discussed. Since there is only one sample drawn from the populations, not discussing the uncertainty with such small samples can lead to wrong conclusions. In most projects a typical biologist is reporting, the results are a part of a hypothetical deductive research process. The new thing with the farmed salmon surveys, is that the results are measured against a defined limit : 5 %. If the point estimate is above 5 %, it means millions in costs (actually billions) for the industry. On the other hand, if the observed point estimate is below 5 % the uncertainty could affect he wild salmon populations . This could result in a long term disaster for the wild salmon.

With the risk of being viciously tabloid: The biologists (and I am one of them) have suddenly come into a situation where their reports have direct consequences. The question about the farmed salmon frequencies in the wild populations have become a political question in Norway – at the highest level. Suddenly we have to really discuss uncertainty in our data. I do not say that all biologists have been ignorant, but I suspect that a lot of publications have not and do not address uncertainty with respect.

The last months more mathematical expertise here in Norway have been involved in the “farmed salmon question” presented. The conclusion so far is that you cannot use the point estimate. You have to view the confidence interval as a test of a hypothesis:

H: The level of farmed salmon is over 5 %

If the 95 % confidence interval has an upper limit that contains the value 5 % or higher, you have to start measures. If the point estimate for example is 1 % but the upper limit in the 95 % confidence interval is 6 %, we must start the job to remove farmed salmon from that population. The problem with this and the fact that the confidence interval from almost all the surveys will contain the critical value of 5 % (although the point estimate is much lower), is that in most populations you cannot reject the hypothesis. The reason for all intervals containing the critical value, is the small sample sizes.

To use this kind of sampling procedure your sample size should exceed about 200 salmon to give a result that will the fish farming industry fair treatment. On the other hand, small sample sizes and large confidence intervals will always be a benefit for the wild salmon. I would like that on behalf of nature, but we biologists will then not be a relevant as experts that give advice in the society as a whole.

Then there are a lot of practical implications linked to the minimum sample size of 200. Since the sample is done by rod catch, some salmon will die due to the sampling procedure. But the most serious problem with the sampling is that several new reports now show that the farmed fish will more frequently take the bait. It is shown that the catchability of farmed salmon is from 3 to 10 times higher than that of wild salmon. This will vary so you cannot put in a constant factor in the calculations.

The solution so far seems to use other methods to acquire the samples. Snorkeling in the rivers performed by trained persons, show that over 85 % of the farmed fish is correctly classified. Since a snorkeling survey involves from 80 to 100 % of the population, the only significant error is the wrong classification, which is a small error compared to the uncertainty of small sample procedures.

Thanks again for showing interest in this question. The research institutions in Norway have not been that positive to even discuss the theme. I suspect that has to do with money. Fish farmers have focus on growth and money but sadly, but so far I guess the researchers involved to monitor environmental impacts see that a crises give more money for research. Therefore it is important to have the discussion free of all questions about money. Here in Norway I miss that kind of approach you have to the topic. The discussions and development and testing of new hypothesis is the reason why we became biologists? It is the closest you come to be a criminal investigator. We did not want to become politicians.

My general comment is to remove the whole “hypothesis” thing. It’s an estimation problem. You’re trying to estimate the level of farmed fish, which varies over time and across locations. And you have some decisions to make. I see zero benefit, and much harm, to framing this as a hypothesis testing problem.

Wald and those other guys from the 1940s were brilliant, doing statistics and operations research in real time during a real-life war. But the framework they were using was improvised, it was rickety, and in the many decades since, people keep trying to adapt it in inappropriate settings. Time to attack inference and decision problems directly, instead of tying yourself into knots with hypotheses and confidence intervals and upper limits and all the rest.

Call for research on California water resources

Patrick Atwater writes:

I serve as a project manager of the California Data Collaborative, a coalition of water utilities working together to share data and ensure water reliability.

We’ve put together a quick call for ideas on studies into the demand effects of water rates leveraging this unique database. California’s water world is highly fragmented across 411 retailers so this centralized repository greatly simplifies the life of prospective researchers.

Your audience is the perfect crowd to leverage this dataset and if you haven’t noticed, we’ve got a big drought out here in California so could use all the help we can get!

I have no idea what this is about but you can click on the link to find out for yourself.Cal

What makes a mathematical formula beautiful?


Hiro Minato pointed me to this paper (hyped here) by Semir Zeki, John Romaya, Dionigi Benincasa, and Michael Atiyah on “The experience of mathematical beauty and its neural correlates,” who report:

We used functional magnetic resonance imaging (fMRI) to image the activity in the brains of 15 mathematicians when they viewed mathematical formulae which they had individually rated as beautiful, indifferent or ugly. Results showed that the experience of mathematical beauty correlates parametrically with activity in the same part of the emotional brain, namely field A1 of the medial orbito-frontal cortex (mOFC), as the experience of beauty derived from other sources.

I wrote that I looked at the paper and I don’t believe it!

Minato replied:

I think what they did wasn’t good enough to answer or even approach the question (scientifically or otherwise). . . . Meanwhile, someone can probably study sociology or culture of mathematicians to understand why mathematicians want to describe some “good” mathematics beautiful, elegant, etc.

I agree. Mathematical beauty is a fascinating topic; I just don’t think they’re going to learn much via MRI scans. It just seems like too crude a tool, kinda like writing a bunch of formulas on paper, feeding these sheets of paper to lab rats, and then performing a chemical analysis of the poop. The connection between input and output is just too noisy and indirect.

This seems like a problem that could use the collaboration of mathematicians, psychologists, and historians or sociologists. And just think of how much sociologist time you could afford, using the money you saved from not running the MRI machine!