Skip to content

A Psych Science reader-participation game: Name this blog post

Screen Shot 2015-09-02 at 1.34.11 PM

In a discussion of yesterday’s post on studies that don’t replicate, Nick Brown did me the time-wasting disservice of pointing out a recent press release from Psychological Science which, as you might have heard, is “the highest ranked empirical journal in psychology.”

The press release is called “Blue and Seeing Blue: Sadness May Impair Color Perception” and it describes a recently published article by Christopher Thorstenson, Adam Pazda, Andrew Elliot, which reports that “sadness impaired color perception along the blue-yellow color axis but not along the red-green color axis.”

Unfortunately the claim of interest is extremely difficult to interpret, as the authors do not seem to be aware of the principle that the difference between “significant” and “not significant” is not itself statistically significant:


The paper also features other characteristic features of Psychological Science-style papers, including small samples of college students, lots of juicy researcher degrees of freedom in data-exclusion rules and in the choice of outcomes to analyze, and weak or vague theoretical support for the reported effects.

The theoretical claim was “maybe a reason these metaphors emerge was because there really was a connection between mood and perceiving colors in a different way,” which could be consistent with almost any connection between color perception and mood. And then once the results came out, we get this: “‘We were surprised by how specific the effect was, that color was only impaired along the blue-yellow axis,’ says Thorstenson. ‘We did not predict this specific finding, although it might give us a clue to the reason for the effect in neurotransmitter functioning.'” This is of course completely consistent with a noise-mining exercise, in that just about any pattern can fit the theory, and then the details of whatever random thing that comes up is likely to be a surprise.

It’s funny: it’s my impression that, when a scientist reports that his findings were a surprise, that’s supposed to be a positive thing. It’s not just turning the crank, it’s scientific discovery! A surprise! Like penicillin! Really, though, if something was a surprise, maybe you should take more seriously the possibility that you’re just capitalizing on chance, that you’re seeing something in one experiment (and then are motivated to find in another). It’s the scientific surprise two-step, a dance discussed by sociologist Jeremy Freese.

As usual in such settings, I’m not saying that Thorstenson et al. are wrong in their theorizing, or that their results would not show up in a more thorough study on a larger sample. I’m just saying that they haven’t really made a convincing case, as the patterns they find could well be explainable by chance alone. Their data offer essentially no evidence in support of their theory, but the theory could still be correct, just unresolvable amid the experimental noise. And, as usual, I’ll say that I’d have no problem with this sort of result being published, just without the misplaced certainty. And, if the editors of Psychological Science think this sort of theorizing is worth publishing, I think they should also be willing to publish the same thing, even if the comparisons of interest are not statistically significant.

The contest!

OK, on to the main event. After Nick alerted me to this paper, I thought I should post something on it. But my post needed a title. Here were the titles I came up with:

“Feeling Blue and Seeing Blue: Desire for a Research Breakthrough May Impair Statistics Perception”


“Stop me before I blog again”


“The difference between ‘significant’ and ‘not significant’ is enough to get published in the #1 journal in psychology”


“They keep telling me not to use ‘Psychological Science’ as a punch line but then this sort of thing comes along”

Or simply,

“This week in Psychological Science.”

But maybe you have a better suggestion?

Winner gets a free Stan sticker.

P.S. We had another one just like this a few months ago.

P.P.S. I have nothing against Christopher Thorstenson, Adam Pazda, or Andrew Elliot. I expect they’re doing their best. It’s not their fault that (a) statistical methods are what they are, (b) statistical training is what is is, and (c) the editors of Psychological Science don’t know any better. It’s all too bad, but it’s not their fault. I laugh at these studies because I’m too exhausted to cry, that’s all. And, before you feel too sorry for these guys or for the editors of Psychological Science or think I’m picking on them, remember: if they didn’t want the attention, they didn’t need to publish this work in the highest-profile journal of their field. If you put your ideas out there, you have to expect (ideally, hope) that people will point out what you did wrong.

I’m honestly surprised that Psychological Science is still publishing this sort of thing. They’re really living up to their rep, and not in a good way. PPNAS I can expect will publish just about anything, as it’s not peer-reviewed in the usual way. But Psych Science is supposed to be a real journal, and I’d expect, or at least hope, better from them.

USAs usannsynlige presidentkandidat.

With current lag, this should really appear in September but I thought I better post it now in case it does not remain topical.

It’s a news article by Linda May Kallestein, which begins as follows:

Sosialisten Bernie Sanders: Kan en 73 år gammel jøde, født av polske innvandrere, vokst opp under enkle kår og som vil innføre sosialdemokrati etter skandinavisk modell, ha sjanse til å bli USAs neste president?

And here’s my quote:

Screen Shot 2015-09-02 at 5.15.23 PM

I actually said it in English, but you get the picture. Not as exciting as the time I was quoted in Private Eye, but I’ll still take it.

The full story is on the sister blog.

To understand the replication crisis, imagine a world in which everything was published.


John Snow points me to this post by psychology researcher Lisa Feldman Barrett who reacted to the recent news on the non-replication of many psychology studies with a contrarian, upbeat take, entitled “Psychology Is Not in Crisis.”

Here’s Barrett:

An initiative called the Reproducibility Project at the University of Virginia recently reran 100 psychology experiments and found that over 60 percent of them failed to replicate — that is, their findings did not hold up the second time around. . . .

But the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works. . . . Science is not a body of facts that emerge, like an orderly string of light bulbs, to illuminate a linear path to universal truth. Rather, science (to paraphrase Henry Gee, an editor at Nature) is a method to quantify doubt about a hypothesis, and to find the contexts in which a phenomenon is likely. Failure to replicate is not a bug; it is a feature. It is what leads us along the path — the wonderfully twisty path — of scientific discovery.

All this is fine. Indeed, I’ve often spoken of the fractal nature of science: at any time scale, whether it be minutes or days or years, we see a mix of forward progress and sudden shocks, realizations that much of what we’ve thought was true, isn’t. Scientific discovery is indeed both wonderful and unpredictable.

But Barrett’s article disturbs me too, for two reasons. First, yes, failure to replicate is a feature, not a bug—but only if you respect that feature, if you take the failure to replicate to reassess your beliefs. But if you just complacently say it’s no big deal, then you’re not taking the opportunity to learn.

Here’s an example. The recent replication paper by Nosek et al. had many examples of published studies that did not replicate. One example was described in Benedict Carey’s recent New York Times article as follows:

Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so. In the reproduced studies, researchers found weaker effects for all three experiments.

Carey got a quote from the author of that original study. To my disappointment, the author did not say something like, “Hey, it looks like we might’ve gone overboard on that original study, that’s fascinating to see that the replication did not come out as we would’ve thought.” Instead, here’s what we got:

In an email, Paola Bressan, a psychologist at the University of Padua and an author of the original mate preference study, identified several such differences — including that her sample of women were mostly Italians, not American psychology students — that she said she had forwarded to the Reproducibility Project. “I show that, with some theory-required adjustments, my original findings were in fact replicated,” she said.

“Theory-required adjustments,” huh? Unfortunately, just about anything can be interpreted as theory-required. Just ask Daryl Bem.

We can actually see what the theory says. Philosopher Deborah Mayo went to the trouble to look up Bressan’s original paper, which said the following:

Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extra pair mating with the former. Only if conception occurs, however, do the evolutionary benefits of such a strategy overcome its costs. Accordingly, we predicted that (a) partnered women should prefer attached men, because such men are more likely than single men to have pair-bonding qualities, and hence to be good replacement partners, and (b) this inclination should reverse when fertility rises, because attached men are less available for impromptu sex than single men.

Nothing at all about Italians there! Apparently this bit of theory requirement wasn’t apparent until after the replication didn’t work.

What if the replication had resulted in statistically significant results in the same direction as expected from the earlier, published paper? Would Bressan have called up the Replication Project and said, “Hey—if the results replicate under these different conditions, something must be wrong. My theory requires that the model won’t work with American college students!” I really really don’t think so. Rather, I think Bressan would call it a win.

And that’s my first problem with Barrett’s article. I feel like she’s taking a heads-I-win, tails-you-lose position. A successful replication is welcomed as a confirmation, an unsuccessful replication indicates new conditions required for the theory to hold. Nowhere does she consider the third option: that the original study was capitalizing on chance and in fact never represented any general pattern in any population. Or, to put it another way, that any true underlying effect is too small and too variable to be measured by the noisy instruments being used in some of those studies.

As the saying goes, when effect size is tiny and measurement error is huge, you’re essentially trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.

My second problem with Barrett’s article is at the technical level. She writes:

Suppose you have two well-designed, carefully run studies, A and B, that investigate the same phenomenon. They perform what appear to be identical experiments, and yet they reach opposite conclusions. Study A produces the predicted phenomenon, whereas Study B does not. . . . Does this mean that the phenomenon in question is necessarily illusory? Absolutely not. If the studies were well designed and executed, it is more likely that the phenomenon from Study A is true only under certain conditions [emphasis in the original].

At one level, there is nothing to disagree with here. I don’t really like the presentation of phenomena as “true” or “false”—pretty much everything we’re studying in psychology has some effect—but, in any case, all effects vary. The magnitude and even the direction of any effect will vary across people and across scenarios. So if we interpret the phrase “the phenomenon is true” in a reasonable way, then, yes, it will only be true under certain conditions—or, at the very least, vary in importance across conditions.

The problem comes when you look at specifics. Daryl Bem found some comparisons in his data which, when looked in isolation, were statistically significant. These patterns did not show up in replication. Satoshi Kanazawa found a correlation between beauty in sex ratio in a certain dataset. When he chose a particular comparison, he found p less than .05. What do we learn from this? Do we learn that, in the general population, beautiful parents are more likely to have girls? No. The most we can learn is that the Journal of Theoretical Biology can be fooled into publishing patterns that come from noise. (His particular analysis was based on a survey of 3000 people. A quick calculation using prior information on sex ratios shows that you would need data on hundreds of thousands of people to estimate any effect of the sort that he was looking for.) And then there was the himmicanes and hurricanes study which, ridiculous as it was, falls well within the borders of much of the theorizing done in psychology research nowadays. And so on, and so on, and so on.

We could let Barrett off the hook on the last quote above because she does qualify her statement with, “If the studies were well designed and executed . . .” But there’s the rub. How do we know if a study was well designed and executed? Publication in Psychological Science, or PPNAS is not enough—lots and lots of poorly designed and executed studies appear in these journals. It’s almost as if the standards for publication are not just about how well designed and executed a study is, but also about how flashy are the claims, and whether there is a “p less than .05” somewhere in the paper. It’s almost as if reviewers often can’t tell whether a study is well designed and executed. Hence the demand for replication, hence the concern about unreplicated studies, or studies that for mathematical reasons are essentially dead on arrival because the noise is so much greater than the signal.

Imagine a world in which everything was published

A close reading of Barrett’s article reveals the centrality of the condition that studies be “well designed and executed,” and lots of work by statisticians and psychology researchers in recent years (Simonsohn, Button, Nosek, Wagenmakers, etc etc) has made it clear that current practice, centered on publication thresholds (whether it be p-value or Bayes factor or whatever), won’t do so well at filtering out the poorly designed and executed studies.

To discourage or disparage or explain away failed replications is to give a sort of “incumbency advantage” to published claims, which puts a burden on the publication process that it cannot really handle.

To better understand what’s going on here, imagine a thought experiment where everything is published, where there’s no such thing as Science or Nature or Psychological Science or JPSP or PPNAS; instead, everything’s published on Arxiv. Every experiment everyone does. And with no statistical significance threshold. In this world, nobody has ever heard of inferential statistics. All we see are data summaries, regressions, etc., but no standard errors no posterior probabilities, no p-values.

What would we do then? Would Barrett reassure us that we shouldn’t be discouraged by failed replications, that everything already published (except, perhaps, for “a few bad eggs”) be taken as likely to be true? I assume (hope) not. The only way this sort of reasoning can work is if you believe the existing system screens out the bad papers. But the point of various high-profile failed replications (for example, in the field of embodied cognition) is that, no, the system does not work so well. This is one reason the replication movement is so valuable, and this is one reason I’m so frustrated by people who dismiss replications or who claim that replications show that “the system works.” It only works if you take the information from the failed replications (and the accompanying statistical theory, which is the sort of thing that I work on) and do something about it!

As I wrote in an earlier discussion on this topic:

Suppose we accept this principle [that published results are to be taken as true, even if they fail to be replicated in independent studies by outsiders]. How, then, do we treat an unpublished paper? Suppose someone with a Ph.D. in biology posts a paper on Arxiv (or whatever is the biology equivalent), and it can’t be replicated? Is it ok to question the original paper, to treat it as only provisional, to label it as unreplicated? That’s ok, right? I mean, you can’t just post something on the web and automatically get the benefit of the doubt that you didn’t make any mistakes. Ph.D.’s make errors all the time (just like everyone else). . . .

Now we can engage in some salami slicing. According to Bissell (as I interpret here), if you publish an article in Cell or some top journal like that, you get the benefit of the doubt and your claims get treated as correct until there are multiple costly, failed replications. But if you post a paper on your website, all you’ve done is make a claim. Now suppose you publish in a middling journal, say, the Journal of Theoretical Biology. Does that give you the benefit of the doubt? What about Nature Neuroscience? PNAS? Plos-One? I think you get my point. A publication in Cell is nothing more than an Arxiv paper that happened to hit the right referees at the right time. Sure, approval by 3 referees or 6 referees or whatever is something, but all they did is read some words and look at some pictures.

It’s a strange view of science in which a few referee reports is enough to put something into a default-believe-it mode, but a failed replication doesn’t count for anything.

I’m a statistician so I’ll conclude with a baseball analogy

Bill James once wrote with frustration about humanist-style sportswriters, the sort of guys who’d disparage his work and say they didn’t care about the numbers, that they cared about how the athlete actually played. James’s response was that if these sportswriters really wanted to talk baseball, that would be fine—but oftentimes their arguments ended up having the form: So-and-so hit .300 in Fenway Park one year, or so-and-so won 20 games once, or whatever. His point was that these humanists were actually making their arguments using statistics. They were just using statistics in an uninformed way. Hence his dictum that the alternative to good statistics is not “no statistics,” it’s “bad statistics.”

That’s how I feel about the people who deny the value of replications. They talk about science and they don’t always want to hear my statistical arguments, but then if you ask them why we “have no choice but to accept” claims about embodied cognition or whatever, it turns out that their evidence is nothing but some theory and a bunch of p-values. Theory can be valuable but it won’t convince anybody on its own; rather, theory is often a way to interpret data. So it comes down to the p-values.

Believing a theory is correct because someone reported p less than .05 in a Psychological Science paper is like believing that a player belongs in the Hall of Fame because hit .300 once in Fenway Park.

This is not a perfect analogy. Hitting .300 anywhere is a great accomplishment, whereas “p less than .05” can easily represent nothing more than an impressive talent for self-delusion. But I’m just trying to get at the point that ultimately it is statistical summaries and statistical models that are being used to make strong (and statistical ridiculous) claims about reality, hence statistical criticisms, and external data such as come from replications, are relevant.

If, like Barrett, you want to dismiss replications and say there’s no crisis in science: Fine. But then publish everything and accept that all data are telling you something. Don’t privilege something that happens to have been published once and declare it true. If you do that, and you follow up by denying the uncertainty that is revealed by failed replications (and was earlier revealed, on the theoretical level, by this sort of statistical analysis), well, then you’re offering nothing more than complacent happy talk.

P.S. Fred Hasselman writes:

I helped analyze the replication data of the Bressan & Stranieri study.

There were two replication samples:

›Original effect is a level comparison after a 2x2x2 ANOVA:
›F(1, 194) = 7.16, p = .008, f = 0.19
t(49) = 2.45, p = .02, Cohen’s d = 0.37

›Replication 1 in-lab with N=263, Power > 99%, Cohen’s d = .06
›Replication 2 on-line with N=317, Power > 99%, Cohen’s d = .09

Initially I did not have the time to read the entire article. I recently did, because I wanted to use the study as an example in a lecture.

I completely agree with the comparisons to Bem-logic.
What I ended up doing is showing the original materials and elaborating on the theory behind the hypothesis during the lecture.

After seeing the stimuli, learning about the hypothesis, but before learning about the replication studies, there was a consensus among students (99% female) that claims like the first sentence of the abstract should disqualify the study as a serious work of science:

ABSTRACT—Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extrapair mating with the former.

Think about it.
Men of higher genetic quality are poorer partners and parents.
That’s a fact you know.
And this genetic quality of men (yes, they mean attractiveness) is why women want their babies, more so than babies from their current partner (the ugly variety of men, but very sweet and good with kids).

My brain hurts.

Thankfully the conclusion is very modest:
In humans’ evolutionary past, the switch in preference from less to more sexually accessible men associated with each ovulatory episode would have been highly adaptive. Our data are consistent with the idea that, although the length of a woman’s reproductive lifetime and the extent of the potential mating network have expanded considerably over the past 50,000 years, this unconscious strategy guides women’s mating choices still.

Erratum: We meant ‘this unconscious strategy guides Italian women’s mating choices still’.


Stan attribution


I worry that I get too much credit for Stan. So let me clarify. I didn’t write Stan. Stan is written in C++, and I’ve never in my life written a line of C, or C+, or C++, or C+++, or C-, or any of these things.

Here’s a quick description of what we’ve all done, listed in order of joining the development team.

• Andrew Gelman (Columbia University)
chief of staff, chief of marketing, chief of fundraising, chief of modeling, chief of training, max marginal likelihood, expectation propagation, posterior analysis, R, founder

• Bob Carpenter (Columbia University)
language design, parsing, code generation, autodiff, templating, ODEs, probability functions, con- straint transforms, manual, web design / maintenance, fundraising, support, training, C++, founder

• Matt Hoffman (Adobe Creative Technologies Lab)
NUTS, adaptation, autodiff, memory management, (re)parameterization, C++, founder

• Daniel Lee (Columbia University)
chief of engineering, CmdStan (founder), builds, continuous integration, testing, templates, ODEs, autodiff, posterior analysis, probability functions, error handling, refactoring, C++, training

• Ben Goodrich (Columbia University)
RStan, multivariate probability functions, matrix algebra, (re)parameterization, constraint trans- forms, modeling, R, C++, training

• Michael Betancourt (University of Warwick)
chief of smooth manifolds, MCMC, Riemannian HMC, geometry, analysis and measure theory, ODEs, CmdStan, CDFs, autodiff, transformations, refactoring, modeling, variational inference, logos, web design, C++, training

• Marcus Brubaker (University of Toronto, Scarborough)
optimization routines, code efficiency, matrix algebra, multivariate distributions, C++

• Jiqiang Guo (NPD Group)
RStan (founder), C++, Rcpp, R

• Peter Li (Columbia University)
RNGs, higher-order autodiff, ensemble sampling, Metropolis, example models, C++

• Allen Riddell (Dartmouth College)
PyStan (founder), C++, Python

• Marco Inacio (University of São Paulo)
functions and distributions, C++

• Jeffrey Arnold (University of Washington)
emacs mode, pretty printing, manual, emacs

• Rob J. Goedman (D3Consulting b.v.)
parsing, Stan.jl, C++, Julia

• Brian Lau (CNRS, Paris)
MatlabStan, MATLAB

• Mitzi Morris (Lucidworks)
parsing, testing, C++

• Rob Trangucci (iSENTIUM)
max marginal likelihood, multilevel modeling and poststratification, template metaprogramming, training, C++, R

• Jonah Sol Gabry (Columbia University)
shinyStan (founder), R

• Alp Kucukelbir (Columbia University)
variational inference, C++

• Robert L. Grant (St. George’s, University of London & Kingston University)
StataStan, Stata

• Dustin Tran (Havard University)
variational inference, C++

Development Team Alumni

These are developers who have made important contributions in the past, but are no longer contributing actively.

• Michael Malecki (, YouGov plc)
original design, modeling, logos, R

• Yuanjun Guo (Columbia University)
dense mass matrix estimation, C++

Constructing an informative prior using meta-analysis

Chris Guure writes:

I am trying to construct an informative prior by synthesizing or collecting some information from literature (meta-analysis) and then to apply that to a real data set (it is longitudinal data) for over 20 years follow-up.

In constructing the prior using the meta-analysis data, the issue of publication bias came up. I have tried looking to see if there is any literature on this but it seems almost all the articles on Bayesian meta-analysis do not actually account for this issue apart from one (Givens, Smith and Tweedie 1997).

My thinking was that I could assume a data augmentation approach by fitting a joint model with the assumption that the observed data are normally distributed and the unobserved studies probably exist but not included in my studies and can be thought of to be missing data (missing not at random or non-ignorable missingness). This way a Bernoulli distribution could be used to account for the missingness.

But according to Lesaffre and Lawson 2012, pp. 196; in hierarchical models, the data augmentation approach enters in a quite natural way via the latent (unobserved) random effects. This statement to me implies that my earlier idea may not be necessary and may even bias the posterior estimates.

My reply: You could certainly do this, build a model in which there are a bunch of latent unreported studies and then go from there. I don’t know how well this would work, though, for two reasons:

1. Estimating what’s missing based on the shape of the distribution—-that’s tough. Inferences will be so sensitive to all sorts of measurement and selection issues, and I’d be skeptical of whatever comes out.

2. You’re trying to adjust for unreported studies in a meta-analysis. But I’d be much more worried about choices in data processing and analysis in each of the studies you have. As I’ve written many times, I think the file-drawer problem is overrated and it’s nothing compared to the garden of forking paths.

Uri Simonsohn warns us not to be falsely reassured


I agree with Uri Simonsohn that you don’t learn much by looking at the distribution of all the p-values that have appeared in some literature. Uri explains:

Most p-values reported in most papers are irrelevant for the strategic behavior of interest.

Covariates, manipulation checks, main effects in studies testing interactions, etc. Including them we underestimate p-hacking and we overestimate the evidential value of data. Analyzing all p-values asks a different question, a less sensible one. Instead of “Do researchers p-hack what they study?” we ask “Do researchers p-hack everything?”

He demonstrates with an example and summarizes:

Looking at all p-values is falsely reassuring.

I agree and will just add two comments:

1. I prefer the phrase “garden of forking paths” because I think the term “p-hacking” suggests intentionality or even cheating. Indeed, in the quoted passage above, Simonsohn refers to “strategic behavior.” I have not doubt that some strategic behavior and even outright cheating goes on, but I like to emphasize that the garden of forking paths can occur even when a researcher does only one analysis of the data at hand and does not directly “fish” for statistical significance.

The idea is that analyses are contingent on data, and researchers can and do make choices in data coding, data exclusion, and data analysis in light of the data they see, setting various degrees of freedom in reasonable-seeming ways that support their model of the world, thus being able to obtain statistical significance at a high rate, merely by capitalizing on chance patterns in data. It’s the forking paths, but it doesn’t feel like “hacking,” not is it necessarily “strategic behavior” in the usual sense of the term.

2. If p-values are what we have, it makes sense to learn what we can from them, as in the justly influential work of Uri Simonsohn, Greg Francis, and others. But, looking at the big picture, once we move to the goal of learning about underlying effects, I think we want to be analyzing raw data (and in the context of prior information), not merely pushing these p’s around. P-values are crude data summaries, and a lot of information can be lost by moving from raw data to p-values. Doing science using published p-values is like trying to paint a picture using salad tongs.

On deck this week

Mon: Constructing an informative prior using meta-analysis

Tues: Stan attribution

Wed: Cannabis/IQ follow-up: Same old story

Thurs: Defining conditional probability

Fri: In defense of endless arguments

Sat: Emails I never finished reading

Sun: BREAKING . . . Sepp Blatter accepted $2M payoff from Dennis Hastert

“Another bad chart for you to criticize”

Stuart Buck sends in this Onion-worthy delight:


Performing design calculations (type M and type S errors) on a routine basis?

Somebody writes writes:

I am conducting a survival analysis (median follow up ~10 years) of subjects who enrolled on a prospective, non-randomized clinical trial for newly diagnosed multiple myeloma. The data were originally collected for research purposes and specifically to determine PFS and OS of the investigational regimen versus historic controls. The trial has been closed to new enrollment for many years; however, we are monitoring for disease progression and all cause mortality.

Here is the crux of the issue. Although data were prospectively collected for research purposes, my investigational variable was collected but not reported as a variable. The results of the prospective trial (PFS and OS) have been previously published in Blood. I am updating the original report with the long-term follow up, but am also exploring the potential impact of my new variable on PFS and OS. I have not yet analyzed the data and do not know the potential impact, or magnitude of impact, on PFS or OS. If I am interpreting your paper correctly, I believe that I should treat the power calculation on a post-hoc basis and utilize Type S and Type M analysis.

I know this is brief, if you would offer a comment or a direction I would be deeply grateful. I am sure it is obvious that I don’t study statistics, I focus on the biology of multiple myeloma.

Fair enough. I’m no expert on myeloma. As a matter of fact, I don’t even know what myeloma is! (Yes, I could google it, but that would be cheating.) Based on the above paragraphs, I assume it is a blood-related disease.

Anyway, my response is, yes, I think it would be a good idea to do some design analysis, using your best scientific understanding to hypothesize an effect size and then going from there, to see what “statistical significance” really implies in such a case, given your sample size and error variance. The key is to hypothesize a reasonable effect size—don’t just use the point estimate from a recent study, as this can be contaminated by the statistical significance filter.

New paper on psychology replication


The Open Science Collaboration, a team led by psychology researcher Brian Nosek, organized the replication of 100 published psychology experiments. They report:

A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes.

“Despite” is a funny way to put it. Given the statistical significance filter, we’d expect published estimates to be overestimates. And then there’s the garden of forking paths, which just makes things more so. It would be meaningless to try to obtain a general value for the “Edlin factor” but it’s gotta be less than 1, so of course exact replications should produce weaker evidence than claimed from the original studies.

Things may change if and when it becomes standard to report Bayesian inferences with informative priors, but as long as researchers are reporting selected statistically-significant comparisons—and, no, I don’t think that’s about to change, even with the publication and publicity attached to this new paper—we can expect published estimates to be overestimates.

That said, even though these results are no surprise, I still think they’re valuable.

As I told Monya Baker in an interview for a news article, “this new work is different from many previous papers on replication (including my own) because the team actually replicated such a large swathe of experiments. In the past, some researchers dismissed indications of widespread problems because they involved small replication efforts or were based on statistical simulations. But they will have a harder time shrugging off the latest study, the value of this project is that hopefully people will be less confident about their claims.”

Nosek et al. provide some details in their abstract:

The mean effect size of the replication effects was half the magnitude of the mean effect size of the original effects, representing a substantial decline. Ninety-seven percent of original studies had significant results. Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.

This is all fine, again the general results are no surprise but it’s good to see some hard numbers with real experiments. The only thing that bothers me in the above sentence is the phrase, “if no bias in original results is assumed . . .” Of course there is bias in the original results (see discussion above), so this just seems like a silly assumption to make. I think I know where the authors are coming from—they’re saying, even if there was no bias, there’d be problems—but really the no-bias assumption makes no sense given the statistical significance filter, so this seems unnecessary.

Anyway, great job! This was a big effort and it deserves all the publicity it’s getting.

Disclaimer: I am affiliated with the Open Science Collaboration. I’m on the email list, and at one point I was one of the zillion authors of the article. At some point I asked to be removed from the author list, as I felt I hadn’t done enough—I didn’t do any replication, nor did I do any data analysis, all I did was participate in some of the online discussions. But I do feel generally supportive of the project and am happy to be associated with it in whatever way that is.

A political sociological course on statistics for high school students

Ben Frisch writes:

I am designing a semester long non-AP Statistics course for high school juniors and seniors. I am wondering if you had some advice for the design of my class. My currentthinking for the design of the class includes:

0) Brief introduction to R/ R Studio and descriptive statistics and data sheet structure.

1) Great Migration in 20th Century US. Students will read sections of “The Warmth of Other Suns”. Each student will explore the size of the Great migration from the South in an industrial city of their choice. We will use the IPUMS micro census data to estimate white and black migration from Southern states and use the income figures to compare migrants and non migrant residents over the years 1910 – 1980. The old teaching software package Fathom used to do the sampling from IPUMS easily, but the Census sampling feature now no longer works with the newer operating systems. I will have the students sample directly from the University of Minnesota site and then decode their samples in excel and R Studio. A final part of the project will be visits with retired people who were a part of the migration.

2) I plan to have the students divide into working groups to prepare statistical information for lobbying elected officials on a social problem of their choice. We have access to the AFSC’s Criminal Justice program near at our school and immigration rights might fruitful topic to study after our examination of migration.

3) It will be primary season again next Spring and I would love to have the students look at geographical effects in political elections. We will, of course, study polling and survey design and explore sampling distributions.

I have just picked up copies of year texts “A Quantitative Tour…” and “Teaching Statistics…” and I plan to mine them for other activities to explore. I also will be catching up on reading your blog!

This sounds great! My only tip is to do as much of the data analysis yourself first so you can be sure your students can handle it. I did some ipums stuff recently and there were lots of little details with the data that were difficult to handle at first.

Perhaps readers of this blog will have other suggestions.

Vizzy vizzy vizzy viz


Nadia Hassan points me to this post by Matthew Yglesias, who writes:

Here’s a very cool data visualization from that took me a minute to figure out because it’s a little bit unorthodox. The way it works is that it visualizes the entire world’s economic output as a circle. That circle is then subdivided into a bunch of blobs representing the economy of each major country. And then each country-blob is sliced into three chunks — one for manufacturing, one for services, and one for agriculture.


What do I like about this image and what don’t I like?

Paradoxically, the best thing about this graph may also be its worst: Its tricky, puzzle-like characteristic (it even looks like some sort of hi-tech jigsaw puzzle) makes it hard to read, hard to follow, but at the same time gratifying for the reader who goes to the trouble of figuring it out.

It’s the Chris Rock effect: Some graphs give the pleasant feature of visualizing things we already knew, shown so well that we get a shock of recognition, the joy of relearning what we already know, but seeing it in a new way that makes us think more deeply about all sorts of related topics.

As a statistician, I can tell you a whole heap of things I don’t like about this graph, starting with the general disorganization—there’s no particular way to find any country you might be looking for, and there seems to be no logic to the spatial positions—I have no idea what Austrlia is doing in the middle of the circle, or why South Korea and Switzerland are long and thin while Mexico and India are more circular. The breakdown of economy into services/industry/agriculture is particularly confusing because of all the different shapes, and for heaven’s sake, why are the numbers given to a hyper-precise two decimal places?? (You might wonder what it means to say that Russia is “2.49%” of the world economy, given that, last time I checked, readily-available estimates of Russia’s GDP per capita varied my more than a factor of five!)

Yglesias’s post is headlined, “This striking diagram will change how you look at the world economy,” and I can believe it will change people’s understanding, not because the data are presented clearly of because the relevant comparisons are easily available, but because the display is unusual enough that it might motivate people to stare at these numbers that they otherwise might ignore.

Some of the problems with this graph can be seen by carefully considering this note from Yglesias:

You can see some cool things here.

For example, compare the US and China. Our economy is much larger than theirs, but our industrial sectors are comparable in size, and China’s agriculture sector looks to be a little bit larger. Services are what drive the entire gap.

The UK and France have similarly sized overall economies, but agriculture is a much bigger slice of the French pie.

For all that Russia gets played up as some kind of global menace, its economy produces less than Italy. Put all the different European countries together, and Russia looks pathetic.

You often hear the phrase “China and India,” but you can see here that the two Asian giants are in very different shape economically.

The only African nation on this list, South Africa, has a smaller economy than Colombia.

What struck me about all these items is how difficult it actually is to find them in the graph. Comparing the U.S. with China on their industry sector, that’s tough: you have to figure out which color is which—it’s particularly confusing here because the color codes for the two countries are different—and then compare two quite different shapes, a task that would make Jean Piaget flip out. The U.K. and France can be compared without too much difficulty but only because they happen to be next to each other, through some quirk of the algorithm. Comparing China and India is not so easy—it took me awhile to find India on this picture. And finding South Africa was even trickier.

My point is not that the graph is “bad”—I’d say it’s excellent for its #1 purpose which is to draw attention to these numbers. It’s just an instructive example for what one might want in a data display.

The click-through solution

As always, I recommend what I call the “click-through solution”: Start with a visually grabby graphic like this one, something that takes advantage of the Chris Rock effect to suck the viewer in. Then click and get a suite of statistical graphs that allow more direct visual comparisons of the different countries and different sectors of the economy. Then click again to get a spreadsheet with all the numbers and a list of sources.

Stan’s 3rd birthday!

Stan v1.0.0 was released on August 30, 2012. We’ve come a long way since.

If you’re around and want to celebrate with some Stan developers and users, feel free to join us:

Monday, August 31.
6 – 9 pm
Untamed Sandwiches
43 W 39th St
New York, NY

If you didn’t know, we also have a Stan Users NYC group that meets every few months.

Thanks and hope to see some of you there.

“Can you change your Bayesian prior?”

Deborah Mayo writes:

I’m very curious as to how you would answer this for subjective Bayesians, at least. I found this section of my book showed various positions, not in agreement.

I responded on her blog:

As we discuss in BDA and elsewhere, one can think of one’s statistical model, at any point in time, as a placeholder, an approximation or compromise given constraints of computation and of expressing one’s model. In many settings the following iterative procedure makes sense:

1. Set up a placeholder model (that is, whatever statistical model you might fit).

2. Perform inference (no problem, now that we have Stan!).

3. Look at the posterior inferences. If some of the inferences don’t “make sense,” this implies that you have additional information that has not been incorporated into the model. Improve the model and return to step 1.

If you look carefully you’ll see I said nothing about “prior,” just “model.” So my answer to your question is: Yes, you can change your statistical model. Nothing special about the “prior.” You can change your “likelihood” too.

And Mayo responded:

Thanks. But surely you think it’s problematic for a subjective Bayesian who purports to be coherent?

I wrote back: No, subjective Bayesianism is inherently incoherent. As I’ve written, if you could in general express your knowledge in a subjective prior, you wouldn’t need Bayesian Data Analysis or Stan or anything else: you could just look at your data and write your subjective posterior distribution. The prior and the data models are just models, they’re not in practice correct or complete.

More here on noninformative priors.

And here’s an example of the difficulty of throwing around ideas like “prior probability” without fully thinking them through.

“The belief was so strong that it trumped the evidence before them.”

I was reading Palko on the 5 cent cup of coffee and spotted this:

We’ve previously talked about bloggers trying to live on a food stamp budget for a week (yeah, that’s a thing). One of the many odd recurring elements of these post is a litany of complaints about life without caffeine because

I had already understood that coffee, pistachios and granola, staples in my normal diet, would easily blow the weekly budget.

Which is really weird because coffee isn’t all that expensive.

Palko then goes into detail about how easy it is to buy a can of ground coffee at the supermarket for the cost of 5 or 10 cents a cup.

He continues:

On the other end, if you go to $0.15 or $0.20 a cup and you know how to shop, you can move up into some surprisingly high-quality whole bean coffee . . . you can do better than the typical cup of diner coffee for a dime and better than what you’d get from most coffee houses for a quarter.

To be clear, I’m not recommending that everyone rush out to Wal-Mart for a big ol’ barrel of Great Value Classic Roast. If your weekly food budget is more than fifty dollars a week, bargain coffee should be near the bottom of your concerns.

But here’s the important point—that is, important in general, not just for coffee drinkers (of which I am not one):

What we’re interested in here are perceptions. The people we discussed earlier suffered through a week of headaches and other caffeine-withdrawal pains, not because they couldn’t afford it but because the belief that they couldn’t afford it was so strong that it trumped the evidence before them.

This comes up a lot. People condition on information that isn’t true.

On deck this week

Mon: “The belief was so strong that it trumped the evidence before them.”

Tues: “Can you change your Bayesian prior?”

Wed: How to analyze hierarchical survey data with post-stratification?

Thurs: A political sociological course on statistics for high school students

Fri: Questions about data transplanted in kidney study

Sat: Performing design calculations (type M and type S errors) on a routine basis?

Sun: “Another bad chart for you to criticize”

We provide a service

A friend writes:

I got the attached solicitation [see below], and Google found me your blog post on the topic. Thank you for quickly explaining what’s going on here!

As far as I can see, they’ve removed the mention of payment from this first contact message – so they’re learning!

But also they have enough past clients to be able to include some nice clips. Ah, the pathological results of making academics feel obliged to self-promote.

This time the email didn’t come from “Nick Bagnall,” it came from “Josh Carpanini.” Still spam. But, as I wrote last time, it’s better than mugging old ladies for spare change or selling Herbalife dealerships.

P.S. Here’s the solicitation:

From: Josh Carpanini
Date: Friday, June 5, 2015
Subject: International Innovation – Highlighting Impacts of Technology Research

Dear Dr **,

I hope this message finds you well.

I was hoping to speak with you at some point in the next few days about an upcoming Technology edition of International Innovation. I have come across some of your research and I am very interested to discuss with you the possibility of highlighting your work within the forthcoming July edition.

I would like to create an article about your work within our next edition; this would be similar in format to some of the attached example articles from previous editions. As you can see, the end result would be a piece looking at the wider implications and impact of your current research. . . .

Plaig! (non-Wegman edition)

Mark Vallen writes (link from here):

What initially disturbed me about the art of Shepard Fairey is that it displays none of the line, modeling and other idiosyncrasies that reveal an artist’s unique personal style. His imagery appears as though it’s xeroxed or run through some computer graphics program; that is to say, it is machine art that any second-rate art student could produce. . . .

Fairey’s Greetings from Iraq is not a direct scan or tracing of the FAP print, but it does indicate an over reliance on borrowing the design work of others. There was no political point or ironic statement to be made by expropriating the FAP print – it was simply the act of an artist too lazy to come up with an original artwork. . . .

Some supporters of Shepard Fairey like to toss around a long misunderstand quote by Pablo Picasso, “Good artists copy, great artists steal.” Aside from the ridiculous comparison of Fairey to Picasso, there’s little doubt that Picasso was referring to the “stealing” of aesthetic flourishes and stylings practiced by master artists, and not simply carting off their works and putting his signature to them.

A last ditch defense used by Fairey groupies is to acknowledge that their champion does indeed “borrow” the works of other artists both living and deceased, but it is argued that the plundered works are all in the “public domain”, and therefore the rights of artists have not been violated. There are those who say that artists should have the right to alter and otherwise modify already existing works in order to produce new ones or to make pertinent statements. Despite some reservations I generally agree with that viewpoint – provided that such a process is completely transparent. . . .

I’m reminded of George Orwell’s classic slam on lazy and dishonest writing:

Each of these passages has faults of its own, but, quite apart from avoidable ugliness, two qualities are common to all of them. The first is staleness of imagery; the other is lack of precision. The writer either has a meaning and cannot express it, or he inadvertently says something else, or he is almost indifferent as to whether his words mean anything or not. This mixture of vagueness and sheer incompetence . . .

Laziness and dishonesty go together, and that fits the stories of Shepard Fairey and Ed Wegman as well. You copy from someone else, and you have nothing of your own to add, so you hide your sources, and this sends you into a sort of spiral of lies. In which case, why do any work at all? In Fairey’s case, the work is all about promotion, not about the art itself. In Wegman’s case, the work all goes into lawsuits and backroom maneuvering, not into the statistics.

Once you’re hiding your sources, you might as well cut corners on the product, eh?

That was easy

This came in the email from Tom Kertscher:

Are you available this afternoon or Wednesday to talk about a fact-check article I’m doing on Gov. Scott Walker’s statement that Wisconsin is a “blue” state?

I’m aware, of course, that Wisconsin has voted for the Democratic presidential nominee in each election since 1988.

But I’d like to talk about whether there are other common ways that states are labeled as red or blue (or perhaps purple).

Tues and Wed have already passed, so it’s probably too late, but here’s my response: I would call Wisconsin a 50-50 or “purple” state, in that its vote split has been very close to the national average in recent presidential elections.

Aahhhhh, young people!

Amusingly statistically illiterate headline from Slate: “Apple Notices That Basically Half the Population Menstruates.”

Ummmm, let’s do a quick calculation: 50 – 12 = 38. If you assume the average woman lives to be 80, then the proportion of the population who is menstruating is approximately .52*38/80 = .247.

25% is hardly “basically half”!

But if you’re a young adult, I guess you don’t think so much about people who are under 12 or over 50.

I was similarly amused by the mistake of Beall and Tracy, authors of that now-famous ovulation-and-clothing study, who thought that peak fertility started 6 days after menstruation. If you’re young, you’ve probably been reminded by sex-ed classes that you can get pregnant at any time. It’s only when you get older that you learn about which are the most important days if you’re trying to get pregnant.