Skip to content

“I do not agree with the view that being convinced an effect is real relieves a researcher from statistically testing it.”

Florian Wickelmaier writes:

I’m writing to tell you about my experiences with another instance of “the difference between significant and not significant.”

In a lab course, I came across a paper by Costa et al. [Cognition 130 (2) (2014) 236-254 ( In several experiments, they compare the effects in two two-by-two tables by comparing the p-values, and not by a test of the interaction. A mistake, very much like you describe it in the Gelman and Stern (2006) paper.

I felt that this should be corrected and, mainly because I had told my students that such an analysis is wrong, I submitted a comment to Cognition ( This comment got rejected. The main argument of the editor seems to be that he is convinced the effect is real, so who needs a statistical test? I compiled the correspondence with Cognition in the attached document.

In the end, with the help of additional quotes from your paper, I persuaded the editor to at least have the authors write a corrigendum (, in which they report a meta-analysis pooling all the data and find a significant interaction.

I think it is a partial success that this corrigendum now is published so readers see that something is wrong with the original paper. On the other hand, I’m unhappy that it is impossible to check this new analysis since the raw data are not accessible. Moreover, this combined effect does not justify the experiment-by-experiment conclusions presented before.

I’d like to thank you very much for your paper. Without it, I’m afraid, my complaints would not have been heard. It seems people are still struggling with understanding the problem even when being pointed to it, and even in a major journal in Psychology.

Lots of interesting things here. The editor wrote:

In the end, given that I’m convinced the effect is real, I’m just not sure that the community would benefit from this interchange.

To which Wickelmaier replied:

My comment is not about whether or not the effect exists (it may or may not), my comment is about the missing statistical test. I do not agree with the view that being convinced an effect is real relieves a researcher from statistically testing it.

I agree. Or, to put it another way, I have no objection to a journal publishing a claim without strong statistical evidence, if the result is convincing for some other reason. Just be open about what those reasons are. For example, the article could say, “We see the following result in our data . . . The comparison is not statistically significant at the 5% level but we still are convinced the result is real for the following reasons . . .”

There was indeed progress. The editor responded to Wickelmaier’s letter as follows:

You have convinced me that there’s a serious problem with Costa et al.’s analysis. But I also remain convinced by his subsequent analyses that he has a real effect. I think your suggestion to have him write an erratum (or a corrigendum) was an excellent one. . . .

This hits the nail on the head. It’s ok to publish weak evidence if it is interesting in some way. The problem comes when there is the implicit requirement that the evidence from each experiment be incontrovertible, which leads to all sorts of contortions when researchers try to show that their data prove their theory beyond a reasonable doubt.

As always, we must accept uncertainty and embrace variation.

Have weak data. But need to make decision. What to do?

Vlad Malik writes:

I just re-read your article “Of Beauty, Sex and Power”.

In my line of work (online analytics), low power is a recurring, existential problem. Do we act on this data or not? If not, why are we even in this business? That’s our daily struggle.

Low power seems to create a sort of paradox: some evidence is better than none, but the evidence is useless. Not sure which it is, and your article hints at the two sides of that conflict.

If you are studying small populations, for example, it might not be possible to collect enough data for a “good sample”. You could collect some. But is such a study worth doing? And is the outcome of such a study worth even a guess? Is too little data as good as no data at all?

You do suggest in your article that “if we had to guess” then the low-powered study might still provide guidance. Yet you also say “we have essentially learned nothing from this study” and later point to a high probability that the effect from such a study may actually be pointing in the wrong direction.

In your critique, you rely heavily on lit review. What if past information is not available or is not as directly relevant, so the expected effect size is vague or unknown (at least to the experimenter’s best knowledge)? In that case, it might not be obvious that the effect is inflated.

Can data collected under such conditions ever be actionable? And if such data is published, how does one preface it? “Use with caution” or “Ignore for now. More research needed”?

What if not acting carries an opportunity cost and we have to act? Do we use the data or ignore it and rely on other criteria? If we say “this weak data supports other indirect evidence”, it might be acting with confirmation bias if the data is not in fact at all reliable. What if the weak data contradicts other evidence? How much power is enough to be worth even considering?

My reply: Indeed, even if evidence from the data at hand is not convincing, that doesn’t mean we have to do nothing. I strongly oppose using a statistical significance threshold to make decisions. In general I have three recommendations:

1. Use prior information. For example, in that silly beauty-and-sex-ratio study, we have lots of prior information that any differences would have to be tiny. We know ahead of time that our prior information is much stronger than the data.

2. When it comes to decision making, much depends on costs and benefits. We have a chapter on decision analysis in Bayesian Data Analysis that illustrates with three examples.

3. As much as possible, use quantitative methods to combine different sources of information:
– Define the parameter or parameters you want to estimate.
– Frame your different sources of information as different estimates of this parameter, ideally with statistically independent errors.
– Get some assessment of the bias and variance of each of your separate estimates.
– Then adjust for bias and get a variance-weighted estimate.
The above is the simplest scenario, where all the estimates are estimating the same parameter. In real life I suppose you’re more likely to be in an “8 schools” or meta-analysis situation, where your different estimates are estimating different things. And then you’ll want to fit a hierarchical model. Which is actually not so difficult. I suppose someone (maybe me) should write an article on how to do this, for a non-technical audience.

Anyway, the point is, when you do things this way, concepts of “power” aren’t so directly relevant. You just combine your information as best you can.

On deck this week

Mon: Have weak data. But need to make decision. What to do?

Tues: “I do not agree with the view that being convinced an effect is real relieves a researcher from statistically testing it.”

Wed: Optimistic or pessimistic priors

Thurs: Draw your own graph!

Fri: Low-power pose

Sat: Annals of Spam

Sun: The Final Bug, or, Please please please please please work this time!

Erdos bio for kids

Chris Gittins recommends the book, “The Boy Who Loved Math: The Improbable Life of Paul Erdos,” by Deborah Heiligman. Gittins reports:

We read it with our soon-to-be-first-grader this evening. She liked it and so did we. I knew a little about Erdos but the book probably quadrupled my knowledge. Thought it might be of interest to readers of your blog who have little kids – and perhaps even to those who don’t.

I haven’t read any books about Erdos myself, as I’ve always been a bit creeped out by the Erdos thing, maybe because I don’t like how it reinforces the idea of mathematicians as weirdos. But I’ll pass along the recommendation. A book called “The Boy Who Loved Math”—I would’ve loved that myself as a kid.

“The frequentist case against the significance test”

Richard Morey writes:

I suspect that like me, many people didn’t get a whole lot of detail about Neyman’s objections to the significance test in their statistical education besides “Neyman thought power is important”. Given the recent debate about significance testing, I have gone back to Neyman’s papers and tried to summarize, for the modern user of statistics, his main objections to the significance test, including the nice demonstration from his 1952 book that without consideration of an alternative, the you can match the distribution of all of your statistics *under the null* and get whatever you answer you like. I thought it might be of interest for your readers.

“The frequentist case against the significance test” is in two parts:
Part 1 (epistemology)
Part 2 (statistics).

I haven’t read this in detail but I’m sympathetic to the general point. I think that frequentist analysis (that is, figuring out the average properties of statistical methods, averaging over some model of the data and underlying parameters) can be valuable, but I also think that the classical framework of hypothesis testing and confidence intervals doesn’t really work—I think these ideas represent too crude a way to jump into inference and decision making.

Stan users meetup in Cambridge, MA on 9/22

There’s a new Stan users meetup group in Boston / Camberville. The first meeting will be on Tuesday, 9/22, at 6 pm in Cambridge.

If you’re a seasoned Stan user, just starting out with Stan, or hearing about Stan for the first time, feel free to join in. At least a couple of the Stan core developers will be around to answer questions.


Sign up here: Stan Users – Boston/Camberville

Thanks to Lizzie Wolkovich for organizing the meetup! And thanks to RStudio for providing food and drinks for the meetup.


P.S. Dustin has some Stan stickers. Go find Dustin at the meetup if you want one.
P.P.S. If you want to organize a meetup in your neighborhood, it’s not difficult. Let me know and I’ll provide as much support as I can.



Leonid Schneider writes:

I am cell biologist turned science journalist after 13 years in academia. Despite my many years experience as scientist, I shamefully admit to be largely incompetent in statistics.

My request to you is as follows:

A soon to be published psychology study set on to reproduce 100 randomly picked earlier studies and succeeded only with 39. This suggests that the psychology literature is only 39% reliable.

I tried to comment on the website Retraction Watch that one needs to take into account that the replication studies may not be reliable themselves, given the incidence of poor reproducibility in psychology. Ivan Oransky of Retraction Watch disagreed with me, citing Zeno’s paradox.

Basically, this was my comment:

39% of studies are claimed to have been reproduced by other psychologists. But: if this is indeed the ratio of reproducibility among psychological research, then only 39% of the reproduction studies are potentially reproducible themselves. My statistics skills are embarrassingly poor, but wouldn’t this mean only 15% of the originally tested psychology studies can indeed be considered as successfully reproduced?

These 15% are certainly completely wrong and obtained by my incompetence, but are the 39% indeed solid, unless we can fully trust the results of the reproducibility project? I wrote to Ivan next:

Only if a third study would confirm same 39 studies reproduced can we trust the previous result. Otherwise, we know that at least 61% of what psychologists publish is wrong or fake, so if we ask these or other psychologists to perform any study, we should be aware of their reliability.

We never agreed. I still have the feeling the proper number must be lower than 39%, unless the psychologists who obtained it are 100% honest and reliable.

Thus, can I ask for your professional view about the true reliability of psychology studies and how to approximate it in this context?

Hey, I know about that replication project! Indeed, I was part of it and was originally going to be one of the many many authors of the paper. But I ended up not really doing anything but commenting in some of the email exchanges among the psychologists on this project, so it didn’t make sense for me to be included as an author.

Anyway, I think there are a few things going on. The first is that the probability that a study can be replicated will vary: Stroop will get replicated over and over, whereas ESP and power pose don’t have such a good shot. So you can’t simply multiply the probabilities. The second issue is that replicability depends on measurement error and sample size, and these will not necessarily be the same in a replication as in the original study.

But, really, the big thing is to move beyond the idea of a study as being true or false, or being replicated or non-replicated. I think the authors of this replication paper have been pretty careful to avoid any claims of the form, “the psychology literature is only 39% reliable.” Rather, it’s my impression that the purpose of this exercise is more to demonstrate that attempted replications can be done. The idea is to normalize the practice of replication, to try to move various subfields of psychology onto firmer foundations by providing guidelines for establishing findings with more confidence, following the debacles of embodied cognition and all the rest.

I’m speaking in Germany today!

Screen Shot 2015-09-17 at 11.07.00 PM

Screen Shot 2015-09-17 at 11.07.25 PM

Screen Shot 2015-09-17 at 11.07.43 PM

Right between Mittagspause and Tagungsabschluss, just how I like it.

It’s a methods conference for the German Psychological Society in Jena. Here’s my title and abstract:

Applied Bayesian Statistics

Bayesian methods allow the smooth combination of information from multiple sources and are associated with open acknowledgement about uncertainty. We discuss modern applied Bayesian perspectives on various topics, including hypothesis testing and model comparisons, noninformative and informative priors, hierarchical models, model checking, and inference using Stan.

I’m planning to do a few of my favorite examples (birthdays, beauty-and-sex-ratio, and World Cup) and then just yap a bit about the above topics, using the examples as a starting point. Maybe answer some questions if that’s how they do things there.

All will be in English; sorry!

And I’ll do it through a Google hangout. Saves on jet fuel.

P.S. It turned out they wanted my Statistical Crisis in Science talk so I gave a version of that. It went well, there were even a couple places in the talk where I was able to work in some jokes about German politics.

Medical decision making under uncertainty

Gur Huberman writes:

The following crossed my mind, following a recent panel discussion in which David Madigan participated on evidence-based medicine. The panel—especially John Iaonnidis—sang the praise of clinical trials.

You may have nothing wise to say about it—or pose the question to your blog followers.

Suppose there’s a standard clinical procedure to address a certain problem. (Say, a particular surgery.)

A physician has an idea for an alternative which he conjectures will deliver better results. But of course he doesn’t know before he has tried it. And he cannot possibly be aware of all the possible costs to the patient (failures, complications etc.). But then, when experienced, this physician and others may improve the procedure in the future.

How should he go about suggesting the alternative procedure to a patient?

The question applies to pilot patients—the first ones—and, assuming that the procedure was successful on a handful of pilot patients (by what criteria?), the question applies to setting up a clinical trial.

My reply: One thing I’ve written on occasionally is that I’d like to see some formal decision analyses, balancing the costs and benefits to the existing patients of trying an experimental treatment, along with the larger costs and benefits to the population if the new treatment is deemed effective and used more generally. I’d think this sort of calculation would be essential in deciding things such as rules for when to approve new treatments. But I’ve never really seen it done. I’ve seen some decision analyses regarding screening for diseases (whether screening should be done, who should be screened, how often should screening be done, etc) but not on the question of when to approve a procedure, when to declare victory and say everyone should get it, when to declare defeat and not try it on people.

The aching desire for regular scientific breakthroughs


This post didn’t come out the way I planned.

Here’s what happened. I cruised over to the British Psychological Society Research Digest (formerly on our blogroll) and came across a press release entitled “Background positive music increases people’s willingness to do others harm.”

Uh oh, I thought. This sounds like one of those flaky studies, the sort of thing associated in recent years with Psychological Science and PPNAS.

But . . . the British Psychological Society, that’s a serious organization. And the paper isn’t even published in one of their own journals, so presumably they can give it a fair and unconflicted judgment.

At the same time, it would be hard to take the claims of the published paper at face value—we just know there are too many things that can go wrong in this sort of study.

So this seemed like a perfect example to use to take what might be considered a moderate stance, to say that this paper looked interesting, it’s not subject to obvious flaws, but for the reasons so eloquently explained by Nosek et al. in their “50 shades of gray” study, it really calls out for a preregistered replication.

So I went to the blog and opened a new post dated 25 Dec (yup, it’s Christmas here in blog-time) entitled, “Here’s a case for a preregistered replication.”

And I started to write the post, beginning by constructing a long quote from the British Psychological Society’s press release:

A new study published in the Psychology of Music takes this further by testing whether positive music increases people’s willingness to do bad things to others.

Naomi Ziv at The College of Management Academic Studies recruited 120 undergrad participants (24 men) to take part in what they thought was an investigation into the effects of background music on cognition. . . .

The key test came after the students had completed the underling task. With the music still playing in the background, the male researcher made the following request of the participants:

“There is another student who came especially to the college today to participate in the study, and she has to do it because she needs the credit to complete her course requirements. The thing is, I don’t feel like seeing her. Would you mind calling her for me and telling her that I’ve left and she can’t participate?”

A higher proportion of the students in the background music condition (65.6 per cent) than the no-music control condition (40 per cent) agreed to perform this task . . .

A second study was similar but this time the research assistant was female, she recruited 63 volunteers (31 men) in the student cafeteria . . . After the underling task, the female researcher made the following request:

“Could I ask you to do me a favor? There is a student from my class who missed the whole of the last semester because she was very sick. I promised her I would give her all the course material and summaries. She came here especially today to get them, but actually I don’t feel like giving them to her after all. Could you call her for me and tell her I didn’t come here?”

This time, 81.8 per cent of the students in the background music condition agreed to perform this request, compared with just 33 per cent of those in the control condition. The findings are all the more striking given that the researchers’ requests in both experiments were based on such thin justifications (e.g. “I don’t feel like giving them to her after all”).

Shoot, this is looking pretty bad. I clicked through to the published paper and it seems to have many of the characteristics of a classic “Psychological Science”-style study: small samples, a focus on interactions, multiple comparisons reported in the research paper and many other potential comparisons that could’ve been performed had the data pointed in those directions, comparisons between statistical significance and non-significance, and an overall too-strong level of assurance.

I could explain all the above points but at this point I’m getting a bit tired of explaining, so I’ll just point you to yesterday’s post.

And, to top it all off, when you look at the claims carefully, they don’t make a lot of sense. Or, as it says in the press release, “The findings are all the more striking.” “More striking” = surprising = implausible. Or, to put it another way, this sort of striking claim puts more of a burden on the data collection and analysis to be doing what the researchers claim is being done.

Also this: “no previous study has compared the effect of different musical pieces on a direct request implying harming a specific person.” OK, then.

When you think about it, even the headline claim seems backwards. Setting aside any skepticism you might feel about background music having any consistent effect at all, doesn’t it seem weird that “positive music increases people’s willingness to do others harm”? I’d think that positive music would, if anything, make people nicer!

And the reported effects are huge. Background music changing the frequency of a particular behavior from 33% to 80%? Even Michael LaCour didn’t claim to find effects that large.

As is unfortunately common in this sort of paper, the results from these tiny samples are presented as general truths; for example,

The results of Study 1 thus show that exposure to familiar, liked music leads to more compliance to a request implying harming a third person. . . .

Taken together, the results of the two studies clearly show that familiar and liked music leads to more compliance, even when the request presented implies harming a third person.

Story time!

Where are we going here?

OK, so I wrote most of the above material, except for the framing, as part of an intended future post on a solid study that I still wasn’t quite ready to believe, given that we’ve been burned so many times before by seemingly solid experimental findings.

But, as I wrote it, I realized that I don’t think this is a solid study at all. Sure, it was published in Psychology of Music, which I would assume is a serious journal, but it just as well could’ve appeared in a “tabloid” such as Psychological Science or PPNAS.

So where are we here? One more criticism of a pointless study in an obscure academic journal. What’s the point? If the combined efforts of Uri Simonsohn, E. J. Wagenmakers, Kate Button, Brian Nosek, and many others (including me!) can’t convince the editors of Psychological Science, the #1 journal in their field, to clean up its act regarding hype of noise, it’s gotta be pretty hopeless of me to expect or even care about changes in the publication policies of Psychology of Music.

So what’s the point? To me, this is all an interesting window into what we’ve called the hype cycle which encompasses not only researchers and their employers but also the British Psychological Society, which describes itself as “the representative body for psychology and psychologists in the UK” and also an entirely credulous article by Tom Jacobs in the magazine Pacific Standard.

I have particular sympathy for Jacobs here, as his news article is part of a series:

Findings is a daily column by Pacific Standard staff writer Tom Jacobs, who scours the psychological-research journals to discover new insights into human behavior, ranging from the origins of our political beliefs to the cultivation of creativity.

A daily column! 365 new psychology insights a year, huh? That’s a lot of pressure.

The problem with the hype cycle is not just with the hype

And this leads me to the real problem I see with the hype cycle. Actual hype doesn’t bother me so much. If an individual or organization hypes some dodgy claims, fine: They shouldn’t do it, but, given the incentives out there, it’s what we might expect. You or I might not think Steven Levitt is a “rogue economist,” but if he wants to call himself that, well, we have to take such claims in stride.

But what’s going on with the British Psychological Society, that in some way seems more troubling. I don’t think the author of that post was trying to promote or hype anything; rather, I expect it was a sincere, if overly trusting, presentation of what seemed on the surface to be solid science (p less than 0.05, published in a reputable journal, some plausible explanations in the accompanying prose). And similarly at Pacific Standard.

The hype cycle doesn’t even need conscious hype. All it needs is what John Tukey might call the aching desire for regular scientific breakthroughs.

You don’t have to be Karl Popper to know that scientific progress is inherently unpredictable, and you don’t need to be Benoit W. Mandelbrot to know that scientific breakthroughs, at whatever scale, do not occur on a regular schedule. But if you want to believe in routine breakthroughs, and you’re willing to not look too closely, you can find everything you need this week—every week—in psychological science.

And that is how the hype cycle continues, even without anyone trying to hype.

The disclaimer

OK, here we are, at that point in the blog post. Yes, some or all the claims in this paper could in fact represent true claims about the general population. And even if many or most of the claims are false, this work could still be valuable in motivating people to think harder about the psychology of music. I mean, sure, why not?

As always, the point is that the statistical evidence is not what is claimed, either in the published paper or the press release.

If someone wants, they can try a preregistered replication. But given that the authors themselves say that these results confound expectations, I don’t know that it’s worth the effort. It’s not where I’d spend my research dollars. In any case, as usual I am not intending to single out this particular publication or its author. There’s nothing especially wrong with it, compared to lots of other papers of its type. Indeed, what makes it worth writing about is its very ordinariness, that this paper represents business as usual in the world of quantitative research.

Those explanations

As always, we get stories which I can’t take seriously because they assume the truth of population statements which haven’t actually been demonstrates. For example:

Why should positive background music render us more willing to perform harmful acts? Ziv isn’t sure – she measured her participants’ mood in a questionnaire but found no differences between the music and control groups. She speculates that perhaps familiar, positive music fosters feelings of closeness among people through a shared emotional experience. “In the setting of the present studies,” she said, “measuring connectedness or liking to the experimenter would have been out of place, but it is possible that a social bond was created.”

Both the researcher and the publicist forgot the alternative explanation that maybe they are just observing variation in some small group that does not reflect any general patterns in the population. That is, maybe no explanation is necessary, just as we don’t actually need to crack open our physics books to explain why Daryl Bem managed to find some statistically significant interactions in his data.

The aching desire for regular scientific breakthroughs

Let me say it again, with reference to the paper by Ziv that got this all started. On one hand, sure, maybe it’s really true that “background positive music increases people’s willingness to do others harm,” despite that the author herself writes that “a large number of studies examining the effects of music in various settings have suggested” the opposite.

But here’s the larger question. Why should we think at all that a little experiment on 200 college students should provides convincing evidence overturning much of what we might expect about the effects of music. Sure, it’s possible—but just barely. What I’m objecting to here is the idea—encouraged, I fear, by lots and lots of statistics textbooks, including my own, that you can routinely learn eternal truths about human nature via these little tabletop experiments.

Yes, there are examples of small, clean paradigm-destroying studies, but they’re hardly routine, and I think it’s a disaster of both scientific practice and scientific communication that everyday noisy experiments are framed this way.

Discovery doesn’t generally come so easily.

This might seem to be a downbeat conclusion, but in many ways it’s an optimistic statement about the natural and social world. Imagine if the world as presented in “Psychological Science” papers were the real world. If so, we’d routinely be re-evaluating everything we thought we knew about human interactions. Decades of research on public opinion, smashed by a five-question survey on 100 or so Mechanical Turk participants. Centuries of physics overturned by a statistically significant p-value discovered by Daryl Bem. Hundreds of years of data on sex ratios of children, all needing to be reinterpreted because of a pattern some sociologist found in some old survey data. Etc.

What a horrible, capricious world that would be.

Luckily for us, as social scientists and as humans trying to understand the world, there is some regularity in how we act and how we interact, a regularity enforced by the laws of physics, the laws of biology, and by the underlying logic of human interactions as expressed in economics, political science, and so forth. There are not actually 365 world-shaking psychology findings each year, and the strategy of run-an-experiment-on-some-nearby-people-and-then-find-some-statistically-significant-comparisons-in-your-data is not a reliable way to discover eternal truths.

And I think it’s time for the Association for Psychological Science and the British Psychological Society to wake up, and to realize that their problem is not just with one bad study here and one bad study there, or even with misapplication of certain statistical methods, but with their larger paradigm, their implicit model for scientific discovery, which is terribly flawed.

And that’s why I wrote this post. I could care less on the effect of pleasant background music to change people’s propensities to be mean. But I do care about how we do science.

Even though it’s published in a top psychology journal, she still doesn’t believe it


Nadia Hassan writes:

I wanted to ask you about this article.

Andrea Meltzer, James McNulty, Saul Miller, and Levi Baker, “A Psychophysiological Mechanism Underlying Women’s Weight-Management Goals Women Desire and Strive for Greater Weight Loss Near Peak Fertility.” Personality and Social Psychology Bulletin (2015): 0146167215585726.

I [Hassan] find it kind of questionable. Fortunately, the authors use a within-subject sample, but it is 22 women. Effects in evolutionary biology are small. Women’s recall is not terribly accurate. Basically, to use the phrasing you have before, the authors are not necessarily wrong, but it seems as though the evidence is not as strong as they claim.

Here’s the abstract of the paper in question:

Three studies demonstrated that conception risk was associated with increased motivations to manage weight. Consistent with the rationale that this association is due to ovulatory processes, Studies 2 and 3 demonstrated that it was moderated by hormonal contraceptive (HC) use. Consistent with the rationale that this interactive effect should emerge when modern appearance-related concerns regarding weight are salient, Study 3 used a 14-day diary to demonstrate that the interactive effects of conception risk and HC use on daily motivations to restrict eating were further moderated by daily motivations to manage body attractiveness. Finally, providing evidence that this interactive effect has implications for real behavior, daily fluctuations in the desire to restrict eating predicted daily changes in women’s self-reported eating behavior. These findings may help reconcile prior inconsistencies regarding the implications of ovulatory processes by illustrating that such implications can depend on the salience of broader social norms.

Ummm, yeah, sure, whatever.

OK, let’s go thru the paper and see what we find:

This broader study consisted of 39 heterosexual women (the total number of participants was determined by the number of undergraduates who volunteered for this study during a time frame of one academic semester); however, 8 participants failed to respond correctly to quality-control items and 7 participants failed to complete both components of the within-person design and thus could not be used in the within-person analyses. Two additional participants were excluded from analyses: 1 who was over the age of 35 (because women over the age of 35 experience a significant decline in fecundability; Rothman et al., 2013) and 1 who reported a desire to lose an extreme amount of weight relative to the rest of the sample . . .

Fork. Fork. Fork.

We assessed self-esteem at each high- and low-fertility session using the Rosenberg Self-Esteem Scale (Rosenberg, 1965) and controlled for it in a supplemental analysis.

Fork. (The supplemental analysis could’ve been the main analysis.)

Within-person changes in ideal weight remained marginally negatively associated with conception risk . . . suggesting that changes in women’s current weight across their ovulatory cycle did not account for changes in women’s ideal weight across their ovulatory cycle.

The difference between “significant” and “not significant” is not itself statistically significant.

Notably, in this small sample of 22 women, self-esteem was not associated with within-person changes in conception risk . . .

“Not statistically significant” != “no effect.”

consistent with the idea that desired weight loss is associated with ovulation, only naturally cycling women reported wanting to weigh less near peak fertility.

The difference between “significant” and “not significant” is not itself statistically significant.

One recent study (Durante, Rae, & Griskevicius, 2013) demon- strates that ovulation had very different implications for women’s voting preferences depending on whether those women were single or in committed relationships.

Ha! Excessive credulity. If you believe that classic “power = .06” study, you’ll believe anything.

OK, I won’t go through the whole paper.

The point is: I agree with Hassan: this paper shows no strong evidence for anything.

Am I being unfair here?

At this point, you say that I’m being unfair: Why single out these unfortunate researchers just because they happen to have the bad luck to work in a field with low research standards? And what would happen if I treated everybody’s papers with this level of skepticism?

This question comes up a lot, and I have several answers.

First, if you think this sort of evolutionary psychology is important, then you should want to get things right. It’s not enough to just say that evolution is true, therefore this is good stuff. To put it another way: it’s quite likely that, if you got enough data and measured carefully enough, that the patterns in the general population could well be in the opposite direction (and, I would assume, much smaller) than what was claimed in the published paper. Does this matter? Do you want to get the direction of the effect right? Do you want to estimate the effect size within an order of magnitude? If the answer to these questions is Yes, then you should be concerned when shaky methods are being used.

Second, remember what happened when that Daryl Bem article on ESP came out? People said that the journal had to publish that paper because the statistical methods Bem used were standard in psychology research. Huh? There’s no good psychology being done anymore so we just have to fill up our top journals with unsubstantiated claims, presented as truths?? Sorry, but I think Personality and Social Psychology Bulletin can do better.

Third, should we care about forking paths and statistical significance and all that? I’d prefer not to. I’d prefer to see an analysis of all the data at once, using Bayesian methods to handle the multiple levels of variation. But if the claims are going to be based on p-values, then forking paths etc are a concern.

What, then?

Finally, the question will arise: What should these researchers do with this project, if not publish it in Personality and Social Psychology Bulletin? They worked hard, they gathered data; surely these data are of some value. They even did some within-person comparisons! It would be a shame to keep these data unpublished.

So here’s my recommendation: they should be able to publish this work in Personality and Social Psychology Bulletin. But it should be published in a way that is of maximum use to the research field (and, ultimately, to society):

– Post all the raw data. All of it.

– Tone down the dramatic claims. Remember Type S errors and Type M errors, and the garden of forking paths, and don’t take your p-values so seriously.

– Present all the relevant comparisons; don’t just navigate through and report the results that are part of your story.

– Finally, theorize all you want. Just recognize that your theories are open-ended and can explain just about any pattern in data (just as Bem could explain whatever interaction happened to show up for him).

And finally, let me emphasize that I’m not saying I think the claims of Meltzer et al. are false, I just think they’re not providing strong empirical evidence for their theory. Remember 50 shades of gray? That can happen to you too.

Why aren’t people sharing their data and code?

Joe Mienko writes:

I made the following post on a couple of hours ago.

It is still relatively uncommon for social scientists to share data or code as a part of the peer review process. I feel that this practice runs contrary to notions of replicability and reproducibility and have a desire to voice opposition to instances in which manuscripts are submitted without data and code. Where, however, is such opposition appropriately expressed? I am specifically curious about whether or not it is appropriate to refuse to review an article in the absence of code or data.

Based on my read of your blog, this seems like something you may be interested in writing about.

The original post is here:

Here’s my reply. I see a few things getting in the way of people sharing data and code:

1. It takes effort to clean up data and code to put them in a format where you can share them. I’m recently engaged in a replication project right now (replicating one of my own recent papers), and it’s taken a lot of work to set up a clean replication. So that’s a lot of it right now: we’re all busy and we’re all lazy, and setting up datasets for other people is not generally a high priority (although of course it would be if it were required by top journals or by promotion review committees).

2. The IRB is always getting in the way, making you jump through a bunch of hoops if you want to share any data. Much simpler just to lock up or even throw out the data, then you don’t have to worry about the busybodies in the IRB telling you that you’re not in compliance in some way.

3. Data collection can be expensive in time, money, and effort, and so you might not want to give away their data until you’ve squeezed all you can out of it. Sometimes there’s direct commercial competition, other times it’s just the desire in science to publish a new discovery first.

4. This next one is horrible but it does happen: Somebody criticizes your published work and so now you don’t want to share your data because they might go through it and find mistakes in your analysis.

5. You’re John Lott and you lost all traces of your data, no survey forms, no computer files, no data at all to be found. So nothing to share.

6. You’re Diedrik Stapel or Michael LaCour and you never did the study in the first place. You can’t share your data because the data never existed. And you wouldn’t want to share the code, as it would just be a record of your cheating.

Review of The Martian


I actually read this a couple months ago after Bob recommended it to me. I don’t know why I did this, given that the last book Bob recommended to me, I hated, but in this case I made a good choice. The Martian was excellent and was indeed hard to set down.

Recently I’ve been seeing ads for the movie so I thought this was the right time to post a review. Don’t worry, no spoilers here.

I have lots of positive things to say but they’d pretty much involve spoilers of one sort or another so you’ll just have to take my word for it that I liked it.

On the negative side: I have only two criticisms of the book. The first is that the characters have no personality at all. OK, the main character has a little bit of personality—not a lot, but enough to get by. But the other characters: no, not really. That’s fine, the book is wonderful as it is, and doesn’t need any more characterization to do what it does, but I think it would’ve been better had the author not even tried. As it is, there are about 10 minor characters whom it’s hard to keep straight—they’re all different flavors of hardworking idealists—and I think it would’ve worked better to not even try to differentiate them. As it is, it’s a mess trying to keep track of who has what name and who does what job.

My more serious criticism concerns the ending. Again, no spoilers, and the ending is not terrible—at a technical level it’s somewhat satisfying (I’m not enough of a physicist to say more than that), but at the level of construction of a story arc, it didn’t really work for me.

Here’s what I think of the ending. The Martian is structured as a series of challenges: one at a time, there is a difficult or seemingly insurmountable problem that the character or characters solve, or try to solve, in some way. A lot of the fun comes when the solution of problem A leads to problem B later on. It’s an excellent metaphor for life (although not stated that way in the book; one of the strengths of The Martian is that the whole thing is played straight, so that the reader can draw the larger conclusions for him or herself).

OK, fine. So what I think is that Andy Weir, the author of The Martian, should’ve considered the ending of the book to be a challenge, not for his astronaut hero, but for himself: how to end the book in a satisfying way? It’s a challenge. A pure “win” for the hero would just feel too easy, but just leaving him on Mars or having him float off into space on his own, that’s just cheap pathos. And, given the structure of the book, an indeterminate ending would just be a cheat.

So how to do it? How to make an ending that works, on dramatic terms? I don’t know. I’m no novelist. All I do know is that, for me, the ending that Weir chose didn’t do the job. And I conjecture that had Weir framed it to himself as a problem to be solved with ingenuity, maybe he could’ve done it. It’s not easy—the great Tom Wolfe had problems with endings too—but it’s my impression that Weir would be up for the job, he seems pretty resourceful.

Something to look forward to in his next book, I suppose.

On deck this week

Mon: Review of The Martian

Tues: Even though it’s published in a top psychology journal, she still doesn’t believe it

Wed: Turbulent Studies, Rocky Statistics: Publicational Consequences of Experiencing Inferential Instability

Thurs: Medical decision making under uncertainty

Fri: Unreplicable

Sat: “The frequentist case against the significance test”

Sun: Erdos bio for kids

War, Numbers and Human Losses

That’s the title of Mike Spagat’s new blog.

In his most recent post, Spagat disputes the the claim that “at least 240,000 Syrians have died violently since the civil war flared up four years ago.”

I am not an expert in this area so I offer no judgment on these particular numbers, but in any case I think that the sort of open discussion offered by Spagat is useful.


Reflecting on the recent psychology replication study (see also here), journalist Megan McArdle writes an excellent column on why we fall for bogus research:

The problem is not individual research papers, or even the field of psychology. It’s the way that academic culture filters papers, and the way that the larger society gets their results. . . .

Journalists . . . easily fall into the habit (and I’m sure an enterprising reader can come up with at least one example on my part), of treating studies not as a potentially interesting result from a single and usually small group of subjects, but as a True Fact About the World. Many bad articles get written using the words “studies show,” in which some speculative finding is blown up into an incontrovertible certainty.

I’d just replace “Journalists” by “Journalists and researchers” in the above paragraph. And then there are the P.R. excesses coming from scientific journals and universities. Researchers are, unfortunately, active participants in the exaggeration process.

McArdle continues:

Psychology studies also suffer from a certain limitation of the study population. Journalists who find themselves tempted to write “studies show that people …” should try replacing that phrase with “studies show that small groups of affluent psychology majors …” and see if they still want to write the article.

Indeed. Instead of saying “men’s upper-body strength,” try saying “college students with fat arms,” and see how that sounds!

More from McArdle:

We reward people not for digging into something interesting and emerging with great questions and fresh uncertainty, but for coming away from their investigation with an outlier — something really extraordinary and unusual. When we do that, we’re selecting for stories that are too frequently, well, incredible. This is true of academics, who get rewarded with plum jobs not for building well-designed studies that offer messy and hard-to-interpret results, but for generating interesting findings.

Likewise, journalists are not rewarded for writing stories that say “Gee, everything’s complicated, it’s hard to tell what’s true, and I don’t really have a clear narrative with heroes and villains.” Readers like a neat package with a clear villain and a hero, or at least clear science that can tell them what to do. How do you get that story? That’s right, by picking out the outliers. Effectively, academia selects for outliers, and then we select for the outliers among the outliers, and then everyone’s surprised that so many “facts” about diet and human psychology turn out to be overstated, or just plain wrong. . . .

Because a big part of learning is the null results, the “maybe but maybe not,” and the “Yeah, I’m not sure either, but this doesn’t look quite right.”

Yup. None of this will be new to regular readers of this blog, but it’s good to see it explained so clearly from a journalist’s perspective.

Why is this double-y-axis graph not so bad?

Usually I (and other statisticians who think a lot about graphics) can’t stand this sort of graph that overloads the y-axis:


But this example from Isabel Scott and Nicholas Pound actually isn’t so bad at all! The left axis should have a lower bound at 0—it’s not possible for conception risk to be negative—but, other than that, the graph works well.

What’s usually the problem, then? I think the usual problem with double-y-axis graphs is that attention is drawn to the point at which the lines cross.

Here’s an example. I was searching the blog for double-y-axis graphs but couldn’t easily find any, so I googled and came across this:


Forget the context and the details—I just picked it out to have a quick example. The point is, when the y-axes are different, the lines could cross anywhere—or they don’t need to cross at all. Also you can make the graph look like whatever you want by scaling the axes.

The top graph above works because the message is that conception risk varies during the monthly cycle while political conservatism doesn’t. It’s still a bit of a cheat—the scale for conception risk just covers the data while for conservatism they use the full 1-6 scale—but, overall, they still get their message across.

Being polite vs. saying what we really think


We recently discussed an article by Isabel Scott and Nicholas Pound entitled, “Menstrual Cycle Phase Does Not Predict Political Conservatism,” in which Scott and Pound definitively shot down some research that was so ridiculous it never even deserved the dignity of being shot down. The trouble is, the original article, “The Fluctuating Female Vote: Politics, Religion, and the Ovulatory Cycle,” had been published in Psychological Science, a journal which is psychology’s version of Science and Nature and PPNAS, a “tabloid” that goes for short sloppy papers with headline-worthy claims. So, Scott and Pound went to the trouble of trying and failing to replicate; as they reported:

We found no evidence of a relationship between estimated cyclical fertility changes and conservatism, and no evidence of an interaction between relationship status and cyclical fertility in determining political attitudes.

The thing that bugged me when Scott and Pound wrote:

Our results are therefore difficult to reconcile with those of Durante et al, particularly since we attempted the analyses using a range of approaches and exclusion criteria, including tests similar to those used by Durante et al, and our results were similar under all of them.

As I wrote earlier: Huh? Why “difficult to reconcile”? The reconciliation seems obvious to me: There’s no evidence of anything going on here. Durante et al. had a small noisy dataset and went all garden-of-forking-paths on it. And they found a statistically significant comparison in one of their interactions. No news here.

Scott and Pound continued:

Lack of statistical power does not seem a likely explanation for the discrepancy between our results and those reported in Durante et al, since even after the most restrictive exclusion criteria were applied, we retained a sample large enough to detect a moderate effect . . .

Again, I feel like they’re missing the elephant in the room: “Lack of statistical power” is exactly what was going on with the original study by Durante et al.: They were estimating tiny effects in a very noisy way. It was a kangaroo situation.

Anyway, I suspect (but am not sure) that Scott and Pound agree with me that Durante et al. were chasing noise and that there’s no problem at all what had been found in that earlier study (bits of statistical significance in the garden of forking paths) with what was found in the replication (no pattern).

And I suspect (but am not sure) that Scott and Pound wrote the way they did in order to make a sort of minimalist argument. Instead of saying that the earlier study is consistent with pure noise, and their replication is also consistent with that, they make a weaker, innocent-sounding statement that the two studies are “difficult reconcile,” leaving the rest of us to read between the lines.

And so here’s my question: When is it appropriate to make a minimalist argument, and when is it appropriate to say what you really think?

My earlier post had a comment from Mb, who wrote:

By not being able to explain the discrepancy with problems of the replication study or “theoretically interesting” measurement differences, they are showing that the non-replication is likely due to low power etc of the original study. It is a rhetorical device to convince those skeptical of the replication.

I replied that this makes sense. I just wonder when such rhetorical devices are a good idea. The question is whether it makes sense to say what you really think, or whether it’s better to understate to make a more bulletproof argument. This has come up occasionally in blog comments: I’ll say XYZ and a comment will say I should’ve just said XY or even just X because that would make my case stronger. My usual reply is that I’m not trying to make a case, I’m just trying to share my understanding of the problem. But I know that’s not the only game.

P.S. Just to clarify: I think strategic communication and honesty are both valid approaches. I’m not saying my no-filter approach is best, nor do I think it’s correct to say that savviness is always the right way to go either. It’s probably good that there are people like me who speak our minds, and people like Scott and Pound who are more careful. (Or of course maybe Scott and Pound don’t agree with me on this at all; I’m just imputing to them my attitude on the scientific questions here.)

Let’s apply for some of that sweet, sweet National Sanitation Foundation funding


Paul Alper pointed me to this news article about where the bacteria and fungi hang out on airplanes. This is a topic that doesn’t interest me at all, but then I noticed this, at the very end of the article:

Note: A previous version of this article cited the National Science Foundation rather than the National Sanitation Foundation. The post has been updated.

That’s just beautiful. We should definitely try to get a National Sanitation Foundation grant for Stan. I’m sure there are many trash-related applications for which we could make a real difference.

Meet Teletherm, the hot new climate change statistic!

Screen Shot 2015-09-07 at 2.10.04 AM

Peter Dodds, Lewis Mitchell, Andrew Reagan, and Christopher Danforth write:

We introduce, formalize, and explore what we believe are fundamental climatological and seasonal markers: the Summer and Winter Teletherm—the on-average hottest and coldest days of the year. We measure the Teletherms using 25 and 50 year averaging windows for 1218 stations in the contiguous United States and find strong, sometimes dramatic, shifts in Teletherm date, temperature, and extent. For climate change, Teletherm dynamics have profound implications for ecological, agricultural, and social systems, and we observe clear regional differences, as well as accelerations and reversals in the Teletherms.

Of course, the hottest and coldest days of the year are not themselves so important, but I think the idea is to have a convenient one-number summary for each season that allows us to track changes over time.

One reason this is important is because one effect of climate change is to mess up synchrony across interacting species (for example, flowers and bugs, if you’ll forgive my lack of specific biological knowledge).

Dodds et al. put a lot of work into this project. For some reason they don’t want to display simple time series plots of how the teletherms move forward and backward over the decades, but they do have this sort of cool plot showing geographic variation in time trends.

Screen Shot 2015-09-07 at 2.12.36 AM

You’ll have to read their article to understand how to interpret this graph. But that’s ok, not every graph can be completely self-contained.

They also have lots more graphs and discussion here.