Skip to content

The problems are everywhere, once you know to look


Josh Miller writes:

My friend and colleague Joachim Vosgerau (at Bocconi) sent me some papers from PNAS and they are right in your wheelhouse. Higher social class people behave more unethically.

I can certainly vouch for the jerky behavior of people that drive BMWs and Mercedes in Italy (similar to Study 1&2 in Piff et al. but their graph doesn’t include error bars). This seems to be true all around the world, which is strange, because personally I feel more comfortable asserting myself if I am driving a clunker.

The fact that they write P<.05 in study 1&2, and then write P<.04 in study 3, as if that is stronger evidence, shows that somebody didn’t get the memo. Is this just a bad habit, or does it signal unreported studies, forking paths, and other shenanigans?

The two papers are High economic inequality leads higher-income individuals to be less generous, PPNAS 2015, by Stephane Cote, Julian House, and Robb Willer, and Higher social class predicts increased unethical behavior, PPNAS 2012, by Paul Piff, Daniel Stancato, Stephane Cite, Rodolfo Mendoza-Denton, and Dacher Keltner.

Without looking at the papers in detail, I am indeed suspicious of all evidence presented there based on p-values, also you have to watch out for comparison-between-significant-and-non-significant statements such as, “Higher-income participants were less generous than lower-income participants when inequality was portrayed as relatively high, but there was no association between income and generosity when inequality was portrayed as relatively low.” All in all, these papers are following the standard paradigm of grab some data and look for statistically significant comparisons—with all the problems that entails. Again, without commenting on the specific claims in these publications, I think they’re using methods that are setting themselves up to find and promulgate spurious results, that is, patterns that occur in their particular datasets but which don’t reflect the general population.

The more recent paper concludes in a blaze of interactions. From a substantive point of view, I’m supportive of this effort: As I’ve said many times, interactions are important, and we should expect large effects to have large interactions. But from a statistical perspective, I’m wary of methods that search for interactions by sifting through data and pulling out statistically significant comparisons. What you end up with is one particular story that fits various aspects of the data, without a recognition that many other, completely different stories would also fit. I guess this is a good candidate for a preregistered replication, but I wouldn’t be so optimistic about the results.

P.S. Vosgerau adds:
Continue reading ‘The problems are everywhere, once you know to look’ »

Colorless green ideas tweet furiously


Nadia Hassan writes:

Justin Wolfers and Nate Silver got into a colorful fight on twitter. Nate has 2 forecasts. Nate is doing a polls-only forecast in addition to a “traditional” one that discounts poll leads and builds in fundamentals. Wolfers noted that the 538 polls-only model had Clinton at a higher chance of winning on August 9th than today, even though arguably a lot of the uncertainty has gone. He opined that the 538 model was broken. Nate sees Wolfers critiques as lazy, esp. with the potential for overfitting in a low-n environment.

I was wondering what you thought about this round of “the nerd fight”. One potential issue with polling error, at the moment, is the nonresponse bias. YouGov is showing a slightly narrower Clinton lead than other national polls are and Ben Lauderdale reported during a big story that Republicans were responding less. I am not sure if that is still the case. But, it is a potential issue.

My reply:

I really really don’t like twitter as a means of communication in this way. It encourages snappiness and discourages engagement with details. Wolfers and Silver are thoughtful and knowledgeable, but I don’t think twitter brings out the best in anyone.

On your final question, yes, given the bad news for Republicans in recent weeks, I’d guess that Republicans are less likely to respond to polls right now, hence I think that if the election were held today, Clinton would not do so well as it appears from the polls.

Full disclosure: Some of my work is supported by YouGov.

How not to analyze noisy data: A case study


I was reading Jenny Davidson’s blog and came upon this note on an autobiography of the eccentric (but aren’t we all?) biologist Robert Trivers. This motivated me, not to read Trivers’s book, but to do some googling which led me to this paper from Plos-One, “Revisiting a sample of U.S. billionaires: How sample selection and timing of maternal condition influence findings on the Trivers-Willard effect.”

This paper is really bad. It has a bunch of fatal statistical errors.

The paper is not on a particularly important topic, it seems to have received little or no scientific influence or media coverage, and it was published in a non-prestiguous journal.

So this post is not about casting doubt on some Ted talk or whatever.

Rather, consider this as a case study in statistical errors. For this purpose, perhaps it’s a good thing that the paper in question is obscure. Statistical errors occur all over the place—indeed it is reasonable to suppose they are more common in obscure work.

It just happens that this particular paper is on a topic with which I’m already familiar, so it’s particularly easy for me to spot the errors. When you read an bad paper on a familiar topic, the errors just pop right out, it’s as if you were wearing 3-D glasses.

The paper

Here’s the abstract:

Based on evolutionary theory, Trivers & Willard (TW) predicted the existence of mechanisms that lead parents with high levels of resources to bias offspring sex composition to favor sons and parents with low levels of resources to favor daughters. This hypothesis has been tested in samples of wealthy individuals but with mixed results. Here, I argue that both sample selection due to a high number of missing cases and a lacking specification of the timing of wealth accumulation contribute to this equivocal pattern. This study improves on both issues: First, analyses are based on a data set of U.S. billionaires with near-complete information on the sex of offspring. Second, subgroups of billionaires are distinguished according to the timing when they acquired their wealth. Informed by recent insights on the timing of a potential TW effect in animal studies, I state two hypotheses. First, billionaires have a higher share of male offspring than the general population. Second, this effect is larger for heirs and heiresses who are wealthy at the time of conception of all of their children than for self-made billionaires who acquired their wealth during their adult lives, that is, after some or all of their children have already been conceived. Results do not support the first hypothesis for all subgroups of billionaires. But for males, results are weakly consistent with the second hypothesis: Heirs but not self-made billionaires have a higher share of male offspring than the U.S. population. Heiresses, on the other hand, have a much lower share of male offspring than the U.S. average. This hints to a possible interplay of at least two mechanisms affecting sex composition. Implications for future research that would allow disentangling the distinct mechanisms are discussed.

Set aside the theoretical problems with this work, as I’ll just be talking about the statistics.

Dead on arrival

The biggest error in the paper, the error that makes the whole thing worthless, is that the noise is so much larger than the signal.

Let’s do the math. N=1165 children in the study. The comparison with the least uncertainty would be simply to take the raw proportion of girls in this sample and compare to the known proportion in the general population. The standard error is simply .5/sqrt(1165) = .015, that’s 1.5 percentage points.

Now, effect sizes. The difference in proportion girl births, comparing billionaires to the general population, has to be much much smaller than this. Compare billionaires to other white people, it will be even smaller. It’s really hard to imagine any “billionaire difference” to be anywhere near the difference in proportion girl births, comparing white and black Americans, which is around .005.

Suppose the true effect size is .002. (I actually think it’s less.) Even if it’s as large as .002, if the standard error is 0.015, that’s basically impossible to detect. We’re in kangaroo territory.

Let’s do the design analysis:

> retrodesign(.002, .015)
[1] 0.052

[1] 0.35

[1] 17.6

That’s right, if the true effect size is .002, this study has a power of 5.2% (that is, a 5.2% chance of getting a statistically significant p-value), a type S error rate of 35% (that is, a 35% chance that an estimate, if statistically significant, would be in the wrong direction), and an exaggeration factor of 17 (that is, an estimate, if statistically significant, would be on average 17 times larger than the true effect).

Or what if you wanted to make the bold, bold claim that billionaires differ from the general population in their sex ratio by the same rate at which whites differ from blacks. Run the program, and you still get a power of only 6%, a type S error rate of 17%, and an exaggeration factor of 7.

In short, such a study is hopeless no matter what. It’s dead on arrival. It’s a wild throw of the dice to even attain statistical significance, but it’s worse than that, as any statistically significant estimate would be essentially noise.

Deader on arrival

But what about the other analyses in the paper, for example the comparisons between subgroups of billionaires? For these comparisons, the statistics are even worse!

Let’s consider a best-case scenario, comparing two groups that are (essentially) equal-sized: 582 babies in one group, 583 in the other. The difference in proportion girls in these groups will have standard error sqrt(.5^2/582 + .5^2/583) = .029, that’s twice the standard error from above. (That’s the general pattern, that comparisons or interactions have twice the standard error of averages or main effects: you get a factor of sqrt(2) from the halving of the within-group sample size and another factor of sqrt(2) from the differencing.)

So this aspect of the study is even more useless. Again, let’s consider a hypothetical effect size of .002:

> retrodesign(.002, .029)
[1] 0.05

[1] 0.42

[1] 33.9

Power of 5%, type S error rate of 42%, exaggeration factor of 33. You can’t get much noisier than that.

Researcher degrees of freedom

The paper has other errors, of course. It almost has to, given that statistical significance was found under such inauspicious conditions.

The most obvious problem is multiple comparisons: the researcher has many degrees of freedom in deciding what to look at, hence he can keep looking and looking until he finds something statistically significant. In the paper at hand, we see:

– Billionaires compared to the general population,
– Heirs compared to self-made billionaires,
– Comparison just of male billionaires,
– Comparison of heiresses to the general population,
– Comparison of heiresses to self-made billionaires,
– Comparison of heiresses to heirs.

The author does a multiple comparisons correction and finds no significance, which is kinda funny because then he reports the differences as if they reflect real patterns in the population.

In any case, the multiple comparisons correction understates the problem because (a) there are lots of other comparisons floating around in the data that the researcher could’ve noticed and surely would’ve reported had they been notable, and (b) there are a bunch more researcher degrees of freedom in the data-exclusion and data-classification rules (for example, the division of heiresses into those who inherited from parents and those who inherited from spouses).

Again, given the variation and sample size in the context of possible effect sizes, the study had no chance of succeeding in any case, so I don’t don’t don’t recommend anyone try a preregistered replication. The point of the above discussion of forking paths and degrees of freedom is just to explain how the researcher could’ve found statistical significance out of what is essentially pure noise.

Interpretation of results

Finally, the paper at hand also demonstrates several standard mistakes associated with p-values:
– The use of one-sided tests in a context where departures in either direction would be notable,
– The reporting of a p-value near .05 as “almost statistical significance,”
– A “robustness check” that is almost identical to the original analysis (in this case, a logistic regression instead of a comparison of proportions),
– Selected non-significant differences interpreted based on their signs as being “consistent with the stated hypothesis,”
– An observed proportion being reported as “considerably lower than that of the general population,” without noting that this difference is entirely explainable by chance,
– A non-significant difference being taken as evidence of the null hypothesis (“Given that this difference is not statistically significant, it speaks against the first hypothesis that billionaires have a higher percentage of male offspring than the general population.”),
– And, of course, comparisons between significance and non-significance.

Followed by tons of storytelling. It’s tea-leaf-reading without the tea.

I have no desire to pick on this particular researcher—that’s why I have not mentioned his name in this post. The name is no secret (you can find it by just clicking on the link above that has the research article), but I want to focus on the very very common statistical errors rather than on which faceless scientist happened to be making them that day.


These are real errors, and they’re avoidable errors. But you’ll make them too, over and over again, if you do statistics using the grab-some-data-and-look-for-statistical-significance approach.

The point of this post is not to pile on and criticize an obscure paper in an obscure journal by an author we’ve never heard of. The point is to help you and your colleagues avoid these same errors in your own work, errors you might well make in higher-stakes situations where you’re under pressure to find results and where you might not see the forest for the trees.

The paper discussed above is almost a laboratory setting of statistical misunderstanding, where a researcher was able to use standard statistical tools to wrap himself in a web of confusion. Again, it’s nothing personal—statistics is hard, and I’m sorry to say that we in the statistical profession often sell our methods as a way of distilling certainty from noise.

The author of this paper inadvertently made a whole bunch of errors all in one place. As discussed above, it is no coincidence that these errors occurred together. When you start with hopelessly noisy data and you add to this the practical necessity to obtain statistical significance, all hell will break loose. It’s kinda sad to have to admit that the dataset you spent so many months painfully constructing, does not have enough information to answer any of your research questions—but that’s how it goes sometimes. Just too bad nobody told this guy about these issues before he started his study.

So remember these statistical errors here, in this clean setting, and watch out for them in your world.

Ptolemaic inference


OK, we’ve been seeing this a lot recently. A psychology study gets published, with a key idea that at first seems wacky but, upon closer reflection, could very well be true!


– That “dentist named Dennis” paper suggesting that people pick where they live and what job to take based on their names.

– Power pose: at first it sounds ridiculous that you could boost your hormones and have success just by holding your body differently. But, sure, think about it some more and it could be possible.

– Ovulation and voting: do your political preferences change this much based on the time of the month? OK, this one seems ridiculous even upon reflection, but that’s just because I’ve seen a lot of polling data. To an outsider, sure, it seems possible, everybody knows voters are irrational.

– Embodied cognition: as Daniel Kahneman memorably put it, “When I describe priming studies to audiences, the reaction is often disbelief . . . The idea you should focus on, however, is that disbelief is not an option.”

– And lots more: himmicanes, air rage, beauty and sex ratio, football games and elections, subliminal smiley faces and attitudes toward immigration, etc. Each of these seems at first to be a bit of a stretch, but, upon reflection, could be real. Maybe people really do react differently to hurricanes with boy or girl names! And so on.

These examples all have the following features:

1. The claimed phenomenon is some sort of bank shot, an indirect effect without a clear mechanism.

2. Still, the effect seems like it could be true; the indirect mechanism seems vaguely plausible.

3. The exact opposite effect is also plausible. One could easily imagine people avoiding careers that sound like their names, or voting in the opposite way during that time of the month, or responding to elderly-themed words by running faster, or reacting with more alacrity to female-named hurricanes, and so on.

Item 3 is not always mentioned but it’s a natural consequence of items 1 and 2. The very vagueness of the mechanisms which allow plausibility, also allow plausibility for virtually any interaction and effects of virtually any sign. Which is why sociologist Jeremy Freese so memorably described these theories as “vampirical rather than empirical—unable to be killed by mere evidence.”

Think about it: If A is plausible, and not-A is plausible, and if the garden of forking paths and researcher degrees of freedom allow you to get statistical significance from just about any dataset, you can’t lose.

Enter Ptolemy

But we’ve discussed all that before, many times. What I want to talk about today is how many of these stories proceed. It goes like this:

– Original paper gets published and publicized. A stunning counterintuitive finding.

– A literature develops. Conceptual replications galore. Each new study finds a new interaction. The flexible nature of scientific discovery, along with the requirement by journals of (a) originality and (b) p less than .05, in practice requires that each new study is a bit different from everything that came before. From the standpoint of scientific replication, this is a minus, but from the standpoint of producing a scientific literature, it’s a plus. A paper that is nothing but noise mining can get thousands of citations:

Screen Shot 2016-06-09 at 11.06.41 PM

– The literature is criticized on methodological grounds, and some attempted replications fail.

Now here’s where Ptolemy comes in. There is, sometimes, an attempt to square the circle, to resolve the apparent contradiction between the original seemingly successful study, the literature of seemingly successful conceptual replications, and the new, discouraging, failed replications.

I say “apparent contradiction” because this pattern of results is typically consistent with a story in which the true effect is zero, or is so highly variable and situation-dependent as to be undetectable, and in which the original study and the literature of apparent successes are merely testimony to the effectiveness of p-hacking or the garden of forking paths to uncover statistically significant comparisons in the presence of researcher degrees of freedom.

But there is this other story that often gets told, which is that effects are contextually dependent in a particular way, a way which preserves the validity of all the published results while explaining away the failed replications as just being done wrong, as not true replications.

This was what the power pose authors said about the unsuccessful replication performed by Ranehill et al., and this is what Gilbert et al. said about the entire replication project. (And see here for Nosek et al.’s compelling (to me) criticism of Gilbert et al.’s argument.)

I call this reasoning Ptolemaic because it’s an attempt to explain an entire pattern of data with an elaborate system of invisible mechanisms. On days where you’re more fertile you’re more likely to wear red. Unless it’s a cold day, then it doesn’t happen. Or maybe it’s not the most fertile days, maybe it’s the days that precede maximum fertility. Or, when you’re ovulating you’re more likely to vote for Barack Obama. Unless you’re married, then ovulation makes you more likely to support Mitt Romney. Or, in the words of explainer-in-chief John Bargh, “Both articles found the effect but with moderation by a second factor: Hull et al. 2002 showed the effect mainly for individuals high in self consciousness, and Cesario et al. 2006 showed the effect mainly for individuals who like (versus dislike) the elderly.”

It’s all possible but this sort of interpretation of the data is a sort of slalom that weaves back and forth in order to be consistent with every published claim. Which would be fine if the published claims were deterministic truths, but in fact they’re noisy and selected data summaries. It’s classic overfitting.

Look. I’m not some sort of Occam fundamentalist. Maybe these effects are real. But in any case you should take account of all these sources of random and systematic error and recognize that, once you open that door, you have to allow for the possibility that these effects are real, and go in the opposite direction as claimed. You have to allow for the very real possibility that power pose hurts people, that Cornell students have negative ESP, that hurricanes with boys’ names create more damage, and so forth. Own your model.

Remember: the ultimate goal is to describe reality, not to explain away a bunch of published papers in a way that will cause the least offense to their authors and their supporters in the academy and the news media.


Yesterday all the past. The language of effect size
Spreading to Psychology along the sub-fields; the diffusion
Of the counting-frame and the quincunx;
Yesterday the shadow-reckoning in the ivy climates.

Yesterday the assessment of hypotheses by tests,
The divination of water; yesterday the invention
Of cartwheels and clocks, the power-pose of
Horses. Yesterday the bustling world of the experimenters.

Yesterday the abolition of Bible codes and hot hands,
the journal like a motionless eagle eyeing the valley,
the chapel built in the psych lab;
Yesterday the carving of instruments and alarming findings;

The trial of heretics among the tenure reviews;
Yesterday the theoretical feuds in the conferences
And the miraculous confirmation of the counterintuitive;
Yesterday the Sabbath of analysts; but to-day the struggle.

Yesterday the installation of statistical packages,
The construction of findings in available data;
Yesterday the evo-psych lecture
On the origin of Mankind. But to-day the struggle.

Yesterday the belief in the absolute value of Bayes,
The fall of the curtain upon the death of a model;
Yesterday the prayer to the sunset
And the adoration of madmen. but to-day the struggle.

As the postdoc whispers, startled among the test tubes,
Or where the loose waterfall sings compact, or upright
On the crag by the leaning tower:
“O my vision. O send me the luck of the Wilcoxon.”

And the investigator peers through his instruments
At the inhuman provinces, the virile bacillus
Or enormous Jupiter finished:
“But the lives of my friends. I inquire. I inquire.”

And the students in their fireless lodgings, dropping the sheets
Of the evening preprint: “Our day is our loss. O show us
History the operator, the
Organiser. Time the refreshing river.”

And the nations combine each cry, invoking the life
That shapes the individual belly and orders
The private nocturnal terror:
“Did you not found the city state of the sponge,

“Raise the vast military empires of the shark
And the tiger, establish the robin’s plucky canton?
Intervene. O descend as a dove or
A furious papa or a mild engineer, but descend.”

And the life, if it answers at all, replied from the heart
And the eyes and the lungs, from the shops and squares of the laboratory
“O no, I am not the mover;
Not to-day; not to you. To you, I’m the

“Yes-man, the associate editor, the easily-duped;
I am whatever you do. I am your vow to be
Good, your humorous story.
I am your business voice. I am your career.

“What’s your proposal? To build the true theory? I will.
I agree. Or is it the suicide pact, the romantic
Death? Very well, I accept, for
I am your choice, your decision. Yes, I am Science.”

Many have heard it on remote peninsulas,
On sleepy plains, in the aberrant fishermen’s islands
Or the corrupt heart of the city.
Have heard and migrated like gulls or the seeds of a flower.

They clung like burrs to the long expresses that lurch
Through the unjust lands, through the night, through the alpine tunnel;
They floated over the oceans;
They walked the passes. All presented their lives.

On that arid square, that fragment nipped off from hot
Inquiry, soldered so crudely to inventive Emotion;
On that tableland scored by experiments,
Our thoughts have bodies; the menacing shapes of our fever

Are precise and alive. For the fears which made us respond
To the medicine ad, and the rumors of multiple comparisons
Have become invading battalions;
And our faces, the institute-face, the multisite trial, the ruin

Are projecting their greed as the methodological terrorists.
B-schools are the heart. Our moments of tenderness blossom
As the ambulance and the sandbag;
Our hours of blogging into a people’s army.

To-morrow, perhaps the future. The research on fatigue
And the movements of packers; the gradual exploring of all the
Octaves of embodied cognition;
To-morrow the enlarging of consciousness by diet and breathing.

To-morrow the rediscovery of romantic fame,
the photographing of brain scans; all the fun under
Publicity’s masterful shadow;
To-morrow the hour of the press release and the Ted talk,

The beautiful roar of the audiences of NPR;
To-morrow the exchanging of tips on the training of MTurkers,
The eager election of chairmen
By the sudden forest of hands. But to-day the struggle.

To-morrow for the young the p-values exploding like bombs,
The walks by the lake, the weeks of perfect communion;
To-morrow the revisions and resubmissions
Through the journals on summer evenings. But to-day the struggle.

To-day the deliberate increase in the chances of rejection,
The conscious acceptance of guilt in the necessary criticism;
To-day the expending of powers
On the flat ephemeral blog post and the boring listserv.

To-day the makeshift consolations: the shared retraction,
The cards in the candlelit barn, and the scraping concert,
The tasteless jokes; to-day the
Fumbled and unsatisfactory link before hurting.

The stars are dead. The editors will not look.
We are left alone with our day, and the time is short, and
History to the defeated
May say Alas but cannot help nor pardon.

P.S. See here and here for background.

“How One Study Produced a Bunch of Untrue Headlines About Tattoos Strengthening Your Immune System”


Jeff points to this excellently skeptical news article by Caroline Weinberg, who writes:

A recent study published in the American Journal of Human Biology suggests that people with previous tattoo experience may have a better immune response to new tattoos than those being inked for the first time. That’s the finding if you read the open access journal article, anyway. If you stick to the headlines of recent writeups of the study, your takeaway was probably that tattoos are an effective way of preventing the common cold. (sorry to break it to you, but they’re probably not). For this study, researchers collected pre- and post- tattoo cortisol and IgA salivary levels on 29 people receiving tattoos in Alabama parlors. . . . these findings indicate that your experience with prior tattoos influences your response when receiving a tattoo—consistent with existing knowledge about stress response.

OK, so far, so good. But then Weinberg lays out the problems with the media reports:

I [Weinberg] rolled my eyes at the Huffington Post headline “Sorry Mom: Getting Lots Of Tattoos Could Have A Surprising Health Benefit.” My bemusement quickly turned to exasperation when I found CBS’s “Getting Multiple Tattoos Can Help Prevent Colds, Study Says,” and Marie Claire’s Getting Lots of Tattoos Might Actually Be Good for You, among many many others. My cortisol levels were probably sky high—my body does not appear to have habituated to seeing science butchered in the media machine.

Huffington Post, sure, they’ll publish anything. And CBS, sure, they promoted that notorious “power pose” research (ironically with the phrase “Believe it or not”)? But Marie Claire? They’re supposed to have some standards, right?

Weinberg reports how it happened:

The title of the University of Alabama’s press release on the study is: “Want to Avoid a Cold? Try a Tattoo or Twenty, says UA Researcher.”

Oooh, that’s really bad. Weinberg went to the trouble of interviewing Christopher Lynn, the lead author of the article in question and a professor at the university in question, who said, “It’s a dumb suggestion that people go out and get tattoos for the express purpose of improving one’s immune system. I don’t think anyone would do that, but that suggestion by some news pieces is a little embarrassing.”

Another failed replication of power pose


Someone sent me this recent article, “Embodying Power: A Preregistered Replication and Extension of the Power Pose Effect,” by Katie Garrison, David Tang, and Brandon Schmeichel.

Unsurprisingly (given that the experiment was preregistered), the authors found no evidence for any effect of power pose.

The Garrison et al. paper is reasonable enough, but for my taste they aren’t explicit enough about the original “power pose” paper being an exercise in noise mining. They do say, “Another possible explanation for the nonsignificant effect of power posing on risk taking is that power posing does not influence risk taking,” but this only appears 3 paragraphs into their implications section, and they never address the question: If power pose has no effect, how did Carney et al. get statistical significance, publication in a top journal, fame, fortune, etc.? The garden of forking paths is the missing link in this story. (In that original paper, Carney et al. had many, many “researcher degrees of freedom” which would allow them to find “p less than .05” even from data produced by pure noise.)

It’s also not clear what makes Garrison et al. conclude, “We believe future research should continue to explore eye gaze in combination body posture when studying the embodiment of power.” If power pose really has no effect (or, more precisely, highly unstable and situation-dependent effects), why is it worth future research at all? At the very least, any future research should consider measurement issues much more carefully.

Perhaps Garrison et al. were just trying to be charitable to Carney et al. and say, Hey, maybe you really did luck into a real finding. Or perhaps psychology journals simply will not allow you to be explicitly say that a published paper in a top journal is nothing but noise mining. Or maybe they thought it more politically savvy to state their conclusions in a subtle way and let readers draw the inference that the original study by Carney et al. was consistent with pure noise.

The downside of subtle politeness

Whatever the reason, I find the sort of subtlety shown by Garrison et al. to be frustrating. For people like me, and the person who sent the article to me, it’s clear what Garrison et al. are saying—no evidence for power pose, and the original study can be entirely discounted. But less savvy readers might not know the code; they might take the paper’s words literally and think that “social context plays a key role in power pose effects, and the current experiment lacked a meaningful social context” (a theory that Garrison et al. discuss before bringing up the “power posing does not influence risk taking” theory).

That would be too bad, if these researchers went to the trouble of doing a new study, writing it up, and getting it published, only to have drawn their conclusions so subtly that readers could miss the point.

Mulligan after mulligan

You may wonder why I continue to pick on power pose. It’s still one of the most popular Ted talks of all time, featured on NPR etc etc etc. So, yeah, people are taking it seriously. One could make the argument that power pose is innocuous, maybe beneficial in that it is a way of encouraging people to take charge of their lives. And this may be so. Even if power pose itself is meaningless, the larger “power pose” story could be a plus. Of course, if power pose is just an inspirational story to empower people, it doesn’t have to be true, or replicable, or scientifically valid, or whatever. From that perspective, power pose lies outside science entirely, and to criticize power pose would be a sort of category error, like criticizing The Lord of the Rings on the grounds that there’s no such thing as an invisibility ring, or criticizing The Rotter’s Club on the grounds that Jonathan Coe was just making it all up. I guess I’d prefer, if business school professors want to tell inspirational stories without any scientific basis, that they label them more clearly as parables, rather than dragging the scientific field of psychology into it. And I’d prefer if scientific psychologists didn’t give mulligan after mulligan to theories like power pose, just because they’re inspirational and got published with p less than .05.

I don’t care about power pose. It’s just a silly fad. I do care about reality, and I care about science, which is one of the methods we have for learning about reality. The current system of scientific publication, in which a research team can get fame, fortune, and citations by p-hacking, and then even when later research groups fail to replicate the study, that even then there is the continuing push to credit the original work and to hypothesize mysterious interaction effects that would manage to preserve everyone’s reputation . . . it’s a problem.

It’s Ptolemy, man, that’s what it is. [No, it’s not Ptolemy; see Ethan’s comment below.]

P.S. I wrote this post months ago, it just happens to be appearing now, at a time in which we’re talking a lot about the replication crisis.

Practical Bayesian model evaluation in Stan and rstanarm using leave-one-out cross-validation

Our (Aki, Andrew and Jonah) paper Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC was recently published in Statistics and Computing. In the paper we show

  • why it’s better to use LOO instead of WAIC for model evaluation
  • how to compute LOO quickly and reliably using the full posterior sample
  • how Pareto smoothing importance sampling (PSIS) reduces variance of LOO estimate
  • how Pareto shape diagnostics can be used to indicate when PSIS-LOO fails

PSIS-LOO makes it possible to use automated LOO in practice in rstanarm, which provides a flexible way to use pre-compiled Stan regression models. The estimation using sampling obtains draws from the full posterior and these same draws are used to compute PSIS-LOO estimate with a negligible additional computational cost. PSIS-LOO can fail, but possible failure is reliably detected by Pareto shape diagnostics. If there are high estimated Pareto shape values, the summary of these is reported to a user with suggestions what to do next. In the initial modeling phase the user can ignore the warnings (and get anyway more reliable results than WAIC or DIC). If there are high estimated Pareto shape values, rstanarm offers to rerun the inference only for the problematic leave-one-out folds (in the paper we named this approach PSIS-LOO+). If there are many high values, rstanarm offers to run k-fold-CV. This way the fast predictive performance estimate is always provided and user can decide how much additional computation time is used to get more accurate results. In the future we will add other utility and cost functions such as explained variance, MAE and classification accuracy to provide easier interpretation of the predictive performance.

The above approach can be used also when using Stan via other interfaces than rstanarm, although then the user needs to add a few lines to the usual Stan code. After this PSIS-LOO and diagonstics are easily computed using the available packages for R, Python, and Matlab.

Authors of AJPS paper find that the signs on their coefficients were reversed. But they don’t care: in their words, “None of our papers actually give a damn about whether it’s plus or minus.” All right, then!

Avi Adler writes:

I hit you up on twitter, and you probably saw this already, but you may enjoy this.

I’m not actually on twitter but I do read email, so I followed the link and read this post by Steven Hayward:


Hoo-wee, the New York Times will really have to extend itself to top the boner and mother-of-all-corrections at the American Journal of Political Science. This is the journal that published a finding much beloved of liberals a few years back that purported to find scientific evidence that conservatives are more likely to exhibit traits associated with psychoticism, such as authoritarianism and tough-mindedness, and that the supposed “authoritarian” personality of conservatives might even have a genetic basis (and therefore be treatable someday?). Settle in with a cup or glass of your favorite beverage, and get ready to enjoy one of the most epic academic face plants ever.

The original article was called “Correlation not causation: the relationship between personality traits and political ideologies,” and was written by three academics at Virginia Commonwealth University. . . .

I had no recollection of this study but I forget lots of things so I decided to google my name and the name of the paper’s first author, and lo! this is what I found, a news article by Shannon Palus:

Researchers have fixed a number of papers after mistakenly reporting that people who hold conservative political beliefs are more likely to exhibit traits associated with psychoticism, such as authoritarianism and tough-mindedness. . . .

To help us make sense of the analysis, we turned to Andrew Gelman, a statistician at Columbia not involved with the work, to explain the AJPS paper to us. He said:

I don’t find this paper at all convincing, indeed I’m surprised it was accepted for publication by a leading political science journal. The causal analysis doesn’t make any sense to me, and some of the things they do are just bizarre, like declaring that correlations are “large enough for further consideration” if they are more than 0.2 for both sexes. Where does that come from? The whole thing is a mess.

He added:

It’s hard for me to care about reported effect sizes here…If the underlying analysis doesn’t make sense, who cares about the reported effect sizes?

Hey, now I remember! Oddly enough, Palus quotes one of the authors of the original paper as saying,

We only cared about the magnitude of the relationship and the source of it . . . None of our papers actually give a damn about whether it’s plus or minus.

How you can realistically expect to learn about the magnitude of a relationship and the source if it, without knowing about its sign, that one baffles me. And the author of the paper then adds to the confusion by saying,

[T]he correlations are spurious, so the direction or even magnitude is not suitable to elaborate on at all- that’s the point of all our papers and the general findings.

Now I’m even more puzzled as to how this paper got published in AJPS, which is a serious political science journal. We’re not talking Psychological Science or PPNAS here. I suspect the AJPS got snowed by all the talk of genetics. Social scientists can be such suckers sometimes!

Looking at the correction note by Brad Verhulst, Lindon Eaves, and Peter Hatemi, I see this:

Since these personality traits and their antecedents have been previously found to both positively and negatively predict liberalism, or not at all, the descriptive analyses did not appear abnormal to the authors, editors, reviewers or the general academy.

Wha??? OK, so you’re saying the data are all noise so who cares? You’ve convinced me not to care, that’s for sure!

Getting back to the original link above: I disagree with Steven Hayward’s claim that this is an “epic correction.” Embarrassing for sure, but given that it’s hard to take the original finding seriously, it’s hard for me to get very excited about the reversal either.

Good to see the error caught, in any case. I’m not at all kidding when I say that I expect more from AJPS than from Psych Sci or PPNAS.

P.S. I did some web search and noticed that Hatemi was also a coauthor of a silly paper about the politics of smell; see here for my skeptical take on that one.

Avoiding model selection in Bayesian social research

One of my favorites, from 1995.

Don Rubin and I argue with Adrian Raftery. Here’s how we begin:

Raftery’s paper addresses two important problems in the statistical analysis of social science data: (1) choosing an appropriate model when so much data are available that standard P-values reject all parsimonious models; and (2) making estimates and predictions when there are not enough data available to fit the desired model using standard techniques.

For both problems, we agree with Raftery that classical frequentist methods fail and that Raftery’s suggested methods based on BIC can point in better directions. Nevertheless, we disagree with his solutions because, in principle, they are still directed off-target and only by serendipity manage to hit the target in special circumstances. Our primary criticisms of Raftery’s proposals are that (1) he promises the impossible: the selection of a model that is adequate for specific purposes without consideration of those purposes; and (2) he uses the same limited tool for model averaging as for model selection, thereby depriving himself of the benefits of the broad range of available Bayesian procedures.

Despite our criticisms, we applaud Raftery’s desire to improve practice by providing methods and computer programs for all to use and applying these methods to real problems. We believe that his paper makes a positive contribution to social science, by focusing on hard problems where standard methods can fail and exp sing failures of standard methods.

We follow up with sections on:

– “Too much data, model selection, and the example of the 3x3x16 contingency table with 113,556 data points”

– “How can BIC select a model that does not fit the data over one that does”

– “Not enough data, model averaging, and the example of regression with 15 explanatory variables and 47 data points.”

And here’s something we found on the web [link fixed] with Raftery’s original article, our discussion and other discussions, and Raftery’s reply. Enjoy.

P.S. Yes, I’ve blogged this one before, also here. But I found out that not everyone knows about this paper so I’m sharing it again here.


A journalist sent me a bunch of questions regarding problems with polls. Here was my reply:

In answer to your question, no, the polls in Brexit did not fail. They were pretty good. See here and here.

The polls also successfully estimated Donald Trump’s success in the Republican primary election.

I think that poll responses are generally sincere. Polls are not perfect because they miss many people, hence pollsters need to make adjustments; see for example here and here.

I hope this helps.

We have a ways to go in communicating the replication crisis

I happened to come across this old post today with this amazing, amazing quote from a Harvard University public relations writer:

The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.

This came up in the context of a paper by Daniel Gilbert et al. defending the reputation of social psychology, a field that has recently been shredded—and rightly so—but revelations of questionable research practices, p-hacking, gardens of forking paths, and high-profile failed replications.

When I came across the above quote, I mocked it, but in retrospect I think it hadn’t disturbed me enough. The trouble was that I was associating it with Gilbert et al.: those guys don’t know a lot of statistics so it didn’t really surprise me that they could be so innumerate. I let the publicist off the hook on the grounds that he was following the lead of some Harvard professors. Harvard professors can make mistakes or even be wrong on purpose, but it’s not typically the job of a Harvard publicist to concern himself with such possibilities.

But now, on reflection, I’m disturbed. That statement about the 100% replication rate is so wrong, it’s so inane, I’m bothered that it didn’t trigger some sort of switch in the publicist’s brain.

Consider the following statements:

“Harvard physicist builds perpetual motion machine”

“Harvard biologist discovers evidence for creationism”

That wouldn’t happen, right? The P.R. guy would sniff that something’s up. This isn’t the University of Utah, right?

I’m not saying Harvard’s always right. Harvard has publicized the power pose and all sorts of silly things. But the idea of a 100% replication rate, that’s not just silly or unproven or speculative or even mistaken: it’s obviously wrong. It’s ridiculous.

But the P.R. guy didn’t realize it. If a Harvard prof told him about a perpetual motion machine or proof of creationism, the public relations officer would make a few calls before running the story. But a 100% replication rate? Sure, why not, he must’ve thought.

We have a ways to go. We’ll always have research slip-ups and publicized claims that fall through, but let’s hope it’s not much longer that people can claim 100% replication rates with a straight face. That’s just embarrassing.

P.S. I have to keep adding these postscripts . . . I wrote this post months ago, it just happens to be appearing now, at a time in which we’re talking a lot about the replication crisis.

Mathematica, now with Stan

Stan logo
Vincent Picaud developed a Mathematica interface to Stan:

You can find everything you need to get started by following the link above. If you have questions, comments, or suggestions, please let us know through the Stan user’s group or the GitHub issue tracker.

MathematicaStan interfaces to Stan through a CmdStan process.

Stan programs are portable across interfaces.

The Psychological Science stereotype paradox


Lee Jussim, Jarret Crawford, and Rachel Rubinstein just published a paper in Psychological Science that begins,

Are stereotypes accurate or inaccurate? We summarize evidence that stereotype accuracy is one of the largest and most replicable findings in social psychology. We address controversies in this literature, including the long-standing and continuing but unjustified emphasis on stereotype inaccuracy . . .

I haven’t read the paper in detail but I imagine that a claim that stereotypes are accurate will depend strongly on the definition of “accuracy.”

But what I really want to talk about is this paradox:

My stereotype about a Psychological Science article is that it is an exercise in noise mining, followed by hype. But this Psychological Science paper says that stereotypes are accurate. So if the article is true, then my stereotype is accurate, and the article is just hype, in which case stereotypes are not accurate, in which case the paper might actually be correct, in which case stereotypes might actually be accurate . . . now I’m getting dizzy!

P.S. Jussim has a long and interesting discussion in the comments. I should perhaps clarify that my above claim of a “paradox” was a joke! I understand about variability.

Webinar: Introduction to Bayesian Data Analysis and Stan

This post is by Eric.

We are starting a series of free webinars about Stan, Bayesian inference, decision theory, and model building. The first webinar will be held on Tuesday, October 25 at 11:00 AM EDT. You can register here.

Stan is a free and open-source probabilistic programming language and Bayesian inference engine. In this talk, we will demonstrate the use of Stan for some small problems in sports ranking, nonlinear regression, mixture modeling, and decision analysis, to illustrate the general idea that Bayesian data analysis involves model building, model fitting, and model checking. One of our major motivations in building Stan is to efficiently fit complex models to data, and Stan has indeed been used for this purpose in social, biological, and physical sciences, engineering, and business. The purpose of the present webinar is to demonstrate using simple examples how one can directly specify and fit models in Stan and make logical decisions under uncertainty.


Update: a video recording of the webinar is now available here.

Advice on setting up audio for your podcast

Jennifer and I were getting ready to do our podcast, and in preparation we got some advice from Enrico Bertini and the Data Stories team:

1) Multitracking. The best way is to multitrack and have each person record locally (note: this is easier if you are in different rooms/locations). Multitracking gives you a lot of freedom in the postediting phase. You can fix when voice overlaps, remove various noises and utterances, adjust volume levels etc. If you are in the same room you can still multitrack but it’s more complex.

2) Microphone. Having good (even high-end) mics makes a huge difference. When you hear the difference between a good mic and your average iPhone earbuds it’s stunning! With good mics you sound like a pro, without you sound meh … Here you have many many options.
You can use a usb mic made for podcasting and plug it in your computer (Rode podcaster is great: I have Yeti also but it’s not as good).
You can buy a standalone recorder (we have a Zoom and we love it:
You can buy high-end condenser mics and plug them in a mixer.

3) Recording device. Recording in your computer is fine. We record most of our sound using our mac and quicktime. Very easy and straightforward. When I use the Zoom I record directly in the mic since it is also a recorder. Recording with an iPhone is not good enough.

4) Remote communication. If you are located remotely and/or have a remote guest, you can (should) keep recording locally but you still have to communicate. We have used Skype or Hangout with mixed results. When there are too many people or someone has a slow network it’s a real pain. We are still struggling with this ourselves. Hangout seems to be a bit more reliable. One good thing with Skype is that you can record within it and make sure you always have a backup. Backups and redundancy are crucial. Things do go wrong sometime in very unexpected ways!

5) Noise. It’s important to reduce noise in your environment. Especially, turn phones down or in airplane mode, avoid interruptions and ambient noise (even birds can be a problem!). Sometime the sound coming from your headphones can also be picked up by your mic so you need to be careful.

6) Synchronization. When you have multiple tracks you have to find a way to sync them. We have a very low-fi trick. We ask our guest to count backward 3, 2, 1 and clap and put our headphones close to our pics (not sure how others have solved this problem).

7) Audio postproduction. There are tons of things that can be done after the recording. We have a fantastic person working for us who is a pro. I don’t know all the details of the filter he uses. But he does cut things down when we are too verbose or make mistakes. This is priceless.

I [Bertini] think the most important thing to know is if you are planning to be in the same room or not and if you are going to have guests. The set up can change considerably according to what kind of combination you have.

Should Jonah Lehrer be a junior Gladwell? Does he have any other options?

Remember Jonah Lehrer—that science writer from a few years back whose reputation was tarnished after some plagiarism and fabrication scandals? He’s been blogging—on science!

And he’s on to some of the usual suspects: Ellen Langer’s mindfulness (see here for the skeptical take) and—hey—“an important new paper [by] Kyla Haimovitz and Carol Dweck” (see here for background).

Also a post, “Can Tetris Help Us Cope With Traumatic Memories?” and another on the recently-questioned “ego-depletion effect.”

And lots more along these lines. It’s This Week in Psychological Science in blog form. Nothing yet on himmicanes or power pose, but just give him time.

Say what you want about Malcolm Gladwell, he at least tries to weigh the evidence and he’s willing to be skeptical. Lehrer goes all-in, every time.

It’s funny: they say you can’t scam a scammer, but it looks like Lehrer’s getting scammed, over and over again. This guy seems to be the perfect patsy for every Psychological Science press release that’s ever existed.

But how could he be so clueless? Perhaps he’s auditioning for the role of press agent: he’d like to write an inspirational “business book” with someone who does one of these experiments, so he’s getting into practice and promoting himself by writing these little posts. He’d rather write them as New York Times magazine articles or whatever but that path is (currently) blocked for him so he’s putting himself out there as best he can. From this perspective, Lehrer has every incentive to believe all these Psychological Science-style claims. It’s not that he’s made the affirmative decision to believe, it’s more that he gently turns off the critical part of his brain, the same way that a sports reporter might only focus on the good things about the home team.

Lehrer’s in a tough spot, though, as he doesn’t that much going for him. He’s a smooth writer, but there are other smooth writers out there. He can talk science, but he can’t get to any depth. He used to have excellent media contacts, but I suspect he’s burned most of those bridges. And there’s a lot of competition out there, lots of great science writers. So he’s in a tough spot. He might have to go out and get a real job at some point.

Some people are so easy to contact and some people aren’t.

I was reading Cowboys Full, James McManus’s entertaining history of poker (but way too much on the so-called World Series of Poker), and I skimmed the index to look up some of my favorite poker writers. Frank Wallace and David Spanier were both there but only got brief mentions in the text, I was disappointed to see. I guess McManus and I have different taste. Fair enough. I also looked up Patrick Marber, author of the wonderful poker-themed play, Dealer’s Choice. Marber was not in the index.

And this brings be to the subject of today’s post. Anyone who wants can reach me by email or even call me on the phone. That’s how it is with college teachers: we’re accessible, that’s part of our job. But authors, not so much. Even authors much more obscure than James McManus typically don’t make themselves easy to contact. Maybe they don’t want to be bothered, maybe it’s just tradition, I dunno. But I think they’re missing out. McManus does seem to have a twitter account, but that doesn’t work for me. I just want to send the guy an email.

People can, of course, duck emails. I tried a couple times to contact Paul Gertler about the effect of the statistical significance filter on his claimed effects of early childhood intervention, and I have it on good authority that he received my email but just chose not to respond, I assume feeling that his life would simpler if he were not to have to worry about that particular statistical bias. And of course famous people have to guard their time, so I usually don’t get responses from the likes of Paul Krugman, Malcolm Gladwell, David Brooks, or Nate Silver. (That last one is particularly ironic given that people are always asking me for Nate’s email. I typically give them the email but warn them that Nate might not respond.)

Anyway, I have no problem at all with famous people not returning my emails—if they responded to all the emails they received from statistics professors, they’d probably have no time for anything else, and they’d be reduced to a Stallman-esque existence.

And, while I disapprove of the likes of Gertler not responding to emails of mine making critical comments on their work, hey, that’s his choice: if he doesn’t want to improve his statistics, there’s nothing much I can do about it.

But it’s too bad it’s not so easy to directly reach people like James McManus, or Thomas Mallon, or George Pelacanos. I think they’d be interested in the stories I would share with them.

P.S. In his book, McManus does go overboard in a few places, including his idealization of Barack Obama (all too consistent with the publication date of 2009) and this bit of sub-Nicholas-Wade theorizing:

Screen Shot 2015-08-15 at 10.42.42 PM

Screen Shot 2015-08-15 at 10.42.53 PM

Aahhhh, so that’s what it was like back in the old days! Good that we have an old-timer like James McManus to remember it for us.

But that’s just a minor issue. Overall, I like the book. All of us are products of our times, so it’s no big deal if a book has a few false notes like this.

Opportunity for publishing preregistered analyses of the 2016 American National Election Study

Brendan Nyhan writes:

Have you heard about the Election Research Preacceptance Competition that Skip Lupia and I are organizing to promote preaccepted articles? Details here: A number of top journals have agreed to consider preaccepted articles that include data from the ANES. Authors who publish qualifying entries can win a $2,000 prize. We’re eager to let people know about the opportunity and to promote better scientific publishing practices

The page in question is titled, “November 8, 2016: what really happened,” which reminded me of this election-night post of mine from eight years ago entitled, “Election 2008: what really happened.”

I could be wrong, but I’m guessing that a post such as mine would not have much of a chance in this competition, which is designed to reward “an article in which the hypotheses and design were registered before the data were publicly available.” The idea is that the proposed analyses would be performed on the 2016 American National Election Study, data from which will be released in Apr 2017. I suppose it would be possible to take a post such as mine and come up with hypotheses that could be tested using ANES data but it wouldn’t be so natural.

So, I think this project of Nyhan and Lupia has some of the strengths and weaknesses of aspects of the replication movement in science.

The strengths are that the competition’s rules are transparent and roughly equally open to all, in a way that, for example, publication in PPNAS does not seem to be. Also, of course, preregistration minimizes researcher degrees of freedom which allows p-values to be more interpretable.

The minuses are the connection to the existing system of journals; the framing as a competition; the restriction to a single relatively small dataset; and a hypothesis-testing framework which (a) points toward confirmation rather than discovery, and (b) would seem to favor narrow inquiries rather than broader investigations. Again, I’m concerned that my own “Election 2008: what really happened” story wouldn’t fit into this framework.

Overall I think this project of Nyhan and Lupia is a good idea and I’m not complaining about it at all. Sure, it’s limited, but it’s only one of may opportunities out there. Researchers who want to test specific hypotheses can enter this competition. Hypothesis testing isn’t my thing, but nothing’s stopping me or others from posting whatever we do on blogs, Arxiv, SSRN, etc. There’s room for lots of approaches, and, at the very least, this effort should encourage some researchers to use ANES more intensively than they otherwise would have.

“Marginally Significant Effects as Evidence for Hypotheses: Changing Attitudes Over Four Decades”

Kevin Lewis sends along this article by Laura Pritschet, Derek Powell, and Zachary Horne, who write:

Some effects are statistically significant. Other effects do not reach the threshold of statistical significance and are sometimes described as “marginally significant” or as “approaching significance.” Although the concept of marginal significance is widely deployed in academic psychology, there has been very little systematic examination of psychologists’ attitudes toward these effects. Here, we report an observational study in which we investigated psychologists’ attitudes concerning marginal significance by examining their language in over 1,500 articles published in top-tier cognitive, developmental, and social psychology journals. We observed a large change over the course of four decades in psychologists’ tendency to describe a p value as marginally significant, and overall rates of use appear to differ across subfields. We discuss possible explanations for these findings, as well as their implications for psychological research.

The common practice of dividing data comparisons into categories based on significance levels is terrible, but it happens all the time (as discussed, for example, in this recent comment thread about a 2016 Psychological Science paper by Haimowitz and Dweck), so it’s worth examining the prevalence of this error, as Pritschet et al. do.

Let me first briefly explain why categorizing based on p-values is is such a bad idea. Consider, for example, this division: “really significant” for p less than .01, “significant” for p less than .05, “marginally significant” for p less than .1, and “not at all significant” otherwise. And consider some typical p-values in these ranges: say, p=.005, p=.03, p=.08, and p=.2. Now translate these two-sided p-values back into z-scores, which we can do in R via 1 – qnorm(c(.005, .03, .08, .2)/2), yielding the z-scores 2.8, 2.2, 1.8, 1.3. The seemingly yawning gap in p-values comparing the “not at all significant” p-value of .2 to the “really significant” p-value of .005, is only 1.5. Indeed, if you had two independent experiments with these z-scores and with equal standard errors and you wanted to compare them, you’d get a difference of 1.5 with a standard error of 1.4—completely consistent with noise. This is the point that Hal Stern and I made in our paper from a few years back.

From a statistical point of view, the trouble with using the p-value as a data summary is that the p-value is only interpretable in the context of the null hypothesis of zero effect—and in psychology studies, nobody’s interested in the null hypothesis. Indeed, once you see comparisons between large, marginal, and small effects, the null hypothesis is irrelevant, as you want to be comparing effect sizes.

From a psychological point of view, the trouble with using the p-value as a data summary is that this is a kind of deterministic thinking, an attempt to convert real uncertainty into firm statements that are just not possible (or, as we would say now, just not replicable).

P.S. Related is this paper from a few years ago, “Erroneous analyses of interactions in neuroscience: a problem of significance,” by Sander Nieuwenhuis, Birte Forstmann, and E. J. Wagenmakers, who wrote:

In theory, a comparison of two experimental effects requires a statistical test on their difference. In practice, this comparison is often based on an incorrect procedure involving two separate tests in which researchers conclude that effects differ when one effect is significant (P < 0.05) but the other is not (P > 0.05). We reviewed 513 behavioral, systems and cognitive neuroscience articles in five top-ranking journals (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience) and found that 78 used the correct procedure and 79 used the incorrect procedure. An additional analysis suggests that incorrect analyses of interactions are even more common in cellular and molecular neuroscience. We discuss scenarios in which the erroneous procedure is particularly beguiling.

It’s a problem.

P.S. Amusingly enough, just a couple days ago we discussed an abstract that had a “marginal significant” in it.