Skip to content


Yesterday all the past. The language of effect size
Spreading to Psychology along the sub-fields; the diffusion
Of the counting-frame and the quincunx;
Yesterday the shadow-reckoning in the ivy climates.

Yesterday the assessment of hypotheses by tests,
The divination of water; yesterday the invention
Of cartwheels and clocks, the power-pose of
Horses. Yesterday the bustling world of the experimenters.

Yesterday the abolition of Bible codes and hot hands,
the journal like a motionless eagle eyeing the valley,
the chapel built in the psych lab;
Yesterday the carving of instruments and alarming findings;

The trial of heretics among the tenure reviews;
Yesterday the theoretical feuds in the conferences
And the miraculous confirmation of the counterintuitive;
Yesterday the Sabbath of analysts; but to-day the struggle.

Yesterday the installation of statistical packages,
The construction of findings in available data;
Yesterday the evo-psych lecture
On the origin of Mankind. But to-day the struggle.

Yesterday the belief in the absolute value of Bayes,
The fall of the curtain upon the death of a model;
Yesterday the prayer to the sunset
And the adoration of madmen. but to-day the struggle.

As the postdoc whispers, startled among the test tubes,
Or where the loose waterfall sings compact, or upright
On the crag by the leaning tower:
“O my vision. O send me the luck of the Wilcoxon.”

And the investigator peers through his instruments
At the inhuman provinces, the virile bacillus
Or enormous Jupiter finished:
“But the lives of my friends. I inquire. I inquire.”

And the students in their fireless lodgings, dropping the sheets
Of the evening preprint: “Our day is our loss. O show us
History the operator, the
Organiser. Time the refreshing river.”

And the nations combine each cry, invoking the life
That shapes the individual belly and orders
The private nocturnal terror:
“Did you not found the city state of the sponge,

“Raise the vast military empires of the shark
And the tiger, establish the robin’s plucky canton?
Intervene. O descend as a dove or
A furious papa or a mild engineer, but descend.”

And the life, if it answers at all, replied from the heart
And the eyes and the lungs, from the shops and squares of the laboratory
“O no, I am not the mover;
Not to-day; not to you. To you, I’m the

“Yes-man, the associate editor, the easily-duped;
I am whatever you do. I am your vow to be
Good, your humorous story.
I am your business voice. I am your career.

“What’s your proposal? To build the true theory? I will.
I agree. Or is it the suicide pact, the romantic
Death? Very well, I accept, for
I am your choice, your decision. Yes, I am Science.”

Many have heard it on remote peninsulas,
On sleepy plains, in the aberrant fishermen’s islands
Or the corrupt heart of the city.
Have heard and migrated like gulls or the seeds of a flower.

They clung like burrs to the long expresses that lurch
Through the unjust lands, through the night, through the alpine tunnel;
They floated over the oceans;
They walked the passes. All presented their lives.

On that arid square, that fragment nipped off from hot
Inquiry, soldered so crudely to inventive Emotion;
On that tableland scored by experiments,
Our thoughts have bodies; the menacing shapes of our fever

Are precise and alive. For the fears which made us respond
To the medicine ad, and the rumors of multiple comparisons
Have become invading battalions;
And our faces, the institute-face, the multisite trial, the ruin

Are projecting their greed as the methodological terrorists.
B-schools are the heart. Our moments of tenderness blossom
As the ambulance and the sandbag;
Our hours of blogging into a people’s army.

To-morrow, perhaps the future. The research on fatigue
And the movements of packers; the gradual exploring of all the
Octaves of embodied cognition;
To-morrow the enlarging of consciousness by diet and breathing.

To-morrow the rediscovery of romantic fame,
the photographing of brain scans; all the fun under
Publicity’s masterful shadow;
To-morrow the hour of the press release and the Ted talk,

The beautiful roar of the audiences of NPR;
To-morrow the exchanging of tips on the training of MTurkers,
The eager election of chairmen
By the sudden forest of hands. But to-day the struggle.

To-morrow for the young the p-values exploding like bombs,
The walks by the lake, the weeks of perfect communion;
To-morrow the revisions and resubmissions
Through the journals on summer evenings. But to-day the struggle.

To-day the deliberate increase in the chances of rejection,
The conscious acceptance of guilt in the necessary criticism;
To-day the expending of powers
On the flat ephemeral blog post and the boring listserv.

To-day the makeshift consolations: the shared retraction,
The cards in the candlelit barn, and the scraping concert,
The tasteless jokes; to-day the
Fumbled and unsatisfactory link before hurting.

The stars are dead. The editors will not look.
We are left alone with our day, and the time is short, and
History to the defeated
May say Alas but cannot help nor pardon.

P.S. See here and here for background.

“How One Study Produced a Bunch of Untrue Headlines About Tattoos Strengthening Your Immune System”


Jeff points to this excellently skeptical news article by Caroline Weinberg, who writes:

A recent study published in the American Journal of Human Biology suggests that people with previous tattoo experience may have a better immune response to new tattoos than those being inked for the first time. That’s the finding if you read the open access journal article, anyway. If you stick to the headlines of recent writeups of the study, your takeaway was probably that tattoos are an effective way of preventing the common cold. (sorry to break it to you, but they’re probably not). For this study, researchers collected pre- and post- tattoo cortisol and IgA salivary levels on 29 people receiving tattoos in Alabama parlors. . . . these findings indicate that your experience with prior tattoos influences your response when receiving a tattoo—consistent with existing knowledge about stress response.

OK, so far, so good. But then Weinberg lays out the problems with the media reports:

I [Weinberg] rolled my eyes at the Huffington Post headline “Sorry Mom: Getting Lots Of Tattoos Could Have A Surprising Health Benefit.” My bemusement quickly turned to exasperation when I found CBS’s “Getting Multiple Tattoos Can Help Prevent Colds, Study Says,” and Marie Claire’s Getting Lots of Tattoos Might Actually Be Good for You, among many many others. My cortisol levels were probably sky high—my body does not appear to have habituated to seeing science butchered in the media machine.

Huffington Post, sure, they’ll publish anything. And CBS, sure, they promoted that notorious “power pose” research (ironically with the phrase “Believe it or not”)? But Marie Claire? They’re supposed to have some standards, right?

Weinberg reports how it happened:

The title of the University of Alabama’s press release on the study is: “Want to Avoid a Cold? Try a Tattoo or Twenty, says UA Researcher.”

Oooh, that’s really bad. Weinberg went to the trouble of interviewing Christopher Lynn, the lead author of the article in question and a professor at the university in question, who said, “It’s a dumb suggestion that people go out and get tattoos for the express purpose of improving one’s immune system. I don’t think anyone would do that, but that suggestion by some news pieces is a little embarrassing.”

Another failed replication of power pose


Someone sent me this recent article, “Embodying Power: A Preregistered Replication and Extension of the Power Pose Effect,” by Katie Garrison, David Tang, and Brandon Schmeichel.

Unsurprisingly (given that the experiment was preregistered), the authors found no evidence for any effect of power pose.

The Garrison et al. paper is reasonable enough, but for my taste they aren’t explicit enough about the original “power pose” paper being an exercise in noise mining. They do say, “Another possible explanation for the nonsignificant effect of power posing on risk taking is that power posing does not influence risk taking,” but this only appears 3 paragraphs into their implications section, and they never address the question: If power pose has no effect, how did Carney et al. get statistical significance, publication in a top journal, fame, fortune, etc.? The garden of forking paths is the missing link in this story. (In that original paper, Carney et al. had many, many “researcher degrees of freedom” which would allow them to find “p less than .05” even from data produced by pure noise.)

It’s also not clear what makes Garrison et al. conclude, “We believe future research should continue to explore eye gaze in combination body posture when studying the embodiment of power.” If power pose really has no effect (or, more precisely, highly unstable and situation-dependent effects), why is it worth future research at all? At the very least, any future research should consider measurement issues much more carefully.

Perhaps Garrison et al. were just trying to be charitable to Carney et al. and say, Hey, maybe you really did luck into a real finding. Or perhaps psychology journals simply will not allow you to be explicitly say that a published paper in a top journal is nothing but noise mining. Or maybe they thought it more politically savvy to state their conclusions in a subtle way and let readers draw the inference that the original study by Carney et al. was consistent with pure noise.

The downside of subtle politeness

Whatever the reason, I find the sort of subtlety shown by Garrison et al. to be frustrating. For people like me, and the person who sent the article to me, it’s clear what Garrison et al. are saying—no evidence for power pose, and the original study can be entirely discounted. But less savvy readers might not know the code; they might take the paper’s words literally and think that “social context plays a key role in power pose effects, and the current experiment lacked a meaningful social context” (a theory that Garrison et al. discuss before bringing up the “power posing does not influence risk taking” theory).

That would be too bad, if these researchers went to the trouble of doing a new study, writing it up, and getting it published, only to have drawn their conclusions so subtly that readers could miss the point.

Mulligan after mulligan

You may wonder why I continue to pick on power pose. It’s still one of the most popular Ted talks of all time, featured on NPR etc etc etc. So, yeah, people are taking it seriously. One could make the argument that power pose is innocuous, maybe beneficial in that it is a way of encouraging people to take charge of their lives. And this may be so. Even if power pose itself is meaningless, the larger “power pose” story could be a plus. Of course, if power pose is just an inspirational story to empower people, it doesn’t have to be true, or replicable, or scientifically valid, or whatever. From that perspective, power pose lies outside science entirely, and to criticize power pose would be a sort of category error, like criticizing The Lord of the Rings on the grounds that there’s no such thing as an invisibility ring, or criticizing The Rotter’s Club on the grounds that Jonathan Coe was just making it all up. I guess I’d prefer, if business school professors want to tell inspirational stories without any scientific basis, that they label them more clearly as parables, rather than dragging the scientific field of psychology into it. And I’d prefer if scientific psychologists didn’t give mulligan after mulligan to theories like power pose, just because they’re inspirational and got published with p less than .05.

I don’t care about power pose. It’s just a silly fad. I do care about reality, and I care about science, which is one of the methods we have for learning about reality. The current system of scientific publication, in which a research team can get fame, fortune, and citations by p-hacking, and then even when later research groups fail to replicate the study, that even then there is the continuing push to credit the original work and to hypothesize mysterious interaction effects that would manage to preserve everyone’s reputation . . . it’s a problem.

It’s Ptolemy, man, that’s what it is. [No, it’s not Ptolemy; see Ethan’s comment below.]

P.S. I wrote this post months ago, it just happens to be appearing now, at a time in which we’re talking a lot about the replication crisis.

Practical Bayesian model evaluation in Stan and rstanarm using leave-one-out cross-validation

Our (Aki, Andrew and Jonah) paper Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC was recently published in Statistics and Computing. In the paper we show

  • why it’s better to use LOO instead of WAIC for model evaluation
  • how to compute LOO quickly and reliably using the full posterior sample
  • how Pareto smoothing importance sampling (PSIS) reduces variance of LOO estimate
  • how Pareto shape diagnostics can be used to indicate when PSIS-LOO fails

PSIS-LOO makes it possible to use automated LOO in practice in rstanarm, which provides a flexible way to use pre-compiled Stan regression models. The estimation using sampling obtains draws from the full posterior and these same draws are used to compute PSIS-LOO estimate with a negligible additional computational cost. PSIS-LOO can fail, but possible failure is reliably detected by Pareto shape diagnostics. If there are high estimated Pareto shape values, the summary of these is reported to a user with suggestions what to do next. In the initial modeling phase the user can ignore the warnings (and get anyway more reliable results than WAIC or DIC). If there are high estimated Pareto shape values, rstanarm offers to rerun the inference only for the problematic leave-one-out folds (in the paper we named this approach PSIS-LOO+). If there are many high values, rstanarm offers to run k-fold-CV. This way the fast predictive performance estimate is always provided and user can decide how much additional computation time is used to get more accurate results. In the future we will add other utility and cost functions such as explained variance, MAE and classification accuracy to provide easier interpretation of the predictive performance.

The above approach can be used also when using Stan via other interfaces than rstanarm, although then the user needs to add a few lines to the usual Stan code. After this PSIS-LOO and diagonstics are easily computed using the available packages for R, Python, and Matlab.

Authors of AJPS paper find that the signs on their coefficients were reversed. But they don’t care: in their words, “None of our papers actually give a damn about whether it’s plus or minus.” All right, then!

Avi Adler writes:

I hit you up on twitter, and you probably saw this already, but you may enjoy this.

I’m not actually on twitter but I do read email, so I followed the link and read this post by Steven Hayward:


Hoo-wee, the New York Times will really have to extend itself to top the boner and mother-of-all-corrections at the American Journal of Political Science. This is the journal that published a finding much beloved of liberals a few years back that purported to find scientific evidence that conservatives are more likely to exhibit traits associated with psychoticism, such as authoritarianism and tough-mindedness, and that the supposed “authoritarian” personality of conservatives might even have a genetic basis (and therefore be treatable someday?). Settle in with a cup or glass of your favorite beverage, and get ready to enjoy one of the most epic academic face plants ever.

The original article was called “Correlation not causation: the relationship between personality traits and political ideologies,” and was written by three academics at Virginia Commonwealth University. . . .

I had no recollection of this study but I forget lots of things so I decided to google my name and the name of the paper’s first author, and lo! this is what I found, a news article by Shannon Palus:

Researchers have fixed a number of papers after mistakenly reporting that people who hold conservative political beliefs are more likely to exhibit traits associated with psychoticism, such as authoritarianism and tough-mindedness. . . .

To help us make sense of the analysis, we turned to Andrew Gelman, a statistician at Columbia not involved with the work, to explain the AJPS paper to us. He said:

I don’t find this paper at all convincing, indeed I’m surprised it was accepted for publication by a leading political science journal. The causal analysis doesn’t make any sense to me, and some of the things they do are just bizarre, like declaring that correlations are “large enough for further consideration” if they are more than 0.2 for both sexes. Where does that come from? The whole thing is a mess.

He added:

It’s hard for me to care about reported effect sizes here…If the underlying analysis doesn’t make sense, who cares about the reported effect sizes?

Hey, now I remember! Oddly enough, Palus quotes one of the authors of the original paper as saying,

We only cared about the magnitude of the relationship and the source of it . . . None of our papers actually give a damn about whether it’s plus or minus.

How you can realistically expect to learn about the magnitude of a relationship and the source if it, without knowing about its sign, that one baffles me. And the author of the paper then adds to the confusion by saying,

[T]he correlations are spurious, so the direction or even magnitude is not suitable to elaborate on at all- that’s the point of all our papers and the general findings.

Now I’m even more puzzled as to how this paper got published in AJPS, which is a serious political science journal. We’re not talking Psychological Science or PPNAS here. I suspect the AJPS got snowed by all the talk of genetics. Social scientists can be such suckers sometimes!

Looking at the correction note by Brad Verhulst, Lindon Eaves, and Peter Hatemi, I see this:

Since these personality traits and their antecedents have been previously found to both positively and negatively predict liberalism, or not at all, the descriptive analyses did not appear abnormal to the authors, editors, reviewers or the general academy.

Wha??? OK, so you’re saying the data are all noise so who cares? You’ve convinced me not to care, that’s for sure!

Getting back to the original link above: I disagree with Steven Hayward’s claim that this is an “epic correction.” Embarrassing for sure, but given that it’s hard to take the original finding seriously, it’s hard for me to get very excited about the reversal either.

Good to see the error caught, in any case. I’m not at all kidding when I say that I expect more from AJPS than from Psych Sci or PPNAS.

P.S. I did some web search and noticed that Hatemi was also a coauthor of a silly paper about the politics of smell; see here for my skeptical take on that one.

Avoiding model selection in Bayesian social research

One of my favorites, from 1995.

Don Rubin and I argue with Adrian Raftery. Here’s how we begin:

Raftery’s paper addresses two important problems in the statistical analysis of social science data: (1) choosing an appropriate model when so much data are available that standard P-values reject all parsimonious models; and (2) making estimates and predictions when there are not enough data available to fit the desired model using standard techniques.

For both problems, we agree with Raftery that classical frequentist methods fail and that Raftery’s suggested methods based on BIC can point in better directions. Nevertheless, we disagree with his solutions because, in principle, they are still directed off-target and only by serendipity manage to hit the target in special circumstances. Our primary criticisms of Raftery’s proposals are that (1) he promises the impossible: the selection of a model that is adequate for specific purposes without consideration of those purposes; and (2) he uses the same limited tool for model averaging as for model selection, thereby depriving himself of the benefits of the broad range of available Bayesian procedures.

Despite our criticisms, we applaud Raftery’s desire to improve practice by providing methods and computer programs for all to use and applying these methods to real problems. We believe that his paper makes a positive contribution to social science, by focusing on hard problems where standard methods can fail and exp sing failures of standard methods.

We follow up with sections on:

– “Too much data, model selection, and the example of the 3x3x16 contingency table with 113,556 data points”

– “How can BIC select a model that does not fit the data over one that does”

– “Not enough data, model averaging, and the example of regression with 15 explanatory variables and 47 data points.”

And here’s something we found on the web [link fixed] with Raftery’s original article, our discussion and other discussions, and Raftery’s reply. Enjoy.

P.S. Yes, I’ve blogged this one before, also here. But I found out that not everyone knows about this paper so I’m sharing it again here.


A journalist sent me a bunch of questions regarding problems with polls. Here was my reply:

In answer to your question, no, the polls in Brexit did not fail. They were pretty good. See here and here.

The polls also successfully estimated Donald Trump’s success in the Republican primary election.

I think that poll responses are generally sincere. Polls are not perfect because they miss many people, hence pollsters need to make adjustments; see for example here and here.

I hope this helps.

We have a ways to go in communicating the replication crisis

I happened to come across this old post today with this amazing, amazing quote from a Harvard University public relations writer:

The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.

This came up in the context of a paper by Daniel Gilbert et al. defending the reputation of social psychology, a field that has recently been shredded—and rightly so—but revelations of questionable research practices, p-hacking, gardens of forking paths, and high-profile failed replications.

When I came across the above quote, I mocked it, but in retrospect I think it hadn’t disturbed me enough. The trouble was that I was associating it with Gilbert et al.: those guys don’t know a lot of statistics so it didn’t really surprise me that they could be so innumerate. I let the publicist off the hook on the grounds that he was following the lead of some Harvard professors. Harvard professors can make mistakes or even be wrong on purpose, but it’s not typically the job of a Harvard publicist to concern himself with such possibilities.

But now, on reflection, I’m disturbed. That statement about the 100% replication rate is so wrong, it’s so inane, I’m bothered that it didn’t trigger some sort of switch in the publicist’s brain.

Consider the following statements:

“Harvard physicist builds perpetual motion machine”

“Harvard biologist discovers evidence for creationism”

That wouldn’t happen, right? The P.R. guy would sniff that something’s up. This isn’t the University of Utah, right?

I’m not saying Harvard’s always right. Harvard has publicized the power pose and all sorts of silly things. But the idea of a 100% replication rate, that’s not just silly or unproven or speculative or even mistaken: it’s obviously wrong. It’s ridiculous.

But the P.R. guy didn’t realize it. If a Harvard prof told him about a perpetual motion machine or proof of creationism, the public relations officer would make a few calls before running the story. But a 100% replication rate? Sure, why not, he must’ve thought.

We have a ways to go. We’ll always have research slip-ups and publicized claims that fall through, but let’s hope it’s not much longer that people can claim 100% replication rates with a straight face. That’s just embarrassing.

P.S. I have to keep adding these postscripts . . . I wrote this post months ago, it just happens to be appearing now, at a time in which we’re talking a lot about the replication crisis.

Mathematica, now with Stan

Stan logo
Vincent Picaud developed a Mathematica interface to Stan:

You can find everything you need to get started by following the link above. If you have questions, comments, or suggestions, please let us know through the Stan user’s group or the GitHub issue tracker.

MathematicaStan interfaces to Stan through a CmdStan process.

Stan programs are portable across interfaces.

The Psychological Science stereotype paradox


Lee Jussim, Jarret Crawford, and Rachel Rubinstein just published a paper in Psychological Science that begins,

Are stereotypes accurate or inaccurate? We summarize evidence that stereotype accuracy is one of the largest and most replicable findings in social psychology. We address controversies in this literature, including the long-standing and continuing but unjustified emphasis on stereotype inaccuracy . . .

I haven’t read the paper in detail but I imagine that a claim that stereotypes are accurate will depend strongly on the definition of “accuracy.”

But what I really want to talk about is this paradox:

My stereotype about a Psychological Science article is that it is an exercise in noise mining, followed by hype. But this Psychological Science paper says that stereotypes are accurate. So if the article is true, then my stereotype is accurate, and the article is just hype, in which case stereotypes are not accurate, in which case the paper might actually be correct, in which case stereotypes might actually be accurate . . . now I’m getting dizzy!

P.S. Jussim has a long and interesting discussion in the comments. I should perhaps clarify that my above claim of a “paradox” was a joke! I understand about variability.

Webinar: Introduction to Bayesian Data Analysis and Stan

This post is by Eric.

We are starting a series of free webinars about Stan, Bayesian inference, decision theory, and model building. The first webinar will be held on Tuesday, October 25 at 11:00 AM EDT. You can register here.

Stan is a free and open-source probabilistic programming language and Bayesian inference engine. In this talk, we will demonstrate the use of Stan for some small problems in sports ranking, nonlinear regression, mixture modeling, and decision analysis, to illustrate the general idea that Bayesian data analysis involves model building, model fitting, and model checking. One of our major motivations in building Stan is to efficiently fit complex models to data, and Stan has indeed been used for this purpose in social, biological, and physical sciences, engineering, and business. The purpose of the present webinar is to demonstrate using simple examples how one can directly specify and fit models in Stan and make logical decisions under uncertainty.

Advice on setting up audio for your podcast

Jennifer and I were getting ready to do our podcast, and in preparation we got some advice from Enrico Bertini and the Data Stories team:

1) Multitracking. The best way is to multitrack and have each person record locally (note: this is easier if you are in different rooms/locations). Multitracking gives you a lot of freedom in the postediting phase. You can fix when voice overlaps, remove various noises and utterances, adjust volume levels etc. If you are in the same room you can still multitrack but it’s more complex.

2) Microphone. Having good (even high-end) mics makes a huge difference. When you hear the difference between a good mic and your average iPhone earbuds it’s stunning! With good mics you sound like a pro, without you sound meh … Here you have many many options.
You can use a usb mic made for podcasting and plug it in your computer (Rode podcaster is great: I have Yeti also but it’s not as good).
You can buy a standalone recorder (we have a Zoom and we love it:
You can buy high-end condenser mics and plug them in a mixer.

3) Recording device. Recording in your computer is fine. We record most of our sound using our mac and quicktime. Very easy and straightforward. When I use the Zoom I record directly in the mic since it is also a recorder. Recording with an iPhone is not good enough.

4) Remote communication. If you are located remotely and/or have a remote guest, you can (should) keep recording locally but you still have to communicate. We have used Skype or Hangout with mixed results. When there are too many people or someone has a slow network it’s a real pain. We are still struggling with this ourselves. Hangout seems to be a bit more reliable. One good thing with Skype is that you can record within it and make sure you always have a backup. Backups and redundancy are crucial. Things do go wrong sometime in very unexpected ways!

5) Noise. It’s important to reduce noise in your environment. Especially, turn phones down or in airplane mode, avoid interruptions and ambient noise (even birds can be a problem!). Sometime the sound coming from your headphones can also be picked up by your mic so you need to be careful.

6) Synchronization. When you have multiple tracks you have to find a way to sync them. We have a very low-fi trick. We ask our guest to count backward 3, 2, 1 and clap and put our headphones close to our pics (not sure how others have solved this problem).

7) Audio postproduction. There are tons of things that can be done after the recording. We have a fantastic person working for us who is a pro. I don’t know all the details of the filter he uses. But he does cut things down when we are too verbose or make mistakes. This is priceless.

I [Bertini] think the most important thing to know is if you are planning to be in the same room or not and if you are going to have guests. The set up can change considerably according to what kind of combination you have.

Should Jonah Lehrer be a junior Gladwell? Does he have any other options?

Remember Jonah Lehrer—that science writer from a few years back whose reputation was tarnished after some plagiarism and fabrication scandals? He’s been blogging—on science!

And he’s on to some of the usual suspects: Ellen Langer’s mindfulness (see here for the skeptical take) and—hey—“an important new paper [by] Kyla Haimovitz and Carol Dweck” (see here for background).

Also a post, “Can Tetris Help Us Cope With Traumatic Memories?” and another on the recently-questioned “ego-depletion effect.”

And lots more along these lines. It’s This Week in Psychological Science in blog form. Nothing yet on himmicanes or power pose, but just give him time.

Say what you want about Malcolm Gladwell, he at least tries to weigh the evidence and he’s willing to be skeptical. Lehrer goes all-in, every time.

It’s funny: they say you can’t scam a scammer, but it looks like Lehrer’s getting scammed, over and over again. This guy seems to be the perfect patsy for every Psychological Science press release that’s ever existed.

But how could he be so clueless? Perhaps he’s auditioning for the role of press agent: he’d like to write an inspirational “business book” with someone who does one of these experiments, so he’s getting into practice and promoting himself by writing these little posts. He’d rather write them as New York Times magazine articles or whatever but that path is (currently) blocked for him so he’s putting himself out there as best he can. From this perspective, Lehrer has every incentive to believe all these Psychological Science-style claims. It’s not that he’s made the affirmative decision to believe, it’s more that he gently turns off the critical part of his brain, the same way that a sports reporter might only focus on the good things about the home team.

Lehrer’s in a tough spot, though, as he doesn’t that much going for him. He’s a smooth writer, but there are other smooth writers out there. He can talk science, but he can’t get to any depth. He used to have excellent media contacts, but I suspect he’s burned most of those bridges. And there’s a lot of competition out there, lots of great science writers. So he’s in a tough spot. He might have to go out and get a real job at some point.

Some people are so easy to contact and some people aren’t.

I was reading Cowboys Full, James McManus’s entertaining history of poker (but way too much on the so-called World Series of Poker), and I skimmed the index to look up some of my favorite poker writers. Frank Wallace and David Spanier were both there but only got brief mentions in the text, I was disappointed to see. I guess McManus and I have different taste. Fair enough. I also looked up Patrick Marber, author of the wonderful poker-themed play, Dealer’s Choice. Marber was not in the index.

And this brings be to the subject of today’s post. Anyone who wants can reach me by email or even call me on the phone. That’s how it is with college teachers: we’re accessible, that’s part of our job. But authors, not so much. Even authors much more obscure than James McManus typically don’t make themselves easy to contact. Maybe they don’t want to be bothered, maybe it’s just tradition, I dunno. But I think they’re missing out. McManus does seem to have a twitter account, but that doesn’t work for me. I just want to send the guy an email.

People can, of course, duck emails. I tried a couple times to contact Paul Gertler about the effect of the statistical significance filter on his claimed effects of early childhood intervention, and I have it on good authority that he received my email but just chose not to respond, I assume feeling that his life would simpler if he were not to have to worry about that particular statistical bias. And of course famous people have to guard their time, so I usually don’t get responses from the likes of Paul Krugman, Malcolm Gladwell, David Brooks, or Nate Silver. (That last one is particularly ironic given that people are always asking me for Nate’s email. I typically give them the email but warn them that Nate might not respond.)

Anyway, I have no problem at all with famous people not returning my emails—if they responded to all the emails they received from statistics professors, they’d probably have no time for anything else, and they’d be reduced to a Stallman-esque existence.

And, while I disapprove of the likes of Gertler not responding to emails of mine making critical comments on their work, hey, that’s his choice: if he doesn’t want to improve his statistics, there’s nothing much I can do about it.

But it’s too bad it’s not so easy to directly reach people like James McManus, or Thomas Mallon, or George Pelacanos. I think they’d be interested in the stories I would share with them.

P.S. In his book, McManus does go overboard in a few places, including his idealization of Barack Obama (all too consistent with the publication date of 2009) and this bit of sub-Nicholas-Wade theorizing:

Screen Shot 2015-08-15 at 10.42.42 PM

Screen Shot 2015-08-15 at 10.42.53 PM

Aahhhh, so that’s what it was like back in the old days! Good that we have an old-timer like James McManus to remember it for us.

But that’s just a minor issue. Overall, I like the book. All of us are products of our times, so it’s no big deal if a book has a few false notes like this.

Opportunity for publishing preregistered analyses of the 2016 American National Election Study

Brendan Nyhan writes:

Have you heard about the Election Research Preacceptance Competition that Skip Lupia and I are organizing to promote preaccepted articles? Details here: A number of top journals have agreed to consider preaccepted articles that include data from the ANES. Authors who publish qualifying entries can win a $2,000 prize. We’re eager to let people know about the opportunity and to promote better scientific publishing practices

The page in question is titled, “November 8, 2016: what really happened,” which reminded me of this election-night post of mine from eight years ago entitled, “Election 2008: what really happened.”

I could be wrong, but I’m guessing that a post such as mine would not have much of a chance in this competition, which is designed to reward “an article in which the hypotheses and design were registered before the data were publicly available.” The idea is that the proposed analyses would be performed on the 2016 American National Election Study, data from which will be released in Apr 2017. I suppose it would be possible to take a post such as mine and come up with hypotheses that could be tested using ANES data but it wouldn’t be so natural.

So, I think this project of Nyhan and Lupia has some of the strengths and weaknesses of aspects of the replication movement in science.

The strengths are that the competition’s rules are transparent and roughly equally open to all, in a way that, for example, publication in PPNAS does not seem to be. Also, of course, preregistration minimizes researcher degrees of freedom which allows p-values to be more interpretable.

The minuses are the connection to the existing system of journals; the framing as a competition; the restriction to a single relatively small dataset; and a hypothesis-testing framework which (a) points toward confirmation rather than discovery, and (b) would seem to favor narrow inquiries rather than broader investigations. Again, I’m concerned that my own “Election 2008: what really happened” story wouldn’t fit into this framework.

Overall I think this project of Nyhan and Lupia is a good idea and I’m not complaining about it at all. Sure, it’s limited, but it’s only one of may opportunities out there. Researchers who want to test specific hypotheses can enter this competition. Hypothesis testing isn’t my thing, but nothing’s stopping me or others from posting whatever we do on blogs, Arxiv, SSRN, etc. There’s room for lots of approaches, and, at the very least, this effort should encourage some researchers to use ANES more intensively than they otherwise would have.

“Marginally Significant Effects as Evidence for Hypotheses: Changing Attitudes Over Four Decades”

Kevin Lewis sends along this article by Laura Pritschet, Derek Powell, and Zachary Horne, who write:

Some effects are statistically significant. Other effects do not reach the threshold of statistical significance and are sometimes described as “marginally significant” or as “approaching significance.” Although the concept of marginal significance is widely deployed in academic psychology, there has been very little systematic examination of psychologists’ attitudes toward these effects. Here, we report an observational study in which we investigated psychologists’ attitudes concerning marginal significance by examining their language in over 1,500 articles published in top-tier cognitive, developmental, and social psychology journals. We observed a large change over the course of four decades in psychologists’ tendency to describe a p value as marginally significant, and overall rates of use appear to differ across subfields. We discuss possible explanations for these findings, as well as their implications for psychological research.

The common practice of dividing data comparisons into categories based on significance levels is terrible, but it happens all the time (as discussed, for example, in this recent comment thread about a 2016 Psychological Science paper by Haimowitz and Dweck), so it’s worth examining the prevalence of this error, as Pritschet et al. do.

Let me first briefly explain why categorizing based on p-values is is such a bad idea. Consider, for example, this division: “really significant” for p less than .01, “significant” for p less than .05, “marginally significant” for p less than .1, and “not at all significant” otherwise. And consider some typical p-values in these ranges: say, p=.005, p=.03, p=.08, and p=.2. Now translate these two-sided p-values back into z-scores, which we can do in R via 1 – qnorm(c(.005, .03, .08, .2)/2), yielding the z-scores 2.8, 2.2, 1.8, 1.3. The seemingly yawning gap in p-values comparing the “not at all significant” p-value of .2 to the “really significant” p-value of .005, is only 1.5. Indeed, if you had two independent experiments with these z-scores and with equal standard errors and you wanted to compare them, you’d get a difference of 1.5 with a standard error of 1.4—completely consistent with noise. This is the point that Hal Stern and I made in our paper from a few years back.

From a statistical point of view, the trouble with using the p-value as a data summary is that the p-value is only interpretable in the context of the null hypothesis of zero effect—and in psychology studies, nobody’s interested in the null hypothesis. Indeed, once you see comparisons between large, marginal, and small effects, the null hypothesis is irrelevant, as you want to be comparing effect sizes.

From a psychological point of view, the trouble with using the p-value as a data summary is that this is a kind of deterministic thinking, an attempt to convert real uncertainty into firm statements that are just not possible (or, as we would say now, just not replicable).

P.S. Related is this paper from a few years ago, “Erroneous analyses of interactions in neuroscience: a problem of significance,” by Sander Nieuwenhuis, Birte Forstmann, and E. J. Wagenmakers, who wrote:

In theory, a comparison of two experimental effects requires a statistical test on their difference. In practice, this comparison is often based on an incorrect procedure involving two separate tests in which researchers conclude that effects differ when one effect is significant (P < 0.05) but the other is not (P > 0.05). We reviewed 513 behavioral, systems and cognitive neuroscience articles in five top-ranking journals (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience) and found that 78 used the correct procedure and 79 used the incorrect procedure. An additional analysis suggests that incorrect analyses of interactions are even more common in cellular and molecular neuroscience. We discuss scenarios in which the erroneous procedure is particularly beguiling.

It’s a problem.

P.S. Amusingly enough, just a couple days ago we discussed an abstract that had a “marginal significant” in it.

Is it fair to use Bayesian reasoning to convict someone of a crime?

Ethan Bolker sends along this news article from the Boston Globe:

If it doesn’t acquit, it must fit

Judges and juries are only human, and as such, their brains tend to see patterns, even if the evidence isn’t all there. In a new study, researchers first presented people with pieces of evidence (a confession, an eyewitness identification, an alibi, a motive) in separate contexts. Then, similar pieces of evidence were presented together in the context of a single criminal case. Although judgments of the probative value of each piece of evidence were uncorrelated when considered separately, their probative value became significantly correlated when considered together. In other words, perceiving one piece of evidence as confirming guilt caused other pieces of evidence to become more confirming of guilt too. For example, among people who ended up reaching a guilty verdict, the same kind of confession was considered more voluntary when considered alongside other evidence than when it had been considered in isolation.

Greenspan, R. & Scurich, The Interdependence of Perceived Confession Voluntariness and Case Evidence, N. Law and Human Behavior (forthcoming).

Bolker writes:

The tone suggests that this observation—“perceiving one piece of evidence as confirming guilt caused other pieces of evidence to become more confirming of guilt too”—reflects an inability to weigh evidence, but to me it makes Bayesian sense: each piece influences the priors for the others.

I agree. It seems like a judicial example of a familiar tension from statistical analysis: when do we want to simply be summarizing the data at hand, and when do we want to “collapse the wave function,” as it were, and perform inference for underlying parameters.

Tenure Track Professor in Machine Learning, Aalto University, Finland

Posted by Aki.

I promise that next time I’ll post something else than a job advertisement, but before that here’s another great opportunity to join Aalto Univeristy where I work, too.

“We are looking for a professor to either further strengthen our strong research fields, with keywords including statistical machine learning, probabilistic modelling, Bayesian inference, kernel methods, computational statistics, or complementing them with deep learning. Collaboration with other fields is welcome, with local opportunities both at Aalto and University of Helsinki. A joint appointment with the Helsinki Institute for Information Technology HIIT a joint research centre with University of Helsinki, can be negotiated.”

Naturally, I would hope we could someone who is also interested in probabilistic modelling and Bayesian inference :)

See more details here

Applying the “If there’s no report you can read, there’s no study” principle in real time


So, I was on the website of the New York Times and came across this story by Donna de la Cruz:

Opioids May Interfere With Parenting Instincts, Study Finds . . .

Researchers at the Perelman School of Medicine at the University of Pennsylvania scanned the brains of 47 men and women before and after they underwent treatment for opioid dependence. While in the scanner, the study subjects looked at various images of babies, and the researchers measured the brain’s response. . . . Sometimes the babies’ features were exaggerated to make them even more adorable; in others, the chubby cheeks and big eyes were reduced, making the faces less appealing. . . .

Compared with the brains of healthy people, the brains of people with opioid dependence didn’t produce strong responses to the cute baby pictures. But once the opioid-dependent people received a drug called naltrexone, which blocks the effects of opioids, their brains produced a more normal response. . . .

Interesting, and full credit to the Times for inserting the qualifier “may” into the headline.

Anyway, the article continues:

The study, among the first to look at the effects of opioid dependence and how its treatment affects social cognition, was presented last month at the European College of Neuropsychopharmacology Congress in Vienna.

OK, where’s the research paper? Where are the data? The Times article provides a link which leads to a short abstract but no paper. Here’s the key paragraph from the abstract:

Forty-seven opioid dependent patients and 25 controls underwent two functional magnetic resonance imaging sessions, approximately ten days apart, while viewing infant portraits with high and low baby schema content and rating them for cuteness. Immediately after the first session, patients received an injection of extended-release naltrexone, while controls received no treatment. The repeated-measures ANOVA revealed a marginal significant main effect of Group [F(1,42) = 3.90, p = 0.06] with greater ΔRating (i.e. RatingHigh – RatingLow) in the patient group, but no main effects of Session [F(1,42) = 0.00, p = 0.99] or Gender [F(1,42) = 2.00, p = 0.17]. Among patients, self-reported craving was significantly reduced [F(1,24) = 45.76, p < 0.001] after the injection (i.e. On-XRNTX), but there was no gender difference [F(1,24) = 3.15, p = 0.09]. Whole brain analysis of variance showed Gender by Group by Session interaction in the ventral striatum. Brain responses increased in female patients and decreased in male patients across sessions, while the pattern was reversed in the controls. We found that the behavioral effect of baby schema, indexed by “cuteness” ratings, was present across all participant categories, without significant differences between patients and controls, genders or sessions. The pattern of the brain response to baby schema includes insulae, inferior frontal gyri, MPFC and the parietal cortex, in addition to the ventral striatum, caudate and fusiform gurus reported by Glocker et al. (2009) [4] in healthy nulliparous women.

I can’t quite follow what they did, but of course the phrase “marginal significant” sets off an alarm—not because I’m a “p less than .05” purist but because it makes me think of forking paths. There’s also the issue of assigned cuteness or rated cuteness, interactions with sex and maybe other patient-level characteristics, the difference between significant and non-significant, not to mention options in the outcome measure (what was referred to in the news article as “strong responses” and whatever zillion degrees of freedom were available from the MRI data.

This is not to say that the conclusions of the study are wrong, just that I have no idea. Just as I have no idea about the gay-gene study, which was one of our original inspirations for the principle, “If there’s no report you can read, there’s no study.”

I googled the title of the paper, *Sustained opioid antagonism increases striatal sensitivity to baby schema in opioid dependent women*, but all I could find was that abstract and an unsigned press release from 19 Sept on a website called MedicalXpress.

Again, I have no idea if this study’s claims are correct, if they have good evidence for their claims, or if the study is useful in some way. (These are three different questions!) In any case, the topic is important and I have no problem with the Times writing about research in this area. But . . . if there’s no report you can read, there’s no study. It’s not about whether it’s peer-reviewed. In this case, there’s nothing to review. An abstract and a press release just don’t cut it.

I know nothing about this research area, and the people who did this project may be doing wonderful work. I’m sure that at some point they’ll write a paper that people can read, and at that point, there’s something to report on.

Stan case studies!


In the spirit of reproducible research, we (that is, Bob*) set up this beautiful page of Stan case studies.

Check it out.

* Bob here. Michael set the site up, I set this page up, and lots of people have contributed case studies and we’re always looking for more to publish.