Skip to content

Review of The Martian


I actually read this a couple months ago after Bob recommended it to me. I don’t know why I did this, given that the last book Bob recommended to me, I hated, but in this case I made a good choice. The Martian was excellent and was indeed hard to set down.

Recently I’ve been seeing ads for the movie so I thought this was the right time to post a review. Don’t worry, no spoilers here.

I have lots of positive things to say but they’d pretty much involve spoilers of one sort or another so you’ll just have to take my word for it that I liked it.

On the negative side: I have only two criticisms of the book. The first is that the characters have no personality at all. OK, the main character has a little bit of personality—not a lot, but enough to get by. But the other characters: no, not really. That’s fine, the book is wonderful as it is, and doesn’t need any more characterization to do what it does, but I think it would’ve been better had the author not even tried. As it is, there are about 10 minor characters whom it’s hard to keep straight—they’re all different flavors of hardworking idealists—and I think it would’ve worked better to not even try to differentiate them. As it is, it’s a mess trying to keep track of who has what name and who does what job.

My more serious criticism concerns the ending. Again, no spoilers, and the ending is not terrible—at a technical level it’s somewhat satisfying (I’m not enough of a physicist to say more than that), but at the level of construction of a story arc, it didn’t really work for me.

Here’s what I think of the ending. The Martian is structured as a series of challenges: one at a time, there is a difficult or seemingly insurmountable problem that the character or characters solve, or try to solve, in some way. A lot of the fun comes when the solution of problem A leads to problem B later on. It’s an excellent metaphor for life (although not stated that way in the book; one of the strengths of The Martian is that the whole thing is played straight, so that the reader can draw the larger conclusions for him or herself).

OK, fine. So what I think is that Andy Weir, the author of The Martian, should’ve considered the ending of the book to be a challenge, not for his astronaut hero, but for himself: how to end the book in a satisfying way? It’s a challenge. A pure “win” for the hero would just feel too easy, but just leaving him on Mars or having him float off into space on his own, that’s just cheap pathos. And, given the structure of the book, an indeterminate ending would just be a cheat.

So how to do it? How to make an ending that works, on dramatic terms? I don’t know. I’m no novelist. All I do know is that, for me, the ending that Weir chose didn’t do the job. And I conjecture that had Weir framed it to himself as a problem to be solved with ingenuity, maybe he could’ve done it. It’s not easy—the great Tom Wolfe had problems with endings too—but it’s my impression that Weir would be up for the job, he seems pretty resourceful.

Something to look forward to in his next book, I suppose.

On deck this week

Mon: Review of The Martian

Tues: Even though it’s published in a top psychology journal, she still doesn’t believe it

Wed: Turbulent Studies, Rocky Statistics: Publicational Consequences of Experiencing Inferential Instability

Thurs: Medical decision making under uncertainty

Fri: Unreplicable

Sat: “The frequentist case against the significance test”

Sun: Erdos bio for kids

War, Numbers and Human Losses

That’s the title of Mike Spagat’s new blog.

In his most recent post, Spagat disputes the the claim that “at least 240,000 Syrians have died violently since the civil war flared up four years ago.”

I am not an expert in this area so I offer no judgment on these particular numbers, but in any case I think that the sort of open discussion offered by Spagat is useful.


Reflecting on the recent psychology replication study (see also here), journalist Megan McArdle writes an excellent column on why we fall for bogus research:

The problem is not individual research papers, or even the field of psychology. It’s the way that academic culture filters papers, and the way that the larger society gets their results. . . .

Journalists . . . easily fall into the habit (and I’m sure an enterprising reader can come up with at least one example on my part), of treating studies not as a potentially interesting result from a single and usually small group of subjects, but as a True Fact About the World. Many bad articles get written using the words “studies show,” in which some speculative finding is blown up into an incontrovertible certainty.

I’d just replace “Journalists” by “Journalists and researchers” in the above paragraph. And then there are the P.R. excesses coming from scientific journals and universities. Researchers are, unfortunately, active participants in the exaggeration process.

McArdle continues:

Psychology studies also suffer from a certain limitation of the study population. Journalists who find themselves tempted to write “studies show that people …” should try replacing that phrase with “studies show that small groups of affluent psychology majors …” and see if they still want to write the article.

Indeed. Instead of saying “men’s upper-body strength,” try saying “college students with fat arms,” and see how that sounds!

More from McArdle:

We reward people not for digging into something interesting and emerging with great questions and fresh uncertainty, but for coming away from their investigation with an outlier — something really extraordinary and unusual. When we do that, we’re selecting for stories that are too frequently, well, incredible. This is true of academics, who get rewarded with plum jobs not for building well-designed studies that offer messy and hard-to-interpret results, but for generating interesting findings.

Likewise, journalists are not rewarded for writing stories that say “Gee, everything’s complicated, it’s hard to tell what’s true, and I don’t really have a clear narrative with heroes and villains.” Readers like a neat package with a clear villain and a hero, or at least clear science that can tell them what to do. How do you get that story? That’s right, by picking out the outliers. Effectively, academia selects for outliers, and then we select for the outliers among the outliers, and then everyone’s surprised that so many “facts” about diet and human psychology turn out to be overstated, or just plain wrong. . . .

Because a big part of learning is the null results, the “maybe but maybe not,” and the “Yeah, I’m not sure either, but this doesn’t look quite right.”

Yup. None of this will be new to regular readers of this blog, but it’s good to see it explained so clearly from a journalist’s perspective.

Why is this double-y-axis graph not so bad?

Usually I (and other statisticians who think a lot about graphics) can’t stand this sort of graph that overloads the y-axis:


But this example from Isabel Scott and Nicholas Pound actually isn’t so bad at all! The left axis should have a lower bound at 0—it’s not possible for conception risk to be negative—but, other than that, the graph works well.

What’s usually the problem, then? I think the usual problem with double-y-axis graphs is that attention is drawn to the point at which the lines cross.

Here’s an example. I was searching the blog for double-y-axis graphs but couldn’t easily find any, so I googled and came across this:


Forget the context and the details—I just picked it out to have a quick example. The point is, when the y-axes are different, the lines could cross anywhere—or they don’t need to cross at all. Also you can make the graph look like whatever you want by scaling the axes.

The top graph above works because the message is that conception risk varies during the monthly cycle while political conservatism doesn’t. It’s still a bit of a cheat—the scale for conception risk just covers the data while for conservatism they use the full 1-6 scale—but, overall, they still get their message across.

Being polite vs. saying what we really think


We recently discussed an article by Isabel Scott and Nicholas Pound entitled, “Menstrual Cycle Phase Does Not Predict Political Conservatism,” in which Scott and Pound definitively shot down some research that was so ridiculous it never even deserved the dignity of being shot down. The trouble is, the original article, “The Fluctuating Female Vote: Politics, Religion, and the Ovulatory Cycle,” had been published in Psychological Science, a journal which is psychology’s version of Science and Nature and PPNAS, a “tabloid” that goes for short sloppy papers with headline-worthy claims. So, Scott and Pound went to the trouble of trying and failing to replicate; as they reported:

We found no evidence of a relationship between estimated cyclical fertility changes and conservatism, and no evidence of an interaction between relationship status and cyclical fertility in determining political attitudes.

The thing that bugged me when Scott and Pound wrote:

Our results are therefore difficult to reconcile with those of Durante et al, particularly since we attempted the analyses using a range of approaches and exclusion criteria, including tests similar to those used by Durante et al, and our results were similar under all of them.

As I wrote earlier: Huh? Why “difficult to reconcile”? The reconciliation seems obvious to me: There’s no evidence of anything going on here. Durante et al. had a small noisy dataset and went all garden-of-forking-paths on it. And they found a statistically significant comparison in one of their interactions. No news here.

Scott and Pound continued:

Lack of statistical power does not seem a likely explanation for the discrepancy between our results and those reported in Durante et al, since even after the most restrictive exclusion criteria were applied, we retained a sample large enough to detect a moderate effect . . .

Again, I feel like they’re missing the elephant in the room: “Lack of statistical power” is exactly what was going on with the original study by Durante et al.: They were estimating tiny effects in a very noisy way. It was a kangaroo situation.

Anyway, I suspect (but am not sure) that Scott and Pound agree with me that Durante et al. were chasing noise and that there’s no problem at all what had been found in that earlier study (bits of statistical significance in the garden of forking paths) with what was found in the replication (no pattern).

And I suspect (but am not sure) that Scott and Pound wrote the way they did in order to make a sort of minimalist argument. Instead of saying that the earlier study is consistent with pure noise, and their replication is also consistent with that, they make a weaker, innocent-sounding statement that the two studies are “difficult reconcile,” leaving the rest of us to read between the lines.

And so here’s my question: When is it appropriate to make a minimalist argument, and when is it appropriate to say what you really think?

My earlier post had a comment from Mb, who wrote:

By not being able to explain the discrepancy with problems of the replication study or “theoretically interesting” measurement differences, they are showing that the non-replication is likely due to low power etc of the original study. It is a rhetorical device to convince those skeptical of the replication.

I replied that this makes sense. I just wonder when such rhetorical devices are a good idea. The question is whether it makes sense to say what you really think, or whether it’s better to understate to make a more bulletproof argument. This has come up occasionally in blog comments: I’ll say XYZ and a comment will say I should’ve just said XY or even just X because that would make my case stronger. My usual reply is that I’m not trying to make a case, I’m just trying to share my understanding of the problem. But I know that’s not the only game.

P.S. Just to clarify: I think strategic communication and honesty are both valid approaches. I’m not saying my no-filter approach is best, nor do I think it’s correct to say that savviness is always the right way to go either. It’s probably good that there are people like me who speak our minds, and people like Scott and Pound who are more careful. (Or of course maybe Scott and Pound don’t agree with me on this at all; I’m just imputing to them my attitude on the scientific questions here.)

Let’s apply for some of that sweet, sweet National Sanitation Foundation funding


Paul Alper pointed me to this news article about where the bacteria and fungi hang out on airplanes. This is a topic that doesn’t interest me at all, but then I noticed this, at the very end of the article:

Note: A previous version of this article cited the National Science Foundation rather than the National Sanitation Foundation. The post has been updated.

That’s just beautiful. We should definitely try to get a National Sanitation Foundation grant for Stan. I’m sure there are many trash-related applications for which we could make a real difference.

Meet Teletherm, the hot new climate change statistic!

Screen Shot 2015-09-07 at 2.10.04 AM

Peter Dodds, Lewis Mitchell, Andrew Reagan, and Christopher Danforth write:

We introduce, formalize, and explore what we believe are fundamental climatological and seasonal markers: the Summer and Winter Teletherm—the on-average hottest and coldest days of the year. We measure the Teletherms using 25 and 50 year averaging windows for 1218 stations in the contiguous United States and find strong, sometimes dramatic, shifts in Teletherm date, temperature, and extent. For climate change, Teletherm dynamics have profound implications for ecological, agricultural, and social systems, and we observe clear regional differences, as well as accelerations and reversals in the Teletherms.

Of course, the hottest and coldest days of the year are not themselves so important, but I think the idea is to have a convenient one-number summary for each season that allows us to track changes over time.

One reason this is important is because one effect of climate change is to mess up synchrony across interacting species (for example, flowers and bugs, if you’ll forgive my lack of specific biological knowledge).

Dodds et al. put a lot of work into this project. For some reason they don’t want to display simple time series plots of how the teletherms move forward and backward over the decades, but they do have this sort of cool plot showing geographic variation in time trends.

Screen Shot 2015-09-07 at 2.12.36 AM

You’ll have to read their article to understand how to interpret this graph. But that’s ok, not every graph can be completely self-contained.

They also have lots more graphs and discussion here.

Irwin Shaw: “I might mistrust intellectuals, but I’d mistrust nonintellectuals even more.”

Screen Shot 2015-02-18 at 2.31.15 PM

A few weeks ago I picked up a paperback of stories by Irwin Shaw, printed in the late 1950s. I love these little pocket books—but this one was a bit too disposable: after about 50 pages the spine gave out and the pages started to fall out, which was a bit irritating because then I needed to hold the book with two hands as I was reading it, which makes it harder to read on the subway.

Anyway, Irwin Shaw (originally “Irwin Gilbert Shamforoff,” according to Wikipedia) wrote zillions of short stories for the New Yorker and other publications during the mid-twentieth century. The stories in this book (“Tip on a Dead Jockey”) were pretty good.

In his New York Times obituary, Herbert Mitgang wrote, “Stylistically, Mr. Shaw’s short stories were noted for their directness of language, the quick strokes with which he established his different characters, and a strong sense of plotting.”

Shaw’s writing struck me as as being very much in the Hemingway style (The Short Happy Life of Francis Macomber, etc.) but with the stories pitched a bit lower in intensity, without the lions and the gunshots. It could be worth comparing to John O’Hara, who published short stories at about the same time and in the same places as Shaw. O’Hara is a bit less smooth: there’s some way in which Shaw’s descriptions of scenes and characters just slide in smoothly, like Jeeves; in contrast, when O’Hara tells us something, it often seems to drop with a thud like the suitcase falling from the hand of an exhausted traveler who enters the hotel room after a long and arduous journey.

Sorry for that last bit; I’m sure a professional writer could’ve done it better.

Anyway, to continue . . . O’Hara more than anyone else reminds me of George V. Higgins (which is backward, of course; Higgins was very openly a follower of O’Hara), while Shaw reminds me a bit of Hemingway (as noted above) and also John Updike (sorry!). Shaw and O’Hara also anticipate Updike and, for that matter, Raymond Carver, by investing day-to-day life with high drama.

Anyway, there’s something perhaps too smooth about Shaw, some way in which his writing seems to fall a bit short of “art,” and some way in which O’Hara, in his ill-fitting prose, seems more the artist.

I plan to read a bit more Shaw and see. It should be pleasant, in any case, and it’s also fun to read this sort of thing for the background, for the unquestioned assumptions that Shaw and so many others of his era believed in.

An online search turned up this 1953 interview by George Plimpton and John Phillips. The interview is a good read. It features various off-putting details (for example, Shaw liked to go to a casino and gamble—wha???) but I guess this just means that these people were men of their time. Also this bit which is a cliche but is still fun:

Interviewer: Did you like being a critic?

Shaw: Not much. It wore out the pleasure of going to the theater. There’s an almost unavoidable feeling of smugness, of self-satisfaction, of teacher’s pettishness, that sinks into a critic’s bones, and I was afraid of it. . . .

One thing this all did was lower my impressions of Norman Mailer, because he said things like this all the time. Mailer’s pseudo-Hemingway shtick is 100% anticipated by Shaw’s pseudo-Hemingway shtick, and I feel that Shaw earned it more.

But then there was this, which I just love:

Interviewer: One more question about criticism, if we may. Are you your own severest critic? Do you . . . ?

Shaw: I am forced to say that I have many fiercer critics than myself.

As a writer myself, I have to say that having any critics at all is a glorious luxury. To know that people are actually reading what you write! Of course, Shaw was writing for well-regarded mass-circulation magazines, so I guess he had no worries in that regard.

And this:

Interviewer: Do you really enjoy writing, Mr. Shaw?

Shaw: I used to enjoy it more. It’s tougher now, as one’s power dwindles.

He was only 40, for chrissake! But I guess 40 feels pretty old, when you’re 40. And the 1950s was a younger time than today, at least in the U.S. and Europe. There were young people all around, and not many old people, so I guess that 40 really felt like the downside of the curve.

And this:

Interviewer: What is thought of the writer in America, then?

Shaw: He’s a freak. People feel uncomfortable when he’s around. He has odd, inconsistent ways of making his living, and nine times out of ten he can’t earn his living by writing. He’s distrusted and maybe he’s subversive. . . . Now, attacking writers as among the most eggheaded of intellectuals is considered a good way of guaranteeing an election. I might mistrust intellectuals, but I’d mistrust nonintellectuals even more.

Good line. I like that.

From an online mini-bio:

by the 1960s his critical reputation had suffered, as his novels were unfavorably compared to his earlier short stories. His marriage had failed by 1967, but during the next decade he produced some of his most popular novels, including Rich Man, Poor Man.

By the mid 1970s, Shaw was drinking heavily and his health was deteriorating. He and Marian reconciled and remarried in 1982, but he died in May of 1984 of prostate cancer.

I believe it. You can see the incipient sadness (not the cancer, of course) in the stories and in that interview. The guy was working so hard—writing story after story, each with its own setting and cast of characters, just writing one story after another and having to come up with a new world each time. How exhausting that’s gotta be! And then he seemed to have been working all the time when not at work: working to enjoy himself, to live a full life, to gamble and box and play tennis and be a husband and a father, to be a good friend to many, and, I assume, to have affairs: He was constructing his own ideal life from scratch, and that’s a lot of work. He didn’t have a convenient set of rules to live by, it wasn’t just about getting behind the typewriter and pounding away.

Anyway, you can’t keep this sort of lifestyle going for decades and decades. At some point you have to take a break. But Shaw’s relationship with his readers and critics does not seem to have been the sort that allowed much of a break.

After thinking about guys like this, it just makes me sad to consider brand-name authors such as James Patterson who just have a factory putting their novels out there. I find something admirable and poignant about Irwin Shaw, 40 years old, on top of the world, and knowing that, once you’re on top, there’s only one direction to go.

“Dow 36,000” guy offers an opinion on Tom Brady’s balls. The rest of us are supposed to listen?

Football season is returning so it’s time for us to return to that favorite statistical topic from the past football season: Tom Brady’s deflated balls.

Back in June, Jonathan Falk pointed me to this report.

You can click through if you’d like and take a look. I didn’t bother reading it because it had no graphs, just lots of text and some very long tables. Also I happened to notice the author list.

My response to Falk: “Kevin Hassett, huh? The Dow is at 36,000 and Tom Brady is innocent.”

Falk replied:

That part is just a bonus.

More seriously, both reports focus on unknown scenarios as to which gauge was used on which balls at hich time. This is almost exactly like one of those standard philosophical setups of the anomalies of classical frequentist statistics. Y’know, the ones where you randomize between more and less imperfect measuring devices and improve your p values. (References available on request.) Neither report adds a hyperparameter for the probability of each gauge preferring instead some sort of mishmash of permutation analysis and “cases analyzed.”

A couple days later he followed up:

OK. Now having spent a lot more time with both reports than I had before, while I stand by my previous remarks, more or less, there is one issue which actually deserves (IMO) wider dispersion.

The central difference between the original result and the Hassett et al. critique is the use of a variable which adjusts for the order in which the balls were measured at halftime. The original report considered such an adjustment but rejected it because the coefficient on the order variable was statistically insignificant. (p=0.299. fn. 49 [Gotta love the three decimal places there, no?]) Hassett, using a slightly different method, agrees that it is statistically insignificant but includes it because the effect is known, even by the authors of the original report, to be an important to the physics of what went on and its directionality (if not magnitude) is perfectly clear.

So, while I continue to believe that a full hierarchical Bayesian setup would have made the results much clearer, even more important is the statistical point that what needs to be included in the model or not is what is important, not whether its noise level covers an estimate of zero. Any Bayesian analysis would clearly have a strong prior on a positive order coefficient. To not do so violates PV=nRT. By including a “statistically insignificant” variable, Hassett et al. should be commended.

Sure, but I still can’t get around that Dow thing. I just can’t trust anything this guy writes. I’m not saying he’s in the category of John Gribbin or Dr. Anil Potti or Steven J. Gould or Michael Lacour, but, still, once I lose trust in an analyst, it’s hard to motivate myself to go to the trouble to take anything he writes seriously. There are just too many ways for someone to distort an argument, if he or she has a demonstrated willingness to do so.

Matlab/Octave and Python demos for BDA3

My Bayesian Data Analysis course at Aalto University started today with a record number of 84 registered students!

In my course I have used some Matlab/Octave demos for several years. This summer Tuomas Sivula translated most of them to Python and Python notebook.

Both Matlab/Octave and Python demos are now available at Github in hope they are helpful for someone else, too. Python notebooks can also be viewed directly at Github with pre-generated figures displayed next to the code.

Matlab/Octave demos
Python demos


Comments on Imbens and Rubin causal inference book

Guido Imbens and Don Rubin recently came out with a book on causal inference. The book’s great (of course I would say that, as I’ve collaborated with both authors) and it’s so popular that I keep having to get new copies because people keep borrowing my copy and not returning it. Imbens and Rubin come from social science and econometrics. Meanwhile, Miguel Hernan and Jamie Robins are finishing up their own book on causal inference, which has more of a biostatistics focus. If you read both these books you’ll be in great shape.

Anyway, rather than reviewing the Imbens and Rubin book, I thought I’d just post the comments I sent on the book to the publisher, when they were asking for reviews of the manuscript, back in 2006.

Comments on table of contents and the 5 sample chapters of Causal Inference in Statistics, by Rubin and Imbens

General Comments

First off, Rubin and Imbens are the leaders in the field of causal inference. Rubin also has an excellent track record, both as a researcher and as a book author. So my overall recommendation is that you publish the book exactly as Rubin and Imbens want it to be. My suggestions are just suggestions, nothing more, and I recommend completely trusting the authors’ judgments about how the book should be. These two authors are such trailblazers in this area that I can only defer to their expertise.

My general comment about the book is that it reminds me of the Little & Rubin book on missing data, in two ways:

1. The book is conceptual, more of a “how does it work” than a “how to”. (I think that a lot of the “how to” is in our forthcoming book (Gelman & Hill) so I hope that the authors can take a look and make some appropriate cross-cites to help out the reader who wants to apply some of these methods.)

2. The book spends a lot of space on methods that I don’t think the authors would generally use or recommend in practice. I’m thinking here of the classical hypothesis-testing framework, especially in Chapters 4,5,6. If it were up to me, I would start right off with the model-based approach and just put the Fisher randomization test in an appendix for interested readers. I think most of the audience will be familiar with regression models, and, to me, that would be a logical place to start, thus moving directly into inference (likelihood or, implicitly, Bayesian inference) for causal estimands. The classical hyp testing framework is fine, but to me it’s more of of historical interest. Could be good in an appendix for the readers who want to see this connection. It seems like a big detour to have this right at the beginning of the book.

My other major comment is that I’d like to see more details on the worked examples, so that they have more of a real feeling and less of the flavor of numerical illustrations.

Specific Comments

Chapter 1

p.2, “potential outcomes”: Perhaps give a couple sentences explaining why you do _not_ use the term “counterfactual”. (See chapter 19, page 30, for an example where you really do use the concept of counterfactual.)

p.7, l. -2: “loose” should be “lose”

p.13: At this point, I made a note that the discussion seems to be going very slowly. I think that much of chapter 1 could be picked up in pace. Perhaps that’s just because I’m familiar with the material–but I wonder if there’s too much discussion here. I think it might go smoother if you’ve worked out some exxamples first.

p.14, “attributes” and “covariates”: Do you ever define these terms? “Covariate” in particular is a key concept for you, so I’d like to see some definition and discussion of the concept. How is it like a “right hand side” variable in a regression?

p.15: this is repeated from the preface.

p.16: of all other versions of causality, I don’t know that you should privilege the relatively obscure “Granger-Sims causality”.

p.18-19: combine tables 1.4-1.5 into one table.

Chapter 5

p.1: It’s funny that you personalize Fisher and Neyman (giving them first names, not just references) but nobody else in the story.

p.2: I’d recommend removing “Notice that”, “Note that”, “It is important that”, etc., from the book in all cases.

p.4: Is it important that it’s in Fresno and Youngstown? I’d remove this detail (and combine the two cities) so as not to distract the reader. Also, you should point out that these are first-graders and give the year of the experiment.

p.6, bottom: Is it really plausible that the treatment would have a constant effect c for all units? On the next page, you quesetion whether the treatment effect should be additive in levels rather than logarithms–but why assume it’s constant at all?

More generally, I don’t like the focus on the so-called exact test–I’d rather see this done using regression modeling.

p.7: The mathematical “Definition” seems out of place and does not fit the rest of the book. Also, why focus on test statistics? In your applied work, you will be estimating parameters, not trying to cleverly design test statistics. This all just seems completely opposite to the Rubin approach to statistics.

p.8: similar question: are you really recommending that researchers spend their time “looking for statistics that have power against intersting alternatives”? Bayes would be spinning in his grave!

p.9: You talk about log of nonnegative variables. You should say positive variables, since you can’t take the log of zero. Nonnegative variables might have to be divided into 2 parts: 0/1 and the positive part. Also, wealth can be negative: lots of people have mortgage and credit card debt.

p.10: I don’t know if you should be talking about rank statistics. See p.252 of Gelman, Carlin, Stern, and Rubin.

p.11: Does the Wilcoxon really have a closed-form distribution? I think it’s an invariant distribution which has been calculated and tabulated

p.12: typos: “Fets’s”, “an FET”. Actually, I think you’d be better off simply using the term “permutation test”.

p.13, Section 5.4.4: Again, are you really recommending that researchers spend their time choosing among test statistics? This is not how you do your applied work.

p.18, Section 5.7: This is another one of these ideas that works in very simple cases but, in general, will not work. You should say this in the book; otherwise people might actually try to use these methods in applied problems.

p.20, data analysis: all this combinatorics is so complicated. During the time it took you to write the section, you could have done the appropriate regression analysis about 10 times! The regression framework would allow you to think about intersting extensions such as interactions, rather than the approach presented in your chapter, which draws the reader into a technical morass of combinatorics.

Similarly, Section 5.9 is hugely complicated, considering that you could solve the same problems nearly trivially using regression.

p.24, section 5.10: Do you really think this approach is “excellent”???

p.26: data should be displayed as scatterplot, not table. (See p.174, 176, 52 of the forthcoming Gelman and Hill book for examples of how to display these data.) Actually, in the discussion you only use the first 6 units, so maybe best to just show these. Also, I seem to recall that the actual experiment had a paired design. This is not reflected in the table.

Tables 5.2,5.3,5.5: combine into a single table.

Table 5.6 should be a graph with treamtne effect on the x-axis.

Table 5.7 should be a graph with number of simulations on the x-axis. Also, I don’t see the advantage of separating Fresno and Youngstown. It’s a distraction.

Table 5.8 should be a graph (see, e.g., p.176 of Gelman & Hill).

Chapter 7

p.1: “Bayes Rule” should be “Bayes’ rule” (or “Bayes’s rule”)

p.4: This will be clearer if you use consistent notation p() for probabilities (rather than p(), f(), and L() for different probabilities). See Gelman, Carlin, Stern, Rubin book.

p.12: Is this integral necessary? In practical calculations, we’re never actually doing an integral. Maybe it’s a bit of a distraction.

p.15, footnote: “Chapters” should be “chapters”.

p.15-…, Section 7.3.2: This is getting really complicated. Is all this matrix algebra and integration really necessary? I’m bothered by the unevenness of the mathematical level. At some points, you have these matrix calculations, at other points (see (7.24) on page 21), you’re spelling out the steps of simple multiplications and additions. What is the math you expect the readers to be using? (You should go into this in the preface.) You also have to decide whether to use an X or a dot for multiplication (compaer to p.23).

p.22, simulation: Some computer code would be nice. See, e.g., Chapters 7 and 8 of the Gelman & Hill book.

p.24: “Notice that” can be removed.

p.29, “de Finetti’s theorem”: Have you discussed this? And is this over-sophisticated for the general level of the book.

p.33, “the Bernstein-Von Mises theorem”: Huh? Have I heard of this one?

p.35: “variance equal to 10,000” should be “standard deviation equal to 100”

p.36: I’d skip the Tanner reference, I don’t think it’s that useful. Also, “Markov-Chain-Monte-Carlo” should be “Markov chain Monte Carlo”. Also, at the beginning of this reference section, you might cite Chapter 7 of Gelman, Carlin, Stern, and Rubin, which develops the Rubin (1978) approach with many examples. This is also relevant for the refs in Section 7.10.

Section 7.11: Can you just remove this? It seems so hard to follow and so messy.

Tables 7.1-7.4: With care, these can be combined into a single table which would make the operations much easier to follow.

Table 7.5: What’s “Freq” doing in the middle of the table? That looks weird. Also, you can spell out the word: there’s space there. Similarly, say “linear”, not “lin”, and explain what is meant by “Cov”. A longer caption would help. And I’d combine the two cities.

Chapter 11

p.4, line 1: Why “less than alpha or greater than 1-alpha”? Might it make sense to use different thresholds at the two ends?

p.4: “highschool” should be “high school”

p.7: If you’re going to recommend histograms, you should recommend scatterplots.

p.8, first displayed equation: This is the logit. You should define it here, partly so you can use it again, partly to connect to what students already know.

p.8, typo: “the an”

p.9, bottom: this is logit^{-1}.

p.14: Here’s where you can use the “logit” notation and simplify the presentation.

p.14, “the potential outcomes are more likely to be linear in the log odds ratio…”: Where does “more likely” come from here?

p.15: use “1” rather than “one”; it’ll be easier to follow.

p.15, “Inspecting the distribution of the differences is generally useful”: Could you supply an example? That would help.

p.20: When discussing this matching, perhaps also look at Ben Hansen’s work.

p.22: You define p_x, then you use p. Are these the same? I’m confused.

p.23, sentence near the top with the word “easy”: This is confusing to me. It would be clearer if stated directly.

p.25: “It is interesting to note”

p.29 etc: These should be graphs. See, for example, p.202 of Gelman and Hill for an example of how to do this compactly. I think it is these comparisons you want, not all the numerical values.

Three pages of histograms following page 35: These should be made smaller, fit on 1 page, and oriented right side up. Also some explanation is needed.

Chapter 14

p.1: you write that “in this chapter we discuss a second approach…namely matching”. But you already discussed matching in Sections 11.4 and 11.5. I’m not syaing you’re repetitive, I’m just saying that you’re introducing something here that you’ve already introduced.

p.5, near bottom: This hyper-mathematical notation looks ugly to me. Can’t you say it in words?

p.8, bottom: “It is important to realize”

p.12: A picture would be helpful here.

p.15, “it will be easier to find good matches if we are matching on only a few covariates.” Not really! If you match only on a few, you’re just ignoring all other potential covariates: not necessarily “good” at all. It should only be better, not worse, to include more covariates.

Related to this, you might mention the work of Hill and McCulloch on using BART (Bayesian additive regression trees) for matching on many variables.

p.20: perhaps mention that this is equivalent to allowing treatment interactions.

More generally, I didn’t see much discussion of interactions in this manuscript. But it’s an important topic (especially given that you’re talking about ATE, LATE, CACE, etc, all of which differ from each other only in the presence of interactions).

p.24: The goal of matching (at least as I understood from Rubin) is to match groups, not individuals. I don’t see this point emphasized here.

p.27: “Notice that”

Throughout this example: Things would be easier to follow if you round off the half-numbers. The material is tough enough without having to tangle with all these decimal places. Just round them off and say in a footnote that you did it for simplicity.

p.28: Have ATE, ATT, and ATC been defined yet?

p.32, top equations: a bit technical. Perhaps clearer to explain in words.

Later on in page 32, you have triple subscripting: Y_kappa_t_i. I suggest putting some of this upstairs as superscripts. Also, I suggest being consistent with treatment/control notation. Are these t/c or 1/0? If you’re using t/c, I’d prefer T/C, since t often is used for “time”. I’m also not thrilled with “tau” as treatment effect. I’d recommend “theta”.

p.38, footnote: Rubin is using complete-case analysis? I’m shocked! I’d think this would be a good “teaching moment” to show people how to do matching even with some missing data.

p.43, “OLS”: perhaps say “full-data” instead. The point is that they use all the data, not that they use least squares.

p.46: Should be a scatterplot, not a table. Also, round off the .5’s.

Tables 14.2-14.7: I’m not thrilled with these, but maybe they’re good for following the details. I’d think about what you could remove from the tables so that they’re still useful but not so overwhelming.

Tables 14.8-14.9: make a graph (see, for example, p.202 of Gelman and Hill)

Table 14.10: What’s the ordering here? Perhaps simpler to have all the ATEs, then all the ATTs, then ATC? Also, it looks like nothing is statistically significant (or even close)! So does this matter at all? Worth discussing, I think.

Table 14.11: a graph would be better (see, for example, p.505 of Gelman and Hill for an example of how to display and compare estimates and se’s). Also, what are the units of “time till first raise”? The tiny coefs suggest you should rescale, perhaps use years, not months, as your scale.

Chapter 19

p.2: A picture with causal arrows could help (or else say why you dislike such pictures).

p.5, typo: “units’s”

p.7, assumption 1: Say “is statistically independent of”. Don’t just use that “perpendicular lines” symbol. This isn’t a math textbook.

p.8,9: What are M and N? I must have missed their definitions.

p.9, assump 2: Say “is statistically independent of”. Don’t just use that “perpendicular lines” symbol. Also, make clear in the formatting where the assumption ends and the discussion beginns.

p.12: “no-one” should be “no one”

p.16: I don’t see the intuition in all this matrix algebra.

p.17, “An alternative is a model-based approach”: I recommend doing the model-based approach first, since that’s what you actually prefer.

p.17, notation: use p() for all densities, rather than the conusing pi(), f(), and L90.

p.19-20: This is getting ugly gain. Maybe it’s necessary, but as a reader I certainly don’t enjoy it! The formulas on p.20, in particular, look like they could be simplified in some intuitive way. Can you also discuss what happens when sum_i Z_i = 0?

p.21, example: Is thisThis seems like a Sutva violation since the assignment to one woman in the village is correlated with the assignment to others?

p.21, “we do not have indicators…”: Can’t you get this info?

p.22, elsewhere: to increase readability, remove the commas in the numbers, for example, “2,385” is “2385”. Also, take the square root of the variance: se’s are more directly interpretable.

p.23,24: It would be better to express as death rate than survival rate ,then you don’t have to work with ugly numbers like 0.9936, you can work with cleaner things like 0.64 percent.

p.24: “Notice that”

p.26: see comment on p.23,24 above

p.32, typos: no space before/after “ITT”

p.32, section 19.9: Perhaps best to put the naive analyses earlier?

p.34: see comment on p.23,24 above

p.40: “onesided” should be “one sided”

p.42, line 4: why “however”?

p.42, since Rubin is the author, it might be more polite to avoid use of the term “Rubin causal model”. You could use the term “potential outcome notation”.

p.42: also include refs from sociology and psychology on structural equation modeling. Even if you don’t like it, say why you don’t. You could cite Sobel’s work in this area. As is, it seems odd that you’re singling out a somewhat obscure idea of Zelen.

I don’t actually know how many of these recommendations they followed. As I wrote, I defer to the authors’ expertise in this area.

In any case, I thought it might amuse you to see this mix of serious and minor comments.

On deck this week

Mon: Comments on Imbens and Rubin causal inference book

Tues: “Dow 36,000″ guy offers an opinion on Tom Brady’s balls. The rest of us are supposed to listen?

Wed: Irwin Shaw: “I might mistrust intellectuals, but I’d mistrust nonintellectuals even more.”

Thurs: Death of a statistician

Fri: Being polite vs. saying what we really think

Sat: Why is this double-y-axis graph not so bad?

Sun: “There are many studies showing . . .”

More Stan on the blog

Whoa. Stan is 3 years old. We’ve come a long way since the start. I came into the project just as a working prototype was implemented by Matt and Bob with discussions with Andrew, Ben, Michael Malecki, Jiqiang, and others. (I had been working for Andrew prior to the official start of the project, but was on the injured reserve list at the time.)

Just for kicks, I looked back at the documentation and the commit log for v1.0.0. The first version was fully functional — it was definitely more than just a prototype. I remember feeling a bit nervous about the release, but not really apprehensive. Most of our meetings were about how we could make Stan faster and what we wanted to implement next.

Fast forward 3 years and Stan is still moving along. We now have a family of projects. We’ve split the math and automatic differentiation library into it’s own Stan Math Library, the Stan Library is being shaped into the language and the inference algorithms, there are interfaces for the command line, R, Python, Julia, Matlab, Stata, and additional supporting libraries like ShinyStan, stan-mode (syntax highlighting for emacs), and some more stuff coming down the line. We’re still talking about how to make Stan faster and what to implement next.

I’m going to start posting more regularly on the blog with Stan-related posts:
– general advice
– how to do easy stuff
– how to do expert stuff
– walk-throughs of models
– ideas that we want help on

If you have any suggestions, let me know.

BREAKING . . . Sepp Blatter accepted $2M payoff from Dennis Hastert

I think Sheldon Silver and Dean Skelos were involved too. It all went down on the George Washington Bridge, and Hillary Clinton recorded it in her personal email.

Details coming from Seymour Hersh.

P.S. This was topical back when I wrote in early June! I would’ve put it on the sister blog, which specializes in topical material, but I don’t think they like silly so much anymore.

Emails I never finished reading

This one came in today (i.e., 2 months ago):

Dear colleague,

Can I ask you a quick question? I am seeing more and more research projects (many in the economic sciences) in which researchers use dummy-codes to account for non-independence due to higher-order units. So when the researchers have data from employees in 15 different departments, they include 14 dummy-codes in the OLS regression.

I have heard and read many times that this is not the right data analytic strategy. As far as I remember the parameter estimates are correct, but the standard errors of the parameter estimates are too large and that the dfs tend to be too high. In short, multilevel modeling is the right strategy, whereas the use of dummy codes produces incorrect inferential statistics (at least with a large number of higher-order units). Can you point me toward a publication that examines this question?
Sorry to bother you with this. I know you are probably very busy. I have looked around quite a bit and despite all my efforts I have not been able to find a relevant publication. . . .

Ummmm, yeah I’m busy! I’ll answer pretty much any question, but not if you don’t even bother to cut-and-paste my name into the salutation!

P-values and statistical practice

What is a p-value in practice? The p-value is a measure of discrepancy of the fit of a model or “null hypothesis” H to data y. In theory the p-value is a continuous measure of evidence, but in practice it is typically trichotomized approximately into strong evidence, weak evidence, and no evidence (these can also be labeled highly significant, marginally significant, and not statistically significant at conventional levels), with cutoffs roughly at p=0.01 and 0.10.

One big practical problem with p-values is that they cannot easily be compared. The p-value is itself a statistic and can be a noisy measure of evidence. This is a problem not just with p-values but with any mathematically equivalent procedure, such as summarizing results by whether the 95% confidence interval includes zero.

I’ve discussed this paper before (it’s a discussion from 2013 in the journal Epidemiology) but with all the recent controversy about p-values and statistical significance, I think it’s worth reposting:

The casual view of the p-value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings . . . The formal view of the p-value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence the popularity of alternative if wrong interpretations). A Bayesian interpretation based on a spike-and-slab model makes little sense in applied contexts in epidemiology, political science, and other fields in which true effects are typically nonzero and bounded (thus violating both the “spike” and the “slab” parts of the model).

Greenland and Poole [in the article in Epidemiology that I was discussing] make two points. First, they describe how p-values approximate posterior probabilities under prior distributions that contain little information relative to the data:

This misuse [of p-values] may be lessened by recognizing correct Bayesian interpretations. For example, under weak priors, 95% confidence intervals approximate 95% posterior probability intervals, one-sided P-values approximate directional posterior probabilities, and point estimates approximate posterior medians.

I used to think this way too (see many examples in our books) but in recent years have moved to the position that I don’t trust such direct posterior probabilities. Unfortunately, I think we cannot avoid informative priors if we wish to make reasonable unconditional probability statements. To put it another way, I agree with the mathematical truth of the quotation above but I think it can mislead in practice because of serious problems with apparently noninformative or weak priors. . . .

When sample sizes are moderate or small (as is common in epidemiology and social science), posterior probabilities will depend strongly on the prior distribution.

I’ll continue to quote from this article but for readability will remove the indentation.

Good, mediocre, and bad p-values

For all their problems, p-values sometimes “work” to convey an important aspect of the relation of data to model. Other times a p-value sends a reasonable message but does not add anything beyond a simple confidence interval. In yet other situations, a p-value can actively mislead. Before going on, I will give examples of each of these three scenarios.

A p-value that worked. Several years ago I was contacted by a person who suspected fraud in a local election. Partial counts had been released throughout the voting process and he thought the proportions for the different candidates looked suspiciously stable, as if they had been rigged to aim for a particular result. Excited to possibly be at the center of an explosive news story, I took a look at the data right away. After some preliminary graphs—which indeed showed stability of the vote proportions as they evolved during election day—I set up a hypothesis test comparing the variation in the data to what would be expected from independent binomial sampling. When applied to the entire dataset (27 candidates running for six offices), the result was not statistically significant: there was no less (and, in fact, no more) variance than would be expected by chance alone. In addition, an analysis of the 27 separate chi-squared statistics revealed no particular patterns. I was left to conclude that the election results were consistent with random voting (even though, in reality, voting was certainly not random—for example, married couples are likely to vote at the same time, and the sorts of people who vote in the middle of the day will differ from those who cast their ballots in the early morning or evening), and I regretfully told my correspondent that he had no case.

In this example, we did not interpret a non-significant result as a claim that the null hypothesis was true or even as a claimed probability of its truth. Rather, non-significance revealed the data to be compatible with the null hypothesis; thus, my correspondent could not argue that the data indicated fraud.

A p-value that was reasonable but unnecessary. It is common for a research project to culminate in the estimation of one or two parameters, with publication turning on a p-value being less than a conventional level of significance. For example, in our study of the effects of redistricting in state legislatures, the key parameters were interactions in regression models for partisan bias and electoral responsiveness. Although we did not actually report p-values, we could have: what made our paper complete was that our findings of interest were more than two standard errors from zero, thus reaching the p<0.05 level. Had our significance level been much greater (for example, estimates that were four or more standard errors from zero), we would doubtless have broken up our analysis (for example, studying Democrats and Republicans separately) to broaden the set of claims that we could confidently assert. Conversely, had our regressions not reached statistical significance at the conventional level, we would have performed some sort of pooling or constraining of our model in order to arrive at some weaker assertion that reached the 5% level. (Just to be clear: we are not saying that we would have performed data dredging, fishing for significance; rather, we accept that sample size dictates how much we can learn with confidence; when data are weaker, it can be possible to find reliable patterns by averaging.) In any case, my point here is that in this example it would have been just fine to summarize our results in this example via p-values even though we did not happen to use that formulation.

A misleading p-value. Finally, in many scenarios p-values can distract or even mislead, either a non-significant result wrongly interpreted as a confidence statement in support of the null hypothesis, or a significant p-value that is taken as proof of an effect. A notorious example of the latter is the recent paper of Bem, which reported statistically significant results from several experiments on ESP. At brief glance, it seems impressive to see multiple independent findings that are statistically significant (and combining the p-values using classical rules would yield an even stronger result), but with enough effort it is possible to find statistical significance anywhere.

The focus on p-values seems to have both weakened that study (by encouraging the researcher to present only some of his data so as to draw attention away from non-significant results) and to have led reviewers to inappropriately view a low p-value (indicating a misfit of the null hypothesis to data) as strong evidence in favor of a specific alternative hypothesis (ESP) rather than other, perhaps more scientifically plausible alternatives such as measurement error and selection bias.

So-called noninformative priors (and, thus, the usual Bayesian interpretation of classical confidence intervals) can be way too strong

The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong. At first this may sound paradoxical, that a noninformative or weakly informative prior yields posteriors that are too forceful—and let me deepen the paradox by stating that a stronger, more informative prior will tend to yield weaker, more plausible posterior statements.

How can it be that adding prior information weakens the posterior? It has to do with the sort of probability statements we are often interested in making. Here is an example. A sociologist examining a publicly available survey discovered a pattern relating attractiveness of parents to the sexes of their children. He found that 56% of the children of the most attractive parents were girls, compared to 48% of the children of the other parents, and the difference was statistically significant at p<0.02. The assessments of attractiveness had been performed many years before these people had children, so the researcher felt he had support for a claim of an underlying biological connection between attractiveness and sex ratio. The original analysis by Kanazawa had multiple comparisons issues, and after performing a regression rather than selecting the most significant comparison, we get a p-value closer to 0.2 rather than the stated 0.02. For the purposes of our present discussion, though, in which we are evaluating the connection between p-values and posterior probabilities, it will not matter much which number we use. We shall go with p=0.2 because it seems like a more reasonable analysis given the data. Let θ be true (population) difference in sex ratios of attractive and less attractive parents. Then the data under discussion (with a two-sided p-value of 0.2), combined with a uniform prior on θ, yields a 90% posterior probability that θ is positive. Do I believe this? No. Do I even consider this a reasonable data summary? No again. We can derive these No responses in three different ways, first by looking directly at the evidence, second by considering the prior, and third by considering the implications for statistical practice if this sort of probability statement were computed routinely. First off, a claimed 90% probability that θ>0 seems too strong. Given that the p-value (adjusted for multiple comparisons) was only 0.2—that is, a result that strong would occur a full 20% of the time just by chance alone, even with no true difference—it seems absurd to assign a 90% belief to the conclusion. I am not prepared to offer 9 to 1 odds on the basis of a pattern someone happened to see that could plausibly have occurred by chance alone, nor for that matter would I offer 99 to 1 odds based on the original claim of the 2% significance level.

Second, the prior uniform distribution on θ seems much too weak. There is a large literature on sex ratios, with factors such as ethnicity, maternal age, and season of birth corresponding to difference in probability of girl birth of less than 0.5 percentage points. It is a priori implausible that sex-ratio differences corresponding to attractiveness are larger than for these other factors. Assigning an informative prior centered on zero shrinks the posterior toward zero, and the resulting posterior probability that θ>0 moves to a more plausible value in the range of 60%, corresponding to the idea that the result is suggestive but not close to convincing.

Third, consider what would happen if we routinely interpreted one-sided p-values as posterior probabilities. In that case, an experimental result that is 1 standard error from zero—that is, exactly what one might expect from chance alone—would imply an 83% posterior probability that the true effect in the population has the same direction as the observed pattern in the data at hand. It does not make sense to me to claim 83% certainty—5 to 1 odds—based on data that not only could occur by chance alone but in fact represent an expected level of discrepancy. This system-level analysis accords with my criticism of the flat prior: as Greenland and Poole note in their article, the effects being studied in epidemiology are typically range from -1 to 1 on the logit scale, hence analyses assuming broader priors will systematically overstate the probabilities of very large effects and will overstate the probability that an estimate from a small sample will agree in sign with the corresponding population quantity.

How I have changed

Like many Bayesians, I have often represented classical confidence intervals as posterior probability intervals and interpreted one-sided p-values as the posterior probability of a positive effect. These are valid conditional on the assumed noninformative prior but typically do not make sense as unconditional probability statements. As Sander Greenland has discussed in much of his work over the years, epidemiologists and applied scientists in general have knowledge of the sizes of plausible effects and biases. . . .

The default conclusion from a noninformative prior analysis will almost invariably put too much probability on extreme values. A vague prior distribution assigns a lot of its probability on values that are never going to be plausible, and this messes up the posterior probabilities more than we tend to expect, something that we probably don’t think about enough in our routine applications of standard statistical methods.

A Psych Science reader-participation game: Name this blog post

Screen Shot 2015-09-02 at 1.34.11 PM

In a discussion of yesterday’s post on studies that don’t replicate, Nick Brown did me the time-wasting disservice of pointing out a recent press release from Psychological Science which, as you might have heard, is “the highest ranked empirical journal in psychology.”

The press release is called “Blue and Seeing Blue: Sadness May Impair Color Perception” and it describes a recently published article by Christopher Thorstenson, Adam Pazda, Andrew Elliot, which reports that “sadness impaired color perception along the blue-yellow color axis but not along the red-green color axis.”

Unfortunately the claim of interest is extremely difficult to interpret, as the authors do not seem to be aware of the principle that the difference between “significant” and “not significant” is not itself statistically significant:


The paper also features other characteristic features of Psychological Science-style papers, including small samples of college students, lots of juicy researcher degrees of freedom in data-exclusion rules and in the choice of outcomes to analyze, and weak or vague theoretical support for the reported effects.

The theoretical claim was “maybe a reason these metaphors emerge was because there really was a connection between mood and perceiving colors in a different way,” which could be consistent with almost any connection between color perception and mood. And then once the results came out, we get this: “‘We were surprised by how specific the effect was, that color was only impaired along the blue-yellow axis,’ says Thorstenson. ‘We did not predict this specific finding, although it might give us a clue to the reason for the effect in neurotransmitter functioning.'” This is of course completely consistent with a noise-mining exercise, in that just about any pattern can fit the theory, and then the details of whatever random thing that comes up is likely to be a surprise.

It’s funny: it’s my impression that, when a scientist reports that his findings were a surprise, that’s supposed to be a positive thing. It’s not just turning the crank, it’s scientific discovery! A surprise! Like penicillin! Really, though, if something was a surprise, maybe you should take more seriously the possibility that you’re just capitalizing on chance, that you’re seeing something in one experiment (and then are motivated to find in another). It’s the scientific surprise two-step, a dance discussed by sociologist Jeremy Freese.

As usual in such settings, I’m not saying that Thorstenson et al. are wrong in their theorizing, or that their results would not show up in a more thorough study on a larger sample. I’m just saying that they haven’t really made a convincing case, as the patterns they find could well be explainable by chance alone. Their data offer essentially no evidence in support of their theory, but the theory could still be correct, just unresolvable amid the experimental noise. And, as usual, I’ll say that I’d have no problem with this sort of result being published, just without the misplaced certainty. And, if the editors of Psychological Science think this sort of theorizing is worth publishing, I think they should also be willing to publish the same thing, even if the comparisons of interest are not statistically significant.

The contest!

OK, on to the main event. After Nick alerted me to this paper, I thought I should post something on it. But my post needed a title. Here were the titles I came up with:

“Feeling Blue and Seeing Blue: Desire for a Research Breakthrough May Impair Statistics Perception”


“Stop me before I blog again”


“The difference between ‘significant’ and ‘not significant’ is enough to get published in the #1 journal in psychology”


“They keep telling me not to use ‘Psychological Science’ as a punch line but then this sort of thing comes along”

Or simply,

“This week in Psychological Science.”

But maybe you have a better suggestion?

Winner gets a free Stan sticker.

P.S. We had another one just like this a few months ago.

P.P.S. I have nothing against Christopher Thorstenson, Adam Pazda, or Andrew Elliot. I expect they’re doing their best. It’s not their fault that (a) statistical methods are what they are, (b) statistical training is what is is, and (c) the editors of Psychological Science don’t know any better. It’s all too bad, but it’s not their fault. I laugh at these studies because I’m too exhausted to cry, that’s all. And, before you feel too sorry for these guys or for the editors of Psychological Science or think I’m picking on them, remember: if they didn’t want the attention, they didn’t need to publish this work in the highest-profile journal of their field. If you put your ideas out there, you have to expect (ideally, hope) that people will point out what you did wrong.

I’m honestly surprised that Psychological Science is still publishing this sort of thing. They’re really living up to their rep, and not in a good way. PPNAS I can expect will publish just about anything, as it’s not peer-reviewed in the usual way. But Psych Science is supposed to be a real journal, and I’d expect, or at least hope, better from them.

P.P.P.S. Lots of great suggestions in the comments, but my favorite is “Psychological Science publishes another Psychological Science-type paper.”

P.P.P.P.S. I feel bad that the whole field of psychology gets tainted by this sort of thing. The trouble is that Psychological Science is the flagship journal of the Association for Psychological Science, which I think is the main society for psychology research. The problem is not haters like me that draw attention to these papers; the problem is that this sort of work is regularly endorsed and publicized by the leading journal in the field. When the Association for Psychological Science regularly releases press releases touting this kind of noise study, it does tell us something bad about the field of psychology. Not all the work in the field, not most of the work in the field, not the most important work in the field. Psychology is important and I have a huge respect for many psychology researchers. Indeed I have a huge respect for much of the research within statistics that has been conducted by psychologists. And I say, with deep respect for the field, that it’s bad news that its leading society publicizes work that is not serious and has huge, obvious flaws. Flaws that might not have been obvious 10 or even 5 years ago, when most of us were not so aware of the problems associated with the garden of forking paths, but flaws which for the past couple of years have been widely known. They should know better; indeed I’d somehow thought they’d cleaned up their act so I was surprised to see this new paper, front and center in their leading journal.

USAs usannsynlige presidentkandidat.

With current lag, this should really appear in September but I thought I better post it now in case it does not remain topical.

It’s a news article by Linda May Kallestein, which begins as follows:

Sosialisten Bernie Sanders: Kan en 73 år gammel jøde, født av polske innvandrere, vokst opp under enkle kår og som vil innføre sosialdemokrati etter skandinavisk modell, ha sjanse til å bli USAs neste president?

And here’s my quote:

Screen Shot 2015-09-02 at 5.15.23 PM

I actually said it in English, but you get the picture. Not as exciting as the time I was quoted in Private Eye, but I’ll still take it.

The full story is on the sister blog.

To understand the replication crisis, imagine a world in which everything was published.


John Snow points me to this post by psychology researcher Lisa Feldman Barrett who reacted to the recent news on the non-replication of many psychology studies with a contrarian, upbeat take, entitled “Psychology Is Not in Crisis.”

Here’s Barrett:

An initiative called the Reproducibility Project at the University of Virginia recently reran 100 psychology experiments and found that over 60 percent of them failed to replicate — that is, their findings did not hold up the second time around. . . .

But the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works. . . . Science is not a body of facts that emerge, like an orderly string of light bulbs, to illuminate a linear path to universal truth. Rather, science (to paraphrase Henry Gee, an editor at Nature) is a method to quantify doubt about a hypothesis, and to find the contexts in which a phenomenon is likely. Failure to replicate is not a bug; it is a feature. It is what leads us along the path — the wonderfully twisty path — of scientific discovery.

All this is fine. Indeed, I’ve often spoken of the fractal nature of science: at any time scale, whether it be minutes or days or years, we see a mix of forward progress and sudden shocks, realizations that much of what we’ve thought was true, isn’t. Scientific discovery is indeed both wonderful and unpredictable.

But Barrett’s article disturbs me too, for two reasons. First, yes, failure to replicate is a feature, not a bug—but only if you respect that feature, if you take the failure to replicate to reassess your beliefs. But if you just complacently say it’s no big deal, then you’re not taking the opportunity to learn.

Here’s an example. The recent replication paper by Nosek et al. had many examples of published studies that did not replicate. One example was described in Benedict Carey’s recent New York Times article as follows:

Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so. In the reproduced studies, researchers found weaker effects for all three experiments.

Carey got a quote from the author of that original study. To my disappointment, the author did not say something like, “Hey, it looks like we might’ve gone overboard on that original study, that’s fascinating to see that the replication did not come out as we would’ve thought.” Instead, here’s what we got:

In an email, Paola Bressan, a psychologist at the University of Padua and an author of the original mate preference study, identified several such differences — including that her sample of women were mostly Italians, not American psychology students — that she said she had forwarded to the Reproducibility Project. “I show that, with some theory-required adjustments, my original findings were in fact replicated,” she said.

“Theory-required adjustments,” huh? Unfortunately, just about anything can be interpreted as theory-required. Just ask Daryl Bem.

We can actually see what the theory says. Philosopher Deborah Mayo went to the trouble to look up Bressan’s original paper, which said the following:

Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extra pair mating with the former. Only if conception occurs, however, do the evolutionary benefits of such a strategy overcome its costs. Accordingly, we predicted that (a) partnered women should prefer attached men, because such men are more likely than single men to have pair-bonding qualities, and hence to be good replacement partners, and (b) this inclination should reverse when fertility rises, because attached men are less available for impromptu sex than single men.

Nothing at all about Italians there! Apparently this bit of theory requirement wasn’t apparent until after the replication didn’t work.

What if the replication had resulted in statistically significant results in the same direction as expected from the earlier, published paper? Would Bressan have called up the Replication Project and said, “Hey—if the results replicate under these different conditions, something must be wrong. My theory requires that the model won’t work with American college students!” I really really don’t think so. Rather, I think Bressan would call it a win.

And that’s my first problem with Barrett’s article. I feel like she’s taking a heads-I-win, tails-you-lose position. A successful replication is welcomed as a confirmation, an unsuccessful replication indicates new conditions required for the theory to hold. Nowhere does she consider the third option: that the original study was capitalizing on chance and in fact never represented any general pattern in any population. Or, to put it another way, that any true underlying effect is too small and too variable to be measured by the noisy instruments being used in some of those studies.

As the saying goes, when effect size is tiny and measurement error is huge, you’re essentially trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.

My second problem with Barrett’s article is at the technical level. She writes:

Suppose you have two well-designed, carefully run studies, A and B, that investigate the same phenomenon. They perform what appear to be identical experiments, and yet they reach opposite conclusions. Study A produces the predicted phenomenon, whereas Study B does not. . . . Does this mean that the phenomenon in question is necessarily illusory? Absolutely not. If the studies were well designed and executed, it is more likely that the phenomenon from Study A is true only under certain conditions [emphasis in the original].

At one level, there is nothing to disagree with here. I don’t really like the presentation of phenomena as “true” or “false”—pretty much everything we’re studying in psychology has some effect—but, in any case, all effects vary. The magnitude and even the direction of any effect will vary across people and across scenarios. So if we interpret the phrase “the phenomenon is true” in a reasonable way, then, yes, it will only be true under certain conditions—or, at the very least, vary in importance across conditions.

The problem comes when you look at specifics. Daryl Bem found some comparisons in his data which, when looked in isolation, were statistically significant. These patterns did not show up in replication. Satoshi Kanazawa found a correlation between beauty in sex ratio in a certain dataset. When he chose a particular comparison, he found p less than .05. What do we learn from this? Do we learn that, in the general population, beautiful parents are more likely to have girls? No. The most we can learn is that the Journal of Theoretical Biology can be fooled into publishing patterns that come from noise. (His particular analysis was based on a survey of 3000 people. A quick calculation using prior information on sex ratios shows that you would need data on hundreds of thousands of people to estimate any effect of the sort that he was looking for.) And then there was the himmicanes and hurricanes study which, ridiculous as it was, falls well within the borders of much of the theorizing done in psychology research nowadays. And so on, and so on, and so on.

We could let Barrett off the hook on the last quote above because she does qualify her statement with, “If the studies were well designed and executed . . .” But there’s the rub. How do we know if a study was well designed and executed? Publication in Psychological Science, or PPNAS is not enough—lots and lots of poorly designed and executed studies appear in these journals. It’s almost as if the standards for publication are not just about how well designed and executed a study is, but also about how flashy are the claims, and whether there is a “p less than .05” somewhere in the paper. It’s almost as if reviewers often can’t tell whether a study is well designed and executed. Hence the demand for replication, hence the concern about unreplicated studies, or studies that for mathematical reasons are essentially dead on arrival because the noise is so much greater than the signal.

Imagine a world in which everything was published

A close reading of Barrett’s article reveals the centrality of the condition that studies be “well designed and executed,” and lots of work by statisticians and psychology researchers in recent years (Simonsohn, Button, Nosek, Wagenmakers, etc etc) has made it clear that current practice, centered on publication thresholds (whether it be p-value or Bayes factor or whatever), won’t do so well at filtering out the poorly designed and executed studies.

To discourage or disparage or explain away failed replications is to give a sort of “incumbency advantage” to published claims, which puts a burden on the publication process that it cannot really handle.

To better understand what’s going on here, imagine a thought experiment where everything is published, where there’s no such thing as Science or Nature or Psychological Science or JPSP or PPNAS; instead, everything’s published on Arxiv. Every experiment everyone does. And with no statistical significance threshold. In this world, nobody has ever heard of inferential statistics. All we see are data summaries, regressions, etc., but no standard errors no posterior probabilities, no p-values.

What would we do then? Would Barrett reassure us that we shouldn’t be discouraged by failed replications, that everything already published (except, perhaps, for “a few bad eggs”) be taken as likely to be true? I assume (hope) not. The only way this sort of reasoning can work is if you believe the existing system screens out the bad papers. But the point of various high-profile failed replications (for example, in the field of embodied cognition) is that, no, the system does not work so well. This is one reason the replication movement is so valuable, and this is one reason I’m so frustrated by people who dismiss replications or who claim that replications show that “the system works.” It only works if you take the information from the failed replications (and the accompanying statistical theory, which is the sort of thing that I work on) and do something about it!

As I wrote in an earlier discussion on this topic:

Suppose we accept this principle [that published results are to be taken as true, even if they fail to be replicated in independent studies by outsiders]. How, then, do we treat an unpublished paper? Suppose someone with a Ph.D. in biology posts a paper on Arxiv (or whatever is the biology equivalent), and it can’t be replicated? Is it ok to question the original paper, to treat it as only provisional, to label it as unreplicated? That’s ok, right? I mean, you can’t just post something on the web and automatically get the benefit of the doubt that you didn’t make any mistakes. Ph.D.’s make errors all the time (just like everyone else). . . .

Now we can engage in some salami slicing. According to Bissell (as I interpret here), if you publish an article in Cell or some top journal like that, you get the benefit of the doubt and your claims get treated as correct until there are multiple costly, failed replications. But if you post a paper on your website, all you’ve done is make a claim. Now suppose you publish in a middling journal, say, the Journal of Theoretical Biology. Does that give you the benefit of the doubt? What about Nature Neuroscience? PNAS? Plos-One? I think you get my point. A publication in Cell is nothing more than an Arxiv paper that happened to hit the right referees at the right time. Sure, approval by 3 referees or 6 referees or whatever is something, but all they did is read some words and look at some pictures.

It’s a strange view of science in which a few referee reports is enough to put something into a default-believe-it mode, but a failed replication doesn’t count for anything.

I’m a statistician so I’ll conclude with a baseball analogy

Bill James once wrote with frustration about humanist-style sportswriters, the sort of guys who’d disparage his work and say they didn’t care about the numbers, that they cared about how the athlete actually played. James’s response was that if these sportswriters really wanted to talk baseball, that would be fine—but oftentimes their arguments ended up having the form: So-and-so hit .300 in Fenway Park one year, or so-and-so won 20 games once, or whatever. His point was that these humanists were actually making their arguments using statistics. They were just using statistics in an uninformed way. Hence his dictum that the alternative to good statistics is not “no statistics,” it’s “bad statistics.”

That’s how I feel about the people who deny the value of replications. They talk about science and they don’t always want to hear my statistical arguments, but then if you ask them why we “have no choice but to accept” claims about embodied cognition or whatever, it turns out that their evidence is nothing but some theory and a bunch of p-values. Theory can be valuable but it won’t convince anybody on its own; rather, theory is often a way to interpret data. So it comes down to the p-values.

Believing a theory is correct because someone reported p less than .05 in a Psychological Science paper is like believing that a player belongs in the Hall of Fame because hit .300 once in Fenway Park.

This is not a perfect analogy. Hitting .300 anywhere is a great accomplishment, whereas “p less than .05” can easily represent nothing more than an impressive talent for self-delusion. But I’m just trying to get at the point that ultimately it is statistical summaries and statistical models that are being used to make strong (and statistical ridiculous) claims about reality, hence statistical criticisms, and external data such as come from replications, are relevant.

If, like Barrett, you want to dismiss replications and say there’s no crisis in science: Fine. But then publish everything and accept that all data are telling you something. Don’t privilege something that happens to have been published once and declare it true. If you do that, and you follow up by denying the uncertainty that is revealed by failed replications (and was earlier revealed, on the theoretical level, by this sort of statistical analysis), well, then you’re offering nothing more than complacent happy talk.

P.S. Fred Hasselman writes:

I helped analyze the replication data of the Bressan & Stranieri study.

There were two replication samples:

›Original effect is a level comparison after a 2x2x2 ANOVA:
›F(1, 194) = 7.16, p = .008, f = 0.19
t(49) = 2.45, p = .02, Cohen’s d = 0.37

›Replication 1 in-lab with N=263, Power > 99%, Cohen’s d = .06
›Replication 2 on-line with N=317, Power > 99%, Cohen’s d = .09

Initially I did not have the time to read the entire article. I recently did, because I wanted to use the study as an example in a lecture.

I completely agree with the comparisons to Bem-logic.
What I ended up doing is showing the original materials and elaborating on the theory behind the hypothesis during the lecture.

After seeing the stimuli, learning about the hypothesis, but before learning about the replication studies, there was a consensus among students (99% female) that claims like the first sentence of the abstract should disqualify the study as a serious work of science:

ABSTRACT—Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extrapair mating with the former.

Think about it.
Men of higher genetic quality are poorer partners and parents.
That’s a fact you know.
And this genetic quality of men (yes, they mean attractiveness) is why women want their babies, more so than babies from their current partner (the ugly variety of men, but very sweet and good with kids).

My brain hurts.

Thankfully the conclusion is very modest:
In humans’ evolutionary past, the switch in preference from less to more sexually accessible men associated with each ovulatory episode would have been highly adaptive. Our data are consistent with the idea that, although the length of a woman’s reproductive lifetime and the extent of the potential mating network have expanded considerably over the past 50,000 years, this unconscious strategy guides women’s mating choices still.

Erratum: We meant ‘this unconscious strategy guides Italian women’s mating choices still’.