Skip to content

“Derek Jeter was OK”


Tom Scocca files a bizarrely sane column summarizing the famous shortstop’s accomplishments:

Derek Jeter was an OK ballplayer. He was pretty good at playing baseball, overall, and he did it for a pretty long time. . . . You have to be good at baseball to last 20 seasons in the major leagues. . . . He was a successful batter in productive lineups for many years. . . . He was not Ted Williams or Rickey Henderson. Spectators did not come away from seeing Derek Jeter marveling at the stupendous, unimaginable feats of hitting they had seen. But he did lots and lots of damage. He got many big hits and contributed to many big rallies. Pitchers would have preferred not to have to pitch to him. . . . His considerable athletic abilities allowed him to sometimes make spectacular leaping and twisting plays on misjudged balls that better shortstops would have played routinely. People enjoyed watching him make those plays, and that enjoyment led to his winning five Gold Gloves. That misplaced acclaim, in turn, helped spur more advanced analysis of defensive play in baseball, a body of knowledge which will ensure that no one ever again will be able to play shortstop as badly as Jeter for as long as he did. And that gave fans something to argue about, which is an important part of sports.

Scocca keeps going in this vein:

Regardless, on balance, Jeter’s good hitting helped his team more than his bad fielding hurt it. The statistical ledger says so—by Wins Above Replacement, according to Baseball Reference, his glovework drops him from being the 20th most productive position player of all time to the 58th. Having the 58th most productive career among non-pitchers in major-league history is still a solid achievement.

And still more:

When [Alex] Rodriguez showed up in the Bronx, Jeter would not yield the job. It was a selfish decision and the situation hurt the team. But powerful egos, misplaced competitiveness, and unrealistic self-appraisals are common features in elite athletes. Whatever wrong Jeter may have done in the intrasquad rivalry, it was the Yankees’ fault for not managing him better.

And this:

Like most star athletes of his era, he kept his public persona intentionally blank and dull . . . Depending on their allegiances, baseball fans could imagine him to be classy or imagine him to be pissy, and the limited evidence could support either conclusion.

I love this Scocca post because its hilariousness (which is intentional, I believe) is entirely contingent on its context. Sportswriting is so full of hype (either of the “Jeter is a hero” variety or the “Jeter’s no big whoop” variety or the “Hey, look at my cool sabermetrics” variety or the “Hey, look at what a humanist I am” variety) that it just comes off (to me) as flat-out funny to see a column that just plays it completely straight, a series of declarative sentences that tell it like it is.

Of course, if all the sportswriters wrote like this, it would be boring. But as long as all the others feel they need some sort of angle, this pitch-it-down-the-middle style will work just fine. The confounding of expectations and all that.

P.S. Also this from a commenter to Scocca’s post:

He also inspired people to like baseball again after the lockout and didn’t juice.

Waic for time series

Helen Steingroever writes:

I’m currently working on a model comparison paper using WAIC, and
would like to ask you the following question about the WAIC computation:

I have data of one participant that consist of 100 sequential choices (you can think of these data as being a time series). I want to compute the WAIC for these data. Now I’m wondering how I should compute the predictive density. I think there are two possibilities:

(1) I compute the predictive density of the whole sequence (i.e., I consider the whole sequence as one data point, which means that n=1 in Equations (11) – (12) of your 2013 Stat Comput paper.)
(2) I compute the predictive density for each choice (i.e., I consider each choice as one data point, which means that n=# choices in Equations (11) – (12) of your 2013 Stat Comput paper.)

My quick thought was that Waic is an approximation to leave-one-out cross-validation and this computation gets more complicated with correlated data.

But I passed the question on to Aki, the real expert on this stuff. Aki wrote:

This a interesting question and there is no simple answer.

First we should consider what is your predictive goal:
(1) predict whole sequence for another participant
(2) predict a single choice given all other choices
(3) predict the next choice given the choices in the sequence so far?

If your predictive goal is

(1) then you should note that WAIC is based on an asymptotic argument and it is not generally accurate with n=1. Watanabe has said (personal communication) that he thinks that this is not sensible scenario for WAIC, but if (1) is really your prediction goal, then I think that this is might be best you can do. It seems that when n is small, WAIC will usually underestimate the effective complexity of the model, and thus would give over-optimistic performance estimates for more complex models.

(2) WAIC should work just fine here (unless your model says that there is no dependency between the choices, ie. having 100 separate models with each having n=1). Correlated data here means just that it is easier to predict a choice if you know the previous choices and the following choices. This may make difference between some models small compared to scenario (1).

(3) WAIC can’t handle this, and you would need to use a specific form of cross-validation (I think I should write a paper on this).

Study published in 2011, followed by successful replication in 2003 [sic]


This one is like shooting fish in a barrel but sometimes the job just has to be done. . . .

The paper is by Daryl Bem, Patrizio Tressoldi, Thomas Rabeyron, and Michael Duggan, it’s called “Feeling the Future: A Meta-Analysis of 90 Experiments on the Anomalous Anticipation of Random Future Events,” and it begins like this:

In 2011, the Journal of Personality and Social Psychology published a report of nine experiments purporting to demonstrate that an individual’s cognitive and affective responses can be influenced by randomly selected stimulus events that do not occur until after his or her responses have already been made and recorded (Bem, 2011). To encourage exact replications of the experiments, all materials needed to conduct them were made available on request. We can now report a meta-analysis of 90 experiments from 33 laboratories in 14 different countries which yielded an overall positive effect in excess of 6 sigma . . . A Bayesian analysis yielded a Bayes Factor of 7.4 × 10-9 . . . An analysis of p values across experiments implies that the results were not a product of “p-hacking” . . .

Actually, no.

There is a lot of selection going on here. For example, they report that 57% (or, as they quaintly put it, “56.6%”) of the experiments had been published in peer reviewed journals or conference proceedings. Think of all the unsuccessful, unpublished replications that didn’t get caught in the net. But of course almost any result that happened to be statistically significant would be published, hence a big bias. Second, they go back and forth, sometimes considering all replications, other times ruling some out as not following protocol. At one point they criticize internet experiments which is fine, but again it’s more selection because if the results from the internet experiments had looked good, I don’t think we’d be seeing that criticism. Similarly, we get statements like, “If we exclude the 3 experiments that were not designed to be replications of Bem’s original protocol . . .”. This would be a lot more convincing if they’d defined their protocols clearly ahead of time.

I question the authors’ claims that various replications are “exact.” Bem’s paper was published in 2011, so how can it be that experiments performed as early as 2003 are exact replications? That makes no sense. Just to get an idea of what was going on, I tried to find one of the earlier studies that was stated to be an exact replication. I looked up the paper by Savva et al. (2005), “Further testing of the precognitive habituation effect using spider stimuli.” I could not find this one but I found a related one, also on spider stimuli. In what sense is this an “exact replication” of Bem? I looked at the Bem (2011) paper, searched on “spider,” and all I could find is a reference to Savva et al.’s 2004 work.

This baffled me so I went to the paper linked above and searched on “exact replication” to see how they defined the term. Here’s what I found:

“To qualify as an exact replication, the experiment had to use Bem’s software without any procedural modifications other than translating on-screen instructions and stimulus words into a language other than English if needed.”

I’m sorry, but, no. Using the same software is not enough to qualify as an “exact replication.”

This issue is central to the paper at hand. For example, there is a discussion on page 18 on “the importance of exact replications”: “When a replication succeeds, it logically implies that every step in the replication ‘worked’ . . .”

Beyond this, the individual experiments have multiple comparisons issues, just as did the Bem (2011) paper. We see very few actual preregistrations, and my impression is that when something counts as a successful replication there is still a lot of wiggle room regarding data inclusion rules, which interactions to study, etc.

Who cares?

The ESP context makes this all look like a big joke, but the general problem of researchers creating findings out of nothing, that seems to be a big issue in social psychology and other research areas involving noisy measurements. So I think it’s worth holding a firm line on this sort of thing. I have a feeling that the authors of this paper think that if you have a p-value or Bayes factor of 10^-9 then your evidence is pretty definitive, even if some nitpickers can argue on the edges about this or that. But it doesn’t work that way. The garden of forking paths is multiplicative, and with enough options it’s not so hard to multiply up to factors of 10^-9 or whatever. And it’s not like you have to be trying to cheat; you just keep making reasonable choices given the data you see, and you can get there, no problem. Selecting ten-year-old papers and calling them “exact replications” is one way to do it.

P.S. I found the delightful image above by googling *bullwinkle crystal ball* but I can’t seem to track down who to give the credit to. Jay Ward, Alex Anderson, and Bill Scott, I suppose. It doesn’t seem to matter so much who actually got the screenshot and posted it on the web.

Why I’m still not persuaded by the claim that subliminal smiley-faces can have big effects on political attitudes

We had a discussion last month on the sister blog regarding the effects of subliminal messages on political attitudes.  It started with a Larry Bartels post entitled “Here’s how a cartoon smiley face punched a big hole in democratic theory,” with the subtitle, “Fleeting exposure to ‘irrelevant stimuli’ powerfully shapes our assessments of policy arguments,” discussing the results of an experiment conducted a few years ago and recently published by Cengiz Erisen, Milton Lodge and Charles Taber. Larry wrote:

What were these powerful “irrelevant stimuli” that were outweighing the impact of subjects’ prior policy views? Before seeing each policy statement, each subject was subliminally exposed (for 39 milliseconds — well below the threshold of conscious awareness) to one of three images: a smiling cartoon face, a frowning cartoon face, or a neutral cartoon face. . . . the subliminal cartoon faces substantially altered their assessments of the policy statements . . .

I followed up with a post expressing some skepticism:

It’s clear that when the students [the participants in the experiment] were exposed to positive priming, they expressed more positive thoughts . . . But I don’t see how they make the leap to their next statement, that these cartoon faces “significantly and consistently altered [students'] thoughts and considerations on a political issue.” I don’t see a change in the number of positive and negative expressions as equivalent to a change in political attitudes or considerations.

I wrote:

Unfortunately they don’t give the data or any clear summary of the data from experiment No. 2, so I can’t evaluate it. I respect Larry Bartels, and I see that he characterized the results as the “subliminal cartoon faces substantially altered their assessments of the policy statements — and the resulting negative and positive thoughts produced substantial changes in policy attitudes.” But based on the evidence given in the paper, I can’t evaluate this claim.  I’m not saying it’s wrong. I’m just saying that I can’t express judgment on it, given the information provided.

Larry then followed up with a post saying that further information was in chapter 3 of Erisen’s Ph.D. dissertation, available online here.

And Erisen sent along a note which I said I would post. Erisen’s note is here:

As a close follower of the Monkey Cage, it is a pleasure to see some interest in affect, unconscious stimuli, perceived (or registered) but unappreciated influences. Accordingly I thought it is now the right time for me to contribute to the discussion.

First, I would like to begin with clarifying conceptual issues with respect to affective priming. Affective priming is not subliminal advertising, nor is it a subliminal message. Subliminal ads (or messages) were used back in the 1970s with questionable methods and current priming studies rarely refer to these approaches.

Second, it is quite normal to be skeptical because no earlier research has attempted to address these kinds of issues in political science. When they first hear about affective influences, people may naturally consider the consequences for measuring political attitudes and political preferences. These conclusions may be especially meaningful for democratic theory, as mentioned by Larry Bartels in an earlier post.

But, fear not, this is not a stand-alone research study. Rather, it is part of an overall research program (Lodge and Taber, 2013) and there are various studies on unconscious stimuli and contextual effects. We name these “perceived but unappreciated effects” in our paper. In addition, we cite some other work on contextual cues (Berger et al., 2008), facial attractiveness (Todorov and Uleman, 2004), the “RATS” ad (Weinberger and Westen, 2008), the Willie Horton ad (Mendelberg, 2001), upbeat music or threatening images in political ads (Brader, 2006), which all provide examples of priming. There is a great deal of research in social psychology that offers other relevant examples of the social or political effects of affective primes.

Third, with respect to the outcomes, I would like to refer the reader to our path analyses (provided in the paper and in The Rationalizing Voter) that show the effects of affect-triggered thoughts on policy preferences (see below). What can be inferred from these results? We can say that controlling for prior attitudes affective primes not only directly affected policy preferences but also indirectly affected preferences through affect-evoked thoughts. The effects on political attitudes and preferences are significant as we discuss in greater detail in the paper.


Fourth, these results were consistent across six experiments that I conducted for my dissertation. Priming procedure was about the same in all those studies and patterns across different dependent variables were quite similar.

Finally, we do not argue that voters cannot make decisions based on “enlightened preferences.” As we repeatedly state in the paper, affective cues color attitudes and preferences but this does not mean that voters’ decisions are necessarily irrational.

Both Bartels and Erisen posted path diagrams in support of their argument, so perhaps I should clarify that I’ve never understood these path diagrams. If an intervention has an effect on political attitudes, I’d like to see a comparison of the political attitudes with and without the intervention.  No amount of path diagrams will convince me until I see the direct comparison. You could argue with some justification that my ignorance in this area is unfortunate, but you should also realize that there are a lot of people like me who don’t understand those graphs—and I suspect that many of those people who do like and understand path diagrams would also like to see the direct comparisons too.  So, purely from the perspective of communication, I think it makes sense to connect the dots and not just show a big model without the intermediate steps.  Otherwise you’re essentially asking the reader to take your claims on faith.

Again, I’m not saying that Erisen is wrong in his claims, just that the evidence he’s shown me is too abstract to convince me.   I realize that he knows a lot more about his experiment and his data than I do and I’m pretty sure that he is much more informed on this literature than I am, so I respect that he feels he can draw certain strong conclusions from his data.  But, for me, I have to go what information is available to me.

P.S.  In his post, Larry also refers to the study of  Andrew Healy, Neil Malhotra, and Cecilia Hyunjung Mo on college football games and election outcomes.  That was an interesting study but, as I wrote when it came out a couple years ago, I think its implications were much less than were claimed at the time in media reports.  Yes, people can be affected by irrelevant stimuli related to mood, but it matters what are the magnitudes of such effects.

“How to disrupt the multi-billion dollar survey research industry”

David Rothschild (coauthor of the Xbox study, the Mythical Swing Voter paper, and of course the notorious Aapor note) will be speaking Friday 10 Oct in the Economics and Big Data meetup in NYC. His title: “How to disrupt the multi-billion dollar survey research industry: information aggregation using non-representative polling data.” Should be fun!

P.P.S. Slightly relevant to this discussion: Satvinder Dhingra wrote to me:

An AAPOR probability-based survey methodologist is a man who, when he finds non-probability Internet opt-in samples constructed to be representative of the general population work in practice, wonders if they work in theory.

My reply:

Yes, although to be fair they will say that they’re not so sure that these methods work in practice. To which my reply is that I’m not so sure that their probability samples work so well in practice either!

Some will spam you with a six-gun and some with a fountain pen


A few weeks ago the following came in the email:

Dear Professor Gelman,

I am writing you because I am a prospective doctoral student with considerable interest in your research. My name is Xian Zhao, but you can call me by my English name Alex, a student from China. My plan is to apply to doctoral programs this coming fall, and I am eager to learn as much as I can about research opportunities in the meantime.

I will be on campus next Monday, September 15th, and although I know it is short notice, I was wondering if you might have 10 minutes when you would be willing to meet with me to briefly talk about your work and any possible opportunities for me to get involved in your research. Any time that would be convenient for you would be fine with me, as meeting with you is my first priority during this campus visit.

Thanks you in advance for your consideration.

To which I’d responded as follows:

Hi, I’m meeting someone at 10am, you can come by at 9:50 and we can talk, then you can listen in on the meeting if you want.

And then I got this:

Dear Professor Gelman,

Thanks for your reply. I really appreciate your arranging to meet with me, but because of a family emergency I have to reschedule my visit. I apologize for any inconvenience this has caused you.


OK, at this point the rest of you can probably see where this story is going. But I didn’t think again about this until I received the following email yesterday:

Dear Professor Gelman,

A few weeks ago, you received an email from a Chinese student with the title “Prospective doctoral student on campus next Monday,” in which a meeting with you was requested. That email was part of our field experiment about how people react to Chinese students who present themselves by using either a Chinese name or an Anglicized name. This experiment was thoroughly reviewed and approved by the IRB (Institutional Review Board) of Kansas University. The IRB determined that a waiver of informed consent for this experiment was appropriate.

Here we will explain the purpose and expected results of this study. Many foreign students adopt Anglicized names when they come to the U.S., but little research has examined whether name selection affects how these individuals are treated. In this study, we are interested in whether the way a Chinese student presents him/herself could influence the email response rate, response speed, and the request acceptance rate from white American faculty members. The top 30 Universities in each social science and science area ranked by U.S. News & World Report were selected. We visited these department websites, including yours, and from the list of faculty we randomly chose one professor who appeared to be a U.S. citizen and White. You were either randomly assigned into the Alex condition in which the Chinese student in the email introduced him/herself as Alex or into the Xian condition in which the same Chinese student in the email presented him/herself as Xian (a Chinese name). Except for the name presented in the email, all other information was identical across these two conditions.

We predict that participants in the Alex condition will more often comply with the request to meet and respond more quickly than those in the Xian condition. But we also predict that because the prevalence of Chinese students is greater in the natural than social sciences in the U.S., faculty participants in the natural sciences will respond more favorably to Xian than faculty participants in the social sciences.

We apologize for not informing you that were participating in a study. Our institutional IRB deemed informed consent unnecessary in this case because of the minimal risk involved and because an investigation of this sort could not reasonably be done if participants knew, from the start, of our interest. We hope the email caused no more than minimal intrusion into your day, and that you quickly received the cancellation response if you agreed to meet.

Please note that no identifying information is being stored with our data from this study. We did keep a master list with your email address, and a corresponding participant number. But your response (yes or no, along with latency) was recorded in a separate file that does not contain your email address or any other identifying information about you. Please also note that we recognize there are many reasons why you may or may not have responded to the email, including scheduling conflicts, travel, etc. An individual response of “yes” or “no” to the meeting request actually tells us nothing about whether the name used by the bogus student had some influence. But in the aggregate, we can assess whether or not there is bias in favor of Chinese students who anglicize their names. We hope that this study will draw attention to how names can shape people’s reactions to others. Practically, the results may also shed light on the practices and policies of cultural adaptation.

Please know that you have the right to withdraw your response from this study at this time. If you do not want us to use your response in this study, please contact us by using the following contact information.

Thank you for taking the time to participating in this study. If you have questions now or in the future, or would like to learn the results of the study later in the semester [after November 30th], please contact one of the researchers below.

Xian Zhao, M.E. Monica Biernat, Ph.D.

Department of Psychology Department of Psychology

University of Kansas University of Kansas

Lawrence, KS 66045 Lawrence, KS 66045

“Thank you for taking the time to participating in this study,” indeed. Thank you for not taking the time to proofread your damn email, pal.

I responded as follows to the email from “Xian Zhao, M.E.” and “Monica Biernat, Ph.D.”:

No problem. I know your time is as valuable as mine, so in the future whenever I get a request from a student to meet, I will forward that student to you. I hope you enjoy talking with statistics students, because you’re going to be hearing from a lot of them during the next few years!

I guess the next logical step is for me to receive an email such as:

Dear Professor Gelman,

A few weeks ago, you received an email from two scholars with the title, “Names and Attitudes Toward Foreigners: A Field Experiment,” which purportedly described an experiment which was done in which you involuntarily participated without compensation, an experiment in which you were encouraged to alter your schedule on behalf of a nonexisting student, thus decreasing by some small amount the level of trust between American faculty and foreign students, all for the purpose of somebody’s Ph.D. thesis. Really, though, this “experiment” was itself an experiment to gauge your level of irritation at this experience.

Yours, etc.

In all seriousness, this is how the world of research works: A couple of highly-paid professors from Ivy League business schools do a crap-ass study that gets lots of publicity. This filters down, and the next thing you know, some not-so-well-paid researchers in Kansas are doing the same thing. Sure, they might not land their copycat study into a top journal, but surely they can publish it somewhere. And then, with any luck, they’ll get some publicity. Hey, they already did!

Good job, Xian Zhao, M.E., and Monica Biernat, Ph.D. You got some publicity. Now could you stop wasting all of our time?

Thanks in advance.

Yours, etc.

P.S. In case you’re wondering, the above picture (from the webpage of Edward Smith, but I don’t know who actually made the image) was the very first link in a google image search on *waste of time*. Thanks, Google—you came through for me again!

P.P.S. No, no, I won’t really forward student requests to Zhao and Biernat. Not out of any concern for Z & B—perhaps handling dozens of additional student requests per week would keep them out of trouble—but because it would be a waste of the students’ time.

P.P.P.S. When I encountered the fake study by Katherine Milkman and Modupe Akinola a few years ago, I didn’t think much of it. It was a one-off and I didn’t change my pattern of interactions with students. But now that I’ve received two of these from independent sources, I’m starting to worry. Either there are lots and lots and lots of studies out there, or else the jokers who are doing these studies are each mailing this crap out to thousands and thousands of professors. But, hey, email is free, so why not, right?

P.P.P.P.S. I just got this email:

Dear Dr. Gelman,

Apologize again that this study brought you troubles and wasted your valuable time. Sincerely hope our apology can make you feel better.

Xian Zhao

I appreciate the apology but they still wasted thousands of people’s time.

On deck this week

Mon: Some will spam you with a six-gun and some with a fountain pen

Tues: Why I’m still not persuaded by the claim that subliminal smiley-faces can have big effects on political attitudes

Wed: Study published in 2011, followed by successful replication in 2003 [sic]

Thurs: Waic for time series

Fri: MA206 Program Director’s Memorandum

Sat: “An exact fishy test”

Sun: People used to send me ugly graphs, now I get these things

I can’t think of a good title for this one.

Andrew Lee writes:

I recently read in the MIT Technology Review about some researchers claiming to remove “bias” from the wisdom of crowds by focusing on those more “confident” in their views.

I [Lee] was puzzled by this result/claim because I always thought that people who (1) are more willing to reassess their priors and (2) “hedgehogs” were more accurate forecasters.

I clicked through to the article and noticed this line: “tasks such as to estimate the length of the border between Switzerland and Italy, the correct answer being 734 kilometers.”

Ha! Haven’t they ever read Mandelbrot?


Estimating discontinuity in slope of a response function

Peter Ganong sends me a new paper (coauthored with Simon Jager) on the “regression kink design.” Ganong writes:

The method is a close cousin of regression discontinuity and has gotten a lot of traction recently among economists, with over 20 papers in the past few years, though less among statisticians.

We propose a simple placebo test based on constructing RK estimates at placebo policy kinks. Our placebo test substantially changes the findings from two RK papers (one which is revise and resubmit at Econometrica by David Card, David Lee, Zhuan Pei and Andrea Weber and another which is forthcoming in AEJ: Applied by Camille Landais). If applied more broadly — I think it is likely to change the conclusions of other RK papers as well.

Regular readers will know that I have some skepticism about certain regression discontinuity practices, so I’m sympathetic to this line from Ganong and Jager’s abstract:

Statistical significance based on conventional p- values may be spurious.

I have not read this new paper in detail but, just speaking generally, I’d imagine it would be difficult to estimate a change in slope. It seems completely reasonable to me that slopes will be changing all the time—that’s just nonlinearity!—but unless the changes are huge, they’ve gotta be hard to estimate from data, and I’d think the estimates would be supersensitive to whatever else is included in the model.

The Ganong and Jager paper looks interesting to me. I hope that someone will follow it up with a more model-based approach focused on estimation and uncertainty rather than hypothesis testing and p-values. Ultimately I think there should be kinks and discontinuities all over the place.

What does CNN have in common with Carmen Reinhart, Kenneth Rogoff, and Richard Tol: They all made foolish, embarrassing errors that would never have happened had they been using R Markdown

Rachel Cunliffe shares this delight:


Had the CNN team used an integrated statistical analysis and display system such as R Markdown, nobody would’ve needed to type in the numbers by hand, and the above embarrassment never would’ve occurred.

And CNN should be embarrassed about this: it’s much worse than a simple typo, as it indicates they don’t have control over their data. Just like those Rasmussen pollsters whose numbers add up to 108%. I sure wouldn’t hire them to do a poll for me!

I was going to follow this up by saying that Carmen Reinhart and Kenneth Rogoff and Richard Tol should learn about R Markdown—but maybe that sort of software would not be so useful to them. Without the possibility of transposing or losing entire columns of numbers, they might have a lot more difficulty finding attention-grabbing claims to publish.

Ummm . . . I better clarify this. I’m not saying that Reinhart, Rogoff, and Tol did their data errors on purpose. What I’m saying is that their cut-and-paste style of data processing enabled them to make errors which resulted in dramatic claims which were published in leading journals of economics. Had they done smooth analyses of the R Markdown variety (actually, I don’t know if R Markdown was available back in 2009 or whenever they all did their work, but you get my drift), it wouldn’t have been so easy for them to get such strong results, and maybe they would’ve been a bit less certain about their claims, which in turn would’ve been a bit less publishable.

To put it another way, sloppy data handling gives researchers yet another “degree of freedom” (to use Uri Simonsohn’s term) and biases claims to be more dramatic. Think about it. There are three options:

1. If you make no data errors, fine.

2. If you make an inadvertent data error that works against your favored hypothesis, you look at the data more carefully and you find the error, going back to the correct dataset.

3. But if you make an inadvertent data error that supports your favored hypothesis (as happened to Reinhart, Rogoff, and Tol), you have no particular motivation to check, and you just go for it.

Put these together and you get a systematic bias in favor of your hypothesis.

Science is degraded by looseness in data handling, just as it is degraded by looseness in thinking. This is one reason that I agree with Dean Baker that the Excel spreadsheet error was worth talking about and was indeed part of the bigger picture.

Reproducible research is higher-quality research.

P.S. Some commenters write that, even with Markdown or some sort of integrated data-analysis and presentation program, data errors can still arise. Sure. I’ll agree with that. But I think the three errors discussed above are all examples of cases where an interruption in the data flow caused the problem, with the clearest example being the CNN poll, where, I can only assume, the numbers were calculated using one computer program, then someone read the numbers off a screen or a sheet of paper and typed them into another computer program to create the display. This would not have happened using an integrated environment.