Skip to content

Cry of Alarm

Stan Liebowitz writes:

Is it possible to respond to a paper that you are not allowed to discuss?

The question above relates to some unusual behavior from a journal editor. As background, I [Liebowitz] have been engaged in a long running dispute regarding the analysis contained in an influential paper published in one of the top three economics journals, the Journal of Political Economy, in 2007. That paper was written by Harvard’s Felix Oberholzer-Gee and Kansas’s Koleman Strumpf and appeared to demonstrate that piracy did not reduce record sales. I have been suspicious of that paper for many reasons. Partly because the authors publicly claimed that they would make their download data public, but never did, and four years later they told reporters they had signed a non-disclosure agreement (while refusing to provide the agreement to those reporters). Partly because Oberholzer-Gee is a coauthor with (the admitted self-plagiarist) Bruno Frey of two virtually identical papers (one in the JPE and one in the AER) that do not cite one another. But mostly because OS have made claims that they either knew were false, or should have known were false.

Although I have been critical of OS (2007) since its publication, it was not until September of 2016 that I published a critique in one of the few economics journals willing to publish comments and replications, Econ Journal Watch (EJW). [I also have a replication of a portion of their paper not reliant on their download data, that is currently under review at a different journal.] The editors of EJW invited Oberholzer-Gee and Strumpf (OS) to submit a response to my critique, to be published concurrently with my critique, but OS instead published their defense in a different journal,Information Economics and Policy (IEP, an Elsevier journal behind a paywall).

OS’s choice of IEP was not surprising. Among other factors, the editor of the journal, Lisa George, was a student of Oberholzer-Gee (he served on her dissertation committee), had coauthored two papers with him, and listed him as one of four references on her CV. IEP clearly fast-tracked the OS paper—it was first submitted to the journal on October 13, and the final draft, dated October 26, thanked three referees and the editor. The paper was published in December, although it often takes over a year from submission to publication in IEP.[1]

I had spent years attempting to get OS to publicly answer questions about their paper, so I was delighted that OS finally publicly defended their paper. Their published defense still left many questions unanswered, however, such as why the reported mean value of their key instrument was four times as large as its true value, but at least OS were now on the record, trying to explain some of their questionable data and results.

As a critic of their work, I took their published defense as a vindication of my concerns. Although their defense was superficially plausible, and was voiced in a confident tone, it was chock full of errors. For example, in EJW I had noted that OS’s data on piracy, which was the main novelty of their analysis, exhibited unusual temporal variability. I knew that OS might claim that this variability was a byproduct from a process of matching their raw piracy data to data on album sales, so I measured the variability of their raw piracy data prior to the matching process, and included a paragraph in EJW explicitly noting that fact. Yet in IEP, OS mischaracterized my analysis and claimed that the surprisingly large temporal variability was due to the matching process. Not only was their claim about my analysis misleading, but their assertion that the matching process could have materially influenced the variability of their data was also incorrect, which was clearly revealed by visual inspection of the data and a correlation of 0.97 between the matched and unmatched series. The icing on the cake was their attempt to demonstrate the validity of their temporal data by claiming a +0.49 correlation of their weekly data with another data set they considered to be unusually reliable. In fact, the correct correlation between those data sets was ‑0.68 (my rejoinder provides the calculations, raw data, and copies of the web pages from which the data were taken). All these errors were found in just the first section of their paper, with later sections continuing in the same vein.

After I became aware of the OS paper in IEP, I contacted the IEP editor and complained that I had not been extended the courtesy of defending my article against their criticisms. Professor George seemed to understand that fair play would require at least the belated pretense of allowing me to provide a rejoinder:

I welcome a submission from you responding to the Oberholzer – Strumpf paper and indeed intended to contact you about this myself in the coming weeks.

She also seemed to be trying to inflate the impact factor of her journal:

As you might be aware, IEP contributors and readers have rather deep expertise in the area of piracy. I would thus [ask] that in your response you take care to cite relevant publications from the journal. I have found that taking care with the literature review makes the referee process proceed more smoothly.

The errors made by OS in IEP seemed so severe that I thought it likely that IEP would try to delay or reject my submission, both to protect OS and to protect the reputation of IEP’s refereeing process. Still, I had trouble envisioning the reasons IEP might give if it decided to reject my paper.  I decided, therefore, to submit my rejoinder to IEP but to avoid a decision dragging on for months or years, I emphatically told Professor George that I expected a quick decision, and I planned to withdraw the submission if I hadn’t heard within two months.

Wondering what grounds IEP might use to reject my paper indicated an apparent lack of imagination on my part. Although the referees did not find any errors in my paper, the editor told me that she was no longer interested in “continued debate on this one paper [OS, 2007]” and that such debate was “not helpful to researchers actively working in this area, or to IEP readers.” Apparently one side of the debate was useful to her readers in December, when she published the OS article, but that utility had presumably evaporated by January when it came to presenting the other side of the debate.

Since Professor George was supposedly planning to “invite” me to respond to OS’s article, she apparently feels the need to keep up that charade, and does so by redefining the meaning of the word “response.” She stated: “I want to emphasize that in rejecting your submission I did not shut the door on a response…IEP would welcome a new submission from you on the topic of piracy that introduces new evidence or connects existing research in novel ways.”

Apparently, I can provide a “response,” but I am not allowed to discuss the paper to which I am supposedly responding. That appears to be a rather Orwellian request.

I have complained to Elsevier about the incestuous and biased editorial process apparently afflicting IEP. We will see what comes of it. The bigger issue is the quality of the original OS article, the validity of which seems even more questionable than before, given the authors’ apparent inability to defend their analysis. This story is not yet over.

  1.   The other papers in that issue were first received March 2014, September 2015, February 2015, November 2015, and April 2016.

Wow. We earlier heard from Stan Liebowitz on economics and ethics here and here. The above story sounds particularly horrible but of course we’re only hearing one side of it here. So if any of the others involved in this episode (I guess that would be Oberholzer, Strumpf, or George) have anything to add, they should feel free to so so in the comments, or they could contact me directly.

P.S. I hope everyone’s liking the new blog titles. I’ve so far used the first five on the list. They work pretty well.

How important is gerrymandering? and How to most effectively use one’s political energy?

Andy Stein writes:

I think a lot of people (me included) would be interested to read an updated blog post from you on gerrymandering, even if your conclusions haven’t changed at all from your 2009 blog post [see also here]. Lots of people are talking about it now and Obama seems like he’ll be working on it this year and there’s a Tufts summer school course where they’re training quantitative PhDs to be expert witnesses. Initially, I thought it would be fun to attend, but as best I can tell from the limited reading I’ve done, it doesn’t seem like gerrymandering itself has that big of an effect. It seems to be that because Democrats like cities, even compact districts favor Republicans.

I’d also be curious to read a post from you on the most effective ways to use one’s polical energy for something productive. The thing I’m trying to learn more about now is how I can help work on improving our criminal justice system on the state level, since state politics seem more manageable and less tribal than national politics.

Here’s what I wrote in 2009:

Declining competitiveness in U.S. House elections cannot be explained by gerrymandering. I’m not saying that “gerrymandering” is a good thing—I’d prefer bipartisan redistricting or some sort of impartial system—but the data do not support the idea that redistricting is some sort of incumbent protection plan or exacerbator of partisan division.

In addition, political scientists have frequently noted that Democrats and Republicans have become increasingly polarized in the Senate as well as in the House, even though Senate seats are not redistricted.

And here’s how Alan Abramowitz, Brad Alexander, and Matthew Gunning put it:

The increasing correlation among district partisanship, incumbency, and campaign spending means that the effects of these three variables tend to reinforce each other to a greater extent than in the past. The result is a pattern of reinforcing advantages that leads to extraordinarily uncompetitive elections.

I added:

I’m not saying that gerrymandering is always benign; there are certainly some places where it has been used to make districts with unnecessarily high partisan concentrations. But, in aggregate, that’s not what has happened, at least according to our research.

But that was then, etc., so it’s reasonable for Stein to ask what’s happened in the eight years since. The short answer is that I’ve not studied the problem. I’ve read some newspaper articles suggesting that a few states have major gerrymanders in the Republican party’s favor, but that’s no substitute for a systematic analysis along the lines of our 1994 paper. My guess (again, without looking at the data) is that gerrymandering in some states is currently giving a few seats to the Republicans in the House of Representatives but that it does not explain the larger pattern of polarization in Congress that we’ve seen in the past few years with party-line or near-party-line votes on health care policy, confirmations for cabinet nominees, etc.

That said, the redistricting system in the United States is inherently partisan, so it’s probably a good idea for activists to get involved on both sides so that the fight in every state is balanced.

Regarding your other question, on effective ways to use one’s polical energy for something productive: I have no idea. Working on particular legislative battles can have some effect, also direct personal contact is supposed to make a difference: I guess that can involve directly talking with voters or political activists, or getting involved in activities and organizations that involve people talking with each other about politics. The other big thing is legislative primary election campaigns. It think that most primary elections are not seriously contested, and primaries can sometimes seem like a sideshow—but powerful incumbent politicians typically started off their careers by winning primary elections. So your primary campaign today could determine the political leaders of the future. And there’s also the indirect effect of influencing incumbent legislators who don’t want to lose in the primary.

All this counsel could apply to activists anywhere on the political spectrum. That said, I’d like to think of this as positive-sum advice in that (a) I hope that if activists on both sides are involved in redistricting, this will help keep the entire system fair, and (b) my advice regarding political participation should, if applied to both sides, keep politicians more responsive to the voters, which I think would be a net gain, even when some of these voters hold positions with which I disagree.

Lasso regression etc in Stan

Someone on the users list asked about lasso regression in Stan, and Ben replied:

In the rstanarm package we have stan_lm(), which is sort of like ridge regression, and stan_glm() with family = gaussian and prior = laplace() or prior = lasso(). The latter estimates the shrinkage as a hyperparameter while the former fixes it to a specified value. Again, there are possible differences in scaling but you should get good predictions. Also, there is prior = hs() or prior = hs_plus() that implement hierarchical shrinkage on the coefficients.

We discussed horseshoe in Stan awhile ago, and there’s more to be said on this topic, including the idea of postprocessing the posterior inferences if there’s a desire to pull some coefficients all the way to zero. And informative priors on the scaling parameters: yes, these hyperparameters can be estimated from data alone, but such estimates can be unstable, and some prior information should be helpful. What we really need are a bunch of examples applying these models to real problems.

Identifying Neighborhood Effects

Dionissi Aliprantis writes:

I have just published a paper (online here) on what we can learn about neighborhood effects from the results of the Moving to Opportunity housing mobility experiment. I wanted to suggest the paper (and/or the experiment more broadly) as a topic for your blog, as I am hoping the paper can start some constructive conversations.

The article is called “Assessing the evidence on neighborhood effects from Moving to Opportunity,” and here’s the abstract:

The Moving to Opportunity (MTO) experiment randomly assigned housing vouchers that could be used in low-poverty neighborhoods. Consistent with the literature, I [Aliprantis] find that receiving an MTO voucher had no effect on outcomes like earnings, employment, and test scores. However, after studying the assumptions identifying neighborhood effects with MTO data, this paper reaches a very different interpretation of these results than found in the literature. I first specify a model in which the absence of effects from the MTO program implies an absence of neighborhood effects. I present theory and evidence against two key assumptions of this model: that poverty is the only determinant of neighborhood quality and that outcomes only change across one threshold of neighborhood quality. I then show that in a more realistic model of neighborhood effects that relaxes these assumptions, the absence of effects from the MTO program is perfectly compatible with the presence of neighborhood effects. This analysis illustrates why the implicit identification strategies used in the literature on MTO can be misleading.

I haven’t had a chance to read the paper, but I can share this horrible graph:

And Figure 4 is even worse!

But don’t judge a paper by its graphs. There could well be interesting stuff here, so feel free to discuss.


The Kangaroo with a feather effect

OK, guess the year of this quote:

Experimental social psychology today seems dominated by values that suggest the following slogan: “Social psychology ought to be and is a lot of fun.” The fun comes not from the learning, but from the doing. Clever experimentation on exotic topics with a zany manipulation seems to be the guaranteed formula for success which, in turn, appears to be defined as being able to effect a tour de force. One sometimes gets the impression that an ever-growing coterie of social psychologists is playing (largely for one another’s benefit) a game of “can you top this?” Whoever can conduct the most contrived, flamboyant, and mirth-producing experiments receives the highest score on the kudometer. There is, in short, a distinctly exhibitionistic flavor to much current experimentation, while the experimenters themselves often seem to equate notoriety with achievement.

It’s from Kenneth Ring, Journal of Experimental Social Psychology, 1967.

Except for the somewhat old-fashioned words (“zany,” “mirth”), the old-fashioned neologism (“kudometer”) and the lack of any reference to himmicanes, power pose, or “cute-o-nomics,” the above paragraph could’ve been written yesterday, or five years ago, or any time during the career of Paul Meehl.

Or, as authority figures Susan Fiske, Daniel Schacter, and Shelley Taylor would say, “Every few decades, critics declare a crisis, point out problems, and sometimes motivate solutions.”

I learned about the above Kenneth Ring quote from this recent post by Richard Morey who goes positively medieval on the recently retracted paper by psychology professor Will Hart, a case that was particularly ridiculous because it seems that the analysis in that paper was faked by the student who collected the data . . . but was not listed as a coauthor or even thanked in the paper’s acknowledgments!

In his post, Morey describes how bad this article was, as science, even if all the data had been reported correctly. In particular, he described how the hypothesized effect sizes were much larger than could make sense based on common-sense reasoning, and how the measurements are too noisy to possibly detect reasonable-sized effects. These are problems we see over and over again; they’re central to the Type M and Type S error epidemic and the “What does not kill my statistical significance makes it stronger” fallacy. I feel kinda bad that Morey has to use, as an example, a retracted paper by a young scholar who probably doesn’t know any better . . . but I don’t feel so bad. The public record is the public record. If the author of that paper was willing to publish his paper, he should be wiling to let it be criticized. Indeed, from the standpoint of the scientist (not the careerist), getting your papers criticized by complete strangers is one of the big benefits of publication. I’ve often found it difficult to get anyone to read my draft articles, and it’s a real privilege to get people like Richard Morey to notice your work and take the trouble to point out its fatal flaws.

Oh, and by the way, Morey did not find these flaws in response to that well-publicized reaction. The story actually happened in the opposite order. Here’s Morey:

When I got done reading the paper, I immediately requested the data from the author. When I heard nothing, I escalated it within the University of Alabama. After many, many months with no useful response (“We’ll get back to you!”), I sent a report to Steve Lindsay at Psychological Science, who, to his credit, acted quickly and requested the data himself. The University then told him that they were going to retract the paper…and we never even had to say why we were asking for the data in the first place. . . .

The basic problem here is not the results, but the basic implausibility of the methods combined with the results. Presumably, the graduate student did not force Hart to measure memory using four lexical decision trials per condition. If someone claims to have hit a bullseye from 500m in hurricane-force winds with a pea-shooter, and then claims years later that a previously-unmentioned assistant faked the bullseye, you’ve got a right to look at them askance.

At this point I’d like to say that Hart’s paper should never have been accepted for publication in the first place—but that misses the point, as everything will get published, if you just keep submitting it to journal after journal. If you can’t get it into Nature, go for Plos-One, and if they turn you down, there’s always Psychological Science or JPSP (but that’ll probably only work if you’re (a) already famous and (b) write something on ESP).

The real problem is that this sort of work is standard operating practice in the field of psychology, no better and no worse (except for the faked data) than the papers on himmicanes, air rage, etc., endorsed by the prestigious National Academy of Sciences. As long as this stuff is taken seriously, it’s still a crisis, folks.

Stan and BDA on actuarial syllabus!

Avi Adler writes:

I am pleased to let you know that the Casualty Actuarial Society has announced two new exams and released their initial syllabi yesterday. Specifically, 50%–70% of the Modern Actuarial Statistics II exam covers Bayesian Analysis and Markov Chain Monte Carlo. The official text we will be using is BDA3 and while we are not mandating the use of a software package, we are strongly recommending Stan:

The first MASII exam will be given in 2018 and things may change, but now that the information is been released I hope you find this of use as a response to your requests for “Stan usage in the wild.”

The insurance industry isn’t the wildest thing out there. Still, I’m happy to hear this news.

Aaron Kaufman reviews Luke Heaton’s “A Brief History of Mathematical Thought”

I got this book in the mail. It looked cool but I didn’t feel I had time to read it. A few decades ago I read this wonderful book by Morris Kline, “Mathematics: The Loss of Certainty,” so I figured I’d have a sense of what most of Heaton’s new book would cover. I would’ve just read the last chapter to see what’s been happening in math during the past 30 years, but then Aaron happened to come by my ofc and it seemed to make more sense for him to review the whole thing.

So here’s Aaron’s review:

Luke Heaton’s A Brief History of Mathematical Thought is a delightful, accessible, and (mostly) well-written jaunt through foundational mathematics. Its greatest strength lies in its sense of narrative: Heaton draws a curvilinear path, starting with the Babylonians and Egyptians, passing through the Greeks, Romans, and the mathematicians of the Arab world, and encompassing everything from mechanical calculus to non-Euclidean geometry and computation.

Despite his tendency to wax confusingly poetic, his conceptual explanations are brilliantly lucid and accompanied by concrete examples. In Chapter 2 Heaton describes Euclid’s Algorithm, a method for determining the ratio of lengths, and its fundamental contributions to early art and architecture. To find the ratio of length a to length b, construct a rectangle with sides of those lengths. Then, Heaton charmingly describes the following steps:

1. Draw the largest square that fits inside your rectangle.

2. Squeeze in as many of those squares as you possibly can. If you can fill the entire rectangle with your squares, you are finished. Otherwise, there will be a rectangular space that has not been covered by squares. We now take the remaining rectangle and apply step 1.

Having subdivided the rectangle completely, the size of the smallest square is the greatest common divisor of the lengths of the two sides. In such parsimonious language, Heaton discusses methods for calculating square roots, functions of π, and the Golden Ratio, as well as elucidating infinite series, the fundamental theorem of calculus, P = NP, and many other concepts.

In his concluding chapter titled “Lived Experience and the Nature of Fact”, which everything from language, truth, and society to the nature of purpose and meaning, Heaton goes a little far afield. While he is uniquely both well-read in his history and adept at providing intuition, he is clearly possessed of an awe for the beauty in the mathematical world which he struggles to impart. But none of that is to detract from the huge amount of value there is in understanding a subject in its historical context. No matter how much my dad likes the Grateful Dead, it’s impossible for me to appreciate them as much as he does, absent that particular zeitgeist. In the same way, much of the beauty of non-Euclidean space and partial derivatives can only be fully grasped relative to what came before. This is where Heaton’s work shines.

They’re not kidding when they say that math is “unreasonably effective.” I talk a lot about the substitution of programming for math—my slogan is, “In the 20th century if you wanted to do statistics you had to be a bit of a mathematician, whether you wanted to or not; in the 21st century if you want to do statistics you have to be a bit of a programmer, whether you want to or not—but math is needed to make a lot of the programs work. I’m not talking about the silly math with “guarantees” and all that; I’m talking about the real stuff that drives NUTS and VB and EP and all the rest. Also the simpler math that I throw up on the board in my survey sampling class, the sqrt(1/n)’s and the 12^2 + 5^2 = 13^2 and all the rest.

As a frequent user of math, I have a much better sense of its role in life than I used to, back when I was a student. I guess it could be worth rereading Kline’s book, seeing if it all makes sense, and thinking about what its hypothetical updated final chapter should say.

Measurement error and the replication crisis

Alison McCook from Retraction Watch interviewed Eric Loken and me regarding our recent article, “Measurement error and the replication crisis.” We talked about why traditional statistics are often counterproductive to research in the human sciences.

Here’s the interview:

Retraction Watch: Your article focuses on the “noise” that’s present in research studies. What is “noise” and how is it created during an experiment?

Andrew Gelman: Noise is random error that interferes with our ability to observe a clear signal. It can have many forms, including sampling variability by using small samples, or unexplained error from unmeasured factors, or measurement error from poor instruments for the things you do want to measure. In everyday life we take measurement for granted – a pound of onions is a pound of onions. But in science, and maybe especially social science, we observe phenomena that vary from person to person, that are affected by multiple factors, and that aren’t transparent to measure (things like attitudes, dispositions, abilities). So our observations are much more variable.

Noise is all the variation that you don’t happen to be currently interested in. In psychology experiments, noise typically includes measurement error (for example, ask the same person the same question on two different days, and you can get two different answers, something that’s been well known in social science for many decades) and also variation among people.

RW: In your article, you “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger.” What do you mean by that?

AG: We blogged about the “What does not kill my statistical significance makes it stronger” fallacy here. As anyone who’s designed a study and gathered data can tell you, getting statistical significance is difficult. And we also know that noisy data and small sample sizes make statistical significance even harder to attain. So if you do get statistical significance under such inauspicious conditions, it’s tempting to think of this as even stronger evidence that you’ve found something real. This reasoning is erroneous, however. Statistically speaking, a statistical significant result obtained under highly noisy conditions is more likely to be an overestimate and can even be in the wrong direction. In short: a finding from a low-noise study can be informative, while the finding at the same significance level from a high-noise study is likely to be little more than . . . noise.

RW: Which fields of research are most affected by this assumption, and the influence of noise?

AG: The human sciences feature lots of variation among people, and difficulty of accurate measurements. So psychology, education, and also much of political science, economics, and sociology can have big issues with variation and measurement error. Not always — social science also deals in aggregates — but when you get to individual data, it’s easy for researchers to be fooled by noise — especially when they’re coming to their data with a research agenda, with the goal of wanting to find something statistically significant that can get published too.

We’re not experts in medical research but, from what we’ve heard, noise is a problem there too. The workings of the human body might not differ so much from person to person, but when effects are small and measurement is variability, researchers have to be careful. Any example where the outcome is binary — life or death, or recovery from disease or not — will be tough, because yes/no data are inherently variable when there’s no in-between state to measure.

A recent example from the news was the PACE study of treatments for chronic fatigue syndrome: there’s been lots of controversy about outcome measurements, statistical significance, and specific choices made in data processing and data analysis — but at the fundamental level this is a difficult problem because measures of success are noisy and are connected only weakly to the treatments and to researchers’ understanding of the disease or condition.

RW: How do your arguments fit into discussions of replications — ie, the ongoing struggle to address why it’s so difficult to replicate previous findings?

AG: When a result comes from little more than noise mining, it’s not likely to show up in a preregistered replication. I support the idea of replication if for no other reason than the potential for replication can keep researchers honest. Consider the strategy employed by some researchers of twisting their data this way and that in order to find a “p less than .05” result which, when draped in a catchy theory, can get published in a top journal and then get publicized on NPR, Gladwell, Ted talks, etc. The threat of replication changes the cost-benefit on this research strategy. The short- and medium-term benefits (publication, publicity, jobs for students) are still there, but there’s now the medium-term risk that someone will try to replicate and fail. And the more publicity your study gets, the more likely someone will notice and try that replication. That’s what happened with “power pose.” And, long-term, enough failed replications and not too many people outside the National Academy of Sciences and your publisher’s publicity department are going to think what you’re doing is even science.

That said, in many cases we are loath to recommend pre-registered replication. This is for two reasons: First, some studies look like pure noise. What’s the point of replicating a study that is, for statistical reasons, dead on arrival? Better to just move on. Second, suppose someone is studying something for which there is an underlying effect, but his or her measurements are so noisy, or the underlying phenomenon is so variable, that it is essentially undetectable given the existing research design. In that case, we think the appropriate solution is not to run the replication, which is unlikely to produce anything interesting (even if the replication is a “success” in having a statistically significant result, that result itself is likely to be a non-replicable fluke). It’s also not a good idea to run an experiment with much larger sample size (yes, this will reduce variance but it won’t get rid of bias in research design, for example when data-gatherers or coders know what they are looking for). The best solution is to step back and rethink the study design with a focus on control of variation.

RW: Anything else you’d like to add?

AG: In many ways, we think traditional statistics, with its context-free focus on distributions and inferences and tests, has been counterproductive to research the human sciences. Here’s the problem: A researcher does a small-N study with noisy measurements, in a setting with high variation. That’s not because the researcher’s a bad guy; there are good reasons for these choices: Small-N is faster, cheaper, and less of a burden on participants; noisy measurements are what happen if you take measurements on people and you’re not really really careful; and high variation is just the way things are for most outcomes of interest. So, the researcher does this study and, through careful analysis (what we might call p-hacking or the garden of forking paths), gets a statistically significant result. The natural attitude is then that noise was not such a problem; after all, the standard error was low enough that the observed result was detected. Thus, retroactively, the researcher decides that the study was just fine. Then, when it does not replicate, lots of scrambling and desperate explanations. But the problem — the original sin, as it were –was the high noise level. It turns out that the attainment of statistical significance cannot and should not be taken as retroactive evidence that a study’s design was efficient for research purposes. And that’s where the “What does not kill my statistical significance makes it stronger” fallacy comes back in.

P.S. Discussion in comments below reveals some ambiguity in the term “measurement error” so for convenience I’ll point to the Wikipedia definition which pretty much captures what Eric and I are talking about:

Observational error (or measurement error) is the difference between a measured value of quantity and its true value. In statistics, an error is not a “mistake”. Variability is an inherent part of things being measured and of the measurement process. Measurement errors can be divided into two components: random error and systematic error. Random errors are errors in measurement that lead to measurable values being inconsistent when repeated measures of a constant attribute or quantity are taken. Systematic errors are errors that are not determined by chance but are introduced by an inaccuracy (as of observation or measurement) inherent in the system. Systematic error may also refer to an error having a nonzero mean, so that its effect is not reduced when observations are averaged.

In addition, many of the points we make regarding measurement error also apply to variation. Indeed, it’s not generally possible to draw a sharp line between variation and measurement error. This can be seen from the definition on Wikipedia: “Observational error (or measurement error) is the difference between a measured value of quantity and its true value.” The demarcation between variation and measurement error depends on how the true value is defined. For example, suppose you define the true value as a person’s average daily consumption of fish over a period of a year. Then if you ask someone how much fish they ate yesterday, this could be a precise measurement of yesterday’s fish consumption but a noisy measure of their average daily consumption over the year.

P.P.S. Here is some R code related to our paper.

Storytelling as predictive model checking

I finally got around to reading Adam Begley’s biography of John Updike, and it was excellent. I’ll have more on that in a future post, but for now I just went to share the point, which I’d not known before, that almost all of Updike’s characters and even the descriptions and events in many of his stories derived from particular people he’d known and places he’d been. Having read different stories by Updike in no particular order and at different times in my own life, I’d not put that together.

Today’s post is not about Updike at all, though, but rather about a completely different style of writing, which we also see in many forms, which is storytelling as exploration, in which the author starts with a character or scenario that might be drawn from life or even “ripped from the headlines” and then uses this as a starting point to explore what might happen next. The writing is a way to map out possibilities in a way that follows some narrative or other structural logic.

There are different ways of doing this as a storyteller. You can start from the situation and work from there in a sort of random walk, or maybe I should say an autoregressive process in that the story will typically drift back to reality or to some baseline measure. (I’m reminded of my point from several years ago that the best and most classic alternative history stories have the feature in common that, in these stories, our “real world” always ends up as the deeper truth.) Or you can set up an intricate plan that links individuals to social history, as was done so brilliantly in The Rotter’s Club.

I got some insight into all this recently when reading a posthumous collection of Donald Westlake’s nonfiction writing. (That’s the book where I encountered this list, which observant readers may have noticed I’ve been using for new post titles.) Somewhere in this book, I don’t remember where, it comes out that Westlake did not plot his novels ahead of time; instead he’d just start out with an idea and then go from there, seeing where the story led him. This was a surprise to me because Westlake’s novels have such great plots, I’d’ve thought this would’ve required careful planning. Upon reflection, though, the plot-as-you-go-along scenario seemed more plausible. At the purely practical level, the guy had tons and tons of experience: he’d skied down all the runs before so he could find his way without a map. And at a more theoretical level—and that’s why I’m bringing all this up here—one could say, with reason, that the development of a story is a working-out of possibilities, and that’s why it makes sense that authors can be surprised at how their own stories come out.

Starting at the beginning and going from there: This can be a surprisingly effective strategy, especially if you’ve done it a few times before, and if you have a bit of structure. Structure can work in a direct way: from page 1, Richard Stark knows that Parker’s gonna get out alive by the end, so it’s just a matter of figuring out how he gets there (I just reread Slayground the other day, which was fun but it got me sad to realize that I no longer have enough days forthcoming in my life to reread all the books on my shelf). Or structure can work indirectly, as in Westlake’s novel Memory (posthumously published but written in the early 1960s), which brilliantly works against various expectations of how the story will develop and resolve. In either case, though, if you start with confidence that you’ll get through it and you have the technical tools, you can do it.

(Indeed, the thoughts that led to the present post arose indirectly from the following email I received the other day from Zach Horne, a man I’ve never met. Horne wrote:

I regularly read your blog and have recently started using Stan. One thing that you’ve brought up in the discussion of nhst is the idea that hypothesis testing itself is problematic. However, because I am an experimental psychologist, one thing I do (or I think I’m doing anyway) is conduct experiments with the aim of testing some hypothesis or another. Given that I am starting to use Stan and moving away from nhst, how would you recommend that experimentalists like myself discuss their findings since hypothesis testing itself may be problematic?

My reply was that this is a great question and I will blog it. I’m looking forward to my reply because I’m curious about the answer to this one too. Like Donald Westlake, I’ll start at the beginning, go from there, and see where the story ends up.)

Anyway, to return to the main thread:

If storytelling is the working out of possible conclusions following narrative logic applied to some initial scenario, then this can be seen as a predictive endeavor, in the statistical sense. Or as Bayesian reasoning: not in the canonical sense of inference about parameters or models conditional on data, but Bayesian inference for predictive quantities conditional on a model which in this case is unstated but is implicitly coded in what I was calling “narrative logic.”

In statistics, one reason we make predictions is to do predictive checks, to elucidate the implications of a model, in particular what it says (probabilistically) regarding observable outcomes, which can then be checked with existing or new data.

To put it in storytelling terms, if you tell a story and it leads to a nonsensical conclusion, this implies there’s something wrong with your narrative logic or with your initial scenario.

In one of my articles with Thomas Basbøll, we discuss the idea of stories as predictive checks, there focusing on the idea that good stories are anomalous and immutable. Anomalousness is relevant in that we learn from stories to the extent they force us to grapple with the unexpected, and immutability is important so that surprising aspects of reality cannot be explained away in trivial fashions. (That’s where copyist Karl Weick went wrong: by repeatedly changing his story to suit his audiences, he removed the immutability that could’ve allowed the story to help him learn about flaws in his understanding of reality.)

P.S. Tyler Cowen illustrates the general point in this recent post.

Pizzagate update: Don’t try the same trick twice or people might notice

I’m getting a bit sick of this one already (hence image above; also see review here from Jesse Singal) but there are a couple of interesting issues that arose in recent updates.
Continue reading ‘Pizzagate update: Don’t try the same trick twice or people might notice’ »

Authority figures in psychology spread more happy talk, still don’t get the point that much of the published, celebrated, and publicized work in their field is no good

Susan Fiske, Daniel Schacter, and Shelley Taylor write (link from Retraction Watch):

Psychology is not in crisis, contrary to popular rumor. Every few decades, critics declare a crisis, point out problems, and sometimes motivate solutions. When we were graduate students, psychology was in “crisis,” raising concerns about whether it was scientific enough. Issues of measurement validity, theoretical rigor, and realistic applicability came to the fore. Researchers rose to the challenges, and psychological science soldiered on.

This decade, the crisis implicates biomedical, social, and behavioral sciences alike, and the focus is replicability. First came a few tragic and well-publicized frauds; fortunately, they are rare—though never absent from science conducted by humans—and they were caught. Now the main concern is some well-publicized failures to replicate, including some large-scale efforts to replicate multiple studies, for example in social and cognitive psychology. National panels will convene and caution scientists, reviewers, and editors to uphold standards. Graduate training will improve, and researchers will remember their training and learn new standards.

All this is normal science, not crisis. A replication failure is not a scientific problem; it is an opportunity to find limiting conditions and contextual effects. Of course studies don’t always replicate.

Annual Reviews provides an additional remedy that is also from the annals of normal science: the expert, synthetic review article. As part of the cycle of discovery, novel findings attract interest, programs of research develop, scientists build on the basic finding, and inevitably researchers discover its boundary conditions and limiting mechanisms. Expert reviewers periodically step in, assess the state of the science—including both dead ends and well-established effects—and identify new directions. Crisis or no crisis, the field develops consensus about the most valuable insights. As editors, we are impressed by the patterns of discovery affirmed in every Annual Review article.

On the plus side, I’m glad that Fiske is no longer using the term “terrorist” to describe people who have scientific disagreements with her. That was a bad move on her part. I don’t think she’s ever apologized, but if she stops doing it, that’s a start.

On the minus side, I find the sort of vague self-contragulatory happy talk in the above passage to be contrary to the spirit of scientific inquiry. Check this out:

National panels will convene and caution scientists, reviewers, and editors to uphold standards. Graduate training will improve, and researchers will remember their training and learn new standards.

What a touching faith in committees. W. H. Auden and Paul Meehl would be spinning in their graves.

Meanwhile, PPNAS publishes papers himmicanes, air rage, and “People search for meaning when they approach a new decade in chronological age.” The National Academy of Sciences is supposed to be a serious organization, no? Who’s in charge there?

The plan to rely on the consensus of authority figures seems to me to have the fatal flaw that some of the authority figures endorse junk science. How exactly do Fiske, Schacter, and Taylor plan to “uphold standards” and improve graduate training, when purportedly authoritative sources continue to publish low-quality papers? What do they say if Bem’s ESP experiment makes it into a textbook? Should psychology students do power poses as part of their graduate training? A bit of embodied cognition, anyone?

I really don’t see how they can plan to do better in the future if they refuse to admit any specific failures of the past and present.

Also this:

When we were graduate students, psychology was in “crisis,” raising concerns about whether it was scientific enough. Issues of measurement validity, theoretical rigor, and realistic applicability came to the fore. Researchers rose to the challenges, and psychological science soldiered on.

Yes, it “soldiered on,” through the celebrated work of Bargh, Baumeister, Bem . . . Is that considered a good thing, to soldier on in this way? Some of the messages of “measurement validity, theoretical rigor, and realistic applicability” didn’t seem to get through, even to leaders in the field. It really does seem like a crisis—not just a “crisis”—that all this work got so much respect. And did you hear, people are still wasting their time trying to replicate power pose?

And this:

A replication failure is not a scientific problem; it is an opportunity to find limiting conditions and contextual effects. Of course studies don’t always replicate.

Yes to this last sentence but no no no no no no no to the sentence before. Or, I should say, not always. To paraphrase a famous psychiatrist, sometimes a replication failure is just a replication failure. The key mistake in the above quote is the assumption that there was necessarily something there in the original study. Remember the time-reversal heuristic: in many of these cases, there’s no reason to believe the original published study has any value at all. Talk of “limiting conditions and contextual effects” is meaningless in the contexts of phenomena that have never been established in the first place.

If you want to move forward, you have to let go of some things. Not all broken eggs can be made into omelets. Sometimes they’re just rotten and have to be thrown out.

I was gonna write a post entitled, “Unlocking past collaboration: student use affects mood and happiness,” but it didn’t seem worth the bother

Ivan Oransky points us to this hilarious story of a retracted paper in Psychological Science. The hilarious part is not the article itself (a dry-as-dust collection of small-N experiments with open-ended data-exclusion and data-analysis rules, accompanied by the usual scattering of statistically significant p-values in the garden) or even the reason for the retraction:
Continue reading ‘I was gonna write a post entitled, “Unlocking past collaboration: student use affects mood and happiness,” but it didn’t seem worth the bother’ »

HMMs in Stan? Absolutely!

I was having a conversation with Andrew that went like this yesterday:

Hey, someone’s giving a talk today on HMMs (that someone was Yang Chen, who was giving a talk based on her JASA paper Analyzing single-molecule protein transportation experiments via hierarchical hidden Markov models).

Maybe we should add some specialized discrete modules to Stan so we can fit HMMs. Word on the street is that Stan can’t fit models with discrete parameters.

Uh, we can already fit HMMs in Stan. There’s a section in the manual that explains how (section 9.6, Hidden Markov Models).

We’ve had some great traffic on the mailing list lately related to movement HMMs, which were all the rage at ISEC 2016 for modeling animal movement (e.g., whale diving behavior in the presence or absence of sonar).

But how do you marginalize out the discrete states?

The forward algorithm (don’t even need the backward part for calculating HMM likelihoods). It’s a dynamic programming algorithm that calculates HMM likelihoods in \mathcal{O}(N \, K^2), where N is the number of time points and K the number of HMM states.

So you built that in C++ somehow?

Nope, just straight Stan code. It’s hard to generalize to C++ given that the state emissions models vary by application and the transitions can involve predictors. Like many of these classes of models, Stan’s flexible enough to, say, implement all of the models described in Roland Langrock et al.’s moveHMM package for R. The doc for that package is awesome—best intro the models I’ve seen. I like the concreteness of resolving details in code. Langrock’s papers are nice, too, so I suspect his co-authored book on HMMs for time series is probably worth reading, too. That’s another one that’d be nice to translate to Stan.

More HMMs in Stan coming soon (I hope)

Case study on movement HMMs coming soon, I hope. And I’ll pull out the HMMs tuff into a state-space chapter. Right now, it’s spread between the CJS ecology models (states are which animals are alive) in the latent discrete parameters chapters and HMMs in the time-series chapter. I’m also going to try to get some of the more advanced sufficient statistics, N-best, and chunking algorithms I wrote for LingPipe going in Stan along with some semi-supervised learning methods I used to work on in NLP. I’m very curious to compare the results in Stan to those derived by plugging in (penalized) maximum likelihood parameter estimates

Getting working movement HMM code

Ben Lambert and I managed to debug the model through a mailing list back and forth, which also contains code for the movement HMMs (which also use a cool wrapped Cauchy distribution that Ben implemented as a function). The last word on that thread has the final working models with covariates.


We should be able to fit Yang Chen’s hierarchical HMMs in Stan. She assumed a univariate normal emission density that was modeled hierarchically across experiments, with a shared transition matrix involving only a few states. We wouldn’t normally perfrom her model selection step; see Andrew’s previous post on overfitting cluster analysis which arose out of a discussion I had with Ben Bolker at ISEC.

You can fit hidden Markov models in Stan (and thus, also in Stata! and Python! and R! and Julia! and Matlab!)

You can fit finite mixture models in Stan; see section 12 of the Stan manual.

You can fit change point models in Stan; see section 14.2 of the Stan manual.

You can fit mark-recapture models in Stan; see section 14.2 of the Stan manual.

You can fit hidden Markov models in Stan; see section 9.6 of the Stan manual. You’ll probably want to start with the subsection on Semisupervised Estimation on page 172, take a look at that Stan program, and then read forward to see how to do prediction and read backward to see the program built up in stages.

Also this: “The program is available in the Stan example model repository; see,” but I don’t know the name of the model so I don’t know how to find it in that big pile of programs.

You can fit all sorts of models in Stan; the above list just gives a sense of some of the latent discrete parameter models that can be fit in Stan. Sometimes people are under the impression that you can’t fit discrete-parameter models in Stan. Actually, you can, as long as you can sum over the possibilities in computing the target function. It’s no problem doing this in all the models listed above. And this summing makes the algorithms converge faster too.

P.S. Chapter and section references refer to the Stan 2.14.0 manual. Things will get moved around in later editions so when looking in the manual you should search for the topics, not go by section numbers.

P.P.S. See here for some background.

Theoretical statistics is the theory of applied statistics: how to think about what we do (My talk at the University of Michigan this Friday 3pm)

Theoretical statistics is the theory of applied statistics: how to think about what we do

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Working scientists and engineers commonly feel that philosophy is a waste of time. But theoretical and philosophical principles can guide practice, so it makes sense for us to keep our philosophical foundations up to date. Much of the history of statistics can be interpreted as a series of expansions and inclusions: formalizations of procedures and ideas which had been previously considered outside the bounds of formal statistics. In this talk we discuss several such episodes, including the successful (in my view) incorporations of hierarchical modeling and statistical graphics into Bayesian data analysis, and the bad ideas (in my view) of null hypothesis significance testing and attempts to compute the posterior probability of a model being true. I’ll discuss my own philosophy of statistics and also the holes in my current philosophical framework.

It’s happening at 3pm Friday 10 Feb, in Michigan League – Kuenzel Room, and it’s the Foundations of Belief & Decision Making Lecture, organized by the Philosophy Department.

The Mannequin

Jonathan Falk points to this article, “Examining the impact of grape consumption on brain metabolism and cognitive function in patients with mild decline in cognition: A double-blinded placebo controlled pilot study,” and writes:

Drink up! N=10, no effect on thing you’re aiming at, p value result on a few brain measurements (out of?), eminently pr-able topic…Seems like a TED talk is nigh…

In all seriousness, don’t these people know that the purpose of a pilot study is to test out the methods, not to draw statistical conclusions? Pilot studies are fine. It’s a great idea to publish pilot studies, letting people know what worked and what didn’t, give all your raw data, let it all hang out. But leave the statistical significance calculator at home.

Also this:

1.112 ± 0.005262


P.S. I’m not saying these researchers are bad guys; I assume they’re just following standard practice which is to try to get at least one statistically significant result and publication out of any experiment. We just need a better scientific communication system that’s focused more on methods, data, and understanding, and less on the promotion of spurious breakthroughs.

The “What does not kill my statistical significance makes it stronger” fallacy

As anyone who’s designed a study and gathered data can tell you, getting statistical significance is difficult. Lots of our best ideas don’t pan out, and even if a hypothesis seems to be supported by the data, the magic “p less than .05” can be elusive.

And we also know that noisy data and small sample sizes make statistical significance even harder to attain. In statistics jargon, noisy studies have low “power.”

Now suppose you’re working in a setting such as educational psychology where the underlying objects of study are highly variable and difficult to measure, so that high noise is inevitable. Also, it’s costly or time-consuming to collect data, so sample sizes are small. But it’s an important topic, so you bite the bullet and accept that your research will be noisy. And you conduct your study . . . and it’s a success! You find a comparison of interest that is statistically significant.

At this point, it’s natural to reason as follows: “We got statistical significance under inauspicious conditions, and that’s an impressive feat. The underlying effect must be really strong to have shown up in a setting where it was so hard to find.” The idea is that statistical significance is taken as an even stronger signal when it was obtained from a noisy study.

This idea, while attractive, is wrong. Eric Loken and I call it the “What does not kill my statistical significance makes it stronger” fallacy.

What went wrong? Why it is a fallacy? In short, conditional on statistical significance at some specified level, the noisier the estimate, the higher the Type M and Type S errors. Type M (magnitude) error says that a statistically significant estimate will overestimate the magnitude of the underlying effect, and Type S error says that a statistically significant estimate can have a high probability of getting the sign wrong.

We demonstrated this with an extreme case a couple years ago in a post entitled, “This is what “power = .06” looks like. Get used to it.” We were talking about a really noisy study where, if a statistically significant difference is found, it is guaranteed to be at least 9 times higher than any true effect, with a 24% chance of getting the sign backward. The example was a paper reporting a correlation between certain women’s political attitudes and the time of the month.

So, we’ve seen from statistical analysis that the “What does not kill my statistical significance makes it stronger” attitude is a fallacy: Actually, the noisier the study, the less we learn from statistical significance. And we can also see the intuition that led to the fallacy, the idea that statistical significance under challenging conditions is an impressive accomplishment. That intuition is wrong because it neglects the issue of selection, which we also call the garden of forking paths.

An example

Even experienced researchers can fall for the “What does not kill my statistical significance makes it stronger” fallacy. For example, in an exchange involving about potential biases in summaries of some well studied, but relatively small, early childhood intervention programs, economist James Heckman wrote:

The effects reported for the programs I discuss survive batteries of rigorous testing procedures. They are conducted by independent analysts who did not perform or design the original experiments. The fact that samples are small works against finding any effects for the programs, much less the statistically significant and substantial effects that have been found.

Yes, the fact that sample are small works against finding any [statistically significant] effects. But no, this does not imply that effect estimates obtained from small, noisy studies are to be trusted. In addition, the phrase, “much less the statistically significant and substantial effects” is misleading, in that when samples are small and measurements are noisy, any statistically significant estimates will be necessarily “substantial,” as that’s what it takes for them to be at least two standard deviations from zero.

My point here is not to pick on Heckman, any more than my point a few years ago was to pick on Kahneman and Turing. No, it’s the opposite. Here you have James Heckman, a brilliant economist who’s done celebrated research on selection bias, who’s following a natural but erroneous line of reasoning that doesn’t account for selection. He’s making the “What does not kill my statistical significance makes it stronger” fallacy.

It’s an easy fallacy to make: if a world-renowned expert on selection bias can get this wrong, we can too. Hence this post.

P.S. Regarding the discussion of the Heckman quote above: He did say, and it’s true, that the measurements are good for the academic achievement etc. These aren’t ambiguous self-reports, or arbitrarily coded things. So the small sample point is still relevant, but it’s not appropriate to label those measurements as noisy. What’s relevant for this sort of study is not that they are noisy but that they are highly variable—and these are between-student comparisons, so between-student variance goes into the error term. The point is that the fallacy can arise when the underlying phenomenon is highly variable, even if the measurements themselves are not noisy.

P.P.S. More here. Eric and I published an article on this in Science.

Long Shot

Frank Harrell doesn’t like p-values:

In my [Frank’s] opinion, null hypothesis testing and p-values have done significant harm to science. The purpose of this note is to catalog the many problems caused by p-values. As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.

His attitudes are similar to mine:

I [Andrew] agree with most of the ASA’s statement on p-values but I feel that the problems are deeper, and that the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

I also agree with Harrell not to take the so-called Fisher exact test seriously (see Section 3.3 of this paper from 2003).

Research connects overpublication during national sporting events to science-journalism problems

Ivan Oransky pointed me to a delightful science-based press release, “One’s ability to make money develops before birth”:

Researchers from the Higher School of Economics have shown how the level of perinatal testosterone, the sex hormone, impacts a person’s earnings in life. Prior research confirms that many skills and successes are linked to the widely known 2D:4D ratio, also knows as the digit ratio. This is the ratio of the index and ring fingers . . . research conducted by a team of scientists from HSE’s Centre for Institutional Studies (John Nye, Maria Yudkevich, Maxym Bruhanov, Ekaterina Kochergina, Ekaterina Orel, and Sergei Polyachenko) became the first study to use Russian data to show the link between the 2D:4D ratio and a person’s income. The study was published in the journal Economics and Human Biology. . . . The number of observations in the base regressions totalled nearly 700 for men and 900 for women, and the age of the subjects varied between 25 and 60. A 2D:4D ratio was made for each participant using a specialised apparatus. In addition, the respondents, whose identities remained anonymous, were asked a number of questions concerning income and salaries.

The results of the regression analysis showed a negative correlation between the income and 2D:4D ratios of women. In other words, the higher the salary, the lower the ratio. The effect was negative even when taking into account salary predictors such as gender, age, education level, job position, and the position’s economic sector. What is interesting is that this quantitative association is seen in men as well, though only after taking into account respondents’ level of education.

Savvy researchers will (a) note the challenge of taking gender into account when the analysis was performed separately for each sex, and (b) the forking-paths and difference-betweeen-significant-and-not-significant aspects of the last sentence above. Other fun things that you can see by following the link to the original paper is that the researchers looked at three different outcome measures and also tried everything with left and right hands. Also, despite the first sentence of the press release, and despite the title of the paper (“The effects of prenatal testosterone . . .”), there are actually no measurements of prenatal testosterone involved in this research (let alone any causal identification).

Bad form to put something in the title of the paper that’s not actually being measured.

I should perhaps emphasize that I have no objection to people researching such things and publishing their results; it’s just that it’s an absolute disaster to rummage around in a pile of data, pulling out things based on their statistical significance level. To put it another way: lots of their apparently statistically significant comparisons will be noise, and lots of the things that they dismiss as not statistically significant can actually correspond to real correlations. All this is made worse by rampant selection (for example, on the three different outcome measures) and the casual slipping from finger length to (unmeasured) hormone levels to (nonidentified) causal effects. What they have is some data, and that’s fine, and I think they’d be better off just publishing their dataset and being more aware of how little they can learn from it.

And of course you don’t have to be John D. Rockefeller IV to realize that “One’s ability to make money develops before birth.” No finger measurements are necessary to discover this obvious social fact.

P.S. I also love the very last paragraph of the press release:

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.

All right, then.

I went to the main Eureka Alert site and found this gem: “Research connects overeating during national sporting events to medical problems,” which begins:

People who overeat during national holidays and national sporting events – like this weekend’s Super Bowl – are 10 times more likely to need emergency medical attention for food obstruction than any at other time of the year, according to a new study led by a University of Florida researcher.

Following along, I see this:

Most of the problems affected men, and most of the cases came during or just after the Thanksgiving holiday. . . . Over the study period, from 2001 to 2012, 38 people underwent an emergency procedure on the esophagus during or just after the holiday or sporting event time period (within three days of the event). Nearly 37 percent of those were due to a food impaction. Comparatively, of the 81 who had the same procedure two weeks before and two weeks after the event during the “control period,” just under 4 percent were due to food impaction. During holidays and national sporting events, the most common impacted food was turkey (50 percent), followed by chicken (29 percent) and beef (21 percent).

I just loooove how “people stuff their faces on Thanksgiving” transmutes to “overeating during national sporting events.” Tie-in to the Super Bowl, perhaps?

This last one indeed got picked up by some bottom-feeding news organizations, for example NBC Chicago which took the bait and led with the headline,

You’re Up to 10 Times More Likely to Choke on Snacks During Super Bowl, Health Officials Say

Suckers! Hey, NBC Chicago: you got played! Pwned by the University of Florida public relations department. Pretty embarrassing, huh?

But it’s all OK, because:

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.

On the plus side, NPR doesn’t seem to have fallen for either of these stories. So there’s some hope yet.

Why bother?

As always, the question arises, why bother even drawing attention to this sort of sloppiness? The quick answer is that someone sent me a link and I found it amusing. The long answer is that that the same research and reporting problems discussed above, also arise in more consequential areas. Remember that dude who was crunching numbers and said Hillary Clinton had a 98% chance of winning the election? Or those guys who said that early childhood intervention increased future earnings by 40%? Selection bias, confusions about causality, leaps that ignore differences between data and the underlying constructs of interest, and credulous reporting that treats every published paper as a Eureka discovery: These occur in problems big and small, and when studying statistical practice, statistical reasoning, and statistical communication, it can be helpful to study the many small cases as well as the few large ones.

Death of the Party


Under the subject line, “Example of a classy response to someone pointing out an error,” Charles Jack​son writes:

In their recent book, Mazur and Stein describe the discovery of an error that one of them had made in a recent paper writing: “Happily, Bartosz Naskreki spotted this error . . .” See below for full context.

​That is from page 129 of Prime Numbers and the Riemann Hypothesis, Barry Mazur and William Stein.

See how easy that was?

Are you listening, himmicanes people? fat-arms-and-voting people? ovulation-and-clothing people? ovulation-and-voting people? air-rage people?
Satoshi? Daryl? John? Roy? Marc? David? Amy? Andy?
Bueller? Bueller?