Skip to content

I refuse to blog about this one

Shravan points me to this article, Twitter Language Use Reflects Psychological Differences between Democrats and Republicans, which begins with the following self-parody of an abstract:

Previous research has shown that political leanings correlate with various psychological factors. While surveys and experiments provide a rich source of information for political psychology, data from social networks can offer more naturalistic and robust material for analysis. This research investigates psychological differences between individuals of different political orientations on a social networking platform, Twitter. Based on previous findings, we hypothesized that the language used by liberals emphasizes their perception of uniqueness, contains more swear words, more anxiety-related words and more feeling-related words than conservatives’ language. Conversely, we predicted that the language of conservatives emphasizes group membership and contains more references to achievement and religion than liberals’ language. We analysed Twitter timelines of 5,373 followers of three Twitter accounts of the American Democratic and 5,386 followers of three accounts of the Republican parties’ Congressional Organizations. The results support most of the predictions and previous findings, confirming that Twitter behaviour offers valid insights to offline behaviour.

and also this delightful figure:

journal.pone.0137422.g001 copy

The pie-chart machine must’ve been on the fritz that day.

I can’t actually complain about this article because it appeared in Plos-one. I have the horrible feeling that, with another gimmick or two, it could’ve become a featured article in PPNAS or Science or Nature.

Anyway, I replied to Shravan:

Stop me before I barf . . .

To which Shravan replied:

Let it all out on your blog.

But no, I don’t think this is worth blogging. There must be some football items that are more newsworthy.

A book on RStan in Japanese: Bayesian Statistical Modeling Using Stan and R (Wonderful R, Volume 2)

Bayesian Statistical Modeling Using Stan and R Book Cover
Wonderful, indeed, to have an RStan book in Japanese:

Google translate makes the following of the description posted on Amazon Japan (linked from the title above):

In recent years, understanding of the phenomenon by fitting a mathematical model using a probability distribution on data and prompts the prediction “statistical modeling” has attracted attention. Advantage when compared with the existing approach is both of the goodness of the interpretation of the ease and predictability. Since interpretation is likely to easily connect to the next action after estimating the values ​​in the model. It is rated as very effective technique for data analysis Therefore reality.

In the background, the improvement of the calculation speed of the computer, that the large scale of data becomes readily available, there are advances in stochastic programming language to very simple trial and error of modeling. From among these languages, in this document to introduce Stan is a free software. Stan is a package which is advancing rapidly the development equipped with a superior algorithm, it can easily be used from R because the package for R RStan has been published in parallel. Descriptive power of Stan is high, the hierarchical model and state space model can be written in as little as 30 lines, estimated calculation is also carried out automatically. Further tailor-made extensions according to the analyst of the problem is the easily possible.

In general, dealing with the Bayesian statistics books or not to remain in rudimentary content, what is often difficult application to esoteric formulas many real problem. However, this book is a clear distinction between these books, and finished to a very practical content put the reality of the data analysis in mind. The concept of statistical modeling was wearing through the Stan and R in this document, even if the change is grammar of Stan, even when dealing with other statistical modeling tools, I’m sure a great help.

I’d be happy to replace this with a proper translation if there’s a Japanese speaker out there with some free time (Masanao Yajima translated the citation for us).

Big in Japan?

I’d like to say Stan’s big in Japan, but that idiom implies it’s not so big elsewhere. I can say there’s a very active Twitter community tweeting about Stan in Japanese, which we follow occasionally using Google Translate.

Looking at the polls: Time to get down and dirty with the data

sherlock

Poll aggregation is great, but one thing that we’ve been saying a lot recently (see also here) is that we can also learn a lot by breaking open a survey and looking at the numbers crawling around inside.

Here’s a new example. It comes from Alan Abramowitz, who writes:

Very strange results of new ABC/WP poll for nonwhite voters

There’s something very odd going on here.

See the table below provided by ABC News. [I’ll put the table at the end of the post.—ed.] They show that Clinton leads Trump by 89-2 among African-Americans and by 68-19 among Hispanics. But then they report that she only leads by 69-19 among all nonwhites. That makes no sense. Trump would have to have a huge lead among the other groups of nonwhite voters, mainly Asian-Americans, to produce that overall result among nonwhites.

Let’s assume that nonwhites are 28 percent of likely voters. And let’s assume that blacks are 12 percent, Hispanics are 11 percent and Asian/other are 5 percent.

According to my calculations, among the nonwhite 28 percent of the electorate, they have Clinton leading Trump by 19.3 to 5.3, a net advantage of 14 percentage points. Among the African-American 12 percent of the electorate, they have Clinton leading Trump by 10.7 to 0.2. And among the Hispanic 11 percent of the electorate, they have Clinton leading 7.5 to 2.1. Adding up the numbers for African-Americans and Hispanics, for that combined 23 percent of the electorate they have Clinton leading 18.2 to 2.3 for a lead of 15.9 percentage points. But remember, they only have Clinton leading by a net 14 percentage points among nonwhites. So in order to get to that result, Clinton must be down by a net 1.9 points among the remaining nonwhite voters. That means she would be LOSING to Trump among those other nonwhite voters by a landslide margin, something like 60 to 20!

Now my assumptions about the African-American, Hispanic and other nonwhite shares of the overall nonwhite electorate could be off a little, but probably not by much. And even if you modify those assumptions somewhat, you are still going to be left with the conclusion that Trump is far ahead of Clinton among nonwhites other than African-Americans and Hispanics.

If we flip the results for nonwhites other than African-Americans and Hispanics, giving Clinton a 60-20 lead rather than a 60-20 deficit, which would certainly be more realistic, this would make a noticeable difference in the overall results of the poll, moving the numbers from a 2 point Clinton lead among all likely voters to closer to a 5-6 point overall lead.

I responded:

What do you think happened? Maybe they used different adjustments for toplines and crosstabs?

Abramowitz said that, given the information that was currently available to him, “I have no idea what they did but I can’t come up with any way that these numbers add up.”

Just to be clear: I’m not saying these pollsters did anything wrong. I have no idea. I’ve not seen the raw data either, and I didn’t even go through all of Abramowitz’s comments in detail. My point here is just that, if we want to use and understand polls, sometimes we have to get down and dirty and try to figure out exactly what’s going on. Mike Spagat knows this, David Rothschild knows this, and so should we.

And here’s that table:
Continue reading ‘Looking at the polls: Time to get down and dirty with the data’ »

No statistically significant differences for get up and go

Politics and chance

After the New Hampshire primary Nadia Hassan wrote:

Some have noted how minor differences in how the candidates come out in these primaries can make a huge difference in the media coverage. For example, only a few thousand voters separate third and fifth and it really impacts how pundits talk about a candidate’s performance. Chance events can have a huge impact in politics and many areas. Candidates can win because of weather, or something they said, or a new news revelation. Nevertheless, I wonder if there’s a better way to handle this kind of thing when we are talking about close results in these primaries.

I replied:

Yes, but one reassuring perspective is that there’s arbitrariness in any case, as there are dozens of well qualified candidates for president and only one winner. Rather than arbitrariness, I’m more worried about systematic factors such as congress filling up with millionaires because these are the people who have the connections to allow them to run for office.

Cracks in the thin blue line

screen-shot-2016-09-23-at-12-32-58-am

When people screw up or cheat in their research, what do their collaborators say?

The simplest case is when coauthors admit their error, as Cexun Jeffrey Cai and I did when it turned out that we’d miscoded a key variable in an analysis, invalidating the empirical claims of our award-winning paper.

On the other extreme, coauthors can hold a united front, as Neil Anderson and Deniz Ones did after some outside researchers found a data-coding error in their paper. Instead of admitting it and simply recognizing that some portion of their research was in error, Anderson and Ones destroyed their reputation by refusing to admit anything. This particular case continues to bother me because there’s no good reason for them not to want to get the right answer. Weggy was accused of plagiarism, which is serious academic misconduct, so it makes sense for him to stonewall and run out the clock until retirement. But Anderson and Ones simply made an error: Is admitting a mistake so painful as all that?

In other cases, researchers mount a vigorous defense in a more reasonable way. For example, after that Excel error was found, Reinhart and Rogoff admitted they made a mistake, and the remaining discussion turned on (a) the implications of the error for their substantive conclusions, and (b) the practice of data sharing. I think both sides had reasonable points in this discussion; in particular, yes the data were public and always available but not the particular data file used by Reinhart and Rogoff was not accessible for outsiders. The resulting discussion moved forward in a useful way, toward a position that researchers who publish data should make their scripts and datasets available, even when they are working with public data. Here’s an example.

But what I want to talk about today is when coauthors do not take a completely united front.

In the case of disgraced primatologist Marc Hauser, collaborator Noam Chomsky escalated with: “Marc Hauser is a fine scientist with an outstanding record of accomplishment. His resignation is a serious loss for Harvard, and given the nature of the attack on him, for science generally.” On the upside, I don’t think Chomsky actually defended Hauser’s practice of trying to tell his research assistants how not code his monkey data. I’m assuming that Chomsky kept his distance from the controversial research studies, allowing him to engage in an aggressive defense on principle alone.

Another option is to just keep quiet. The famous “power pose” work of Carney, Cuddy, and Yap has been questioned on several grounds: first that their study is too small and their data are too noisy for them to have a hope of finding the effects they were looking for, second that an attempted replication of their main finding failed, and third that at least one of the test statistics in their paper was miscalculated in a way that moved the p-value from above .05 to below .05. This last sort of error has also been found in at least one other paper of Cuddy. Upon publication of the non-replication, all three of Carney, Cuddy, and Yap responded in a defensive way that implied a lack of understanding of the basic statistical principles of statistical significance and replication. But after that, Carney and Yap appear to have kept quiet. [Not quite; see P.P.S. below.] Cuddy issued loud attacks on her critics but her coauthors perhaps have decided to stay out of the limelight. I’m glad they’re not going on the attack but I’m disappointed that they seem to want to hold on to their discredited claims. But that’s one strategy to follow when your work is found lacking: just stay silent and hope the storm blows over.

A final option, and the one I find most interesting, is when a researcher commits fraud or gross incompetence and does not admit it, but his or her coauthor will not sit still and accept this.

The most famous recent example was the gay-marriage-persuasion study of Michael Lacour and Don Green. When outsiders found out that the data were faked, Lacour denied it but Green pulled the plug. He told the scientific journal and the press that he had no trust in the data. Green did the right thing.

Another example is biologist Robert Trivers, who found out about problems in a paper he had coauthored—one coauthor had faked the data and another was defending the fraud. It took years until Trivers could get the journal to retract it.

My final example, which motivated me to write this post, came today in a blog comment from Randall Rose, a coauthor, with Promothesh Chatterjee and Jayati Sinha, of a social psychology study that was utterly destroyed by Hal Pashler, Doug Rohrer, Ian Abramson, Tanya Wolfson, and Christine Harris, to the extent that that Pashler et al. concluded that the data could not have happened as claimed in the paper and were consistent with fraud. Chatterjee and Sinha wrote horrible, Richard Tol-like defenses of their work (here’s a sample: “Although 8 coding errors were discovered in Study 3 data and this particular study has been retracted from that article, as I show in this article, the arguments being put forth by the critics are untenable”), but Rose did not join in:

I have ceased trying to defend the data in this paper, particularly Study 3, a long time ago. I am not certain what happened to generate the odd results (other than clear sloppiness in study execution, data coding, and reporting) but I am certain that the data in Study 3 should not be relied on . . .

I appreciate that. Instead of the usual the-best-defense-is-a-good-offense attitude, Rose openly admits that he did not handle the data himself and that he has no reason to vouch for the data quality or claim that the results still stand.

Wouldn’t it be great if everyone could do that?

It’s not an easy position, to be a coauthor in a study that has been found wanting, either through fraud, serious data errors, or simply a subtle statistical misunderstanding (such as that which led Satoshi Kanazawa to think that he could possibly learn anything about variation in sex ratios from a sample of size 3000). I find the behavior of Trivers, Green, and Rose in this setting to be exemplary, but I recognize the personal and professional difficulties here.

For someone like Carney or Yap, it’s a tough call. On one hand, to distance themselves from this work and abandon their claims would represent a serious hit on their careers, not to mention the pain involved in having to reassess their understanding of psychology. On the other hand, the work really is wrong, the experiment really wasn’t replicated, the data really are too noisy to learn what they were hoping to learn, and unlike Cuddy they’ve kept a lower profile so it doesn’t seem too late for them to admit error, accept the sunk cost, and move on.

P.S. See here for a discussion of a similar situation.

P.P.S. Commenter Bernoulli writes:

It is not true that Carney has remained totally silent.

Continue reading ‘Cracks in the thin blue line’ »

Trump +1 in Florida; or, a quick comment on that “5 groups analyze the same poll” exercise

Nate Cohn at the New York Times arranged a comparative study on a recent Florida pre-election poll. He sent the raw data to four groups (Charles Franklin; Patrick Ruffini; Margie Omero, Robert Green, Adam Rosenblatt; and Sam Corbett-Davies, David Rothschild, and me) and asked each of us to analyze the data how we’d like to estimate the margin of support for Hillary Clinton vs. Donald Trump in the state. And then he compared this to the New York Times pollster’s estimate.

Here’s what everyone estimated:

Franklin: Clinton +3 percentage points
Ruffini: Clinton +1
Omero, Green, Rosenblatt: Clinton +4
Us: Trump +1
NYT, Siena College: Clinton +1

We did Mister P, and the big reason our estimate was different from everyone else’s was that one of the variables we adjusted for was party registration, and this particular sample of 867 respondents had more registered Democrats than you’d expect, compared to Florida voters from 2012 with an adjustment for anticipated changes in the electorate for the upcoming election.

In previous efforts we’d adjusted on stated party identification but in this case the survey was conducted based on registered voter lists, so we knew party registration of the respondents. It was simpler to adjust for registration than stated party ID because we know the poststratification distribution for registration (based on the earlier election), whereas if we wanted to poststratify on party ID we’d need to take the extra step of estimating from that distribution.

Anyway, the exercise was fun and instructive. As Cohn put it:

We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results. . . . How so? Because pollsters make a series of decisions when designing their survey, from determining likely voters to adjusting their respondents to match the demographics of the electorate. These decisions are hard. They usually take place behind the scenes, and they can make a huge difference.

And he also provided some demographics on the different adjusted estimates:

screen-shot-2016-09-23-at-12-19-04-am

I like this. You can evaluate our estimates not just based on our headline numbers but also based on the distributions we matched to.

But the differences weren’t as large as they look

Just one thing. At first I was actually surprised the results varied by so much. 5 percentage points seems like a lot!

But, come to think of it, the variation wasn’t so much. The estimates had a range of 5 percentage points, but that corresponds to a sd of about 2 percentage points. And that’s the sd on the gap between Clinton and Trump, hence the sd for either candidate’s total is more like 1 percentage point. Put it that way, and it’s not so much variation at all.

Andrew Gelman is not the plagiarism police because there is no such thing as the plagiarism police.

Police-IMG_4105

The title of this post is a line that Thomas Basbøll wrote a couple years ago.

Before I go on, let me say that the fact that I have not investigated this case in detail is not meant to imply that it’s not important or that it’s not worth investigating. It’s just not something that I had the energy to look into. Remember, people can be defined by what ticks them off.

And now here’s the story. I got the following email from someone called Summer Madison:

I think this might interest you:

http://www.econjobrumors.com/topic/new-family-ruptures-aer-nber-is-rip-off-of-obscure-paper

I replied:
Continue reading ‘Andrew Gelman is not the plagiarism police because there is no such thing as the plagiarism police.’ »

Multicollinearity causing risk and uncertainty

Alexia Gaudeul writes:

Maybe you will find this interesting / amusing / frightening, but the Journal of Risk and Uncertainty recently published a paper with a rather obvious multicollinearity problem.

The issue does not come up that often in the published literature, so I thought you might find it interesting for your blog.

The paper is:

Rohde, I. M., & Rohde, K. I. (2015). Managing social risks–tradeoffs between risks and inequalities. Journal of Risk and Uncertainty, 51(2), 103-124.

The authors report very nicely all the elements that would normally indicate to a reviewer that there is something wrong. I got the data from the authors to run my own tests, which I [Gaudeul] report here.

I haven’t looked into this in detail but I thought I’d post it because there’s this scandal in econ that’s somehow all about process and little about substance, and tomorrow I have a post on that, so I thought it was worth preceding it with an example that’s all about substance, not process.

Why is the scientific replication crisis centered on psychology?

The replication crisis is a big deal. But it’s a problem in lots of scientific fields. Why is so much of the discussion about psychology research?

Why not economics, which is more controversial and gets more space in the news media? Or medicine, which has higher stakes and a regular flow of well-publicized scandals?

Here are some relevant factors that I see, within the field of psychology:

1. Sophistication: Psychology’s discourse on validity, reliability, and latent constructs is much more sophisticated than the usual treatment of measurement in statistics, economics, biology, etc. So you see Paul Meehl raising serious questions as early as the 1960s, at a time in which min other fields we were just getting naive happy talk about how all problems would be solved with randomized experiments.

2. Overconfidence deriving from research designs: When we talk about the replication crisis in psychology, we’re mostly talking about lab experiments and surveys. Either way, you get clean identification of comparisons, hence there’s assumption that simple textbook methods can’t go wrong. We’ve seen similar problems in economics (for example, that notorious paper on air pollution in China which was based on a naive trust in regression discontinuity analysis, not recognizing that, when you come down to it, what they had was an observational study), but lab experiments and surveys in psychology are typically so clean that researchers sometimes can’t seem to imagine that there could be any problems with their p-values.

3. Openness. This one hurts: psychology’s bad press is in part a consequence of its open culture, which manifests in various ways. To start with, psychology is _institutionally_ open. Sure, there are some bad actors who refuse to share their data or who try to suppress dissent. Overall, though, psychology offers many channels of communication, even including the involvement of outsiders such as myself. One can compare to economics, which is notoriously reistant to ideas coming from other fields.

And, compared to medicine, psychology is much less restricted by financial and legal considerations. Biology and medicine are big business, and there are huge financial incentives for suppressing negative results, silencing critics, and flat-out cheating. In psychology, it’s relatively easy to get your hands on the data or at least to find mistakes in published work.

4. Involvement of some of prominent academics. Research controversies in other fields typically seem to involve fringe elements in their professons, and when discussing science publication failures, you might just say that Andrew Wakefield had an axe to grind and the editor of the Lancet is a sucker for political controversy, or that Richard Tol has an impressive talent for getting bad work published in good journals. In the rare cases when a big shot is involved (for example, Reinhart and Rogoff) it is indeed big news. But, in psychology, the replication crisis has engulfed Susan Fiske, Roy Baumeister, John Bargh, Carol Dweck, . . . these are leaders in their field. So there’s a legitimate feeling that the replication crisis strikes at the heart of psychology, or at least social psychology; it’s hard to dismiss it as a series of isolated incidents. It was well over half a century ago that Popper took Freud to task regarding unfalsifiable theory, and that remains a concern today.

5. Finally, psychology research is often of general interest (hence all the press coverage, Ted talks, and so on) and accessible, both in its subject matter and its methods. Biomedicine is all about development and DNA and all sorts of actual science; to understand empirical economics you need to know about regression models; but the ideas and methods of psychology are right out in the open for all to see. At the same time, most of psychology is not politically controversial. If an economist makes a dramatic claim, journalists can call up experts on the left and the right and present a nuanced view. Ta least until recently, reporting about psychology followed the “scientist as bold discoverer” template, from Gladwell on down.

What do you get when you put it together?

The strengths and weaknesses of the field of research psychology seemed to have combined to (a) encourage the publication and dissemination of lots of low-quality, unreplicable research, while (b) creating the conditions for this problem to be recognized, exposed, and discussed openly.

It makes sense for psychology researchers to be embarrassed that those papers on power pose, ESP, himmicanes, etc. were published in their top journals and promoted by leaders in their field. Just to be clear: I’m not saying there’s anything embarrassing or illegitimate about studying and publishing papers on power pose, ESP, or himmicanes. Speculation and data exploration are fine with me; indeed, they’re a necessary part of science. My problem with those papers is that they presented speculation as mature theory, that they presented data exploration as confirmatory evidence, and that they were not part of research programmes that could accomodate criticism. That’s bad news for psychology or any other field.

But psychologists can express legitimate pride in the methodological sophistication that has given them avenues to understand the replication crisis, in the openness that has allowed prominent work to be criticized, and in the collaborative culture that has facilitated replication projects. Let’s not let the breakthrough-of-the-week hype and the Ted-talking hawkers and the “replication rate is statistically indistinguishable from 100%” blowhards distract us from all the good work that has showed us how to think more seriously about statistical evidence and scientific replication.

“Crimes Against Data”: My talk at Ohio State University this Thurs; “Solving Statistics Problems Using Stan”: My talk at the University of Michigan this Fri

Crimes Against Data

Statistics has been described as the science of uncertainty. But, paradoxically, statistical methods are often used to create a sense of certainty where none should exist. The social sciences have been rocked in recent years by highly publicized claims, published in top journals, that were reported as “statistically significant” but are implausible and indeed could not be replicated by independent research teams. Can statistics dig its way out of a hole of its own construction? Yes, but it will take work.

Thursday, September 22, 2016 – 3:00pm
Location: EA 170

Solving Statistics Problems Using Stan

Stan is a free and open-source probabilistic programming language and Bayesian inference engine. In this talk, we demonstrate the use of Stan for some small fun problems and then discuss some open problems in Stan and in Bayesian computation and Bayesian inference more generally.

Friday, September 23, 2016 at 11:30 am
411 West Hall
Welcome Reception at 11:00am in the Statistics Lounge, 450 WH

What has happened down here is the winds have changed

screen-shot-2016-09-20-at-11-33-41-pm

Someone sent me this article by psychology professor Susan Fiske, scheduled to appear in the APS Observer, a magazine of the Association for Psychological Science. The article made me a little bit sad, and I was inclined to just keep my response short and sweet, but then it seemed worth the trouble to give some context.

I’ll first share the article with you, then give my take on what I see as the larger issues. The title and headings of this post allude to the fact that the replication crisis has redrawn the topography of science, especially in social psychology, and I can see that to people such as Fiske who’d adapted to the earlier lay of the land, these changes can feel catastrophic.

I will not be giving any sort of point-by-point refutation of Fiske’s piece, because it’s pretty much all about internal goings-on within the field of psychology (careers, tenure, smear tactics, people trying to protect their labs, public-speaking sponsors, career-stage vulnerability), and I don’t know anything about this, as I’m an outsider to psychology and I’ve seen very little of this sort of thing in statistics or political science. (Sure, dirty deeds get done in all academic departments but in the fields with which I’m familiar, methods critiques are pretty much out in the open and the leading figures in these fields don’t seem to have much problem with the idea that if you publish something, then others can feel free to criticize it.)

As I don’t know enough about the academic politics of psychology to comment on most of what Fiske writes about, so what I’ll mostly be talking about is how her attitudes, distasteful as I find them both in substance and in expression, can be understood in light of the recent history of psychology and its replication crisis.

Here’s Fiske:

aps1

aps2

In short, Fiske doesn’t like when people use social media to publish negative comments on published research. She’s implicitly following what I’ve sometimes called the research incumbency rule: that, once an article is published in some approved venue, it should be taken as truth. I’ve written elsewhere on my problems with this attitude—in short, (a) many published papers are clearly in error, which can often be seen just by internal examination of the claims and which becomes even clearer following unsuccessful replication, and (b) publication itself is such a crapshoot that it’s a statistical error to draw a bright line between published and unpublished work.

Clouds roll in from the north and it started to rain

To understand Fiske’s attitude, it helps to realize how fast things have changed.
As of five years ago—2011—the replication crisis was barely a cloud on the horizon.

Here’s what I see as the timeline of important events:

1960s-1970s: Paul Meehl argues that the standard paradigm of experimental psychology doesn’t work, that “a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of ‘an integrated research program,’ without ever once refuting or corroborating so much as a single strand of the network.”

Psychologists all knew who Paul Meehl was, but they pretty much ignored his warnings. For example, Robert Rosenthal wrote an influential paper on the “file drawer problem” but if anything this distracts from the larger problems of the find-statistical-signficance-any-way-you-can-and-declare-victory paradigm.

1960s: Jacob Cohen studies statistical power, spreading the idea that design and data collection are central to good research in psychology, and culminating in his book, Statistical Power Analysis for the Behavioral Sciences, The research community incorporates Cohen’s methods and terminology into its practice but sidesteps the most important issue by drastically overestimating real-world effect sizes.

1971: Tversky and Kahneman write “Belief in the law of small numbers,” one of their first studies of persistent biases in human cognition. This early work focuses on resarchers’ misunderstanding of uncertainty and variation (particularly but not limited to p-values and statistical significance), but they and their colleagues soon move into more general lines of inquiry and don’t fully recognize the implication of their work for research practice.

1980s-1990s: Null hypothesis significance testing becomes increasingly controversial within the world of psychology. Unfortunately this was framed more as a methods question than a research question, and I think the idea was that research protocols are just fine, all that’s needed was a tweaking of the analysis. I didn’t see general airing of Meehl-like conjectures that much published research was useless.

2006: I first hear about the work of Satoshi Kanazawa, a sociologist who published a series of papers with provocative claims (“Engineers have more sons, nurses have more daughters,” etc.), each of which turns out to be based on some statistical error. I was of course already aware that statistical errors exist, but I hadn’t fully come to terms with the idea that this particular research program, and others like it, were dead on arrival because of too low a signal-to-noise ratio. It still seemed a problem with statistical analysis, to be resolved one error at a time.

2008: Edward Vul, Christine Harris, Piotr Winkielman, and Harold Pashler write a controversial article, “Voodoo correlations in social neuroscience,” arguing not just that some published papers have technical problems but also that these statistical problems are distorting the research field, and that many prominent published claims in the area are not to be trusted. This is moving into Meehl territory.

2008 also saw the start of the blog Neuroskeptic, which started with the usual soft targets (prayer studies, vaccine deniers), then started to criticize science hype (“I’d like to make it clear that I’m not out to criticize the paper itself or the authors . . . I think the data from this study are valuable and interesting – to a specialist. What concerns me is the way in which this study and others like it are reported, and indeed the fact that they are repored as news at all,” but soon moved to larger criticisms of the field. I don’t know that the Neuroskeptic blog per se was such a big deal but it’s symptomatic of a larger shift of science-opinion blogging away from traditional political topics toward internal criticism.

2011: Joseph Simmons, Leif Nelson, and Uri Simonsohn publish a paper, “False-positive psychology,” in Psychological Science introducing the useful term “researcher degrees of freedom.” Later they come up with the term p-hacking, and Eric Loken and I speak of the garden of forking paths to describe the processes by which researcher degrees of freedom are employed to attain statistical significance. The paper by Simmons et al. is also notable in its punning title, not just questioning the claims of the subfield of positive psychology but also mocking it. (Correction: Uri emailed to inform me that their paper actually had nothing to do with the subfield of positive psychology and that they intended no such pun.)

That same year, Simonsohn also publishes a paper shooting down the dentist-named-Dennis paper, not a major moment in the history of psychology but important to me because that was a paper whose conclusions I’d uncritically accepted when it had come out. I too had been unaware of the fundamental weakness of so much empirical research.

2011: Daryl Bem publishes his article, “Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect,” in a top journal in psychology. Not too many people thought Bem had discovered ESP but there was a general impression that his work was basically solid, and thus this was presented as a concern for pscyhology research. For example, the New York Times reported:

The editor of the journal, Charles Judd, a psychologist at the University of Colorado, said the paper went through the journal’s regular review process. “Four reviewers made comments on the manuscript,” he said, “and these are very trusted people.”

In retrospect, Bem’s paper had huge, obvious multiple comparisons problems—the editor and his four reviewers just didn’t know what to look for—but back in 2011 we weren’t so good at noticing this sort of thing.

At this point, certain earlier work was seen to fit into this larger pattern, that certain methodological flaws in standard statistical practice were not merely isolated mistakes or even patterns of mistakes, but that they could be doing serious damage to the scientific process. Some relevant documents here are John Ioannidis’s 2005 paper, “Why most published research findings are false,” and Nicholas Christakis’s and James Fowler’s paper from 2007 claiming that obesity is contagious. Ioannidis’s paper is now a classic, but when it came out I don’t think most of us thought through its larger implications; the paper by Christakis and Fowler is no longer being taken seriously but back in the day it was a big deal. My point is, these events from 2005 and 1007 fit into our storyline but were not fully recognized as such at the time. It was Bem, perhaps, who kicked us all into the realization that bad work could be the rule, not the exception.

So, as of early 2011, there’s a sense that something’s wrong, but it’s not so clear to people how wrong things are, and observers (myself included) remain unaware of the ubiquity, indeed the obviousness, of fatal multiple comparisons problems in so much published research. Or, I should say, the deadly combination of weak theory being supported almost entirely by statistically significant results which themselves are the product of uncontrolled researcher degrees of freedom.

2011: Various episodes of scientific misconduct hit the news. Diederik Stapel is kicked out of the pscyhology department at Tilburg University and Marc Hauser leaves the psychology department at Harvard. These and other episodes bring attention to the Retraction Watch blog. I see a connection between scientific fraud, sloppiness, and plain old incompetence: in all cases I see researchers who are true believers in their hypotheses, which in turn are vague enough to support any evidence thrown at them. Recall Clarke’s Law.

2012: Gregory Francis publishes “Too good to be true,” leading off a series of papers arguing that repeated statistically significant results (that is, standard practice in published psychology papers) can be a sign of selection bias. PubPeer starts up.

2013: Katherine Button, John Ioannidis, Claire Mokrysz, Brian Nosek, Jonathan Flint, Emma Robinson, and Marcus Munafo publish the article, “Power failure: Why small sample size undermines the reliability of neuroscience,” which closes the loop from Cohen’s power analysis to Meehl’s more general despair, with the connection being selection and overestimates of effect sizes.

Around this time, people start sending me bad papers that make extreme claims based on weak data. The first might have been the one on ovulation and voting, but then we get ovulation and clothing, fat arms and political attitudes, and all the rest. The term “Psychological-Science-style research” enters the lexicon.

Also, the replication movement gains steam and a series of high-profile failed replications come out. First there’s the entirely unsurprising lack of replication of Bem’s ESP work—Bem himself wrote a paper claiming successful replication, but his meta-analysis included various studies that were not replications at all—and then came the unsuccessful replications of embodied cognition, ego depletion, and various other respected findings from social pscyhology.

2015: Many different concerns with research quality and the scientific publication process converge in the “power pose” research of Dana Carney, Amy Cuddy, and Andy Yap, which received adoring media coverage but which suffered from the now-familiar problems of massive uncontrolled researcher degrees of freedom (see this discussion by Uri Simonsohn), and which failed to reappear in a replication attempt by Eva Ranehill, Anna Dreber, Magnus Johannesson, Susanne Leiberg, Sunhae Sul, and Roberto Weber.

Meanwhile, the prestigous Proceedings of the National Academy of Sciences (PPNAS) gets into the game, publishing really bad, fatally flawed papers on media-friendly topics such as himmicanes, air rage, and “People search for meaning when they approach a new decade in chronological age.” These particular articles were all edited by “Susan T. Fiske, Princeton University.” Just when the news was finally getting out about researcher degrees of freedom, statistical significance, and the perils of low-power studies, PPNAS jumps in. Talk about bad timing.

2016: Brian Nosek and others organize a large collaborative replication project. Lots of prominent studies don’t replicate. The replication project gets lots of attention among scientists and in the news, moving psychology, and maybe scientific research, down a notch when it comes to public trust. There are some rearguard attempts to pooh-pooh the failed replication but they are not convincing.

Late 2016: We have now reached the “emperor has no clothes” phase. When seemingly solid findings in social psychology turn out not to replicate, we’re no longer surprised.

Rained real hard and it rained for a real long time

OK, that was a pretty detailed timeline. But here’s the point. Almost nothing was happening for a long time, and even after the first revelations and theoretical articles you could still ignore the crisis if you were focused on your research and other responsibilities. Remember, as late as 2011, even Daniel Kahneman was saying of priming studies that “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”

Then, all of a sudden, the world turned upside down.

If you’d been deeply invested in the old system, it must be pretty upsetting to think about change. Fiske is in the position of someone who owns stock in a failing enterprise, so no wonder she wants to talk it up. The analogy’s not perfect, though, because there’s no one for her to sell her shares to. What Fiske should really do is cut her losses, admit that she and her colleagues were making a lot of mistakes, and move on. She’s got tenure and she’s got the keys to PPNAS, so she could do it. Short term, though, I guess it’s a lot more comfortable for her to rant about replication terrorists and all that.

Six feet of water in the streets of Evangeline

Who is Susan Fiske and why does she think there are methodological terrorists running around? I can’t be sure about the latter point because she declines to say who these terrorists are or point to any specific acts of terror. Her article provides exactly zero evidence but instead gives some uncheckable half-anecdotes.

I first heard of Susan Fiske because her name was attached as editor to the aforementioned PPNAS articles on himmicanes, etc. So, at least in some cases, she’s a poor judge of social science research.

Or, to put it another way, she’s living in 2016 but she’s stuck in 2006-era thinking. Back 10 years ago, maybe I would’ve fallen for the himmicanes and air rage papers too. I’d like to think not, but who knows? Following Simonsohn and others, I’ve become much more skeptical about published research than I used to be. It’s taken a lot of us a lot of time to move to the position where Meehl was standing, fifty years ago.

Fiske’s own published work has some issues too. I make no statement about her research in general, as I haven’t read most of her papers. What I do know is what Nick Brown sent me:

For an assortment of reasons, I [Brown] found myself reading this article one day: This Old Stereotype: The Pervasiveness and Persistence of the Elderly Stereotype by Amy J. C. Cuddy, Michael I. Norton, and Susan T. Fiske (Journal of Social Issues, 2005). . . .

This paper was just riddled through with errors. First off, its main claims were supported by t statistics of 5.03 and 11.14 . . . ummmmm, upon recalculation the values were actually 1.8 and 3.3. So one of the claim wasn’t even “statistically significant” (thus, under the rules, was unpublishable).

But that wasn’t the worst of it. It turns out that some of the numbers reported in that paper just couldn’t have been correct. It’s possible that the authors were doing some calculations wrong, for example by incorrectly rounding intermediate quantities. Rounding error doesn’t sound like such a big deal, but it can supply a useful set of “degrees of freedom” to allow researchers to get the results they want, out of data that aren’t readily cooperating.

There’s more at the link. The short story is that Cuddy, Norton, and Fiske made a bunch of data errors—which is too bad, but such things happen—and then when the errors were pointed out to them, they refused to reconsider anything. Their substantive theory is so open-ended that it can explain just about any result, any interaction in any direction.

And that’s why the authors’ claim that fixing the errors “does not change the conclusion of the paper” is both ridiculous and all too true. It’s ridiculous because one of the key claims is entirely based on a statistically significant p-value that is no longer there. But the claim is true because the real “conclusion of the paper” doesn’t depend on any of its details—all that matters is that there’s something, somewhere, that has p less than .05, because that’s enough to make publishable, promotable claims about “the pervasiveness and persistence of the elderly stereotype” or whatever else they want to publish that day.

When the authors protest that none of the errors really matter, it makes you realize that, in these projects, the data hardly matter at all.

Why do I go into all this detail? Is it simply mudslinging? Fiske attacks science reformers, so science reformers slam Fiske? No, that’s not the point. The issue is not Fiske’s data processing errors or her poor judgment as journal editor; rather, what’s relevant here is that she’s working within a dead paradigm. A paradigm that should’ve been dead back in the 1960s when Meehl was writing on all this, but which in the wake of Simonsohn, Button et al., Nosek et al., is certainly dead today. It’s the paradigm of the open-ended theory, of publication in top journals and promotion in the popular and business press, based on “p less than .05” results obtained using abundant researcher degrees of freedom. It’s the paradigm of the theory that in the words of sociologist Jeremy Freese, is “more vampirical than empirical—unable to be killed by mere data.” It’s the paradigm followed by Roy Baumeister and John Bargh, two prominent social psychologists who were on the wrong end of some replication failures and just can’t handle it.

I’m not saying that none of Fiske’s work would replicate or that most of it won’t replicate or even that a third of it won’t replicate. I have no idea; I’ve done no survey. I’m saying that the approach to research demonstrated by Fiske in her response to criticism of that work of hers is an style that, ten years ago, was standard in psychology but is not so much anymore. So again, her discomfort with the modern world is understandable.

Fiske’s collaborators and former students also seem to show similar research styles, favoring flexible hypotheses, proof-by-statistical-significance, and an unserious attitude toward criticism.

And let me emphasize here that, yes, statisticians can play a useful role in this discussion. If Fiske etc. really hate statistics and research methods, that’s fine; they could try to design transparent experiments that work every time. But, no, they’re the ones justifying their claims using p-values extracted from noisy data, they’re the ones rejecting submissions from PPNAS because they’re not exciting enough, they’re the ones who seem to believe just about anything (e.g., the claim that women were changing their vote preferences by 20 percentage points based on the time of the month) if it has a “p less than .05” attached to it. If that’s the game you want to play, then methods criticism is relevant, for sure.

The river rose all day, the river rose all night

Errors feed upon themselves. Researchers who make one error can follow up with more. Once you don’t really care about your numbers, anything can happen. Here’s a particularly horrible example from some researchers whose work was questioned:

Although 8 coding errors were discovered in Study 3 data and this particular study has been retracted from that article, as I show in this article, the arguments being put forth by the critics are untenable. . . . Regarding the apparent errors in Study 3, I find that removing the target word stems SUPP and CE do not influence findings in any way.

Hahaha, pretty funny. Results are so robust to 8 coding errors! Also amusing that they retracted Study 3 but they still can’t let it go. See also here.

I’m reminded of the notorious “gremlins” paper by Richard Tol which ended up having almost as many error corrections as data points—no kidding!—but none of these corrections was enough for him to change his conclusion. It’s almost as if he’d decided on that ahead of time. And, hey, it’s fine to do purely theoretical work, but then no need to distract us with data.

Some people got lost in the flood

Look. I’m not saying these are bad people. Sure, maybe they cut corners here or there, or make some mistakes, but those are all technicalities—at least, that’s how I’m guessing they’re thinking. For Cuddy, Norton, and Fiske to step back and think that maybe almost everything they’ve been doing for years is all a mistake . . . that’s a big jump to take. Indeed, they’ll probably never take it. All the incentives fall in the other direction.

In her article that was my excuse to write this long post, Fiske expresses concerns for the careers of her friends, careers that may have been damaged by public airing of their research mistakes. Just remember that, for each of these people, there may well be three other young researchers who were doing careful, serious work but then didn’t get picked for a plum job or promotion because it was too hard to compete with other candidates who did sloppy but flashy work that got published in Psych Science or PPNAS. It goes both ways.

Some people got away alright

The other thing that’s sad here is how Fiske seems to have felt the need to compromise her own principles here. She deplores “unfiltered trash talk,” “unmoderated attacks” and “adversarial viciousness” and insists on the importance of “editorial oversight and peer review.” According to Fiske, criticisms should be “most often in private with a chance to improve (peer review), or at least in moderated exchanges (curated comments and rebuttals).” And she writes of “scientific standards, ethical norms, and mutual respect.”

But Fiske expresses these views in an unvetted attack in an unmoderated forum with no peer review or opportunity for comments or rebuttals, meanwhile referring to her unnamed adversaries as “methological terrorists.” Sounds like unfiltered trash talk to me. But, then again, I haven’t seen Fiske on the basketball court so I really have no idea what she sounds like when she’s really trash talkin’.

I bring this up not in the spirit of gotcha, but rather to emphasize what a difficult position Fiske is in. She’s seeing her professional world collapsing—not at a personal level, I assume she’ll keep her title as the Eugene Higgins Professor of Psychology and Professor of Public Affairs at Princeton University for as long as she wants—but her work and the work of her friends and colleagues is being questioned in a way that no one could’ve imagined ten years ago. It’s scary, and it’s gotta be a lot easier for her to blame some unnamed “terrorists” than to confront the gaps in her own understanding of research methods.

To put it another way, Fiske and her friends and students followed a certain path which has given them fame, fortune, and acclaim. Question the path, and you question the legitimacy of all that came from it. And that can’t be pleasant.

The river have busted through clear down to Plaquemines

Fiske is annoyed with social media, and I can understand that. She’s sitting at the top of traditional media. She can publish an article in the APS Observer and get all this discussion without having to go through peer review; she has the power to approve articles for the prestigious Proceedings of the National Academy of Sciences; work by herself and har colleagues is featured in national newspapers, TV, radio, and even Ted talks, or so I’ve heard. Top-down media are Susan Fiske’s friend. Social media, though, she has no control over. That’s must be frustrating, and as a successful practioner of traditional media myself (yes, I too have published in scholarly journals), I too can get annoyed when newcomers circumvent the traditional channels of publication. People such as Fiske and myself spend our professional lives building up a small fortune of coin in the form of publications and citations, and it’s painful to see that devalued, or to think that there’s another sort of scrip in circulation that can buy things that our old-school money cannot.

But let’s forget about careers for a moment and instead talk science.

When it comes to pointing out errors in published work, social media have been necessary. There just has been no reasonable alternative. Yes, it’s sometimes possible to publish peer-reviewed letters in journals criticizing published work, but it can be a huge amount of effort. Journals and authors often apply massive resistance to bury criticisms.

There’s also this discussion which is kinda relevant:

What do I like about blogs compared to journal articles? First, blog space is unlimited, journal space is limited, especially in high-profile high-publicity journals such as Science, Nature, and PPNAS. Second, in a blog it’s ok to express uncertainty, in journals there’s the norm of certainty. On my blog, I was able to openly discuss various ideas of age adjustment, whereas in their journal article, Case and Deaton had nothing to say but that their numbers “are not age-adjusted within the 10-y 45-54 age group.” That’s all! I don’t blame Case and Deaton for being so terse; they were following the requirements of the journal, which is to provide minimal explanation and minimal exploration. . . . over and over again, we’re seeing journal article, or journal-article-followed-by-press-interviews, as discouraging data exploration and discouraging the expression of uncertainty. . . . The norms of peer reviewed journals such as PPNAS encourage presenting work with a facade of certainty.

Again, the goal here is to do good science. It’s hard to do good science when mistakes don’t get flagged and when you’re supposed to act as if you’ve always been right all along, that any data pattern you see is consistent with theory, etc. It’s a problem for the authors of the original work, who can waste years of effort chasing leads that have already been discredited, it’s a problem for researchers who follow up on erroneous work, and it’s a problem for other researchers who want to do careful work but find it difficult to compete in a busy publishing environment with the authors of flashy, sloppy exercises in noise mining that have made “Psychological Science” (the journal, not the scientific field) into a punch line.

It’s fine to make mistakes. I’ve published work myself that I’ve had to retract, so I’m hardly in a position to slam others for sloppy data analysis and lapses in logic. And when someone points out my mistakes, I thank them. I don’t label corrections as “ad hominem smear tactics”; rather, I take advantage of this sort of unsolicited free criticism to make my work better. (See here for an example of how I adjusted my research in response to a critique which was not fully informed and kinda rude but still offered value.) I recommend Susan Fiske do the same.

Six feet of water in the streets of Evangeline

To me, the saddest part of Fiske’s note is near the end, when she writes, “Psychological science has acheived much through collaboration but also through responding to constructive adversaries . . .” Fisk emphasizes “constructive,” which is fine. We may have different definitions of what is constructive, but I hope we can all agree that it is constructive to point out mistakes in published work and to perform replication studies.

The thing that saddens me is Fiske’s characterization of critics as “adversaries.” I’m not an adversary of pscyhological science! I’m not even an adversary of low-quality psychological science: we often learn from our mistakes and, indeed, in many cases it seems that we can’t really learn without first making errors of different sorts. What I am an adversary of, is people not admitting error and studiously looking away from mistakes that have been pointed out to them.

If Kanazawa did his Kanazawa thing, and the power pose people did their power-pose thing, and so forth and so on, I’d say, Fine, I can see how these things were worth a shot. But when statistical design analysis shows that this research is impossible, or when replication failures show that published conclusions were mistaken, then damn right I expect you to move forward, not keep doing the same thing over and over, and insisting you were right all along. Cos that ain’t science. Or, I should say, it’s a really really inefficient way to do science, for individual researchers to devote their careers to dead ends, just cos they refuse to admit error.

We learn from our mistakes, but only if we recognize that they are mistakes. Debugging is a collaborative process. If you approve some code and I find a bug in it, I’m not an adversary, I’m a collaborator. If you try to paint me as an “adversary” in order to avoid having to correct the bug, that’s your problem.

They’re tryin’ to wash us away, they’re tryin’ to wash us away

Let me conclude with a key disagreement I have with Fiske. She prefers moderated forums where criticism is done in private. I prefer open discussion. Personally I am not a fan of Twitter, where the space limitation seems to encourge snappy, often adversarial exchanges. I like blogs, and blog comments, because we have enough space to fully explain ourselves and to give full references to what we are discussing.

Hence I am posting this on our blog, where anyone has an opportunity to respond. That’s right, anyone. Susan Fiske can respond, and so can anyone else. Including lots of people who have an interest in psychological science but don’t have the opportunity to write non-peer-reviewed articles for the APS Observer, who aren’t tenured professors at major universities, etc. This is open discussion, it’s the opposite of terrorism. And I think it’s pretty ridiculous that I even have to say such a thing which is so obvious.

P.S. More here: Why is the scientific replication crisis centered on psychology?

“Methodological terrorism”

160918131656-04-chelsea-explosion-0819-restricted-super-169

Methodological terrorism is when you publish a paper in a peer-reviewed journal, its claim is supported by a statistically significant t statistic of 5.03, and someone looks at your numbers, figures out that the correct value is 1.8, and then posts that correction on social media.

Terrorism is when somebody blows shit up and tries to kill you.

Acupuncture paradox update

The acupuncture paradox, as we discussed earlier, is:

The scientific consensus appears to be that, to the extent that acupuncture makes people feel better, it is through relaxing the patient, also the acupuncturist might help in other ways, encouraging the patient to focus on his or her lifestyle.

But whenever I discuss the topic with any Chinese friend, they assure me that acupuncture is real. Real real. Not “yeah, it works by calming people” real or “patients respond to a doctor who actually cares about them” real. Real real. The needles, the special places to put the needles, the whole thing. I haven’t had a long discussion on this, but my impression is that Chinese people think of acupuncture as working in the same way that we understand that TV’s or cars or refrigerators work: even if we don’t know the details, we trust the basic idea.

Anyway, I don’t know what to make of this. The reports of scientific studies finding no effect of acupuncture needles are plausible to me (not that I’ve read any of these studies in detail)—but if they’re so plausible, how come none of my Chinese friends seem to be convinced?

This does seem to be a paradox, as evidenced by some of the discussion in the 56 comments on the above post.

Anyway, I was reminded of this when Paul Alper pointed me to this news article from Susan Perry, entitled “Real and fake acupuncture have similar effects on hot flashes, study finds”:

Women reported improvements in the number and intensity of their hot flashes whether they received the real or the fake treatment — a strong indication that the placebo effect was at work with both. . . . And before anybody jumps on this study for being conducted by conventional physicians who are antagonistic to nonconventional medical treatments, I [Perry] will point out that the lead author is Dr. Carolyn Ee, a family physician at the University of Melbourne who is trained in — and uses — Chinese medicine, including acupuncture, with her patients.

Here’s the study, which reports:

Results: 327 women were randomly assigned to acupuncture (n = 163) or sham acupuncture (n = 164). At the end of treatment, 16% of participants in the acupuncture group and 13% in the sham group were lost to follow-up. Mean HF scores at the end of treatment were 15.36 in the acupuncture group and 15.04 in the sham group (mean difference, 0.33 [95% CI, −1.87 to 2.52]; P = 0.77). No serious adverse events were reported.

StanCon is coming! Sat, 1/21/2017

Save the date! The first Stan conference is going to be in NYC in January. Registration will open at the end of September.

 

When:

Saturday, January 21, 2017

9 am – 5 pm

 

Where:

Davis Auditorium, Columbia University

530 West 120th Street

4th floor (campus level), room 412

New York, NY 10027

 

Registration:

Registration will open at the end of September.

 

Early registration (on or before December 20, 2016):

– Student: $50

– Academic: $100

– Industry: $200

This will include coffee, lunch, and some swag.

 

Late Registration (December 21, 2016 and on):

– Student: $75

– Academic: $150

– Industry: $300

This will include coffee and lunch. Probably won’t get swag.

 

Contributed talks:

We’re looking for contributed talks. We will start accepting submissions at the end of September.

The contributed talks at StanCon will be based on interactive, self-contained notebooks, such as knitr or Jupyter, that will also take the place of proceedings.  For example, you might demonstrate a novel modeling technique or a simplified version of a novel application. Each submission should include the notebook and separate files containing the Stan program, data, initializations if used, and a permissive license for everything such as CC BY 4.0.

 

Tentative Schedule:

8:00- 9:00 Registration / Coffee / Breakfast

9:00 – 9:20 Opening remarks

9:20 – 10:30 Session 1

10:30 – 11:00 Coffee break

11:00 – 12:30 Session 2

12:30 – 2:00 Lunch

2:00 – 3:15 Session 3

3:15 – 3:45 Coffee break

3:45 – 5:00 Session 4

 

Sponsorship:

We are looking for some sponsorship to either defer costs or provide travel assistance. Please email stancon@mc-stan.org for more information.

 

Organizers:

Michael Betancourt (Columbia University)

Tamara Broderick (MIT)

Jonah Gabry (Columbia University)

Andrew Gelman (Columbia University)

Ben Goodrich (Columbia University)

Daniel Lee (Columbia University)

Eric Novik (Stan Group Inc)

Lizzie Wolkovich (Harvard University)

 

Hey, PPNAS . . . this one is the fish that got away.

JoseIgnacio_2134667a

Uri Simonsohn just turned down the chance to publish a paper that could’ve been published in a top journal (a couple years ago I’d’ve said Psychological Science but recently they’ve somewhat cleaned up their act, so let’s say PPNAS which seems to be still going strong) followed by features in NPR, major newspapers, BoingBoing, and all the rest. Ted talk too, if he’d played his cards right, maybe even a top-selling book and an appearance in the next issue of Gladwell.

Wow—what restraint. I’m impressed. I thought Nosek et al.’s “50 shades of gray” was pretty cool but this one’s much better.

Here’s Simonsohn on “Odd numbers and the horoscope”:

I [Simonsohn] conducted exploratory analyses on the 2010 wave of the General Social Survey (GSS) until discovering an interesting correlation. If I were writing a paper about it, this is how I may motivate it:

Based on the behavioral priming literature in psychology, which shows that activating one mental construct increases the tendency of people to engage in mentally related behaviors, one may conjecture that activating “oddness,” may lead people to act in less traditional ways, e.g., seeking information from non-traditional sources. I used data from the GSS and examined if respondents who were randomly assigned an odd respondent ID (1,3,5…) were more likely to report reading horoscopes.

The first column in the table below shows this implausible hypothesis was supported by the data, p<.01: T1d

People are about 11 percentage points more likely to read the horoscope when they are randomly assigned an odd number by the GSS. Moreover, this estimate barely changes across alternative specifications that include more and more covariates, despite the notable increase in R2.

But he blew the gaff by letting us all in on the secret.

Simonsohn describes the general framework:

One popular way to p-hack hypotheses involves subgroups. Upon realizing analyses of the entire sample do not produce a significant effect, we check whether analyses of various subsamples — women, or the young, or republicans, or extroverts — do. Another popular way is to get an interesting dataset first, and figure out what to test with it second.

Yup. And:

Robustness checks involve reporting alternative specifications that test the same hypothesis. Because the problem is with the hypothesis, the problem is not addressed with robustness checks.

Are you listening, economists?

A couple of concerns

But I do have a couple of concerns about Uri’s post.

1. Conceptual replications. Uri writes:

One big advantage is that with rich data sets we can often run conceptual replications on the same data.

To do a conceptual replication, we start from the theory behind the hypothesis, say “odd numbers prompt use of less traditional sources of information” and test new hypotheses.

Sure, but the garden of forking paths applies to replications as well. I know Uri knows this because . . . remember the dentist-named-Dennis paper? It had something like 10 different independent studies, each showing a statistically significant effect. Uri patiently debunked each of these, one at a time. Similarly, consider the embodied cognition literature.

Or, what about that paper about people with ages ending in 9? Again, it looked like a big mound of evidence, a bunch of different studies all in support of a common underlying theory—but, again, when you looked carefully at each individual analysis, there was no there there.

So, although I agree with Uri that there are some principles for understanding conceptual replications, I think he needs a big red flashing WARNING sign explaining why you can think you have a mass of confirming evidence, but you don’t.

2. Uri recommends looking at how the treatment effect varies in an predicted way:

A closely related alternative is also commonly used in experimental psychology: moderation. Does the effect get smaller/larger when the theory predicts it should?

This is fine but I have two problems here. First, “the theory” is often pretty vague, as we saw for example in the ovluation-and-voting literature. Most of these theories can predict just about anything and can give a story where effects increase, decrease, or stay the same.

Second, interactions can be hard to estimate: they have bigger standard errors than main effects. So if you go in looking for an interaction, you can be disappointed, and if you find an interaction, it’s likely to be way overestimated in magnitude (type M error) and maybe in the wrong direction (type S error).

This is not to say that interactions aren’t worth studying, and it’s not to say you shouldn’t do exploratory analysis of interactions—the most important things I’ve ever found from data have been interactions that I hadn’t been looking for!—but I’m wary of a suggestion to improve weak research by looking for this sort of confirmation.

Uri’s giving good advice if you’re studying something real, but if you’re doing junk science, I’m afraid he’s just giving people more of a chance to fool themselves (and newspaper editors, and NPR correspondents, and Gladwell, and the audience for Ted talks, and the editors of PPNAS, and so on).

3. Finally, this one I’ve discussed with Uri before: I don’t think the term “p-hacking” is broad enough. It certainly describes what Uri did here, which was to hack through the data looking for something statistically significant. But researchers can also do this without trying, just working through the data and finding things. That’s the garden of forking paths: p-values can be uninterpretable even if you only perform a single analysis on the data at hand. I won’t go through the whole argument again here; I just want to again register my opposition to the term “p-hacking” because I think it leads researchers to (incorrectly) think they’re off the hook if they have only performed a single analysis on the data they saw.

Summary

Uri writes:

Tools common in experimental psychology, conceptual replications and testing moderation, are viable solutions.

To which I reply: Only if you’re careful, and only if you’re studying something with a large and consistent effect. All the conceptual replications and testing of moderation aren’t gonna save you if you’re studying power pose, or ESP, or Bible Code, or ovulation and clothing, or whatever other flavor-of-the-month topic is hitting the tabloids.

Pro Publica Surgeon Scorecard Update

Adan Becerra writes:

In light of your previous discussions on the ProPublica surgeon scorecard, I was hoping to hear your thoughts about this article recently published in Annals of Surgery titled, “Evaluation of the ProPublica Surgeon Scorecard ‘Adjusted Complication Rate’ Measure Specifications.”​

The article is by K. Ban, M. Cohen, C. Ko, M. Friedberg, J. Stulberg, L. Zhou, B. Hall, D. Hoyt, and K. Bilimoria and begins:

The ProPublica Surgeon Scorecard is the first nationwide, multispecialty public reporting of individual surgeon outcomes. However, ProPublica’s use of a previously undescribed outcome measure (composite of in-hospital mortality or 30-day related readmission) and inclusion of only inpatients have been questioned. Our objectives were to (1) determine the proportion of cases excluded by ProPublica’s specifications, (2) assess the proportion of inpatient complications excluded from ProPublica’s measure, and (3) examine the validity of ProPublica’s outcome measure by comparing performance on the measure to well-established postoperative outcome measures.

They find:

ProPublica’s inclusion criteria resulted in elimination of 82% of all operations from assessment (range: 42% for total knee arthroplasty to 96% for laparoscopic cholecystectomy). For all ProPublica operations combined, 84% of complications occur during inpatient hospitalization (range: 61% for TURP to 88% for total hip arthroplasty), and are thus missed by the ProPublica measure. Hospital-level performance on the ProPublica measure correlated weakly with established complication measures, but correlated strongly with readmission.

And they conclude:

ProPublica’s outcome measure specifications exclude 82% of cases, miss 84% of postoperative complications, and correlate poorly with well-established postoperative outcomes. Thus, the validity of the ProPublica Surgeon Scorecard is questionable.

When this came up before, I wrote, “The more important criticisms involved data quality, and that’s something I can’t really comment on, at least without reading the report in more detail.”

And that’s still the case. I still haven’t put in any effort to follow this story. So I’ll repeat what I wrote before:

You fit a model, do the best you can, be open about your methods, then invite criticism. You can then take account of the criticisms, include more information, and do better.

So go for it, Pro Publica. Don’t stop now! Consider your published estimates as a first step in a process of continual quality improvement.

At this point, I’d like Pro Publica not to try to refute these published data-quality criticisms (unless they truly are off the mark) but rather to thank the critics and take this as an opportunity to do better. Let this be the next step in an ongoing process.

Redemption

animaltestingmonkeycovance2

I’ve spent a lot of time mocking Mark Hauser on this blog, and I still find it annoying that, according to the accounts I’ve seen, he behaved unethically toward his graduate students and lab assistants, he never apologized for manipulating data, and, perhaps most unconscionably, he wasted the lives of who knows how many monkeys in discredited experiments.

But, fine, nobody’s perfect. On the plus side—and I’m serious about this—it seems that Hauser was not kidding when, upon getting kicked out of his position as a Harvard professor, he said he’d be working with at-risk kids.

Here’s the website for his organization, Risk Eraser. I’m in no position to evaluate this work, and I’d personally think twice before hiring someone with a track record of faking data, but I looked at the people listed on his advisory board and they include Steven Pinker and Susan Carey, two professors in the Harvard Psychology Department. This is notable because Hauser dragged that department into the muck, and Pinker and Carey, as much as almost anybody, would have reason to be angry with him. But they are endorsing his new project. This suggest that these two well-respected researchers, who know Hauser much better than I do (that is, more than zero!) have some trust in him, they think his project is real and that it’s doing some good.

That’s wonderful news. Hauser is clearly a man of many talents, even if experimental psychology is not his strong point, and, no joke, no sarcasm at all, I’m so happy to see that he is using these talents productively. He’s off the academic hamster wheel and doing something real with his life. An encouraging example for Michael Lacour, Diederik Stapel, Jonah Lehrer, Team Power Pose, Team Himmicanes, and all the rest. (I think Weggy and Dr. Anil Potti are beyond recovery, though.)

We all make mistakes in various aspects of life, and let’s hope we can all turn things around in the manner of Marc Hauser, to forget about trying to be a “winner” and to just move forward and help others. I imagine Hauser is feeling good about this life shift as well, taking the effort he was spending trying to be a public intellectual and to stay ahead of the curve in science, and just trying to help others.

P.S. You might feel that I’m being too hard on Hauser here: even while celebrating his redemption I can’t resist getting in some digs on his data manipulation. But that’s kinda the point: even if Hauser has not fully reformed his ways, maybe especially if that is the case, his redemption is an inspiring story. In reporting this I’m deliberately not oversimplifying; I’m not trying to claim that Hauser has resolved all his issues. He still, to my knowledge, has shown neither contrition nor even understanding of why it’s scientifically and professionally wrong to lie about and hide your data. But he’s still moved on to something useful (at least, I’ll take the implicit word of Carey and Pinker on this one, I don’t know any of these people personally). And that’s inspiring. It’s good to know that redemption can be achieved even without a full accounting for one’s earlier actions.

P.P.S. Some commenters share snippets from Hauser’s Risk Eraser program and some of it sounds kinda hype-y. So perhaps I was too charitable in my quick assessment above.

Hauser may be “trying to help others” (as I wrote above), but that doesn’t mean his methods will actually help.

For example, they offer “the iWill game kit, a package of work-out routines that can help boost your students’ willpower through training.” And then they give the following reference:

Baumeister, R.F. (2012). Self-control: the moral muscle. The Psychologist.

We’ve heard of this guy before; he’s in the can’t replicate and won’t admit it club. Also this, which looks like another PPNAS-style mind-over-matter bit of p-hacking:

Job, V., Walton, G.M., Bernecker, K. & Dweck, C.S. (2013). Beliefs about willpower determine the impact of glucose on self-control. Proceedings of the National Academy of Sciences.

OK, so what’s going on here? I was giving Hauser the benefit of the doubt because of the endorsements from Pinker and Carey, who are some serious no-bullshit types.

But maybe that inference was a mistake on my part. One possibility is that Pinker and Carey fell for the Ted-talking Baumeuster PPNAS hype, just like so many others in their profession. Another possibility is that they feel sorry for Hauser and agreed to be part of his program as a way to help out their old friend. This one seems hard to imagine—given what Hauser did to the reputation of their department, I’d think they’d be furious at the guy, but who knows?

Too bad. I really liked the story of redemption, but this Risk Eraser thing seems not so different from what Hauser was doing before: shaky science and fast talking, with the main difference being that this time he’s only involved in the “fast talking” part. Baumeister indeed. Why not go whole hog and start offering power pose training and ESP?

An auto-mechanic-style sign for data sharing

Yesterday’s story reminds me of that sign you used to see at the car repair shop:

mechanic

Maybe we need something similar for data access rules:

DATA RATES PER HOUR

If you want to write a press release for us        $   50.00
If you want to write a new paper using our data    $   90.00
If you might be questioning our results            $  450.00*
If you're calling from Retraction Watch            $30000.00

* Default rate unless you can convince us otherwise

Whaddya think? Anyone interested in adding a couple more jokes and formatting it poster style with some cute graphic?

Sharing data: Here’s how you do it, and here’s how you don’t

I received the following email today:

Professor Gelman,

My name is **, I am a senior at the University of ** studying **, and recently came across your paper, “What is the Probability That Your Vote Will Make a Difference?” in my Public Choice class. I am wondering if you are able to send me the actual probabilities that you calculated for all of the states, as some are mentioned in the paper, but I can’t find the actual data anywhere online.

The reason I ask is that I am trying to do some analysis on rational voter absenteeism. Specifically I want to see if there is any correlation between the probability that someone’s vote will make a difference (From your paper) and the voter turnout in each state in the 2008 election.

Thanks!

Hmmm, where are the data? I went to the page of my published papers, searched on “What is the probability” and found the file, which was called probdecisive2.pdf, then searched on my computer for that file name, found the directory, came across two seemingly relevant files, electionnight.R and nate.R, and send this student a quick email with those two R files and all the data files that were referenced there. No big deal, it took about 5 minutes.

And then I was reminded of this item that Malte Elson pointed me to the other day, a GoFundMe website that begins:

My name is Chris Ferguson, I am a psychology professor at Stetson University in DeLand, FL. In my research, I’m studying how media affect children and young adults.

Earlier this year, another researcher from Brigham Young University published a three-year longitudinal study between viewing relational aggression on TV and aggressive behavior in the journal Developmental Psychology. Longitudinal studies are rare in my field, so I was very excited to see this study, and eager to take a look at the data myself to check up on some of the analyses reported by the authors.

So I spoke with the Flourishing Families project staff who manage the dataset from which the study was published and which was authored by one of their scholars. They agreed to send the data file, but require I cover the expenses for the data file preparation ($300/hour, $450 in total; you can see the invoice here). Because I consider data sharing a courtesy among researchers, I contacted BYU’s Office of Research and Creative Activities and they confirmed that charging a fee for a scholarly data request is consistent with their policy.

Given I have no outside funding, I might not be able to afford the dataset of Dr. [Sarah] Coyne’s study, although it is very important for my own research. Although somewhat unconventional, I am hoping that this fundraising site will help me cover parts of the cost!

The paper in question was published in the journal Developmental Psychology. On the plus side, no public funding seems to have been involved, so I guess I can’t say that these data were collected with your tax dollars. If BYU wants to charge $300/hr for a service that I provide for free, they can go for it.

Here’s the invoice:

screen-shot-2016-09-15-at-9-38-43-pm

In future perhaps journals will require all data to be posted as a condition of publication and then this sort of thing won’t happen anymore.

P.S. Related silliness here.