On the sister blog we’re supposed to switch to catchy headlines to match our Big Media presence. I actually think that’s a good idea. It’s not easy to write good titles, and it’s worth trying to do better in that regard.
But coming up with catchy titles is hard work, and it’s a bit of a relief to be back here, free to do whatever I want. So, just for fun, I tried to come up with the most boring (while remaining accurate) title for this post. Above, you can see what I came up with!
OK, now to the real story. Dan Wisneski sent me a copy of this document [updated version here] by David Funder, John Levine, Diane Mackie, Carolyn Morf, Carol Sansone, Simine Vazire, and Stephen West, that’s been “starting to make the rounds in social psych circles.” The paper in question addresses the now-familiar topic of unreplicable research, the sort of linkbait which often seems to get published in “Psychological Science” nowadays.
As an outsider to the field, my impression is that concern over this issue started to increase following John Ioannidis’s famous 2005 paper, “Why most published research findings are false” (duly cited in Funder’s report) and then got kicked into high gear after the 2011 publication in JPSP of Daryl Bem’s ridiculous paper on ESP.
As Mickey Kaus might say, the media treatment of Bem’s paper was just like that of any other bit of science hype, only faster: the entire cycle, from release of the preprint to announced publication to credulous news reports (after all, JPSP is a top journal, and science writers know that) to skepticism to controversy to debunked, all seemed to happen in just a few days.
To say it in a way that you will all understand: Daryl Bem’s ESP paper is the “The Rutles” of science reporting, brilliantly and concisely capturing, in deadpan style, all the tropes in just the right order, with no distracting details to get in the way. (It was, for example, convenient that, there was no associated scandal and no political valence to the research. That way, people had to react to the story itself and could simply rely on preprogrammed ideological reactions, as can happen with all those evolutionary psychology stories.)
In its stark, almost parodied series of steps, the Bem episode made a lot of people realize: Hey, this happens all the time! Maybe we can short-circuit the process next time and just jump straight to the last step of not believing the outrageous claim?
But that’s not so easy. Bem’s elusive ESP finding may be easy to discard, and maybe we can also laugh at the “dentists named Dennis” study (although I believed that one when it came out, and I remain sympathetic to the hypothesis). And what about the “think outside the box” study, or “stereotype threat“? It might be that none of these are real, but most of us don’t feel so comfortable rejecting all of them. Recall that even Brian Nosek, a central player on Team Skeptic, had his own pet hypothesis (“50 shades of gray”) which he had to abandon only after it failed to replicate.
To step back for a moment, here’s what a lot of thoughtful people were saying after the Bem debacle: In politics, they sometimes say that the scandal is not what’s illegal, but what’s legal. Similarly, the scandalous aspect of the Bem study was not that he was some sort of badass Mark Hauser who broke all the rules and just didn’t care, but rather that he followed all the rules. He did what everyone said you should do, and look where that left him! So, many thoughtful people said, instead of letting Bem twist slowly in the wind, we should figure out what went wrong so that this sort of paper can be routinely published.
That is, we need to change the rules. And that’s what this discussion is all about.
Here’s a key part of the story (from page 20 of the Funder et al. report):
The late meta-analyst John Hunter wryly offered his observations on the progress of research in many areas of psychology given that researchers often ignore considerations of effect size and statistical power. According to Hunter, a research area begins with the proposal of an interesting hypothesis and the excitement of a first demonstration study that finds a large effect size. Subsequent research tries to clarify the phenomenon by designing studies to rule out alternative explanations, thereby making the effect size smaller. This stage is followed by a generation of studies investigating mediation and moderation, which further reduce the effect size.
Or maybe there is no effect there at all. Or, perhaps closer to the truth in many cases, maybe the effect is positive in some settings and negative in others, with variation being high enough that substantially different effects will be found in different populations at different times under different experimental conditions.
Here’s what I like about the Funder et al. report: It is thoughtfully addressing important and real questions, and in my opinion its recommendations are generally going in the right direction.
Here’s what I don’t like about the report: It remains tied to what I see as an old-fashioned statistical approach based on power analysis (that is, statistical significance and “p less than .05) which in turn relies on a conception of science that turns on the discovery of nonzero effects. As alluded to above, in the human-science settings with which I am familiar, just about nothing is zero, but effects and comparisons can be highly variable, so much so that in many cases the idea of “the effect” of something does not make much sense.
In addition, I don’t really buy the Funder committee’s acceptance of the standard paradigm in which a researcher can specify a hypothesis ahead of time and then simply test it. I do believe that pre-registration of research hypotheses is both possible and a good idea, but I think this preregistration makes most sense as the second half of a study (as in the above-noted Nosek et al. paper), following up on a more traditional exploratory (even if theory-driven) part.
I am devoting more space to what I don’t like about the report than what I do like, so before going on I should probably emphasize that, overall, I think the report is a big step forward.
Let me say this another way. The report is completely reasonable. Indeed, it gives the sort of advice that I might have recommended to practitioners, three or five or more years ago. But, given what I know now, I don’t think they go far enough.
Here are the report’s recommendations for research practice:
1. Describe and address choice of N and consequent issues of statistical power.
2. Report effect sizes and 95% confidence intervals for reported findings.
3. Avoid “questionable research practices.”
4. Include in an appendix the verbatim wording (translated if necessary) of all independent and dependent variable instructions, manipulations and measures. If the manuscript is published, this appendix can be made available as an on-line supplement to the article.
5. Adhere to SPSP’s “Data Sharing Policy” which states that: “The corresponding author of every empirically-based publication is responsible for providing the raw data and related coding information . . .
6. Encourage, and improve the availability of publication outlets for replication studies.
7. Maintain flexibility and openness to alternative standards and methods when evaluating research.
I’m 100% with them on items 4, 5, 6, 7. I recently had an experience in which I had some difficulty commenting on a paper that had been published in a top journal, where neither the article nor the supplemental material ever gave the survey questions used in the study, nor was there a clear statement of the data-exclusion rules or the raw data themselves. This should always be available—but if it’s neither a requirement nor a norm, we can’t expect people to share this information, Not because of secrecy but simply because it’s effort to put it all in there, and the #1 goal in writing a paper is typically to get it accepted, not to provide information for later researchers.
As for items 1, 2, 3, I think they represent a good start but we can do better:
1. The choice of N is important, and I’m completely in favor of design calculations; I just think they should be decoupled from “statistical power,” which is a very specific idea that is tied to statistical significance. Design calculations are relevant whether or not statistical significance is going to be part of the story.
2. Effect sizes and 95% intervals are fine but they don’t really solve the key problem of the statistical significance filter. When we focus on statistically significant results, we will systematically overestimate the magnitude of effects, sometimes by a huge amount. So the corresponding effect sizes will be misleading (yes, in some settings an expert can see the unrealistically high effect size estimates and scream that there is a problem, but what seems more common is people just accepting these ridiculous numbers (e.g., more beautiful people being 8 percentage points more likely to have girl babies, ovulating women being 20 percentage points more likely to vote in a certain way, women at more fertile days in their menstrual period being 3 times (!) more likely to wear certain colors of clothing) at face value. And confidence intervals can be even worse, in that the extreme endpoint of the confidence interval can well be out in never-never land. You’ll see this a lot in epidemiology studies, where the 95% interval for the risk ratio is something like [1.1, 8.5], and realistically we don’t believe it could be much higher than 1.5 or 2. The only point of the interval is that it excludes zero.
I’m not saying that interval estimates are useful. It’s important to have a sense of inferential uncertainty. But with small sample size (or, more generally, sparse data), classical confidence intervals don’t look so good. They’ll include all sorts of unreasonable values.
Funder et al. write, “Confidence intervals can be easily constructed for most types of effects, but sometimes complications arise. Some confidence intervals (e.g., for the Pearson r) are not symmetric and require a normalizing transformation (e.g., the Fisher r to z transformation); others do not have a known mathematical solution and can only be constructed empirically through repeated sampling procedures (e.g., bootstrapping)”—but all that has nothing to do with anything. The problem is much more fundamental than that.
3. Funder et al. want researchers to avoid questionable research practices and avoid “procedures that look at the results and then tweak the data post hoc to achieve statistical significance,” including:
(1) conducting multiple tests of significance on a data set without statistical correction; (2) running participants until significant results are obtained (i.e., data-peeking to determine the stopping point for data collection); (3) dropping observations, measures, items, experimental conditions, or participants after looking at the effects on the outcomes of interest; and (4) running multiple experiments with similar procedures and only reporting those yielding significant results. These practices may not be equally problematic; both (3) and (4) have particularly great potential to lead to serious inflation of the Type 1 error rate and yet not be recognized in the review process.
I agree, but I’m afraid this is close to useless advice, because the people who do these research practices don’t in general realize they’re doing it! Eric Loken and I have a whole paper on this topic, but, very briefly: what researchers are doing is, making data-analysis choices (including all of (1), (2), and (3) above, and also including (4) in the sense that they are choosing when to pool and when to compare-and-constrast when they do multiple experiments) contingent on data. But, because these researchers’ decisions are contingent on data, and they only see one data set, they don’t realize their multiplicity problems. You see researcher after researcher (Bem included) insisting that their data selection and data analysis are completely theory-driven, and thus not subject to multiple comparisons problems—but when you look carefully you see lots and lots of decisions that were not prespecified.
The point here is that different data sets would lead to different analyses, hence there’s a multiplicity problem even if only one analysis was ever considered for the particular dataset that was observed. The closely related point is that I fear the Funder committee recommendations will be misleading because there are lots and lots of researchers who don’t do “questionable” analyses as described above, but still have multiplicity problems because their data manipulations are contingent on the data they saw.
Finally, here are the report’s recommendations for educational practice:
1. Encourage a culture of “getting it right” rather than “finding significant results.”
2. Teach and encourage transparency of data reporting, including “imperfect” results.
3. Improve methodological instruction on topics such as effect size, confidence intervals, statistical power, meta-analysis, replication, and the effects of questionable research practices.
4. Model sound science and support junior researchers who seek to “get it right.”
I’m happy with all these (as long as item 3 is interpreted in light of my comments above regarding the problems with effect size, confidence intervals, and statistical power as general research tools).
P.S. I agree with footnote 5 of the Funder et al. report! And I suspect Hal Stern agrees with it as well.