A few people pointed me to this editorial by D. Stephen Lindsay, the new editor of Psychological Science, a journal that in recent years has been notorious for publishing (and, even more notoriously, promoting) click-bait unreplicable dead-on-arrival noise-mining tea-leaf-reading research papers. It was getting so bad for awhile that they’d be publishing multiple such studies in a single issue (see, for example, slides 15 and 16 of this presentation) or just enter Psychological Science in the search box on this blog).
This editorial seems great. Lindsay talks about replication problems and how researchers should do better. He warns about p-hacking, noise, and the difference between significance and non-significance not being itself statistically significant. In his letter he never quite says that Psychological Science itself has published papers with weak to no statistical evidence, but I guess that’s a political thing. Best in my opinion would be to (1) acknowledge current and past problems and then (2) do better in the future.
But if the Association for Psychological Science is too constrained to do (1), I’m still happy for them to do (2).
Lindley concludes with this upbeat statement:
The editors of Psychological Science are confident that we can reduce the rate at which Type I errors are published without compromising other values (e.g., interestingness, relevance, elegance), and that is what we intend to do.
I believe in type 1 error about as much as I believe in yoga, kings, elvis, zimmerman, and beatles, but I appreciate the general sentiment. To be more precise, I might expect a decline in interestingness but an increase in relevance.
Measurement, measurement, measurement (and design): Doing better statistics is fine, but we really need to be doing better psychological measurement and designing studies to make the best use of these measurements
There is one big thing I’d add to Lindsay’s statement, and that’s measurement and design.
Lindsay does talk about low power, which you get when data are noisy, but I don’t think this is enough. I worry that readers of his note will get the impression that non-replicability is a statistical problem or maybe a procedural problem to be solved by reforms such as preregistration and minimization of p-hacking. But fundamentally I think it’s more of a problem of measurement and study design, a point I’ve been making for the past year or so in this space.
One reason so many of these Psychological Science studies are so dead on arrival is that they hinge on noisy measurements in uncontrolled, between-subject designs. That puts you right here, and no amount of preregistration or fancy statistics is going to solve your problems.
When people asked me if I thought the fat-arms-and-voting study or the ovulation-and-clothing study or the ovulation-and-voting study should be subject to preregistered replications, I said: sure, if you want to replicate these studies, go for it, but I wouldn’t really recommend wasting your time. The measurements are so noisy that such replications would be primarily of methodological interest, just to demonstrate that with new random data you’ll be able to find new random patterns.
So I’d love it if the official statement from the Psychological Science editor emphasized that performing more replicable studies is not just a matter of being more careful in your data analysis (although that can’t hurt) or increasing your sample size (although that, too, should only help) but also it’s about putting real effort into design and measurement. All too often I feel like I’m seeing the attitude that statistical significance is a win or a proof of correctness, and I think this pushes researchers in the direction of going the cheap route, rolling the dice, and hoping for a low p-value that can be published. But when measurements are biased, noisy, and poorly controlled, even if you happen to get that p less than .05, it won’t really be telling you anything.
And some other things
As noted above, I like Lindsay’s editorial. But there are a few places where I’d say things differently.
I’m loath to make these comments because I don’t want to dilute the major points I just made above, and I certainly don’t want to piss of Lindsay, who seems to be on my side in this general issue.
But ultimately I think I’m more effective when I just say what I think (at least, when it comes to my areas of expertise). So here goes. But, again, let me emphasize that in my pickiness here, I’m just trying to help, I’m not trying to get into any fights.
1. Garden of forking paths. Lindsay decries “p-hacking”: “practices that inflate the Type I error rate, such as (a) dropping subjects, observations, measures, or conditions that yielded inconvenient data; (b) applying poorly motivated and post hoc data transformations; (c) using questionable covariates; (d) suppressing mention of experiments that were conducted but ‘didn’t work’; and (e) using the optional-stopping strategy . . .” I agree with Lindsay that these are problems “whether these sorts of things are done innocently or nefariously.”
But I think he should go further. Eric Loken and I use the term “garden of forking paths” to refer to the many choices in data processing and analysis that can be taken, contingent on data. The key point is that even if you, the researcher, do only one analysis of existing data, your p-values will still in general be wrong if you could have done something different, had the data been different. It’s a Monty Hall kind of thing.
This upsets some people—they don’t like to be penalized, as it were, for analyses they didn’t do—but, sorry, that’s the logic of p-values. As Eric and I explain in our paper, the p-value is necessarily defined based on what you would’ve done. If you don’t want outsiders speculating on what you would’ve done, had the data been different, you can preregister or you can use other statistical methods. If you want to play the p-value game, you gotta play by the rules.
Anyway, I think it’s important to emphasize this “forking paths” thing. Otherwise I fear that researchers will think that, because they only did a single analysis on their dataset, they haven’t p-hacked. Just a sentence would do here, something like this: “P-values can be invalidated by p-hacking or the garden of forking paths, even when only a single analysis was performed on the existing data.
2. Moving away from “power.” I appreciate all the warnings about noisy, low-power studies. But ultimately I don’t think power is quite the right way to look at this. The trouble is that “power” is all about getting statistical significance (p less than .05), which isn’t really where it’s at. John Carlin and I discuss our preferred framework in terms of type M and S errors in our recent paper in Perspectives on Psychological Science.
3. Abandoning “statistical significance.” Lindsay expresses concerns about “a p value only slightly less than .05” but I feel that the implication is that the p-value maps in some direct way to evidence. To disabuse you of this attitude, I refer you to this classic example from Carl Morris.
Overall I think this is all a step in the right direction, and I’m very happy that the editor of Psychological Science has released this statement.
Next stop, PPNAS. (Ha! That’ll be the day.)