Ivan Zupic points me to this online discussion of the article, Dwork et al. 2015, The reusable holdout: Preserving validity in adaptive data analysis.
The discussants are all talking about the connection between adaptive data analysis and the garden of forking paths; for example, this from one commenter:
The idea of adaptive data analysis is that you alter your plan for analyzing the data as you learn more about it. . . . adaptive data analysis is typically how many researchers actually conduct their analyses, much to the dismay of statisticians. As such, if one could do this in a statistical valid manner, it would revolutionize statistical practice.
Just about every data analysis I’ve ever had is adaptive, and I do think most of what I do is “statistically valid,” so whassup with that? A clue is provided by my 2001 paper with Jennifer and Masanao, “Why we (usually) don’t have to worry about multiple comparisons.” If you fit a multilevel model (or a Bayesian model with informative prior distributions), then it’s perfectly “statistically valid” to look at many comparisons. The key is aim to do all the analyses you might do, avoiding selection bias by performing all relevant comparisons, and avoiding the problems with p-values by partially pooling all your comparisons rather than just reporting a selected subset.
So, is this correct? The problem lies not in adapting your analysis but in ignoring other possibilities as you go along. Multiple comparisons are informative if you expose and examine all the comparisons (or as many as possible).
Thus, for instance, if Carney, Cuddy, and Yap had controlled for gender, they might have cast productive doubt on their findings, as Shravan has pointed out (http://statmodeling.stat.columbia.edu/2016/09/30/why-the-garden-of-forking-paths-criticism-of-p-values-is-not-like-a-famous-borscht-belt-comedy-bit/#comment-318708) and Carney has essentially acknowledged.
You state in your paper: “The main multiple comparisons problem is that the probability a researcher wrongly concludes that there is at least one statistically significant effect across a set of tests, even when in fact there is nothing going on, increases with each additional test.” This problem would be less severe, I take it, if the researcher looked carefully at the differences between the results, instead of seizing on the ones with the statistically significant effects.
One day I will understand Bayesian modeling, I hope (I am working slowly, very slowly, through Data Analysis Using Regression and Multilevel/Hierarchical Models).
Diana:
You might find t useful to watch these lectures first
Statistical Rethinking Winter 2015 Richard McElreath 21 videos https://www.youtube.com/playlist?list=PLDcUM9US4XdMdZOhJWJJD4mDBMnbTWw_z
Thank you. I look forward to watching them.
If they looked at the ambiguity of their results in the eye so to speak, the paper would have been unpublishable because “closure” is demanded in papers. Decisive conclusions.
That is untenable. I am not generally a “rally person,” but I would go to a rally to protest this situation.
“Stop demanding closure! Stop demanding closure!”
“Hey, stop! Do the math! Look down *every* forking path!”
“What do we want? Thorough and unflinching investigation of uncertainties, discrepancies, and ambiguities! When do we want it? Now!”
Suggestion for rally sign: Closure leads to closed-mindedness
Aki already looked at that Dwork thing; according to him the results were not encouraging.
How about a slightly more restrictive position? Something like Simonsohn, Simmons, and Nelson’s Specification Curve framework to facilitate discussion regarding “All Reasonable Specifications.”https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2694998
Along these lines, what do you think of John Elder’s concept of target shuffling?
https://www.cmich.edu/colleges/CHP/ihbi/Events/Documents/Target%20Shuffling%20-%20John%20Elder.pdf
Frequentism seems like so much fun I thought I’d give it a try:
The only way to avoid selection bias is by considering every conceivable function of the data, computing it’s distribution from the sampling distribution and getting the p-value. If you do anything less you’re not getting the real p-values and can get wrong results. Since Frequentist analysis in the wild is wrong usually it must be because they fail to compute the p-value of every conceivable statistic.
So where do I go to pick up my Philosophy of Science Ph.D.? Is there a Statistics Nobel?
I love your comments.
Maybe an Ig Nobel?
How exactly do you “do all the analyses you might do”? I think that this basically requires implicit pre-registration of your whole data analysis procedure making it non-adaptive in the sense of the discussed article.
Vitaly:
I said to aim to do all the analyses you might do. It’s an aim, even if we don’t get all the way there, just as in statistical analysis we aim to model the underlying process and the data-generating mechanism.
Andrew,
I certainly agree that this is a good aim to have. But how realistic is it? Quoting from your article:
“What, then, can be done? Humphreys, Sanchez, and Windt (2013) and Monogan (2013) recommend
preregistration: defining the entire data-collection and data-analysis protocol ahead of
time. For most of our own research projects this strategy hardly seems possible: in our many
applied research projects, we have learned so much by looking at the data. Our most important
hypotheses could never have been formulated ahead of time.”
What I’m saying is that, of course, we should try to account as much as we can for all the data-dependent decisions. But it seems unrealistic to expect that all “forks” will be accounted for. Should we then just allow ourselves to ignore the deviations from pre-specified analysis?
Woohoo, a second cross-validated posting of mine made Gelman’s blog! I feel like I’ve made it as a statistician. Although if my postings occur too frequently, that likely means I’m not qualified as a statistician…
Anyways, in that post, I really didn’t want to get into whether adaptive data analysis is, in general, a good idea, but rather I just interested in the Science paper. Perhaps I should have been more careful about my wording, as most of the discussion quickly turned into whether adaptive data analysis is a good idea, both there and here.
To clarify my views a bit more, the idea of adaptive data analysis is important but dangerous. P-hacking is a method of adaptive data analysis, and exactly the type of adaptive data analysis that the authors are attempting to turn from terrible practice into a statistical valid method. Aiming for the stars indeed.
But even though I’m against p-hacking, I’m not uniformly against adaptive data analysis! There’s times when there are clear patterns in the data that were not foreseen when drawing up the analysis plans. I don’t think excluding clear patterns just because you didn’t think of them ahead of time is a good idea. The slippery slope is “is that really a clear and obvious pattern? Or just the one I want to see?”
I discuss a case of this on my own blog:
http://cliffstats.weebly.com/almost-surely-degenerate-ramblings/the-slippery-slope-of-the-more-sophisticated-analysis
And in that post, I even support thinking about things in a Bayesian manner (even if formal Bayesian methods are not used) if you want to have any hope of having a reasonable conclusion.
Sometimes it seems like too much attention is paid to statistics and research on this blog, and not enough attention is paid to the cat photos. So I just want to thank you for the excellent choice of photo for this post, which is marvelous.