Bill Harris writes:
In “Type M error can explain Weisburd’s Paradox,” you reference Button et al. 2013. While reading that article, I noticed figure 1 and the associated text describing the 50% probability of failing to detect a significant result with a replication of the same size as the original test that was just significant.
At that point, something clicked: what advice do people give for holdout samples, for those who test that way?
R’s Rattle has a default partition of 70/15/15 (percentages). http://people.duke.edu/~rnau/three.htm recommends at least a 20% holdout — 50% if you have a lot of data.
Seen in the light of Button 2013 and Gelman 2016, I wonder if it’s more appropriate to have a small training sample and a larger test or validation sample. That way, one can explore data without too much worry, knowing that a significant number of results could be spurious, but the testing or validation will catch that. With a 70/15/15 or 80/20 split, you risk wasting test subjects by finding potentially good results and then having a large chance of rejecting the result due to sampling error.
I’m not so sure about your intuition. Yes, if you hold out 20%, you don’t have a lot of data to be evaluating your model and I agree with you that this is bad news. But usually people do 5-fold cross-validation, right? So, yes you hold out 20%, but you do this 5 times, so ultimately you’re fitting your model to all the data.
Hmmmm, but I’m clicking on your link and it does seem that people recommend this sort of one-shot validation on a subset (see image above). And this does seem like it would be a problem.
I suppose the most direct way to check this would be to run a big simulation study trying out different proportions for the test/holdout split and seeing what performs best. A lot will depend on how much of the decision making is actually being done at the evaluation-of-the-holdout stage.
I haven’t thought much about this question—I’m more likely to use leave-one-out cross-validation as, for me, I use such methods not for model selection but for estimating the out-of-sample predictive properties of models I’ve already chosen—but maybe others have thought about this.
I’ve felt for awhile (see here and here) that users of cross-validation and out-of-sample testing often seem to forget that these methods have sampling variability of their own. The winner of a cross-validation or external validation competition is just the winner for that particular sample.
P.S. My main emotion, though, when receiving the above email was pleasure or perhaps relief to learn that at least one person is occasionally checking to see what’s new on my published articles page! I hadn’t told anyone about this new paper so it seems that he found it there just by browsing. (And actually I’m not sure of the publication status here: the article was solicited by the journal but then there’ve been some difficulties, we’ve brought in a coauthor . . . who knows what’ll happen. It turns out I really like the article, even though I only wrote it as a response to a request and I’d never heard of Weisburd’s paradox before, but if the Journal of Quantitative Criminology decides it’s too hot for them, I don’t know where I could possibly send it. This happens sometimes in statistics, that an effort in some very specific research area or sub-literature has some interesting general implications. But I can’t see a journal outside of criminology really knowing what to do with this one.)