Erikson Kaszubowski writes:

I have recently read an article by D. R. Cox (from 1977!) on significance testing where he discusses modification of analysis in the light of data. I don’t know if the article is well known, but I didn’t see it in the references of the garden of forking paths article. His argument is that relevant changes call for a separation in exploratory and confirmatory analyses, but in some cases it should not be such a big problem. The first comment, by Prof. Spjotvoll, is not so optimistic: he mentions a social science or genetics researcher actively searching for significant hypotheses and always finding (and publishing) something “interesting” in the statistical sense. Under such setting, no conclusion is possible based on p-values.

I took a look and asked: Does he mention forking paths? I see him mentioning multiple tests on the same dataset, but I think there’s a key concern, not well understood, that even if only a single test is done on data, that the test can be contingent on data, thus invalidating the p-value without any apparent p-hacking or multiple testing.

It’s a “multiple potential comparisons” problem.

Kaszubowski replies:

In Section 3, “Modification of analysis in the light of data”, D.R. Cox talks exactly about changing the method of analysis due to the data as something different than multiple tests on the same dataset. He claims, after enumerating four possible scenarios, that “[…] in extreme cases, choice of a null hypothesis in the light of the data makes irrelevant the hypothetical physical interpretation of the significante level” (p. 56, emphasis mine). This “hypothetical physical interpretation of the significance level” is a really strange way to say that “p_obs is the probability that H0 would be ‘rejected’ when true” (p. 50).

In the first scenario, when the whole formulation comes from exploratory works, Cox then mentions that when a specific aspect of a “haphazard dataset” is tested for significance after exploratory work, it is sometimes possible to calculate an “allowance for selection”, that is, correct the p-value for the fact that the hypothesis was defined after seeing the data, but “often the sequence of exploratory analysis is ill-defined” (p. 57), which again invalidates the interpretation of the obtained p-value.

He doesn’t frame it exactly in terms of “multiple potential comparisons”, but he does make it clear that a p-value obtained from a sequence of exploratory analyses cannot be interpreted in its original intended way.

Indeed. Cox doesn’t emphasize the point but it’s there.

In “Frequentist Stat as a Theory of Inductive Inference” Cox and I attempt to delineate the kinds of cases where adjustment is needed and when not. http://www.phil.vt.edu/dmayo/personal_website/(2006)%20MayoCox%20Freq%20stats%20as%20a%20theory%20of%20inductive%20%20inference.pdf

That portion of the paper blends Cox’s taxonomies on this matter with my discussions over the years on when non-novel data, double counting and the like do and don’t matter to inference. It turns on altering error probabilities, or not.

I am wary of instantial evidence dressed up as hypothetico-deductive evidence. They are both fine but we have to be careful not to mistake one for the other. When we diverge from a strictly pre-specified evaluation (as we must, almost every time, and as path-forkers do from the very outset), we switch to instantial evidence, though it might not look that way presented with all those p-values and significant / borderline / trend-towards / failed-to-achieve language. I borrow liberally from Peter Lipton here, who borrowed from Nelson Goodman and others. Seeing some black ravens makes you confident that all ravens are black, while seeing some bearded philosophers would not have the same effect. Why? Context and prior information, anathema to NHST fans. If an instance is more valuable after specifying the hypothesis than before, then it is more valuable to some people than to others (subjectivity, abomination to NHST fans), and worse still, the difference between these groups of people is that one group thought up an idea in the pub while the others didn’t. Seeing some green leaves also backs up the black raven hypothesis, because they are non-black non-ravens, unless you exert some common sense in defining the data to be collected, aka prior knowledge. So, how I see it is that when we revise the analysis, we get weaker evidence but perhaps more informative, in the sense that really clear information about something irrelevant is not much use at all, so revision is OK, and it’s important to recognise that we almost certainly started down the road of revision even before we collected any data.

I suspect Cox would be quite comfortable with Bayesian adjustments for biases in his ‘allowance for selection’. Shravan Vasishth presented some really nice work in meta-analysis of linguistics studies along these lines at Bayes@Lund a couple of weeks ago, adjusting for all the “monkey business” in published studies (including his own) by eliciting priors for reporting bias and forking.

Robert:

I wrote a bit about Cox’s views on Bayesian statistics in my article, Ethics and the Statistical Use of Prior Information.

Robert:

I specifically discussed Bayesian adjustments for bias with David Cox in 2002/3 specifically for meta-analysis of non-randomized studies based on Sander Greenland’s MBA work (as part of my thesis) and his response was roughly “such an approach would not at all be helpful to decision makers”.

On the other hand, Nelson Goodman’s blue-green emeralds puzzle which CS Peirce apparently resolved based on “not all regularities require explanation” [need for abduction in inference and not being able to start inquiry but where one finds oneself] could be thought of as just a qualitative prior and hence not relevant here?

Keith: Peirce discusses the kind of situation wherein “predesignation” isn’t necessary for induction (which for him is severe testing), e.g., 2.740. It’s on p. 314 of EGEK (Mayo 1996) in chapter 9: “Hunting and Snooping: Understanding the N-P Predesignationist Stance” http://www.phil.vt.edu/dmayo/personal_website/EGEK%20CH%20NINE.pdf

It’s a big mistake to allege that significance tests preclude background information, simply because they don’t use insist on quantifying it in terms of prior probabilities in hypotheses being appraised (unless frequency based). There’s quite a lot of discussion of the use of background in frequentist inference on my blog, for anyone interested.

http://errorstatistics.com/2013/07/23/background-knowledge-not-to-quantify-but-to-avoid-being-misled-by-subjective-beliefs-2/

Several involve direct exchanges with Gelman on this issue:

http://errorstatistics.com/2012/10/05/deconstructing-gelman-part-1-a-bayesian-wants-everybody-else-to-be-a-non-bayesian/

There’s more discussion about this issue in sections 3.7 and 4.7 of Cox and Snell’s ‘Applied Statistics’. I particularly like a quote from pg. 39 that: “The most widespread use of significance tests is in fields where random variation is substantial and where there is appreciable danger of premature claims that effects and relationships have been established from limited data”.

The discussion of “allowance for selection” reminds me of the selective inference work that is coming out of Stanford Statistics right now (where I recently completed my doctorate). http://www.pnas.org/content/112/25/7629.full provides a nice introduction to the ideas and http://arxiv.org/pdf/1410.2597v2.pdf does a good job discussing the theoretical framework. This body of work provides a frequentist approach to resolving the garden of forked paths.

Essentially, the idea is to only test the probability of that extreme a value under some null conditional on the null you decided to test. My favorite example is looking at a one sided test of whether a coin is a fair coin after 10 flips. If it comes up heads 8 times you might test if the probability of heads is greater than .5, but if it came up tails more often you would have tested the other direction. Selective inference resolves the issue by looking at the distribution of the number of heads conditional on which direction you decide to test (equivalent to truncating the binomial distribution). In this case, the selective test is identical to the two sided test. While only some types of selection events can be accommodated, many common selection strategies can be adjusted for using this idea.