If you are interested in a philosophical/historical take on combination, there is this http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

]]>I really liked your article! Indeed, the point about scientific inference being about information COMBINATION is the critical shift in perspective that we need today. It implies that decision criteria should be eschewed, and that multiple evidentiary measures (and original data) should be reported, so that results can be more easily combined.

I’m really chuffed that there’s so much activity from so many great thinkers on this issue now!

]]>(Would you mind writing this as a comment to my post as well, for that readership, or may I copy it there? Thanks!)

]]>You write, “Nothing magical happens when going from a p-value of 0.051 to one of 0.049.” But it’s much worse than that! Actually, nothing magical happens when going from a p-value of 0.005 (z=2.8) and a p-value of 0.2 (z=1.3): the difference between these z-values is a mere 1.5 which is nothing remarkable at all, given that the difference between two independent random variables, each with standard error 1, will have standard error 1.4. This was the point that Hal Stern and I made in our paper, The difference between “significant” and “not significant” is not itself statistically significant. The problem is not just with an arbitrary threshold, it’s a more fundamental issue that the p-value is a very noisy statistic. Practitioners seem to think that p=0.005 is very strong evidence while p=0.2 is no evidence at all, yet it’s no surprise at all to see both of these from two different measures of the very same effect.

]]>The lexicographic gatekeeping function (or as I call it “decision criterion science”) is pernicious in just so many ways, and prevents a clear understanding of the true evidentiary value of each given result, and how it must be considered in combination with others.

]]>+1

]]>I think this is often the case, but I’m not sure it’s the crux of the problem. Even applied to relatively simple systems, significance testing combined with noisy measurements and small effects and publication bias and flexibility in analyses and hypothesizing will lead to the same problems – people using patterns in noise to fool themselves and others into thinking they’ve discovered underlying truths.

I do agree that significance testing can be especially dangerous in complex environments insofar is it leads people to think in simple terms and downplay uncertainty.

]]>Can anyone of us imagine what someone like Richard Feynman would say when we talked about 95% confidence intervals being meaningful in any way?

]]>Interesting – wonder what the percentages of those who adequately care about logic are in various disciplines.

(And of course it depends on what you mean by logic – http://andrewgelman.com/2017/09/27/value-set-act-represent-possibly-act-upon-aesthetics-ethics-logic/ )

]]>Very off topic but Cox’s theorem doesn’t generalise predicate logic, it ‘generalises’ simple _propositional_ logic. Extending probabilistic logics to interact well with quantifiers is to some extent an unsettled question.

]]>Thank you for your clarification the Cox theorem.

” The question really is whether it has anything to do with data collection in scientific enterprises. » That is THE question. I agree, of course, with the frequentist interpretation of probability and recognize that p-value may be useful, but only under specific situation, for instance to check properties of an estimator, a situation that pertains to the statistician, not the lay user of t-test and chi-square.

But, if I understand and see the logic in the computation of a p-value in the framework of NHT, I do not understand the usefulness of “data more distant from the null than the observed one” in decisional problem. This is the specific context I wanted to talk about. From a practical (at least medical) point of view, I don’t see how they could be justified (hence my remark on the likelihood principle). It is as if a physician would consider a body temperature of 40° when in fact the patient is 39°. Non-observed data are not relevant in the process. Do you have a real, practical, exemple of a situation in which using a p-value would be really useful ?

A ban on p-value may be too strong an answer but I stick to the fact that the message to the community of users must be very clear. Maybe the real or major problem is the fact that the majority of consumers and producer of statistics are not statisticians themselves. As I always say to my colleagues, I do not do surgery in my kitchen, as an amateur, please do not do statistics on your own. They will go using a recipe, and the recipe must be as clear as possible.

No, this is not true, the Cox theorem says that the only generalization of predicate logic that agrees with binary logic in the limit is probability calculus.

The frequency interpretation of probability is a totally valid mathematical construct. It corresponds to the behavior of certain types of infinite sequences of numbers. The question really is whether it has anything to do with data collection in scientific enterprises.

p values do exactly one thing, they tell you how probable a certain range of test statistics is if the data comes out of an algorithmically random mathematical sequence. There is nothing wrong with this logic. What’s wrong is *applying this logic where it’s inappropriate*.

although I agree with your diagnosis of what is wrong with people who use p values and how they cut first and ask questions later… I think it goes too far to say that we should somehow “ban” the p value. What we need to ban is science done by people who don’t care about logic.

]]>I read your paper and this post with great interest. It’s brilliant and thought-provoking, as usual.

But, nevertheless, I must say that I am a bit flabbergasted by one point: while you remind of the different flaws of the p-value, your proposal for the future do nevertheless still include p-values in the general process of statistical learning and decision. In my view, this may considerably blur the message.

Maybe I am totally mistaken and p-value may have some interest, but which one ? I really don’t understand and if it has an interest, what is wrong in all the papers that for so long have opposed its flaws ?

My understanding of the issue is that the problem is not (only) in the .05 threshold used, or whatever value it may take, but in the efinition and nature itself of the p-value, and its native defaults : non respect of the likelihood principle, lack of incorporation of prior values etc. Moreover, the Cox theorem indicates that the only valid interpretation of probability must be done based on the Bayesian paradigm. Then, be they melted or not in a more global view of the statistical analysis of a study, p-value still have those defaults. What is the use of a defective tools and thus what is the use of including it in the statistical reflexion ?

So, I fear that keeping p-value, even when incorporated in a more flexible and fluid reflexion on the data and after getting rid of any magic threshold, will not be accompanied by a better reflexion. I work as a biostatistician in a french medical university and I can assure you that almost all my clinician colleagues cut information first (more or less than 0.05) and then (sometime) discuss the methodology of the study, while we should do the reverse and publish all results, be they positive or negative. They have been wrongly and empirically but heavily taught by the literature for so long to cut anything that comes up at a 5% levels. As long as there is something to cut, they will cut it first, and have a reflexion afterwards. It will take decades before we may get rid of p-values if physician were « authorized » to use p-values, even without threshold and even along with other statistical tools.

The only way out of this mess is to completely get rid of p-values. There will be shortly the ASA meeting on “the world beyond p < 0.05 ». I think that a very clear message must be given, otherwise it will be difficult to advocate a p-value-less world. Leaving value make their way in a paper will only muddle a study message. My colleagues judge almost only by the p-value, and habits are such that if p-values continues to appear in the results they will still be the only piece of evidence that they will rely on. I would bet on that : I am certain to win. The reproducibility issues, the selection of only positive value will then go on.

Evidence

Likelihood

And things could always be worse:

5%

0.05

To p<0.05, or not to p<0.05?

https://errorstatistics.com/2017/08/10/thieme-on-the-theme-of-lowering-p-value-thresholds-for-slate/

She does state that the classical “error probabilistic” method is “the most natural way in which humans regularly reason in the face of error prone inferences”, and that “you could say that when people assign the probability to the result they are only reflecting an intrinsic use of probability as attaching to the method”. This was in response to my claim that when people interpret the p-value as “the probability the results occurred due to chance”, they are mistakenly interpreting the p-value as the probability of the null. I take her response to mean that she thinks people who interpret the p-value as “the probability the results occurred due to chance” are aware of the fact that the probability statement refers to the error rate of the procedure and not the null itself.

The part about “fixed truth” was embellishment on my part.

https://errorstatistics.com/2017/08/10/thieme-on-the-theme-of-lowering-p-value-thresholds-for-slate/

]]>Modus Tollens applies to one-off situations. If we have “If A, then B,” then a single observation of not-B suffices to conclude not-A. In addition to this not working in general when probability is added to the mix, this (appropriate) mention of reliable production of particular results makes it clear that we’re not talking about one-off situations. Rather, we’re talking about exactly the kind of thing that McShane et al discuss (among many others, including Mayo in other places), namely how design, measurement, and statistical tools relate to each other, and how these jointly license inferences.

This quote also make it look an awful lot like Mayo granting that a single statistically significant result carries little, if any, evidential weight, since by itself, it doesn’t indicate that we can, in fact, reliably produce a particular effect. This bolsters the argument from McShane et al that statistical significance should not be used as a publication filter.

]]>I agree; I didn’t mean to imply otherwise. I don’t like dichotomization of evidence at all. I tend to take a stance that embracing uncertainty is useful, and if a decision is to be made, it’s with respect to some posterior uncertainty (or that + a utility function). I’d rather describe a posterior distribution, or in the least some estimate + some error, and either let the reader decide for themselves or formalize a decision in the context of theory or prediction. E.g., we have had posteriors on interactions that were essentially N(0,.1), and I more or less said “if there’s an effect, we can’t tell whether it’s + or -, but regardless it appears to be tiny and negligible.

But with that said, IF one makes a decision about some substantive hypothesis, I just meant that with the logic of probabilistic modus tollens, posterior quantities or basically non-NHST derived quantities can tell you the same thing. IF you’re going to make a decision about some substantive hypothesis, you can do so via the posterior. “If H0, then negative; effect probably positive, therefore not H0.”

I posted this elsewhere recently, which is relevant to this whole discussion.

I think one can say that the VAST majority of nil-nulls are false, to the point where nil-nulls should be the exception, not the rule.

And exactly: If nil-nulls are false, then there really isn’t a point in using a nil-null hypothesis test.

True nulls can exist, in very very few cases. E.g., when a physical law is in question. Like… ESP would actually require some rethinking about foundational physical laws if it were true; given our physical constraints, there truly can be a ZERO effect, as in it truly cannot occur.

But in the social sciences, we rarely have these sorts of questions, because most of our questions don’t require rewriting a constraint due to a physical law. Otherwise, everything is connected in some way, and out to SOME decimal place, there is an effect of everything on everything else.

The only realistic cases in psych where some statistic may be truly zero is only when the measurement sucks too badly to even detect a difference in the population. Like, imagine on some latent continuum, the statistic is .000001, but our measure only permits .1 differences at its smallest. Then even if we had every single person in the population, we would see zero difference, but only because our measure sucks, not because there’s literally zero difference. In this case, conditional on our measure the true effect could be “zero”, in the sense that asymptotically and with everyone in the population, the true value is zero; but that is not true unconditional of our measure. With a more precise measure, the true value would NOT be zero, but perhaps .000001.

]]>“If H0, then theta probably negative; theta probably not negative, therefore probably not H0”; and that’s a posterior statement regardless of NHST or significance testing.”

Yes – and I think the generalization of modus tollens being used in practice is more like “if A, then probably not B; B, therefore probably not A”:

If H0, t(x) probably not greater than t*; t(x) greater than t*, therefore probably not H0.

Now, I know Mayo and other proponents of significance testing doesn’t actually use this reasoning. They know that, in classical statistics, we don’t get to say “probably not H0”; we can only say that we reject or fail to reject H0 according to a procedure that has a given error rate, and the strength of our inference rests upon the error rate of the procedure.

But I strongly suspect most significance tests are performed and interpreted by people who have “p<0.05, therefore probably not H0" in their heads. And I think this is because the frequency interpretation of probability is not intuitive, except in the canned examples taught in intro stats classes. I think that most people treat probability as quantifying uncertainty, and so probabilities can apply to truth statements, e.g. H0. And since probabilities can apply to truth statements, the p-value can (and does!) tell us the probability of H0. I brought this up once on Mayo's blog and my interpretation of her response is that she doesn't agree that scientists typically think this way – I think she sees the natural way of applying probabilistic thinking to scientific questions as "there is some fixed truth out there in the world, and probabilistic statements cannot be made about this truth, they can only be made about data gathered by a random procedure." And so this whole brouhaha about people misinterpreting p-values and misinterpreting the results of significance tests comes down to the critics of significance testing unfairly assuming that users and consumers of significance tests don't really understand how they work. Hopefully I'm not misrepresenting her.

Needless to say, I don't think that (most) users and consumers of significance tests really understand how they work.

]]>What is the null model for ‘rainy day’?

Do you include whether or not sprinklers are nearby?

If I spill a glass of water on the grass does that count as wet enough?

These might seem like trolling questions but this sort of thing can matter in the high noise, highly variable and imprecisely measured situations we’re typically talking about.

Even in super precise contexts we need more than basic logic – that’s why we have mathematics! Most of the philosophy of science I’ve read seems to have a real obsession with starting from things like ‘H: Newton’s theory’. I find it hard to take anything that starts from that point too seriously.

In my view scientific theories are not (or almost never) simple Boolean propositions and hence don’t have simple truth values as such. Some people seem to freak out at this and think everything is about to go postmodern but I think that’s more like the argument that you can’t be good without god than anything too serious.

]]>Yes, whether it rains or not is a discrete outcome, and for this, discrete models make sense. Similarly, I think discrete models make sense for some questions of the form, is Trait A associated with Gene B. But I don’t think discrete models make sense for questions of the form, is embodied a real thing.

]]>(a) If it rains the grass will probably be wet;

(b) The numbers of non-wet and wet grass observations in my sample would rarely be seen after a rainy day;

(c) Therefore, either it probably didn’t rain or rain probably doesn’t make the grass wet

isn’t Modus tollens. But it does sound a lot like the problem we’ve been discussing.

]]>You quote Mayo as saying, “With several tests: Prob(test T < do; Ho) = very high, yet reliably produce d ≥do, thus not-Ho." That's all fine, but in just about every problem I've seen, I know ahead of time that Ho (the null hypothesis of zero effect and zero systematic error) is false. So inferring "not-Ho" doesn't really do anything for me. As I've also said many times in various blog comments and elsewhere, I do think there are problems where Ho could be approximately true, in examples such as genetics or disease classification, examples of discrete models. But I don't think this reasoning applies in settings such as "the effects of early childhood intervention" or "the effects of power pose" or "different political attitudes at different times of the month" or "the effects of shark attacks on voting." These are settings where we have every reason to believe that there are real effects, but these effects will be highly variable, unpredictable, and difficult to measure accurately. Rejecting Ho tells me nothing in these settings.

]]>A quibble: even in the probabilistic version the minor premise is just “not B” — the data are whatever is observed, by definition. That is, the p-value version of modus tollens goes,

Major premise: If H_0 holds then the probability of observing a realized value of X less than x is some value p close to 1.

Minor premise: X = x.

—

Therefore, H_0 is false or else something improbable has happened.

“To claim we shouldn’t test for the (statistical) significance of observed differences is tantamount to recommending we ban Modes Tollens [sic].”

I replied:

“We dont have modus tollens. How does that exist with probabilistic statements?”

Mayo:

“it’s stat MT practically all we ever have: With several tests: Prob(test T < do; Ho) = very high, yet reliably produce d ≥do, thus not-Ho."

Me:

"But with that logic, all sorts of inferential methods could be used. If H0, then BF 0|y) = .99; therefore not H0.”

Am I wrong here?

Essentially nil-null hypothesis testing isn’t an expression of modus tollens, because modus tollens doesn’t include a probabilistic statement. “If A, then B. Not B, therefore not A” doesn’t really work when you have “If A, then probably B; probably not B, therefore probably not A”.

But if you generalized modus tollens to just be, generally, “If A, then probably B; probably not B, therefore probably not A”, then lots of inferential frameworks would permit modus tollens anyway. “If H0, then theta probably negative; theta probably not negative, therefore probably not H0”; and that’s a posterior statement regardless of NHST or significance testing.

1) I can’t fathom why some still defend having testifying (as opposed to consulting) experts hired by the opposing parties, which easily creates situations in which the majority of unbiased expert opinions get excluded because they don’t fall clearly enough on either side. The goal as I see it should be to de-bias expert testimony as much as possible and get a fair representation of what is out in the field, given that the triers of fact will not have the expertise to do so very effectively. Your Hand story seems to support my view of this problem. And court experts could help address the problem of absurd claims getting filed: the availability of court experts early on should help get more of these meritless cases thrown out early, and could discourage such filings if the plaintiff has to pay the cost of this process when complete lack of scientific merit is determined (and make such a determination more sound than a court could achieve). Each party might recommend experts to the court, but again all testifying experts need to be screened by all parties for COIs and other bias sources (such as connections to the parties).

2) Regarding Hill: Like P-values, his list takes a lot of misplaced blame for abuse in the hands of incompetent or biased users. To be sure the list is far from perfect (Mod Epi goes through the list critically) but like a chainsaw it can be used constructively (to remind one of relevant items to check) or for butchery (if taken as a set of necessary conditions, as I often see defense do). There have been many proposals to update it but none I’ve seen result in dramatic shifts, and in my view only support the idea that it was quite a nice summary for 1965 when relative risks of 10 were being contested by industries. This says to me such updates as needed are for extension to our current, far less certain RR of 2 or less era, not for philosophic subtleties. And that means modern causal analysis tools will enter into its application.

3) There is a fundamental asymmetry I’ve observed between defense and plaintiff lawyers in torts: Plaintiff lawyers search for cases they can win, which for the best means at the outset they hire consulting experts to judge whether the science could actually support causation, before entering the arduous and expensive litigation process. Defense is more reactive, having to defend in response to plaintiff claims and build a case against causation regardless of what the actual science shows (even if only to minimize final judgments or drag on the case to bankrupt plaintiffs, as I’ve seen happen). This asymmetry is incomplete: again, there are plenty of absurd claims filed but again the availability of experts for the court should help get more of these thrown out early, and discourage meritless filings if the plaintiff has to pay the cost of this process when that happens.

4) Given the inertia that has greeted proposals and mechanisms for court experts, we need effective if less ideal solutions to the current mess. The Reference Manual on Scientific Evidence is one and my impression is it has helped quite a bit, albeit in a limited sphere. I think it needs updating with coverage of cognitive biases, with guidelines to help courts detect not only bias in expert opinions, but also to help de-bias their own judgments.

That said, I was puzzled by your comment about helping with the Reference Manual on Scientific Evidence, insofar as I’m not one known to be shy about sharing my opinions and criticisms (Andrew even blogged humorously about that very fact) – so if they ask they’ll get plenty. Having been however at the 2003 San Diego conference on science and the law that fed into the latest edition and not contacted since, I won’t hold my breath for that.

]]>Statistics should be about grasping the real uncertainties from what we observe and you seem to be requesting some certainty about that uncertainty (e.g. uniform physical rules, natural laws, something about general principles?) and very likely is just not possible.

Unlike commonly accepted accounting practices, adequate for this purpose Newtonian physics, etc. I do not think those are in the foreseeable future for statistical practice. But as I said earlier “That is what I see is being worked through currently in the statistical discipline” so I might be wrong.

On the other hand, science as hard as it is, does not need to meet the additional requirements of the courts – so I think it is best to start there. If we can sort out what constitutes minimally sound statistical inference in science then maybe that can be upgraded to meet those additional requirements. And maybe not.

However, the idea of a court appointed expert, perhaps to arbitrate between the competing experts seems a far more promising route. I think you have already convincingly pointed out that what ever sensible stuff a statistician might write – it will be naively or purposely be twisted into something very different that will do a lot of harm!

]]>I mean, Fisher was far from perfect, but don’t forget Fisher’s major contributions to the modern synthesis of genetics and natural selection using…mathematical modelling!

He has at least one PDE (in various forms) named after him: https://en.wikipedia.org/wiki/Fisher%27s_equation

]]>Perhaps it’s this business of thinking about variable phenomena in a language more suited to discussing gravity, the charge on an electron, etc. that’s at the root of the problem. Maybe when Fisher read Student’s paper and saw how well his empirical test of 750 randomized N=4 finger length samples fit his newly discovered curve he thought “well, it ain’t E=Mc^2 but there is an underlying law here!” Whereupon he immediately fell into reification’s trap because he had no solid theory/philosophy of modeling. I’m hoping Andrew’s new book will devote a page or two (maybe it’s already in the old book but I’m not buying it to find out since the iDataAnalysis Model X is rumored to be coming out soon and I’m cheap) to the fundamental issue of “how (to paraphrase Student) may a practical man reach a definite conclusion from all this modeling you’re going on about, Andrew?”

]]>the experiment was designed and controlled sufficiently well to rule out any plausible cause of a signal other than a gravity wave.

They definitely do put a lot of effort into that, to the point of publishing entire separate papers about it (which is good). But shouldn’t each “ruling out” of each alternative have some kind of probability associated with it? Then plug all that into Bayes’ rule and say “assuming nothing we didn’t check for is going on here is the probability it is a GW”.

Clearly they do not rely *only* on predictions of GR regarding inspiraling black holes, or they wouldn’t do this type of “reject background” analysis. Many other things must cause such signals that correlate across sites in their equipment as well, right?

What I inartfully attempted to convey was frustration with (a) what Daniel wrote; the attempt by courts to turn your measures of a model’s uncertainty into certainty about a competing (and usually untested) conjecture; and, (b) the increasingly common claim sometimes made by people citing Rothman/Greenland (especially when we get to the execrable – The Court: “Counsel, let’s now go through the Bradford Hill causal criteria analysis” – phase of the proceedings) that either methods don’t matter or that no method is any more likely to ferret out false claims than any other. The result in at least two cases have been courts deciding that causation can be discovered and justice thereby done merely from a credentialed scientist’s assessment of biologic plausibility alone.

What I hoped to get from you assembled scientists was (recalling the witch in Monty Python’s The Holy Grail) agreement as to whether there exists some minimal level of testing of hypothetical statistical claims made against you that would if the test(s) was passed cause you to say “it’s a fair cop”. As for the other points you raised:

Paragraph 3 (yours): Since the ASA statement on p-values was published there have been more than 500 published legal opinions and orders that contain “statistically significant”, “confidence interval”, “statistical power”, etc. Children are being taken from their parents, prisoners executed, fortunes won or lost all on the basis of reasoning like this from a recent federal appellate court opinion: “The p-value quantifies the statistical significance of a relationship; the smaller the p-value the greater the likelihood that associations determined in a study do not result from chance.” I’m not saying you’re responsible for this mess. I’m saying you have the street cred to help clean it up. To that end, if you’re asked to help on the next version of the Reference Manual on Scientific Evidence please consider it.

Paragraph 4 (yours): Responsibility in the context of duty (along with the other aspects of wise judgments) was supposed to be parked in the definition of “legal causation” (which was grounded in public policy theory not too distant from that of public health) but courts began to ignore it 40+ years ago when they became convinced that but-for causation was all that was needed and NHST could find it. Thus the spectacle of the administration of justice via NHST. Thus, I’m in complete agreement.

Finally, as to your fifth paragraph, (because witches), you’ll be pleased to know that the famous jurist Learned Hand made your point in a law review article 116 years ago. In his survey of the use of expert witnesses at trials he covered a number of instances of their use before juries including in “The Witches Case” 6 Howell, State Trials, 697 – “Dr. Brown, of Norwich, was desired to state his opinion of the accused persons, and he was clearly of opinion that they were witches, and he elaborated his opinion by a scientific explanation of the fits to which they were subject.” Hand was troubled by the uses to which “science” had been put but more importantly troubled by the role of experts.

Expert witnesses were supposed to testify to “uniform physical rules, natural laws, or general principles, which the jury must apply to the facts.” That meant that the jury was largely displaced from its usual role of supplying the major premise (drawn from common sense and common experience) to which the admitted testimony of witness was applied but IFF they believe the expert witness. That in turn drew the focus of the lawyers to the credibility/likeability of the expert rather than the soundness of the claim he/she was making. This is how we wound up with (true story) a highly credentialed, bright, quick and deadly expert witness being paid $2 million per year NOT to testify against a certain party and group of lawyers. Anyway, here’s what Hand wrote in 1901 about paid experts:

“The expert becomes a hired champion of one side… Enough has been said elsewhere as to the natural bias of one called in such matters to represent a single side and liberally paid to defend it. Human nature is too weak for that; I can only appeal to my learned brethren of the long robe to answer candidly how often they look impartially at the law of a case they have become thoroughly interested in, and what kind of experts they think they would make, as to foreign law, in their own cases…. It is obvious that my path has led to a board of experts or a single expert, not called by either side, who shall advise the jury of the general propositions applicable to the case which lie within his province.” http://www.jstor.org/stable/pdf/1322532.pdf?refreqid=excelsior%3Ae67b3a059b1c64f4d63f7b6fbbc5018b

]]>Take nobody’s word for it (don’t rely on argument from consensus/authority heuristics), which entails replicating each others work and requiring each other to make accurate and precise predictions about observations in the future. Those are the basic requirements of science, without them you have no functioning scientific community.

]]>2) I certainly agree about the issues with political science and econ; as I’ve said before ( https://medium.com/@davidmanheim/the-good-the-bad-and-the-appropriately-under-powered-82c335652930 ) there are questions that can’t be answered with better statistical tools, because the samples involved are limited. That’s a fundamental limitation about questions where new samples cannot be generated. I just had a closely related discussion regarding a re-analysis of data about wars; http://oefresearch.org/blog/debate-continues-peace-numbers – they essentially used p-values to conclude that the evidence they re-analyzed doesn’t reject the null hypothesis that there is no change, thereby “arguing with” Pinker. This was based on previously collected data, and the method of considering p-values for a new, more complex model to ask whether the evidence supports a previous claim seems even more conceptually broken that the usual use of p-values.

3) Agreed.

4) I’m not saying it’s actually addressed to Journal Editors. I managed not to say it explicitly, but I was contrasting this with the Lakens paper (which I was Nth author on, and contributed a very very little bit towards,) which was addressed much more to Researchers. The idea that the choice of alpha should be justified is very closely related to your points about how to choose what to research – but there, we had even more (implicit) focus on how to choose your sample size and study design. The idea that one should abandon NHST has implications for study design, but you more closely focused on what to study based on interpreting previous research, and how to analyze the data. That said, obviously, this is all conceptually related; designing good studies to maximize posterior surprise requires both interpretation of previous results to choose what to study, and implicitly choosing something like an SDT loss function to figure out what to look at, and how hard. What it shouldn’t involve, but currently does, is figuring out what sample size will get you p<0.05.

]]>Both pipelines identiﬁed GW170814 with a Hanford-Livingston network SNR of 15, with ranking statistic values from the two pipelines corresponding to a false-alarm rate of 1 in 140 000 years in one search [39, 40] and 1 in 27 000 years in the other search [41–45, 57], clearly identifying GW170814 as a GW signal. The difference in signiﬁcance is due to the different techniques used to rank candidate events and measure the noise background in these searches, however both report a highly signiﬁcant event.

https://dcc.ligo.org/LIGO-P170814/public/main [ligo.org]

Here they come right out and say, “We rejected the background model with p-value = 1/140,000. This number is very small, therefore this event was a GW signal (from two black holes of specific size, etc colliding).” Given that this is one (if not the) most high profile science project of the last few years, I don’t think any abandonment is coming soon.

]]>Regarding your second paragraph: I think philosophical arguments have their place, as there are many ways to try to understand the world. I have spent most of my career keeping quiet about foundations and just doing stuff, and writing books demonstrating how to do stuff. When writing Bayesian Data Analysis and then again while writing Data Analysis Using Regression and Multilevel Models, I consciously avoided arguments about what methods are better or worse, instead focusing on good practice and the derivations of such methods. And had I spent the past thirty years doing nothing but screaming about hypothesis testing, I think I’d be a lot less happy and the world would be a poorer place.

Here’s what happened: I’ve had problems with null hypothesis testing for a long time, but for a long time my way of expressing this view was simply to not use such methods except in the rare cases where they seemed appropriate to me. I also explained this position in some theoretical articles, such as my 1995 paper with Rubin on avoiding model selection, my 2000 paper with Tuerlinckx on type M and S errors, my 2003 paper on exploratory data analysis and goodness of fit testing, and my 2011 paper with Hill and Masanao and multiple comparisons.

Incidentally, one of the cases where I *did* find some version of null hypothesis significance testing to be useful was in exploring problems with a model used for medical imaging analysis. I felt I got something from measuring the distance of the data to a model that I’d been using. This was in my Ph.D. thesis and it motivated my later work on posterior predictive checking, which started with a focus on Bayesian p-values but has since moved toward graphical visualizations.

Anyway, my crusade against NHST in recent years came by accident, as I happened to encounter various bad published papers, starting with those of Satoshi Kanazawa and continuing with the well-publicized work of Daryl Bem and all the rest that we’ve been hearing so much about, and I started to realize that the problems with all this work was not just a bunch of individual data-processing mistakes of the Wansink variety, but a larger problem that when classical “p less than 0.05” methods are used in an attempt to extract signal from extremely noisy data, that what will be extracted is something close to pure noise. I got involved in some particular disputes and then it seemed that it made sense to think harder about the general issues. Working on all of this has deepened my own understanding of these problems, and I feel that my work in this area has been a contribution, I hope in some part by motivating researchers to think more carefully about data quality rather than thinking that, just because they have a randomized experiment or a regression discontinuity or whatever, that they can just push the buttons, grab statistically significant comparisons, and claim victory.

“Circular philosophical arguments” have necessarily played some role in these discussions, in part because classical NHST methods *are* supported by some theory. The math isn’t wrong but the assumptions don’t really apply—at least, not in many of the sorts of application areas where I see those methods being used—and so it’s kinda necessary to explain *why* the assumptions don’t apply, to give a sense that, yes, in some settings NHST can be theoretically supported but not in these sets of problems.

* I.e. those where no intercurrent events mess up your randomized comparison and make the interpretation harder. The problems are things like when a patient dies in a randomized clinical trial when you said really wanted to compare blood pressure at the end of the trial between treatment groups.

]]>The Science article on reproducibility encouraged that kind of thinking (by saying that the lower the p-value in the original study, the more reproducible the result tended to be), and now I see more and more psychologists writing the above statement as if it were a fact.

]]>