We’ve spent a lot of time during the past few years discussing the difficulty of interpreting “p less than .05” results from noisy studies. Standard practice is to just take the point estimate and confidence interval, but this is in general wrong in that it overestimates effect size (type M error) and can get the direction wrong (type S error).

So what about noisy studies where the p-value is more than .05, that is, where the confidence interval includes zero? Standard practice here is to just declare this as a null effect, but of course that’s not right either, as the estimate of 0 is surely a negatively biased estimate of the magnitude of the effect. When the confidence interval includes 0, we can typically say that the data are consistent with no effect. But that doesn’t mean the true effect is zero; it could mean that we should bring more data to bear on the question.

Jeff points us to a recent example, presented in this letter from Elizabeth Hatch, Lauren Wise, and Kenneth Rothman:

I’m not so sure. I agree that the journal should not describe this result as “indicating no effect,” but it’s not quite clear what should be said instead, beyond simply reporting the confidence interval and letting readers go from there.

The journal’s phrase is “the absence of a clear benefit.” It would be more accurate to say “absence of clear evidence of a net benefit,” but that’s a pretty minor difference.

Given that the confidence interval *does* overlap with zero, I’m not sure why Hatch et al. are so bothered by the journal’s characterization of the results. I’m guessing that they (Hatch et al.) have other evidence, not from that published study, which they feel supports the idea that this treatment should have a large effect. That is, they have prior information. But, as I said, I can’t be sure, because I don’t know this area at all.

Speaking in the abstract, I agree with Hatch et al. that 30% is not zero; on the other hand, I wouldn’t expect to see an effect as large as 30% in a replication.

Given the information above, the best estimate of the effect in the general population is somewhere between 0 and 30%. And the data are also consistent with a zero or even a negative effect (in the parameterization of the letter above, a hazard ratio of 1 or higher).

Looking forward, it seems to me that the next step is to explicitly include more information in the decision process. I agree with the journal that the confidence interval of [0.47, 1.02] indicates enough uncertainty that, given the data in that experiment alone, we shouldn’t act as if we’re sure that the treatment has a net benefit—and we certainly shouldn’t go around acting as if the effect is 30%. But the topic is important, so let’s bring more information to the table. The problem, as I see it, is not that the journal made any mistakes in conveying the evidence; rather, the problem is with the attitude that a single noisy study should be considered as dispositive.

To put it another way: Had the original paper reported an effect size estimate of 30% with p=.04, I’d be skeptical: I’d say that I’d guess the 30% was an overestimate and that we should be aware that treatment effects can vary. With p=.06, I’m just very slightly more skeptical (or maybe less skeptical; see discussion here). I don’t think the experiment should be thrown away, but I see where the journal is coming from, in emphasizing that there’s no clear evidence from these data alone.

**P.S.** I don’t agree with Hatch et al. when they write, “Given the expected bias toward a null result that comes from non-adherence coupled with an intent-to-treat analysis . . .” Yes, such estimates are biased toward zero, but that makes a true population effect size of 30% even less plausible, and it remains that, with the confidence interval as reported, such a difference could have been explained by chance alone. Having a more biased or noisier estimate does not in any way increase the strength of the results. Eric Loken and I discuss this issue in our post on the “What does not kill my statistical significance makes it stronger” fallacy.

I guess that this study could be used to estimate the effect size of the influence of vitamin and calcium supplement on cancer. Using equivalence test might be a way to provide some information with this study.

Our finding indicate that the effect of vitamin and calcium supplement is probably no larger than 25% (or how much the equivalence test gives you).

At least this is informative to adequately power future studies

JJ

I’m not familiar with Type M errors, but is there a typo here:

“overestimates effect size (type M error) and can get the direction wrong (type M error).”

http://andrewgelman.com/2004/12/29/type_1_type_2_t/

Holy moly, that was twelve and a half years ago.

Corey:

Here’s the original paper on Type M and Type S errors, from 17 years ago!

You reasonably state that at p=0.04 you are skeptical and at p=0.06 you are “very slightly more skeptical”. But the problem is the Journal’s view would jump off the cliff from “no effect” to “effect” over the same p-value difference. And that is the problem that Hatch and others are rightly highlighting, whether they mean to or not.

Howard:

It’s even worse. Just about everybody seems to think that a p-value of .01 is much different from a p-value of .10, but this difference is all in the noise too, as Hal Stern and I discussed in our paper.

Somewhat related, James Heathers had an interesting Twitter poll:

https://twitter.com/jamesheathers/status/859284639600570368

This mistake occurs when you design a study to look for “an effect” (ie studies designed for NHST rather than scientific purposes). They need some kind of timeseries and/or relationship between a few variables so there is the opportunity to come up with a process that could lead to a pattern like the one observed. I assure you that all studies “looking for an effect” like this will be totally ignored in a few hundred years, they have no role in accumulation of knowledge.

“I assure you that all studies “looking for an effect” like this will be totally ignored in a few hundred years, they have no role in accumulation of knowledge.”

I hope you’re right, but am not convinced that you are.

“To put it another way: Had the original paper reported an effect size estimate of 30% with p=.04, I’d be skeptical: I’d say that I’d guess the 30% was an overestimate and that we should be aware that treatment effects can vary. With p=.06, I’m just very slightly more skeptical. I don’t think the experiment should be thrown away, but I see where the journal is coming from, in emphasizing that there’s no clear evidence from these data alone.”

Or to put it a third way: the threshold of 0.05 shouldn’t be treated as a threshold but as a point along a continuum, and there is not a shred of scientific evidence anywhere that its conventional status serves any good social purpose.

Or to put it a fourth way – we don’t actually know the p value, but only an estimate of it. Since p-values are distributed uniformly, they have large variances. Therefore *even if* we were willing to use a p-value threshold in principle, we shouldn’t get excited by a difference of p=0.04 vs p=0.06 because the estimate of p is just too noisy.

How do we only have an estimate of the p value? Don’t we know it precisely for the measured effect size under the null?

Never mind… I take it back since you’re measuring the standard error. I was thinking of the binary case in which the effect size implies the standard error.

It’s more general than that. In most experiments, you only know the sample data, not the population data. But p-values come from the population distribution (or perhaps hypothetical population data), and you only have an estimate of that.

The p value is the probability that something more extreme than the observed data would come out of a particular random number generator. The only valid interpretation of your statement is that “which particular random number generator we ‘should’ compare to” is not known.

If there is a well accepted null, then you *do* have the *precise* p value. However, if the “null” is really kind of a meaningless thing and you might choose various “nulls” then, p values are actually pretty meaningless in that context anyway. You shouldn’t calculate them.

The p-value is a property of the data (and the assumed sampling distribution if the null hypothesis were true). If you had a different set of data, you would get a different p-value. But these are not estimates of a “true p-value”, such a thing doesn’t exist.

In many situations there’s no real “precise” null we can all agree on for comparison purposes. For example, null = normal(0, sd) which sd should we choose? how about student_t(n,0,w) where n is dof and w is width, which n and w best represents what we think of as “null”? Perhaps the quantity is positive and we should use a gamma(a,b) distribution with mean 1… but what shape parameter?

when no precise null makes sense, even calculating the p value is kind of pointless.

“In many situations there’s no real “precise” null we can all agree on for comparison purposes. For example, null = normal(0, sd) which sd should we choose?”

Huh? I don’t understand this statement at all; most tests are set up that under the null hypothesis, t (test statistic) is asymptotically distributed as N(0,1) (or maybe the test is set up such that the test statistics is Chi-squared, but you get what I mean).

There’s no question about “what the null should look like”; if you didn’t know what the distribution of the test statistic was under the null, you’d have no test!

Unless you’re arguing about the accuracy of asymptotical approximations? In that case, this is easily checked using simulation, and typically is not much of an issue at all.

< "[don't] know what the distribution of the test statistic was under the null, you’d have no test!"

I don't think there ever was an non-randomized study where that was known (some try to estimate it).

Ethical randomized studies allow variation that ruins a default distribution of the test statistic was under the null to some degree – hoes does one fix that?

This leaves simulation studies without programming errors…

Many roads become smooth in asymptopia but not those haveingbumps and curves from systematic errors and mis-specification.

Not to be rude, but do you mind pointing to any area of research that describes what you are talking about?

Confounders are confounders, which seems to be your first point. Of course we can talk long about that.

Your second point about randomized studies seems (to me) to be about things that get wrapped up in the error term. Or are there more specific issues you are referring to? I don’t expect you to write out the full explanation for what you mean, but I’d like to read a discussion about what you’re referring; I’m sure it’s more nuanced than my interpretation of what you’ve stated.

“most tests are set up…”

Yes, fine this is an assumption, but it isn’t an assumption that every person involved has to accept. Suppose for example that you and I go out and collect 3 count em 3 samples of mud from some lake, and measure the cadmium content. You now have a null hypothesis that the lake contains no cadmium and there is pure measurement error in the cadmium measurement. Given some description of the measurement apparatus you can then give a p value for this null. But do I have to think that this is the null to be concerned about? No.

Another person can come along and say “I’ve measured cadmium in 100 lakes in this state and I’ve found that cadmium content of a sample has a power law distribution with lots of near zero measurements, but long tails as most of the cadmium comes from small regions of each lake. My null hypothesis of interest is that this lake has the same distribution as previously seen in the other 100 lakes…. and gets a p value for *this* null hypothesis.

which null hypothesis should we be concerned with? The one that says *these samples have nothing but measurement error* or the one that says *samples from lakes in this region are distributed so that the mean cadmium content of the lake is virtually unrelated to the mean cadmium content of any typical small set of samples*

It’s not that the null hypothesis isn’t well defined after you precisely define it mathematically, it’s that there’s not necessarily a reason to think that whatever your definition is has any bearing on any real-world meaningful question.

Again, I’m having trouble with your point.

You’ve described two very different hypotheses to test. You’re absolutely correct that a lot of thought should go into which hypothesis is interesting to look at.

But how would this be in anyway related to the distribution of the p-value of each of these different hypotheses?

My point is that there are plenty of cases where no clear null hypothesis with real world relevance is even well defined. It’s not that you can’t define a test, but more that there are thousands of plausible tests, why should anyone care about the particular one you chose to run.

Okay…but that has nothing to do with p-values. You can easily take out the random nature, and say “why should anyone care about any given fact?”.

Always a valid question, but orthogonal to the topic of the distribution of the test statistic under the null.

This is why Meehl’s “Omniscient Jones” argument is so genius. Forget all the math, someone with super-knowledge tells you A has a non-zero correlation with B. Who cares? If you collect trillions of data points and find no correlation that would be interesting. Other than that we should assume everything is correlated with everything else to some extent. NHST starts with the opposite principle, that it would be somehow surprising if any two things were correlated at all… no it isn’t.

Cliff, the thing is that p values by themselves are unassailable mathematical facts: “If you generated random numbers using my chosen RNG program “NullHypothesis(i)” you would rarely see a dataset stranger than the data D[i] as measured by test statistic t(Data), (p = 0.0122)”

but the relevance of this mathematical fact to science is an entirely different question, and yet statistical textbooks and stats 101, 102, 201, 202, 301, 302 and graduate level biostats and etc etc all basically assert the idea that these numbers are of essential relevance, that they form the structural basis for the application of scientific reasoning in the presence of uncertainty. It’s this incorrect philosophy that is at issue, not the mathematical fact that certain RNGs have certain pushforward measures on observed statistics of datasets.

The confusion between “this is a mathematical fact” and “this is a scientific fact” are at the heart of everything that is wrong with current practice.

The default acceptance that there exists a well defined “Null Hypothesis” against which every “Real Hypothesis” can be compared, and that the rejection of the Null therefore strongly suggests the “Real Hypothesis” is true… that’s a deep deep problem for many fields.

Acknowledging that there are really a bazillion possible meaningless hypotheses you could choose to compare your data to is another way of putting the “Garden Of Forking Paths” concept. And, it’s essential to understanding why you should stop calculating p values and start hypothesizing models of how your measurements *did* come about.

So, I’m simultaneously disagreeing with the mathematical statement “we don’t know the p value” and agreeing with the logical statement “there is no s̶p̶o̶o̶n̶ meaningful p value”. I think Tom Passin is incorrect in saying that we only have an estimate of the p value, but a bigger point is correct, the meaningfulness of the p value is strongly at question. And that goes even for the simple case of something like t distributed data with unknown degrees of freedom. if you want to test a null of your data arising as

student_t_rng(dof,0,scale)

Then you need to explain why it is you chose whatever particular dof you used, and that will either look like having a prior over dof, or look like “not knowing the true p value” (because there is one for every dof you could choose) and either way it comes down to the deeper question “why do you care about this number?”

…but none of your arguments are specific to p-values. I can readily apply the same reasoning to posterior probabilities…or just plain facts that have no uncertainty whatsoever!

+1 to Daniels latest post.

For those who found it TLDR: Just focus on the middle part:

“The confusion between “this is a mathematical fact” and “this is a scientific fact” are at the heart of everything that is wrong with current practice.

The default acceptance that there exists a well defined “Null Hypothesis” against which every “Real Hypothesis” can be compared, and that the rejection of the Null therefore strongly suggests the “Real Hypothesis” is true… that’s a deep deep problem for many fields.

Acknowledging that there are really a bazillion possible meaningless hypotheses you could choose to compare your data to is another way of putting the “Garden Of Forking Paths” concept. And, it’s essential to understanding why you should stop calculating p values and start hypothesizing models of how your measurements *did* come about.”

Cliff AB wrote:

Yep, the p-value is just the current arbitrary meaningless pedantic calculation, you can substitute any other such calculation and get the same results (as long as it will yield “success” infrequently enough to seem like an achievement). The problem is the choice of null hypothesis and designing studies focused on testing that choice.

Anon:

Yes, the original sin here is the attempt to transmute uncertainty into certainty.

I agree with Carlos. This was a difficulty I had with this paper, for instance: http://fooledbyrandomness.com/pvalues.pdf

I attended many talks about oncology themed trials during my forty year career. When the results did not meet expectations, the most common response was for the investigators to talk about subgroup analysis. You see when you carefully select the subjects, you can get impressive results without any fancy math.

And as soon as they start the subgroup analyses, the power takes a big hit. I wonder how many of impressive results are down to Type S and Type M error.

Jonathan argues the threshold of 0.05 shouldn’t be treated as a threshold but here it may well a threshold for selectivity.

That is .05 less selection of what gets past.

So 30% with p=.04, I’d be skeptical but p=.06 maybe I should not be as skeptical.

(Thought perhaps its .10 anything and > .05 and < .10 depends on authors experience/motivations)

Spent a bot of time thinking about this http://journals.sagepub.com/doi/abs/10.1111/j.1467-9280.1994.tb00281.x a couple years ago motivated by the view that confidence intervals are too difficult for most scientists.

Now, I did not think about selective reporting issues would effect that – those _uncapitalized_ type s and m errors.

OH yea – special characters..

Jonathan argues the threshold of 0.05 shouldn’t be treated as a threshold but here it may well a threshold for selectivity.

That is more than .05 – less selection of what gets past.

So 30% with p=.04, I’d be skeptical but p=.06 maybe I should not be as skeptical.

(Though perhaps its less than .05 only large estimates, more than .10 anything and more than .05 but less than .10 depends on authors experience/motivations)

Spent a lot of time thinking about this http://journals.sagepub.com/doi/abs/10.1111/j.1467-9280.1994.tb00281.x a couple years ago motivated by the view that confidence intervals are too difficult for most scientists.

Now, I did not think about selective reporting issues would effect that – those _uncapitalized_ type s and m errors.

I looked at the linked article, Keith, which I like a lot, but I’m not sure I get your point. Why would you be *more* skeptical of 30% with p=.04 than you would be with p=.06? Or maybe I should ask: skeptical of what? I agree with Andrew — your skepticism that you have a substantial effect rises with the p value unambiguously (holding the effect size constant.) Why wouldn’t it? What Andrew is saying (correct me if I’m wrong) is that “statistically significant” results anywhere near the boundary suffer from the Type M and Type S errors of the significance filter (I agree) but that “statistically insignificant” results just on the other side of the border suffer from the exact same problems, only slightly more so.

> the significance filter

With the brightline .05 the significance filter defines (along with the sample size) a palpable reference class of estimates – most of which will be overestimates. We know journals select on it, authors exert all steam ahead to get published with it and many folks only pay attention if so.

With something above .05 its muddier (yes you could define a reference class that way but why?) so I am thinking more thought about bias correction would be better (despite how tricky that might be https://www.ncbi.nlm.nih.gov/pubmed/12933636 )

Jonathan:

Keith’s point, I think, is that when you see p=.04 it’s natural to suspect a data-driven analysis—p-hacking or the garden of forking paths—which will bias the estimate upward. If you see p=.06 this looks more like a direct analysis of the data with no selection bias.

Keith’s point is paradoxical at first, but maybe he’s right.

Now I get it, but that seems dramatically unfair — It would mean that whenever I get a p=0.0492 (and we’ve gotten that before: http://www.stat.columbia.edu/~gelman/research/unpublished/notrump_falk_gelman_icml.pdf sorry couldn’t help myself in the current circumstances) I should cheat a little to get p OVER 0.05 to keep suspicious minds at bay!

In academic publication it is spy versus spy – reputations is all that matters :-(

More (or less?) seriously – selection bias is subtle.

For instance, I believe I was unable to convince David F Andrews (U of T) that meta-analysis was a good idea because I think he thought if you only analysed one study on its own, you avoided publication bias, whereas if you dealt with all published studies together you could not avoid it (or fix it).

Years later in a departmental seminar, I argued that selection bias was attached to an individual study unless you could argue it could never have happened (not that it just did not happen) and collecting the studies together did not cause the bias but rather just allowed an opportunity to assess it and (hopelessly try?) to adjust for it. Afterwards David told me the seminar had addressed a major concern he had with meta-analysis.

This is where pre-registration helps, un-tracked changes in outcomes etc. can’t happen.

Now if the journal reviewed the methodology and agreed to published the paper no mater what – then the achieved p_value would not signal anything about selective reporting because it can’t happen. Also if some researchers are well enough funded, knowledgeable and committed so that their studies will almost always be published, the p-value would signal little about selection – to those with insider information about the group.

Yes dramatically unfair!

Just a small gloss on that — A former colleague argued that the reason 0.05 is used as a filter is that a result that can’t be p-hacked to get below 0.05 shows that the effect can’t possibly be there! So he acknowledges that it inflates results, but at least it filters out results that even determined p-hacking can’t reach.

Sigh…

Unless the authors were motivated to debunk the effect of Vitamin D and believed that by getting .06 they had done it.

Having done some work producing evidence summaries for clinical guidelines I would definitely say that the difference between writing “the absence of a clear benefit” and “absence of clear evidence of a net benefit” is not minor. If this was presented to a guideline committee as the only relevant study for the question it may well lead to a statement that vit D does not reduce risk and a recommendation against supplementation for cancer prevention. On the other hand, if is presented as weak evidence of net benefit with some kind of (wide) interval summarizing the uncertainty, then that would strongly encourage the kind of decision analysis that you would advocate. Ultimately, the opposite conclusion and recommendation may or may not be reached.

Assuming that there is always some sort of synthesis with other evidence and prior beliefs, the reporting of the study abstract is unimportant. However, given that many decision makers may not do this it is worth making the correct statements in research abstracts (and attempting to correct the record when this is not the case).

You and others write about skepticism with respect to such p-values, but would it be better to switch more completely to an estimation framework? That is, would it be more informative to show a graph of the probability of getting at least various effect sizes?

From Kenneth Rothman’s “Six Persistent Research Misconceptions” (reference 3 in the tweet): “It is unfortunate that a confidence interval, from which both an estimate of effect size and its measurement precision can be drawn, is typically used merely to judge whether it contains the null value or not, thus converting it to a significance test.”

Andrew, Nice post of great general interest. Could you flesh out your statement that

“Given the information above, the best estimate of the effect in the general population is somewhere between 0 and 30%.”

The reported confidence interval ranges on the hazard ratio goes from [0.47, 1.02]. So, why do you think the truth is likely to be found only in the right half of this interval (.7 to 1.0)? More generally, how asymmetrically do you think we should interpret confidence intervals?

Jrg:

I’m not saying that I think the true population effect

isbetween 0 and 30%. I’m saying that, if you have to give a point estimate based on these data, that the point estimate should be somewhere between 0 and 30%. But a point estimate is not so useful here.I don’t understand the PS. “that makes a true population effect size of 30% even less plausible”. What do you understand by “true population effect”? Do you mean that in the broad population compliance will be even lower and given that many people will not bother taking the pills the actual effect would be lower than in the study?

One could argue that the “true effect” of the supplements is the effect when the patients do indeed take their vitamins/calcium. Of course they may have a relevant reason to stop the treatment (i.e adverse effects) but the ITT analysis already accounts for that in a conservative way if I understand her point.

Carlos:

By “true population effect,” I mean the expected difference that we would see if the experiment were applied to the general population, which in this case would be all of the sort of women who would satisfying the entry criterion for the study. The point of my paper with Loken is not about intent-to-treat or anything like that; rather, it’s a general issue that when noise is added to a study (for example, from noncompliance), this increases standard errors and thus increases the sense in which a statistically significant estimate (or, in this case, a nearly statistically significant estimate) gives an overestimate of the magnitude of the effect size.

Thanks. I thought that you said that, conditional on their reported results (HR: 0.70, 95%CI: 0.47-1.02), the non-adherence/ITT-analysis issue made the 0.7 hazard ratio even less plausible.

Medication studies generally report results on two sub-groups of recruited subjects, Intent-to-Treat and Per-Protocol. The letter authors point out that this confidence interval is from Intent-to-Treat (ITT) data. The Intent-to-Treat principle assures that medical research reflects the reality of clinical processes-people quit, have to be removed, don’t follow instructions, follow wrong instructions given in error, etc. So they want to say that without the Per-Protocol results, we can’t say much about the impact of the drug. However, the Per- Protocol group is the same size or smaller, so the wider confidence interval will most likely block out any gains from greater efficacy in the sample.

Kenneth Rothman has long advocated confidence intervals and disparaged hypothesis testing in epidemiological research. But Bill Harris points out that he has little success. And clinical research is all about using the right words and the footnotes, as Ewan points out.

The thing that irks me with this that I’ve come across in my own work is when I’m doing multiple tests of near-equivalent hypotheses, just with different data sources or measures. Say I am testing the same theoretical point 4 times with 4 different data sources/measures: Study 1 p = .001 (boom! great result!), Study 2 p = .09 (failed replication!), Study 3 p = .11 (another failure!), Study 4 p = .03 (it replicated!). The p < .05 framework would ask me to report these as mixed support for the hypothesis. Of course, any reasonable person should see that if the p values are consistently low, even if missing the threshold, this indicates a pattern consistent with the effect existing.

“The p < .05 framework would ask me to report these as mixed support for the hypothesis."

Really? I don't think anyone with a proper understanding of p-values would ever ask you to report as such.

Have you actually had feedback from reviewers saying something like "please state you had 2 successful replications and 2 failures to replicate" in the above scenario?

Isn’t it much more likely they will tell you to “leave out” the non-replications, or alternatively that your negative results are unworthy of publication at that journal.

I’m sure that’s happened before, but I would be surprised if that’s common, and certainly not “much more likely”.

Moreover, I’ve had plenty of cases where reviewers with limited statistical background give bad advice on how to handle the statistical analysis…but that’s why you’re allowed to respond to the reviewers. I have not had a problem with issuing a well written response to a poor suggestion.

Finally, I think the proper way to handle the scenario in question is a meta-analysis. And those are very popular these days (even if they aren’t execute perfectly every-time).

Agree, the proper way but there is some explanation here of how some useful ideas from meta-analysis can be poorly understood and mis-executed into a report of mixed support for the hypothesis (e.g. qualitative tally or vote counting). Greenland S, O’ Rourke K: Meta-Analysis. Page 652 in Modern Epidemiology, 3rd ed. Edited by Rothman KJ, Greenland S, Lash T. Lippincott Williams and Wilkins; 2008.

I’d say these results indicate a pattern consistent with 1) an effect existing, and 2) low power. I would hope that any reasonable person would ask to see the associated confidence intervals / standard errors in addition to the p-values!

Similarly skeptical of their statistical reasons for wanting a near-significant result to be viewed as better, but I think the point in this specific case is that there’s value in considering the costs of treatment and in thinking more carefully about the actual problem domain. There’s not really a plausible reason why this intervention would increase the risk of cancer, as far as I know – we just don’t have compelling evidence that it decreases it. But since supplements are basically harmless and relatively affordable, there’s probably value in noting that the benefits of treatment, while unclear, likely outweigh the costs of treatment. It at least lets us stick the treatment on a rough cost-benefit continuum of cancer-risk reduction.

In a similar vein, the conclusions of this very recent paper surprised me – every single confidence interval of incidence rate ratios in their intention-to-treat analysis of a cluster RCT included unity, yet they claimed that meaningful treatment effects were observed, e.g. from the abstract “..clear reductions were evident in the intervention arm for concussion incidence (RR=0.71, 0.48 to 1.05)”

See: http://bjsm.bmj.com/content/early/2017/05/08/bjsports-2016-097434

Exploring this further, they claim to be using “magnitude based inference”, which, as far as I can see, is only used within sports medicine, and seems to be a more permissive form of NHST – there is a commentary on the method with some responses here: https://www.ncbi.nlm.nih.gov/pubmed/25051387