memory deficit, they measure how long it takes treated and control groups to gather all the food from a familiar maze.

The “effect” may be due to the treatment making the mice more/less hungry, faster/slower, stressed out by being put in the maze, aggressive (thus influencing how much food they get, or how the handler deals with them), their sense of smell/sight/hearing, etc. It may have nothing to do with memory, or memory may be only part of the story.

To deal with this you need to eventually come up with a prediction of what the results should look like if one explanation is correct vs the others. Just knowing there is “an effect” is not helpful. I literally do not care if you detect an effect or not (I always assume there is one), this type of data/analysis gives me nothing to work with.

]]>“Currently both are sitting in file drawers, so you consider this an improvement. Is that right?”– Indeed that is right! Consider the proportion of all published studies that are actually true (Ioannidis, 2005). Including more “negative results” will up this proportion. If this policy provides extra incentive for researchers to use larger sample sizes, this proportion is increased further. While the improvement may still be somewhat small, these are achievable, impactful gains. Expecting a journal to publish all null results is a lot to ask- I think those file drawers are pretty damn full! So why not start will the “high-quality null results”.

Example:

“BioLab X is testing a new drug in mice to see if it may cure Alzheimer’s. To determine memory deficit, they measure how long it takes treated and control groups to gather all the food from a familiar maze. Due to operational/ethical/budgetary reasons (let’s be honest, these are expensive mice!), the sample size is restricted to 32 per arm. This sample size provides ~88% power to detect a large effect (d=0.8) and ~50% power to detect a medium effect (d=0.5).

Additional clinical research would only be justified if the true effect were substantial (i.e. d>0.5). As such, determining with certainty that d is less than 0.5, would be an important finding. The equivalence margin is therefore = [-0.5, 0.5]. Under the assumption that the true effect size is 0, this sample size provides ~52% chance of reaching a “negative result”, ~43% chance of reaching an “inconclusive result”, and ~5% chance of a (false)-“positive result”, (alpha_1=0.05, alpha_2=0.10).

I agree that it’s not a perfect solution to all issues with NHST. However, I think publication policies like this one can and should be some part of the solution.

]]>I agree that other information needs to be taken into account to avoid over-optimism. I suggest regarding a Bayesian posterior probability based on the study data alone and a uniform prior as an upper bound (e.g. see 2nd paragraph of https://blog.oup.com/2017/06/suspected-fake-results-in-science/ ).

One approach is to explore the effect of various pessimistic subjective priors (or ‘posterior likelihood’ distributions) as part of a sensitivity analysis to model the possible posterior probabilities of replication when a repeat study is conducted in other centres. Is your view on the over-optimistic nature of a posterior distribution based on a uniform prior down to Nosek et al’s study on ‘Estimating the reproducibility of psychological science’ or some other rationale?

]]>A Bayesian inference with uniform priors is what it is, and it can be useful as a data summary. But I generally wouldn’t want to use this posterior distribution to make decisions as it can be wildly optimistic.

]]>> p=0.10 and a narrow CI means that your results are consistent with there being “zero effect” or a “small effect”, but a large effect is pretty much out

p=0.10 means that the interval does cover zero. In his words, the results are consistent with there bing zero effect. There is no “another case”.

You are right that a CI tells us more than a p-value (a p-value is one single piece on information, the CI contains two pieces of information).

But consider these two possible 95% confidence intervals for H0:b=0 with (two-sided) p=0.01:

Wide CI: 6 < b < 45

Narrow CI: 0.006 < b < 0.045

I am not sure the first one warrants a "so what?" reaction more than the second one.

]]>Some examples on Frank Harrell’s blog

http://www.fharrell.com/2017/04/statistical-errors-in-medical-literature.html

I have loads more if you want.

]]>…this is not a 95% confidence interval or the null hypothesis used to calculate the p-value is not H0:b=0.”

This is nitpicky or I’m missing something very obvious so apologies (fighting a cold so I will get my excuses in early). Ben mentions 95% CI, then switches to p=0.10 for his example … then I used p=0.01 … does it matter? I think (hope?) the point is clear. If you are going to fixate on a particular cutoff for “significance”, you can tell us a lot more with the CI than with the (usually almost useless) fact of whether or not zero is inside it or not.

]]>…this is not a 95% confidence interval or the null hypothesis used to calculate the p-value is not H0:b=0.

> This is the trap that so many fall into. H0:b=0 and p=0.01 but the CI is huge … woo-hoo. So you think maybe it isn’t zero … so what?

The case when the CI is wide is may be more interesting (for the same p-value). If you have H0:b=0 and p=0.01 and the CI is narrow then maybe b isn’t zero but surely it’s small… so what?

]]>This is the trap that so many fall into. H0:b=0 and p=0.01 but the CI is huge … woo-hoo. So you think maybe it isn’t zero … so what? (Yes, I know, interpreting realized CIs is tricky, but still this is an improvement over mindless p-value citing.)

But I agree with Ben and Martha that the answer to “So what?” can be “There’s more noise here than we thought.” Indeed may well be a result worth sharing. A CI can convey this.

]]>If a symmetrical distribution (e.g. Student’s ‘t’ or Gaussian) is used to model data, the 95% confidence intervals become 95% credibility intervals and that the null hypothesis corresponds to one of the confidence limits. However, if the CI applies to proportions, then the likelihood distribution will often be asymmetrical so that the simple relationship between confidence and credibility intervals does not hold. In this situation, the likelihood distributions and credibility intervals have to be calculated ‘exactly’ using hyper-geometric distributions using binomial coefficients (by analogy with ‘Fisher’s exact tests).

By applying the ‘principle of uniform distributions’, it is possible to provide a proper posterior likelihood distribution that allows the scientist to specify any credibility interval of his or her choosing and then for the statistician to estimate the probability of a mean or proportion falling inside that range after making an infinite number of observations. Prior data distributions can also be incorporated as a form of meta-analysis. This would be the probability of (long term) replication; I prefer to use ‘chosen replication range’ rather than CI. This would be much more intuitively appealing to scientists (and doctors whom I teach). The lines of ‘statistical significance’ (or not) could be marked on the distribution and treated with a pinch of salt. I also discuss some of the issues related to this approach elsewhere on Andrew’s blog: http://andrewgelman.com/2017/10/04/worry-rigged-priors/#comment-578758 ).

]]>I believe it is really hard to anticipate how journals, authors and review committees will react and evolve under such as system.

Additionally, you are disregarding some shared insights (revised proposed list below) especially item 4 as your step 5 arguable applies to any single isolated study analysis “There is insufficient evidence to support any conclusion”

1. A p_value is just one view of what to assess about an experiment/study – that being how consistent is the data with a specific bundle of assumptions (which includes the null hypothesis). Furthermore, what to make of such an assessment as being rare, is seldom obvious or clearly spelled out. For instance, if it is from the first study – this may suggest further studies are likely not wasteful. Whereas, if they can be usually brought about in repeated studies – this may support the effect being real (replicable). On the other hand, it might be more prudent to simply take it to suggest that estimation based on the bundle of assumptions (which now also includes the alternative hypothesis) may be completely misleading (i.e. the assumed model is just too wrong). At least, that is, if estimation is considered as an essential step in answering the real scientific question.

2. Consider p_values as continuous assessments and be wary of any thresholds it may or may not be under (or targeted alpha error levels).

3. Keep in mind that p_value assessments are based on the possibly questionable assumption of zero effect and zero systematic error as well as additional ancillary assumptions.

4. Realize that the real or penultimate inference considers the ensemble of studies (completed, ongoing and future), individual studies are just pieces in that, which only jointly allows the assessment of real uncertainty.

5. Be aware that informative prior (beyond the ensemble of studies) information, even if informally brought in as categorical qualifications (e.g. in large well done RCTs with large effects the assumption of zero systematic error is not problematic) maybe unavoidable – learning how to peer review priors so that they are not just seen personal opinion may also be unavoidable.

6. The above considerations must be highly motivated towards discerning what experiments suggest/support as well as quantifying the uncertainties in that, as all of them can be gamed for publication and career advantage. It seems the importance of this cannot be over-estimated nor the need to repeatedly mention it in teaching and writing about statistics.

7. All of this simply cannot be entrusted to single individuals or groups no matter how well meaning they attempt to be – bias and error are unavoidable and random audits may be the only way to overcome these.

8. ???

9. ???

]]>Step 1- Calculate a (1 – alpha_1)% Confidence Interval for theta.

Step 2- If this C.I. excludes theta, then declare a positive result. Otherwise, if theta is within the C.I., proceed to Step 3.

Step 3- Calculate a (1 – 2*alpha_2)% Confidence Interval for theta.

Step 4- If this C.I. is entirely within delta, declare a negative result. Otherwise, proceed to Step 5.

Step 5- Declare an inconclusive result. There is insufficient evidence to support any conclusion.

What you seem to be saying is a result can be insignificant either because the estimated difference from zero is very small or the uncertainty is very large (for arbitrary, yet exact, definitions of small and large). In the former case you want to call the results “negative”, in the latter you want to call them “inconclusive”.

If this new terminology is adopted you expect people will at least publish the subset of insignificant results that are “negative”, but still leave out those that are “inconclusive”. Currently both are sitting in file drawers, so you consider this an improvement.

Is that right? If so, I don’t think this proposal actually solves the real issues with NHST. Can you give a “real life” example of how this would be used and interpreted? For example:

“BioLab X is testing a new amyloid-beta clearing drug in mice to see if it may cure Alzheimer’s. To determine memory deficit, they measure how long it takes treated and control groups to gather all the food from a familiar maze. To determine amyloid-beta levels, they split these mice into high/low categories based on their olfactory habituation test performance (deficits in learning smells have been previously linked to Alzheimer’s disease).”

Include whatever sample sizes, effect sizes, as desired.

]]>https://peerj.com/articles/3544/ ]]>

https://arxiv.org/abs/1710.01771 ]]>

+1

]]>“But what I have never seen made clear is what we should do with higher p values. In this new world that doesn’t believe in thresholds, is there value in p = .10?”

This is one reason I see merit in encouraging people to switch out their p-values for confidence intervals when possible. I know this gets pooh-poohed on the grounds that a 95% CI is just an inverted null hypothesis test using p<0.05, but one huge advantage to the interval is that its width tells you something. p=0.10 and a wide CI means that your results are consistent with there being "zero effect" or "a large effect", so this isn't very informative. p=0.10 and a narrow CI means that your results are consistent with there being "zero effect" or a "small effect", but a large effect is pretty much out (at least contingent on all the other assumptions that went into the analysis).

I think this is a useful distinction, but with just a p-value you can't make it. And of course it also gives perspective to those p = 0.02 results where zero is just barely outside some really wide CI.

More broadly, I think "statistically weak" results are still useful; if the study was worth doing then the results are worth reporting, even if the conclusion is that we can't conclude anything. "Turns out there's more noise here than we thought" is a result worth sharing, not least because it can point to where improvements need to be made in measurement and design and modeling.

I don't know how many journal editors would buy into this, but maybe our new world that doesn't believe in thresholds will also be more open to the value of "unpublished" research.

]]>The problem is that people rather easily jump to that conclusion. For example, when the journal *Basic and Applied Social Psychology* made their 2015 shift to “no more significance testing”, the editorial specifically said (in answer to the question “Will manuscripts with p-values be desk-rejected automatically?”), “No. … But prior to publication, the authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about ‘significant’ differences or lack thereof, and so on).”

As an aside I think that some researchers should reread Dr. Ioannidis work more carefully. It is paraphrased incorrectly & somewhat correctly. But not interpreted precisely & correctly. Either the applications are over-generalized or referred to apply too narrowlly. ]]>

https://www.youtube.com/channel/UCBiO111B17hlhtRY5Cg3V_Q ]]>

The quick answer is, yes, lots of little data can become big data, and I think it’s fine for people to publish and analyze what data they have, and others can build on this. It can also be valuable to publish and recognize uncertainty, to say: Here’s what seemed like a promising line of research, but the data are too noisy to learn much of anything useful.

Finally, sometimes we have real decisions to make and we can’t wait until there is any sort of near-certainty. Again, better to give the information that is available, in all its ambiguity. In a field such as empirical macroeconomics, the questions are too important and the data too sparse for us to wait for statistical significance in any form.

In addition to all of the above, there’s also the question of incentives. So much bad work and data misrepresentation is done because of the implicit requirement that claims be presented as near-certain. I’d like to remove that burden, so the Susan T. Fiskes of the world can publish speculative work without the pressure to misrepresent the results as conclusive.

]]>Of course, there are many reasons to dislike the p < .05 threshold that go beyond "it's not sufficiently conservative." But what I have never seen made clear is what we should do with higher p values. In this new world that doesn't believe in thresholds, is there value in p = .10? It's often said that in the garden of forking paths, there's always a scientific explanation for your results. Do we trust scientific reasoning about theory enough to accept results that are statistically weaker as part of this movement that is about science improvement?

]]>Also, YouTube tutorials have been tremendous in my stats education. Again, McElreath’s lecture series is a great example. The internet needs more. But they don’t need to be that polished. There are tons of low-production value screenshot only YouTube videos on classical statistics in SPSS and so forth. The grad student audience is hungry for the Bayesian analogues featuring Stan, rstanarm, brms, and so forth.

And of course, blogs. But it appears Andrew and many of the rest of y’all have that one covered. [I love this community.]

So, yes, we can attempt top-down approached. And we should. But the current youngsters are primed and ready. Reach us with more of these.

]]>So my rough estimate is that most readers are just looking for distractions that don’t require much commitment.

]]>‘But I’m pretty sure that most of you reading this blog are sitting in your parent’s basement eating Cheetos, with one finger on the TV remote and the other on the Twitter “like” button. So I can feel free to rant away.’

;-)

]]>Maybe not – “Our recommendation is similarly twofold. First, when describing results, we recommend that the label ‘statistically significant’ simply no longer be used.”

Then later “Second, when designing studies, we propose that authors transparently specify their design choices. These include (where applicable) the alpha level,”

Now for “when studying the effects of snakebites, and ESP, and spell checking,” the alpha level should be justified and hence varied by application but what purpose does it serve other than to ignore their first recommendation?

AG> “But I do think that comparisons to a null model of absolutely zero effect and zero systematic error are rarely relevant.”

With regard to the zero systematic error rarely being relevant – I raised that in a comment perhaps too late in the editing phase “there does not seem to be anything on systematic error or confounding”

In general, I think there are number of authors that are agreeing more than is suggested by their making distinctions out of differences they discern in each others papers.

Perhaps the various papers should be partially pooled towards Valentin Amrhein and Sander Greenland very concise letter – to better pin point what the important distinctions really are?

]]>Bad statistical training invites lazy assessment of ‘evidence’.

Journals get tons of submissions, and they need some easy manner of deciding on pub-worthiness (not that evidence should even factor into that equation).

The entire notion of “hypothesis testing” is so strongly embedded into psychology, it’s going to be hard to uproot. Hypothesis testing sounds more “scientific” than “making the best inference from the data we have” or “trying to recover the DGP with out-of-sample prediction as a goal”, even if the latter two are really much harder [and more useful, more informative, and arguably more scientific].

I fear the next move will be:

1) P-values? Boo hiss. Let’s use Bayes factors.

2) Oh, Bayes factors don’t condition on the data either? The priors are actually prior predictive hypotheses? That’s not what I want.

3) Bayesian posteriors! Huzzah! If the credible interval excludes zero, then H1 is supported! Oh, that has many of the same problems as p-values?

4) Prediction! Let’s use predictive utility as a goal, and based on that construct better predictive hypothesis tests. Oh, that requires more work, and about 10 people on earth understand how to do that….

5) Let’s have everyone report everything. p-value, BF, informed BF, predictive utility metrics, posterior credible intervals. Ah, but that’s confusing and hard to interpret.

The statisticians: Exactly. Data are messy. Stop thresholding. Lots of metrics needed for decisions and inferences.

]]>In sum, perhaps a new journal that adopts these standards may be the best way to get pragmatic change by offering researchers a chance to showcase their more enlightened analyses.

]]>Perhaps “abandon” is the wrong word to use then.

]]>I’ve been railing against the Benjamin proposal /primarily/ because I pretty well hate thresholds. So long as thresholds exist, people will try their damndest to dive past it. So long as we have this “past the threshold, evidence; not past the threshold, no evidence” mentality, the career-incentives + publication practices essentially mandate that people will threshold-dive and mischaracterize evidence (i.e., p-hacking). And any inferential quantity can be ‘hacked’; I’ve shown that it can be done, of course, with p-values, CIs, credible intervals, BFs, whatever else. It’s not particularly hard, regardless of the threshold. I even wrote a script that will find subsets where some threshold is met, just to demonstrate. And of course, it’s arbitrary; why one line can delineate “true” from “not true”, or “publication-worthy” from “not publication worthy” is silly.

Finally, my argument about the Benjamin proposal was that it’s unrealistically optimistic: “In an imperfect world where p-values are used, use .005” but then they say “.005 should not be used as a publication threshold”; this is a sticking point for me as an ECR. In the same imperfect world where people misinterpret p-values and dichotomize evidence, they will use .005 as a pub threshold instead of .05 — Because it’s an imperfect world. Saying one should use .005 < p < .05 as 'suggestive' and < .005 as 'significant' but "don't use p-value as pub-worthiness" is too "idealistic" for me; of COURSE people will just now trichotomize evidence, and those people are now being told "only .005 signifies evidence", therefore they will just s/.05/.005 in their pub-worthiness evaluation. TDLR; why would people who judge pub-worthiness based on whether p is less than some evidentiary threshold now stop doing so with .005 if .005 is the new evidentiary threshold?

Even if I am on the 'justify your alpha' paper, I will say that I just don't like thresholds, period. But for me, IF we are going to use thresholds, they should be justified, and hence my contribution to the paper.

Aside from the 'where should the threshold be' question, which I personally think is moot… as I said on twitter:

Things that caused psych problems: HARKing, QRPs, threshold diving, publication bias, poor stats understandings, bad incentive structure. Incentive structure requiring novel, sexy findings, 10 papers a year, only publishable if beyond threshold. No replications permitted. Things that don't need fixing: An arbitrary threshold for rejecting a hypothesis noone believes based on a fictitious universe of events.