Comments on: Response to some comments on “Abandon Statistical Significance”

By: Anoneuoid

Anoneuoid — Sun, 08 Oct 2017 18:34:10 +0000

In reply to Harlan.

memory deficit, they measure how long it takes treated and control groups to gather all the food from a familiar maze.

The "effect" may be due to the treatment making the mice more/less hungry, faster/slower, stressed out by being put in the maze, aggressive (thus influencing how much food they get, or how the handler deals with them), their sense of smell/sight/hearing, etc. It may have nothing to do with memory, or memory may be only part of the story. To deal with this you need to eventually come up with a prediction of what the results should look like if one explanation is correct vs the others. Just knowing there is "an effect" is not helpful. I literally do not care if you detect an effect or not (I always assume there is one), this type of data/analysis gives me nothing to work with.

By: Harlan

Harlan — Sun, 08 Oct 2017 18:17:41 +0000

In reply to Anoneuoid.

Thanks for the feedback.
“Currently both are sitting in file drawers, so you consider this an improvement. Is that right?”– Indeed that is right! Consider the proportion of all published studies that are actually true (Ioannidis, 2005). Including more “negative results” will up this proportion. If this policy provides extra incentive for researchers to use larger sample sizes, this proportion is increased further. While the improvement may still be somewhat small, these are achievable, impactful gains. Expecting a journal to publish all null results is a lot to ask- I think those file drawers are pretty damn full! So why not start will the “high-quality null results”.

Example:
“BioLab X is testing a new drug in mice to see if it may cure Alzheimer’s. To determine memory deficit, they measure how long it takes treated and control groups to gather all the food from a familiar maze. Due to operational/ethical/budgetary reasons (let’s be honest, these are expensive mice!), the sample size is restricted to 32 per arm. This sample size provides ~88% power to detect a large effect (d=0.8) and ~50% power to detect a medium effect (d=0.5).

Additional clinical research would only be justified if the true effect were substantial (i.e. d>0.5). As such, determining with certainty that d is less than 0.5, would be an important finding. The equivalence margin is therefore = [-0.5, 0.5]. Under the assumption that the true effect size is 0, this sample size provides ~52% chance of reaching a “negative result”, ~43% chance of reaching an “inconclusive result”, and ~5% chance of a (false)-“positive result”, (alpha_1=0.05, alpha_2=0.10).

I agree that it’s not a perfect solution to all issues with NHST. However, I think publication policies like this one can and should be some part of the solution.

By: Huw Llewelyn

Huw Llewelyn — Sun, 08 Oct 2017 10:26:20 +0000

In reply to Andrew.

Andrew

I agree that other information needs to be taken into account to avoid over-optimism. I suggest regarding a Bayesian posterior probability based on the study data alone and a uniform prior as an upper bound (e.g. see 2nd paragraph of https://blog.oup.com/2017/06/suspected-fake-results-in-science/ ).

One approach is to explore the effect of various pessimistic subjective priors (or ‘posterior likelihood’ distributions) as part of a sensitivity analysis to model the possible posterior probabilities of replication when a repeat study is conducted in other centres. Is your view on the over-optimistic nature of a posterior distribution based on a uniform prior down to Nosek et al’s study on ‘Estimating the reproducibility of psychological science’ or some other rationale?

By: Andrew

Andrew — Sun, 08 Oct 2017 00:19:57 +0000

In reply to Huw Llewelyn. Huw: A Bayesian inference with uniform priors is what it is, and it can be useful as a data summary. But I generally wouldn't want to use this posterior distribution to make decisions as it can be wildly optimistic.

By: Huw Llewelyn

Huw Llewelyn — Sun, 08 Oct 2017 00:10:24 +0000

In reply to Simon Gates. Thank you. You, Frank Harrell and others are preaching to the converted! However, a 'null hypothesis' of zero difference between treatment and placebo may be of little interest to a doctor who may wish to know the estimated probability of at least a 25% cure rate (for example) before starting a new treatment policy. He would therefore specify a ‘replication range’ of >25% (the interval not having an upper bound) and wish to know the probability of achieving this conditional on the existing data alone. This would require a Bayesian estimate with uniform priors. In order to plan further studies it might help to use the existing study result to calculate the power required to achieve a specified higher probability of such a ' >25% replication range’ based on the distribution suggested by the existing data and also perhaps together with a sensitivity analysis using various subjectively estimated likelihood distributions that are different to the one provided by an existing study. I think that this would be easier for doctors to understand than P values and even fixed (e.g. 95%) confidence intervals. I completely agree with Ben and everyone that P-values are inadequate.

By: Ben Prytherch

Ben Prytherch — Sun, 08 Oct 2017 00:00:19 +0000

In reply to Carlos Ungil. Carlos, now that I read back over my previous posts, I forgot to add that I think 95% CIs are not only more informative than p-values, but that they should also be treated as the ultimate arbiter of Truth. Scientific papers should be accepted for publication if and only if a 95% CI for something or another is reported to exclude zero. All hail the 95% CI!

By: Carlos Ungil

Carlos Ungil — Sat, 07 Oct 2017 22:46:47 +0000

In reply to Mark Schaffer. Ben, Mark, if we all agree this is no fun! ;-)

By: Mark Schaffer

Mark Schaffer — Sat, 07 Oct 2017 21:35:31 +0000

In reply to Carlos Ungil. Ah ... my fault, I wasn't clear (I blame that cold). The "so what?" was in reference to the rejection of a point null of b=0, which we all agree is on its own usually pretty pointless and uninformative. CIs are far more informative, as the comments by Carlos and Ben (and me) illustrate.

By: Ben Prytherch

Ben Prytherch — Sat, 07 Oct 2017 21:02:35 +0000

In reply to Valentin Amrhein. Thanks Valentin, I like the homage to Cohen in your title, and the plot of difference CIs along with their p-values really drives the point home regarding how much more informative a CI is relative to a p-value. Not very many people who use p-values know what they are. And maybe not many people who use CIs really know what they are either, but I don't think it matters as much with CIs. The picture of the interval basically tells the story, while the p-value remains mysterious.

By: Ben Prytherch

Ben Prytherch — Sat, 07 Oct 2017 20:58:16 +0000

In reply to Carlos Ungil.

Thanks Carlos, that’s a good point and I agree. p=0.01 can be consistent with a precisely estimated small effect, or an imprecisely estimated large effect… but precisely estimated large effects produce p-values of 0.00000….whatever.

By: Carlos Ungil

Carlos Ungil — Sat, 07 Oct 2017 18:37:47 +0000

In reply to Mark Schaffer.

> p=0.10 and a wide CI means that your results are consistent with there being “zero effect” or “a large effect”, so this isn’t very informative.
> p=0.10 and a narrow CI means that your results are consistent with there being “zero effect” or a “small effect”, but a large effect is pretty much out
p=0.10 means that the interval does cover zero. In his words, the results are consistent with there bing zero effect. There is no “another case”.

You are right that a CI tells us more than a p-value (a p-value is one single piece on information, the CI contains two pieces of information).
But consider these two possible 95% confidence intervals for H0:b=0 with (two-sided) p=0.01:

Wide CI: 6 < b < 45
Narrow CI: 0.006 < b < 0.045

I am not sure the first one warrants a "so what?" reaction more than the second one.

By: Simon Gates

Simon Gates — Sat, 07 Oct 2017 16:07:27 +0000

In reply to Huw Llewelyn.

Most medical journals require reporting of confidence intervals now, so people do generally report them (though not all the time). But when it comes to interpreting the results, it’s a different story – the usual thing is dichotomisation into there’s an effect/no effect based on a significance test.

Some examples on Frank Harrell’s blog
http://www.fharrell.com/2017/04/statistical-errors-in-medical-literature.html

I have loads more if you want.

By: Mark Schaffer

Mark Schaffer — Sat, 07 Oct 2017 15:44:27 +0000

In reply to Carlos Ungil.

“> p=0.10 and a wide CI that doesn’t include zero means that …

…this is not a 95% confidence interval or the null hypothesis used to calculate the p-value is not H0:b=0.”

This is nitpicky or I’m missing something very obvious so apologies (fighting a cold so I will get my excuses in early). Ben mentions 95% CI, then switches to p=0.10 for his example … then I used p=0.01 … does it matter? I think (hope?) the point is clear. If you are going to fixate on a particular cutoff for “significance”, you can tell us a lot more with the CI than with the (usually almost useless) fact of whether or not zero is inside it or not.

By: Carlos Ungil

Carlos Ungil — Sat, 07 Oct 2017 14:39:33 +0000

In reply to Mark Schaffer.

> p=0.10 and a wide CI that doesn’t include zero means that …

…this is not a 95% confidence interval or the null hypothesis used to calculate the p-value is not H0:b=0.

> This is the trap that so many fall into. H0:b=0 and p=0.01 but the CI is huge … woo-hoo. So you think maybe it isn’t zero … so what?

The case when the CI is wide is may be more interesting (for the same p-value). If you have H0:b=0 and p=0.01 and the CI is narrow then maybe b isn’t zero but surely it’s small… so what?

By: Mark Schaffer

Mark Schaffer — Sat, 07 Oct 2017 14:14:35 +0000

In reply to Ben Prytherch.

+1 from me too. And maybe worth rewording one of Ben’s examples to cover another case: “p=0.10 and a wide CI that doesn’t include zero means that your results are consistent with there being “small effect” or “a large effect”, so this isn’t very informative either”.

This is the trap that so many fall into. H0:b=0 and p=0.01 but the CI is huge … woo-hoo. So you think maybe it isn’t zero … so what? (Yes, I know, interpreting realized CIs is tricky, but still this is an improvement over mindless p-value citing.)

But I agree with Ben and Martha that the answer to “So what?” can be “There’s more noise here than we thought.” Indeed may well be a result worth sharing. A CI can convey this.

By: Huw Llewelyn

Huw Llewelyn — Fri, 06 Oct 2017 23:59:38 +0000

In reply to Huw Llewelyn. Erratum: In the first sentence of the 3rd paragraph the phrase "proper posterior likelihood distribution" should have been "proper posterior PROBABILITY distribution".

By: Huw Llewelyn

Huw Llewelyn — Fri, 06 Oct 2017 22:03:30 +0000

In reply to Ben Prytherch.

Ben. I like your suggestion of using confidence intervals. The British Medical Journal has also been campaigning for their use as you suggest for years and look more favourably on papers that include them. The ’pooh-poohing’ suggests that critics of CIs think that the inversion is invalid. I have a special interest in this because I have shown that random sampling is an interesting special case where the prior probabilities of all its possible specified outcomes are equally probable (i.e. they are of necessity uniform). I explain this in my Oxford University Press blog: https://blog.oup.com/2017/06/suspected-fake-results-in-science/ . I argue that any non-uniform Bayesian prior distribution is actually a posterior distribution formed by ‘normalising’ a likelihood distribution (based on real or pseudo-data) by assuming a uniform ‘base-rate’ distribution. This posterior probability distribution then becomes a Bayesian prior to be used with new data.

If a symmetrical distribution (e.g. Student’s ‘t’ or Gaussian) is used to model data, the 95% confidence intervals become 95% credibility intervals and that the null hypothesis corresponds to one of the confidence limits. However, if the CI applies to proportions, then the likelihood distribution will often be asymmetrical so that the simple relationship between confidence and credibility intervals does not hold. In this situation, the likelihood distributions and credibility intervals have to be calculated ‘exactly’ using hyper-geometric distributions using binomial coefficients (by analogy with ‘Fisher’s exact tests).

By applying the ‘principle of uniform distributions’, it is possible to provide a proper posterior likelihood distribution that allows the scientist to specify any credibility interval of his or her choosing and then for the statistician to estimate the probability of a mean or proportion falling inside that range after making an infinite number of observations. Prior data distributions can also be incorporated as a form of meta-analysis. This would be the probability of (long term) replication; I prefer to use ‘chosen replication range’ rather than CI. This would be much more intuitively appealing to scientists (and doctors whom I teach). The lines of ‘statistical significance’ (or not) could be marked on the distribution and treated with a pinch of salt. I also discuss some of the issues related to this approach elsewhere on Andrew’s blog: http://statmodeling.stat.columbia.edu/2017/10/04/worry-rigged-priors/#comment-578758 ).

By: Keith O'Rourke

Keith O'Rourke — Fri, 06 Oct 2017 14:22:37 +0000

In reply to Harlan.

So with what Anoneuoid wrote – all studies get statistically labeled as better, not too different or too uncertain to discern whether better, worse or not too different.

I believe it is really hard to anticipate how journals, authors and review committees will react and evolve under such as system.

Additionally, you are disregarding some shared insights (revised proposed list below) especially item 4 as your step 5 arguable applies to any single isolated study analysis “There is insufficient evidence to support any conclusion”

1. A p_value is just one view of what to assess about an experiment/study – that being how consistent is the data with a specific bundle of assumptions (which includes the null hypothesis). Furthermore, what to make of such an assessment as being rare, is seldom obvious or clearly spelled out. For instance, if it is from the first study – this may suggest further studies are likely not wasteful. Whereas, if they can be usually brought about in repeated studies – this may support the effect being real (replicable). On the other hand, it might be more prudent to simply take it to suggest that estimation based on the bundle of assumptions (which now also includes the alternative hypothesis) may be completely misleading (i.e. the assumed model is just too wrong). At least, that is, if estimation is considered as an essential step in answering the real scientific question.

2. Consider p_values as continuous assessments and be wary of any thresholds it may or may not be under (or targeted alpha error levels).

3. Keep in mind that p_value assessments are based on the possibly questionable assumption of zero effect and zero systematic error as well as additional ancillary assumptions.

4. Realize that the real or penultimate inference considers the ensemble of studies (completed, ongoing and future), individual studies are just pieces in that, which only jointly allows the assessment of real uncertainty.

5. Be aware that informative prior (beyond the ensemble of studies) information, even if informally brought in as categorical qualifications (e.g. in large well done RCTs with large effects the assumption of zero systematic error is not problematic) maybe unavoidable – learning how to peer review priors so that they are not just seen personal opinion may also be unavoidable.

6. The above considerations must be highly motivated towards discerning what experiments suggest/support as well as quantifying the uncertainties in that, as all of them can be gamed for publication and career advantage. It seems the importance of this cannot be over-estimated nor the need to repeatedly mention it in teaching and writing about statistics.

7. All of this simply cannot be entrusted to single individuals or groups no matter how well meaning they attempt to be – bias and error are unavoidable and random audits may be the only way to overcome these.

8. ???

9. ???

By: Anoneuoid

Anoneuoid — Fri, 06 Oct 2017 12:16:52 +0000

In reply to Harlan.

Step 1- Calculate a (1 – alpha_1)% Confidence Interval for theta.

Step 2- If this C.I. excludes theta, then declare a positive result. Otherwise, if theta is within the C.I., proceed to Step 3.

Step 3- Calculate a (1 – 2*alpha_2)% Confidence Interval for theta.

Step 4- If this C.I. is entirely within delta, declare a negative result. Otherwise, proceed to Step 5.

Step 5- Declare an inconclusive result. There is insufficient evidence to support any conclusion.

What you seem to be saying is a result can be insignificant either because the estimated difference from zero is very small or the uncertainty is very large (for arbitrary, yet exact, definitions of small and large). In the former case you want to call the results “negative”, in the latter you want to call them “inconclusive”.

If this new terminology is adopted you expect people will at least publish the subset of insignificant results that are “negative”, but still leave out those that are “inconclusive”. Currently both are sitting in file drawers, so you consider this an improvement.

Is that right? If so, I don’t think this proposal actually solves the real issues with NHST. Can you give a “real life” example of how this would be used and interpreted? For example:

“BioLab X is testing a new amyloid-beta clearing drug in mice to see if it may cure Alzheimer’s. To determine memory deficit, they measure how long it takes treated and control groups to gather all the food from a familiar maze. To determine amyloid-beta levels, they split these mice into high/low categories based on their olfactory habituation test performance (deficits in learning smells have been previously linked to Alzheimer’s disease).”

Include whatever sample sizes, effect sizes, as desired.

By: Valentin Amrhein

Valentin Amrhein — Fri, 06 Oct 2017 07:36:31 +0000

In reply to Ben Prytherch.

We made a very similar point in our review on significance thresholds (see, e.g., fig. 1 for a comparison of p-values with CI, and how CI may be meaningfully interpreted even though p-values are relatively large).
https://peerj.com/articles/3544/

By: Harlan

Harlan — Fri, 06 Oct 2017 04:18:50 +0000

I’m putting out a different idea, going in the opposite direction– let’s rely even more on significance and further the dichotomization of evidence.
https://arxiv.org/abs/1710.01771

By: Martha (Smith)

Martha (Smith) — Fri, 06 Oct 2017 02:24:43 +0000

In reply to Ben Prytherch.

” “Turns out there’s more noise here than we thought” is a result worth sharing, not least because it can point to where improvements need to be made in measurement and design and modeling.”

By: Ben Prytherch

Ben Prytherch — Fri, 06 Oct 2017 00:56:38 +0000

In reply to Jacob.

Jacob, regarding this question:

“But what I have never seen made clear is what we should do with higher p values. In this new world that doesn’t believe in thresholds, is there value in p = .10?”

This is one reason I see merit in encouraging people to switch out their p-values for confidence intervals when possible. I know this gets pooh-poohed on the grounds that a 95% CI is just an inverted null hypothesis test using p<0.05, but one huge advantage to the interval is that its width tells you something. p=0.10 and a wide CI means that your results are consistent with there being "zero effect" or "a large effect", so this isn't very informative. p=0.10 and a narrow CI means that your results are consistent with there being "zero effect" or a "small effect", but a large effect is pretty much out (at least contingent on all the other assumptions that went into the analysis).

I think this is a useful distinction, but with just a p-value you can't make it. And of course it also gives perspective to those p = 0.02 results where zero is just barely outside some really wide CI.

More broadly, I think "statistically weak" results are still useful; if the study was worth doing then the results are worth reporting, even if the conclusion is that we can't conclude anything. "Turns out there's more noise here than we thought" is a result worth sharing, not least because it can point to where improvements need to be made in measurement and design and modeling.

I don't know how many journal editors would buy into this, but maybe our new world that doesn't believe in thresholds will also be more open to the value of "unpublished" research.

By: Peter Erwin

Peter Erwin — Thu, 05 Oct 2017 13:34:14 +0000

In reply to Stephen Martin. "Abandon statistical significance != abandon p-values." The problem is that people rather easily jump to that conclusion. For example, when the journal Basic and Applied Social Psychology made their 2015 shift to "no more significance testing", the editorial specifically said (in answer to the question "Will manuscripts with p-values be desk-rejected automatically?"), "No. ... But prior to publication, the authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about 'significant' differences or lack thereof, and so on)."

By: Sameera Daniels

Sameera Daniels — Tue, 03 Oct 2017 11:32:52 +0000

I would think that the preprint & pregisteration platforms should give us better insight into the reasoning processes of researchers. Simply carefully rereading of journal articles have yielded reconsideration of findings as well.
As an aside I think that some researchers should reread Dr. Ioannidis work more carefully. It is paraphrased incorrectly & somewhat correctly. But not interpreted precisely & correctly. Either the applications are over-generalized or referred to apply too narrowlly.

By: Martha (Smith)

Martha (Smith) — Tue, 03 Oct 2017 05:03:48 +0000

In reply to Corey. Gee, my parents never had a TV in their basement, let alone one with a remote. (Come to think of it, my house has neither a basement nor a TV.)

By: Ben Goodrich

Ben Goodrich — Tue, 03 Oct 2017 04:25:14 +0000

In reply to Solomon Kurz.

My lectures are up at
https://www.youtube.com/channel/UCBiO111B17hlhtRY5Cg3V_Q

By: Andrew

Andrew — Tue, 03 Oct 2017 01:44:19 +0000

In reply to Jacob.

Jacob:

The quick answer is, yes, lots of little data can become big data, and I think it’s fine for people to publish and analyze what data they have, and others can build on this. It can also be valuable to publish and recognize uncertainty, to say: Here’s what seemed like a promising line of research, but the data are too noisy to learn much of anything useful.

Finally, sometimes we have real decisions to make and we can’t wait until there is any sort of near-certainty. Again, better to give the information that is available, in all its ambiguity. In a field such as empirical macroeconomics, the questions are too important and the data too sparse for us to wait for statistical significance in any form.

In addition to all of the above, there’s also the question of incentives. So much bad work and data misrepresentation is done because of the implicit requirement that claims be presented as near-certain. I’d like to remove that burden, so the Susan T. Fiskes of the world can publish speculative work without the pressure to misrepresent the results as conclusive.

By: Jacob

Jacob — Tue, 03 Oct 2017 01:25:50 +0000

You raise an interesting point with regard to the fact that sometimes we might be justified to interpret p > .05 results as interesting and meriting further research (which is as far as most social scientists are willing to go in scientific outlets in describing their findings). Some of the reformism has often felt like it originated in the idea that p = roughly .05 results just aren’t compelling, but we’d prefer not to have to come up with a new, lower threshold. That’s why I wasn’t surprised by the new suggested .005 threshold.

Of course, there are many reasons to dislike the p < .05 threshold that go beyond "it's not sufficiently conservative." But what I have never seen made clear is what we should do with higher p values. In this new world that doesn't believe in thresholds, is there value in p = .10? It's often said that in the garden of forking paths, there's always a scientific explanation for your results. Do we trust scientific reasoning about theory enough to accept results that are statistically weaker as part of this movement that is about science improvement?

By: Ben Prytherch

Ben Prytherch — Tue, 03 Oct 2017 00:18:29 +0000

In reply to Solomon Kurz. +1 Solomon Kurz

By: Solomon Kurz

Solomon Kurz — Mon, 02 Oct 2017 23:42:07 +0000

In reply to Stephen Martin.

That’s one way to approach the problem. And yet, as they say, science progresses one funeral at a time. Which implies it’s the youngsters—like me—who should be the easiest to target. And you all have targeted me well. Here was one of the most effective ways I was targeted: high quality intro stats books focusing on all the cool things you can do outside of the NHST paradigm. My prime example is McElreath’s “Statistical Rethinking.” Sure, some of his code is a little spaghetti-ish, but now we have the tidyverse and brms to help with future attempts. Perhaps the 2nd edition of Gelman and Hill will qualify, too.

Also, YouTube tutorials have been tremendous in my stats education. Again, McElreath’s lecture series is a great example. The internet needs more. But they don’t need to be that polished. There are tons of low-production value screenshot only YouTube videos on classical statistics in SPSS and so forth. The grad student audience is hungry for the Bayesian analogues featuring Stan, rstanarm, brms, and so forth.

And of course, blogs. But it appears Andrew and many of the rest of y’all have that one covered. [I love this community.]

So, yes, we can attempt top-down approached. And we should. But the current youngsters are primed and ready. Reach us with more of these.

By: Corey

Corey — Mon, 02 Oct 2017 23:25:21 +0000

In reply to Daniel Lakeland. For the record, I only sit in my parent's basement on Sundays.

By: Keith O'Rourke

Keith O'Rourke — Mon, 02 Oct 2017 21:43:44 +0000

In reply to Joachim. As I understand it many readers never read any comments (nor make any). So my rough estimate is that most readers are just looking for distractions that don't require much commitment.

By: Sameera Daniels

Sameera Daniels — Mon, 02 Oct 2017 21:00:34 +0000

In reply to Daniel Lakeland. Actually Daniel I'm eating chips & salsa. And I'd rather post here than watching numbing tv.

By: Joachim

Joachim — Mon, 02 Oct 2017 20:29:51 +0000

In reply to Daniel Lakeland. It does make one wonder how Andrew thinks we read his papers.

By: Daniel Lakeland

Daniel Lakeland — Mon, 02 Oct 2017 20:19:04 +0000

I think there’s not enough emphasis placed on the critical idea in this blog post, which is:

‘But I’m pretty sure that most of you reading this blog are sitting in your parent’s basement eating Cheetos, with one finger on the TV remote and the other on the Twitter “like” button. So I can feel free to rant away.’

;-)

By: Keith O'Rourke

Keith O'Rourke — Mon, 02 Oct 2017 20:14:25 +0000

> but they {Lakens et al] still seem to recommend that statistical significance be used for decision rules:

Maybe not – “Our recommendation is similarly twofold. First, when describing results, we recommend that the label ‘statistically significant’ simply no longer be used.”

Then later “Second, when designing studies, we propose that authors transparently specify their design choices. These include (where applicable) the alpha level,”

Now for “when studying the effects of snakebites, and ESP, and spell checking,” the alpha level should be justified and hence varied by application but what purpose does it serve other than to ignore their first recommendation?

AG> “But I do think that comparisons to a null model of absolutely zero effect and zero systematic error are rarely relevant.”
With regard to the zero systematic error rarely being relevant – I raised that in a comment perhaps too late in the editing phase “there does not seem to be anything on systematic error or confounding”

In general, I think there are number of authors that are agreeing more than is suggested by their making distinctions out of differences they discern in each others papers.

Perhaps the various papers should be partially pooled towards Valentin Amrhein and Sander Greenland very concise letter – to better pin point what the important distinctions really are?

By: Sameera Daniels

Sameera Daniels — Mon, 02 Oct 2017 20:05:34 +0000

Sander Greenland is absolutely right in positing that non-standard & standard definitions of p-values as well as other statistics terms have lent their abuses & misuses in practice & of epistemics.

By: Sameera Daniels

Sameera Daniels — Mon, 02 Oct 2017 20:02:07 +0000

As a non-statistician, however, I haven’t yet found a cogent explanation for why p-values should be regarded as continuous measure either. More broadly, the quality of insights is the real problem.

By: Sameera Daniels

Sameera Daniels — Mon, 02 Oct 2017 19:59:36 +0000

Well NHST should never have expanded its criteria to nearly all disciplines, like economics, medicine, & public health. And it’s not hard to see that it substitutes as a marketing tool & thinking more generally.

By: Stephen Martin

Stephen Martin — Mon, 02 Oct 2017 19:03:45 +0000

In reply to Bob.

Yup; I think the problem is multi-rooted.

Bad statistical training invites lazy assessment of ‘evidence’.
Journals get tons of submissions, and they need some easy manner of deciding on pub-worthiness (not that evidence should even factor into that equation).
The entire notion of “hypothesis testing” is so strongly embedded into psychology, it’s going to be hard to uproot. Hypothesis testing sounds more “scientific” than “making the best inference from the data we have” or “trying to recover the DGP with out-of-sample prediction as a goal”, even if the latter two are really much harder [and more useful, more informative, and arguably more scientific].

I fear the next move will be:
1) P-values? Boo hiss. Let’s use Bayes factors.
2) Oh, Bayes factors don’t condition on the data either? The priors are actually prior predictive hypotheses? That’s not what I want.
3) Bayesian posteriors! Huzzah! If the credible interval excludes zero, then H1 is supported! Oh, that has many of the same problems as p-values?
4) Prediction! Let’s use predictive utility as a goal, and based on that construct better predictive hypothesis tests. Oh, that requires more work, and about 10 people on earth understand how to do that….
5) Let’s have everyone report everything. p-value, BF, informed BF, predictive utility metrics, posterior credible intervals. Ah, but that’s confusing and hard to interpret.

The statisticians: Exactly. Data are messy. Stop thresholding. Lots of metrics needed for decisions and inferences.

By: Bob

Bob — Mon, 02 Oct 2017 18:47:37 +0000

I think there are political-economic obstacles to the implementation of this idea. Speaking from a social science perspective, what you’re arguing for is a kind of idealism, or change that derives from shifting people’s perspective of the world. But p-values are deeply embedded in people’s material interests in which they have been claiming for decades that their research is correct because p<0.05. Of course it's still possible to change this, but there has to be a coalition built that is capable of creating substantial disincentives to the current system, such as by considering research that uses discretized p-values "second-class".

In sum, perhaps a new journal that adopts these standards may be the best way to get pragmatic change by offering researchers a chance to showcase their more enlightened analyses.

By: Jonathan (another one)

Jonathan (another one) — Mon, 02 Oct 2017 18:20:04 +0000

In reply to Jonathan (another one). Sorry... 55 years. I'll give you a standard deviation back.

By: Jonathan (another one)

Jonathan (another one) — Mon, 02 Oct 2017 18:19:25 +0000

The “late Ronald Fisher?” He’s been dead for 65 years… That’s about 5 standard deviations past the consensus of how long one can use this locution… (P<<.01)

By: Stephen Martin

Stephen Martin — Mon, 02 Oct 2017 18:07:19 +0000

In reply to Phil. Abandon statistical significance != abandon p-values.

By: Phil

Phil — Mon, 02 Oct 2017 18:04:41 +0000

> First off, we don’t recommend getting rid of p-values; we recommend treating them as one piece of evidence.

Perhaps “abandon” is the wrong word to use then.

By: Stephen Martin

Stephen Martin — Mon, 02 Oct 2017 18:01:16 +0000

I am a coauthor on the Lakens et al. paper, and I agree with everything you say here. There was some heterogeneity in opinion on that commentary. From my perspective, I wish to get away from discretizing evidence and hypothesis testing in most cases (Some thought droppings here: http://srmart.in/thought-droppings-substantive-statistical-hypotheses/). There’s rarely a case where substantive hypotheses map very well onto statistical hypotheses, and because of that, I’d rather not use statistical hypothesis testing to make decisions about substantive hypotheses. Instead, we should build strong models, and make inferences about substantive claims in a more continuous manner, free of thresholds.

I’ve been railing against the Benjamin proposal /primarily/ because I pretty well hate thresholds. So long as thresholds exist, people will try their damndest to dive past it. So long as we have this “past the threshold, evidence; not past the threshold, no evidence” mentality, the career-incentives + publication practices essentially mandate that people will threshold-dive and mischaracterize evidence (i.e., p-hacking). And any inferential quantity can be ‘hacked’; I’ve shown that it can be done, of course, with p-values, CIs, credible intervals, BFs, whatever else. It’s not particularly hard, regardless of the threshold. I even wrote a script that will find subsets where some threshold is met, just to demonstrate. And of course, it’s arbitrary; why one line can delineate “true” from “not true”, or “publication-worthy” from “not publication worthy” is silly.

Finally, my argument about the Benjamin proposal was that it’s unrealistically optimistic: “In an imperfect world where p-values are used, use .005” but then they say “.005 should not be used as a publication threshold”; this is a sticking point for me as an ECR. In the same imperfect world where people misinterpret p-values and dichotomize evidence, they will use .005 as a pub threshold instead of .05 — Because it’s an imperfect world. Saying one should use .005 < p < .05 as 'suggestive' and < .005 as 'significant' but "don't use p-value as pub-worthiness" is too "idealistic" for me; of COURSE people will just now trichotomize evidence, and those people are now being told "only .005 signifies evidence", therefore they will just s/.05/.005 in their pub-worthiness evaluation. TDLR; why would people who judge pub-worthiness based on whether p is less than some evidentiary threshold now stop doing so with .005 if .005 is the new evidentiary threshold?

Even if I am on the 'justify your alpha' paper, I will say that I just don't like thresholds, period. But for me, IF we are going to use thresholds, they should be justified, and hence my contribution to the paper.

Aside from the 'where should the threshold be' question, which I personally think is moot… as I said on twitter:
Things that caused psych problems: HARKing, QRPs, threshold diving, publication bias, poor stats understandings, bad incentive structure. Incentive structure requiring novel, sexy findings, 10 papers a year, only publishable if beyond threshold. No replications permitted. Things that don't need fixing: An arbitrary threshold for rejecting a hypothesis noone believes based on a fictitious universe of events.